|
CG:
You really need a mnemonic Bank ID that can be mapped to a standard bank name. We have to do this type of standardization often. It never works perfectly in new situations.
I'd experiment first with different length constraints on the SOUNDEX() function. Once you have a relation of soundex values and standard bank names, then you may be able to use one of the similarity functions (edit distance or rearrangement cost [e.g., SPEDIS()]) to check to see if the relation contains a close enough match to a standard name. If not, it would have to be reconciled off-line.
Here's a rough cut of how you might create a relation of standardized bank names and soundex values:
data banks;
infile cards pad truncover;
input @1 loanid $char8. @10 class $4. @16 Bank_Name $char39. @55 Type $char20.
;
datalines;
BF486261 Corp Credit Lyonnais Mandated Arranger
BF486265 Corp JP Morgan Bookrunner
BF486265 Corp JP Morgan Mandated Arranger
BF486265 Corp KFW Kreditanstalt Fur Wiederaufbau Mandated Arranger
BF486269 Corp Societe Generale Agent(s)
BF486269 Corp KFW Kreditanstalt Fur Wiederaufbau Arranger(s)
BF486269 Corp Societe Generale Bookrunner
BF486269 Corp Societe Generale Mandated Arranger
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Agent(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Bookrunner
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Lender(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Lender(s)
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Mandated Arranger
BF497955 Corp Societe Generale Bookrunner
BF497955 Corp Societe Generale Mandated Arranger
BF500064 Corp Royal Bank of Scotland Agent(s)
BF500064 Corp Royal Bank of Scotland Bookrunner
BF500064 Corp BNP Paribas Group Lender(s)
BF500064 Corp HSBC Financial Products (France) SNC Mandated Arranger
BF500064 Corp Royal Bank of Scotland Mandated Arranger
BF500797 Corp ING Bank NV Bookrunner
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF501049 Corp Credit Lyonnais Agent(s)
BF501049 Corp CIT Group Lender(s)
BF501049 Corp Credit Lyonnais Lender(s)
BF501049 Corp ING Groep NV Lender(s)
BF501049 Corp CIT Group Mandated Arranger
BF501049 Corp Credit Lyonnais Mandated Arranger
BF501049 Corp ING Groep NV Mandated Arranger
LN059143 Corp Banca Commerciale Italiana Lender(s)
LN059143 Corp Banca di Roma (France) Lender(s)
LN059143 Corp Banco Bilbao Vizcaya Argentaria Lender(s)
LN059143 Corp Banco Espanol de Credito Lender(s)
LN059143 Corp Bank One NA Lender(s)
LN059143 Corp Barclays Bank PLC (US) Lender(s)
LN059143 Corp Deutsche Bank AG Lender(s)
LN059143 Corp Dexia Banque SA Lender(s)
LN059143 Corp ING Bank NV/United States Lender(s)
LN059143 Corp Rabobank International Lender(s)
LN059143 Corp Raiffeisen Zentralbank Oesterreich AG Lender(s)
LN059143 Corp Sanpaolo IMI SpA Lender(s)
LN059143 Corp Unicredito Italiano Lender(s)
LN059143 Corp Deutsche Bank AG Mandated Arranger
LN059143 Corp IntesaBci SpA Mandated Arranger
;
run;
proc sql;
create table BankStan as
select distinct Bank_Name,substr(soundex(Bank_Name),1,8) as stanBankName
from banks
group by stanBankName having Bank_Name=min(Bank_Name)
;
quit;
The subsequent comparisons of observed bank names to standardized bank names using SPEDIS() or other similarity functions would come next, and would return the standardized name. I am assuming here that the bank name doesn't change with Type.
S
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of ChrisG
Sent: Wednesday, December 16, 2009 9:40 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: matching names + cleaning data
Hi guys
Thank you in advance for any clue, advice and so on ...
Here is my big issue
I have data on the composition of bank loan syndicates (i.e. several
banks make a loan together to a borrower)
I have a variable LOANID to identify the loan, a variable BANK_NAME
for the name of the bank, and TYPE to classify the "rank" of the bank
in the syndicate
I have copy paste a portion of the data below :
loanid Bank_Name Type
BF486261 Corp Credit Lyonnais Mandated Arranger
BF486265 Corp JP Morgan Bookrunner
BF486265 Corp JP Morgan Mandated Arranger
BF486265 Corp KFW Kreditanstalt Fur Wiederaufbau Mandated Arranger
BF486269 Corp Societe Generale Agent(s)
BF486269 Corp KFW Kreditanstalt Fur Wiederaufbau Arranger(s)
BF486269 Corp Societe Generale Bookrunner
BF486269 Corp Societe Generale Mandated Arranger
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Agent(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Bookrunner
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Lender(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Lender(s)
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Mandated Arranger
BF497955 Corp Societe Generale Bookrunner
BF497955 Corp Societe Generale Mandated Arranger
BF500064 Corp Royal Bank of Scotland Agent(s)
BF500064 Corp Royal Bank of Scotland Bookrunner
BF500064 Corp BNP Paribas Group Lender(s)
BF500064 Corp HSBC Financial Products (France) SNC Mandated Arranger
BF500064 Corp Royal Bank of Scotland Mandated Arranger
BF500797 Corp ING Bank NV Bookrunner
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF500797 Corp ING Bank NV Mandated Arranger
BF501049 Corp Credit Lyonnais Agent(s)
BF501049 Corp CIT Group Lender(s)
BF501049 Corp Credit Lyonnais Lender(s)
BF501049 Corp ING Groep NV Lender(s)
BF501049 Corp CIT Group Mandated Arranger
BF501049 Corp Credit Lyonnais Mandated Arranger
BF501049 Corp ING Groep NV Mandated Arranger
LN059143 Corp Banca Commerciale Italiana Lender(s)
LN059143 Corp Banca di Roma (France) Lender(s)
LN059143 Corp Banco Bilbao Vizcaya Argentaria Lender(s)
LN059143 Corp Banco Espanol de Credito Lender(s)
LN059143 Corp Bank One NA Lender(s)
LN059143 Corp Barclays Bank PLC (US) Lender(s)
LN059143 Corp Deutsche Bank AG Lender(s)
LN059143 Corp Dexia Banque SA Lender(s)
LN059143 Corp ING Bank NV/United States Lender(s)
LN059143 Corp Rabobank International Lender(s)
LN059143 Corp Raiffeisen Zentralbank Oesterreich AG Lender(s)
LN059143 Corp Sanpaolo IMI SpA Lender(s)
LN059143 Corp Unicredito Italiano Lender(s)
LN059143 Corp Deutsche Bank AG Mandated Arranger
LN059143 Corp IntesaBci SpA Mandated Arranger
I need to do 2 things (not sure in what order to make things less
complicated / more efficient)
1. i need to harmonize the names of the banks
as you can see in the data above, for instance i have sometimes ING
Bank NV and sometimes ING Bank NV/United States
I want to have always let say ING Bank NV or ING or whatever but one
name for any possible "variation" of a company which belongs to the
ING Group
I tried some proc transpose then some compged and so on but honestly i
have the impression im going nowhere...
let say for instance that at the end of that step i would like to get
something like that (with unique and simple names of banks):
loanid Bank_Name Type
BF486261 Corp Credit Lyonnais Mandated Arranger
BF486265 Corp JP Morgan Bookrunner
BF486265 Corp JP Morgan Mandated Arranger
BF486265 Corp KFW Kreditanstalt Mandated Arranger
BF486269 Corp Societe Generale Agent(s)
BF486269 Corp KFW Kreditanstalt Arranger(s)
BF486269 Corp Societe Generale Bookrunner
BF486269 Corp Societe Generale Mandated Arranger
BF486273 Corp Bank of Tokyo-Mitsubishi Agent(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
BF486273 Corp Bank of Tokyo-Mitsubishi Bookrunner
BF486273 Corp Bank of Tokyo-Mitsubishi Lender(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Lender(s)
BF486273 Corp Bank of Tokyo-Mitsubishi Mandated Arranger
BF497955 Corp Societe Generale Bookrunner
BF497955 Corp Societe Generale Mandated Arranger
BF500064 Corp Royal Bank of Scotland Agent(s)
BF500064 Corp Royal Bank of Scotland Bookrunner
BF500064 Corp BNP Paribas Lender(s)
BF500064 Corp HSBC Mandated Arranger
BF500064 Corp Royal Bank of Scotland Mandated Arranger
BF500797 Corp ING Bank Bookrunner
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF500797 Corp ING Bank Mandated Arranger
BF501049 Corp Credit Lyonnais Agent(s)
BF501049 Corp CIT Group Lender(s)
BF501049 Corp Credit Lyonnais Lender(s)
BF501049 Corp ING Bank Lender(s)
BF501049 Corp CIT Group Mandated Arranger
BF501049 Corp Credit Lyonnais Mandated Arranger
BF501049 Corp ING Bank Mandated Arranger
LN059143 Corp Banca Commerciale Italiana Lender(s)
LN059143 Corp Banca di Roma Lender(s)
LN059143 Corp Banco Bilbao Vizcaya Argentaria Lender(s)
LN059143 Corp Banco Espanol de Credito Lender(s)
LN059143 Corp Bank One Lender(s)
LN059143 Corp Barclays Bank Lender(s)
LN059143 Corp Deutsche Bank Lender(s)
LN059143 Corp Dexia Banque Lender(s)
LN059143 Corp ING Bank Lender(s)
LN059143 Corp Rabobank International Lender(s)
LN059143 Corp Raiffeisen Zentralbank Lender(s)
LN059143 Corp Sanpaolo IMI Lender(s)
LN059143 Corp Unicredito Italiano Lender(s)
LN059143 Corp Deutsche Bank Mandated Arranger
LN059143 Corp IntesaBci SpA Mandated Arranger
so if anyone have a clue, thanks a lot in advance !
2. i need to clean the data in the following way
as you can see, for each loan, the name of a bank can appear several
times
let me pick that one :
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Agent(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Bookrunner
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Lender(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Lender(s)
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Mandated Arranger
here Bank of Tokyo-Mitsubishi UFJ Ltd appears 4 times because it acts
in different roles in that particular bank syndicate (as Agent,
Bookrunner, Lender and Mandated Arranger)
I need to keep only one occurrence of this bank for that loan
as the hierarchy is Agent > Mandated Arranger > Lender, if this bank
is agent, i don't care that it is arranger or lender
i want to keep it just once as agent
same applies to the second bank of that loan
Landesbank Baden-Wuerttemberg appears 2 times, once as arranger and
once as lender
again as arranger > lender, i just want to keep it as arranger
to get at the end something like that
BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Agent(s)
BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
again thanks in advance for any advice / clue
Best regards
CG
|