LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2011, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 30 Nov 2011 23:15:06 +0000
Reply-To:     "DUELL, BOB" <bd9439@ATT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "DUELL, BOB" <bd9439@ATT.COM>
Subject:      Re: Data Pull from Large databases
Comments: To: Ahmet Duman <Bir_Yoklu@YAHOO.COM>
In-Reply-To:  <201111302231.pAULnqpQ012763@waikiki.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"

Hi Ahmet,

You say your "largedata" contains an index on "subject_id"? If so, have you tried reading it with a key option?

In other words, after you create and de=dup your data set x, try this:

data z; set x; set largedata key=subject_id; run;

Of course, I'm assuming "largedata" is a SAS data set. You also say they are "relational" so perhaps I misunderstood. If you really are talking about accessing a RDBMS, tell us which database your are using and how you are connection to it.

Bob

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Ahmet Duman Sent: Wednesday, November 30, 2011 2:32 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Data Pull from Large databases

I am dealing with large scale databases and often need to create subsets using subject id's. These databases are relational databases (EMR and Claims) and depending on the nature of the study, I use codes (ICD, CPT etc..) or demographics to identify the subject id's. The databases I use are often indexed by subject id's.

Here is my problem: When I am dealing with a large datasets, I can not use merge in data step or sql because of the extremely large size of the data.

If I have a small set of subject id's (lets say less than 3000), I am creating a macro variable and using this macro variable in a data step to pull the data;

data x; set code; where code in ('xxx.xx'); keep subject_id; run; proc sort data = x out = y nodupkey; by = subject_id; run;

proc sql; select subject_id into:sid seperated by ',' from y; quit;

data z; set largedata; where subject_id in (&sid); run;

This is quite fast and much much more efficient than datastep or sql merge since an index variable is created for subject_id in largedata. When I have more than 3000 subject id's however, I have to split them into multiple categories and repeat the code for each macro variable since macro variable exceeds the sas limits. Here is the new code for lets say for 12,000 subject id's;

data x; set code; where code in ('xxx.xx'); keep subject_id; run; proc sort data = x out = y nodupkey; by = subject_id; run;

proc sql; select subject_id into:sid1 seperated by ',' from y (firstobs=00001 obs = 03000); proc sql; select subject_id into:sid2 seperated by ',' from y (firstobs=03001 obs = 06000); proc sql; select subject_id into:sid3 seperated by ',' from y; (firstobs=06001 obs = 09000); proc sql; select subject_id into:sid4 seperated by ',' from y; (firstobs=09001 obs = 12000); quit;

data z1; set largedata; where subject_id in (&sid1); run; data z2; set largedata; where subject_id in (&sid2); run; data z3; set largedata; where subject_id in (&sid3); run; data z4; set largedata; where subject_id in (&sid4); run;

Data z; set z1 z2 z3 z4; run;

As you can see, it is managable if subject id is not large but the code becomes inefficient especially if you are dealing with 100K or millions of subjects.

Is there an easy way to deal with this problem? I really appreciate for your helps in advance.

Ahmet


Back to: Top of message | Previous page | Main SAS-L page