Date: Fri, 19 Dec 2008 17:05:49 -0600
Reply-To: Mary <mlhoward@avalon.net>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Mary <mlhoward@AVALON.NET>
Subject: Re: Aggregate and individual-level data analysis
Content-Type: text/plain; charset="iso-8859-1"
Actually, this is a bit better.
First what you'd need is the population of the zip codes that
are in your disease surveillance dataset. If it isn't too many,
then try http://www.city-data.com/ and it will give you the population,
or there are other databases that you can obtain this from
(I would think that they would originally come from the U.S.
Gov. Census).
Then you've got this disease surveillance dataset, which is
by individual (or could be even lower data, such as
individual claims). If it is lower aggregate data than
the individual level, I think you would want to aggregate
it to one record per individual with that person's zipcode.
Then once you get it to that point you could count the
number of ID's within zip code:
Now to get everybody that doesn't have the disease. Let's
start with the population counts:
Then we can get rates per population.
Now once we've got that, we can turn to the other data set,
and merge in your counts of child abuse.
Then merge that in with your population data on the diseases.
Then you'll have a table with the disease counts and abuse
counts, along with their rates, in one table. Perhaps you
can do a correlation on the two rates or other analysis
at that point.
data detail;
infile cards missover;
input id zip_code;
cards;
001 52402
001 52402
002 52402
002 52242
;
run;
proc sql;
create table individual_level as
select distinct id, zip_code
from detail
order by zip_code, id;
quit;
proc sql;
create table disease_counts as
select '1' as disease_status, zip_code, count(id) as disease_count
from individual_level
group by zip_code
order by zip_code;
quit;
data pop_counts;
infile cards missover;
input zip_code population;
cards;
52402 39912
52244 42212
52241 50000
52242 49000
;
run;
proc sort data=disease_counts;
by zip_code;
run;
proc sort data=pop_counts;
by zip_code;
run;
proc sql;
create table disease_rates as
select pop_counts.zip_code,
pop_counts.population,
disease_counts.disease_count,
case when disease_counts.disease_count=. then 0 else disease_count/(population/10000) end as
disease_rate format 7.2
from pop_counts
left outer join
disease_counts
on pop_counts.zip_code=disease_counts.zip_code;
quit;
data child_abuse_counts;
infile cards missover;
input zip_code abuse_cases;
cards;
52402 1
52244 0
52241 2
52242 1
;
run;
proc sql;
create table disease_rates_with_child as
select disease_rates.zip_code,
disease_rates.population,
disease_rates.disease_count,
child_abuse_counts.abuse_cases,
disease_rates.disease_rate,
case when child_abuse_counts.abuse_cases=. then 0 else child_abuse_counts.abuse_cases/(population/10000) end as
abuse_rate format 7.2
from disease_rates
left outer join
child_abuse_counts
on disease_rates.zip_code=child_abuse_counts.zip_code;
quit;
proc corr data=disease_rates_with_child;
var disease_rate abuse_rate;
run;
-Mary
----- Original Message -----
From: Rieza Soelaeman
To: SAS-L@LISTSERV.UGA.EDU
Sent: Friday, December 19, 2008 3:44 PM
Subject: Re: Aggregate and individual-level data analysis
Ok, so that was a silly little example. I apologize that it's not as good
as my other silly little examples :(. I guess I just wanted to create an
example that is similar enough to my situation, which involves confidential
data. So I have this disease surveillance dataset, in which everyone in
this dataset has the disease (this is what Dataset1 was alluding to). A
colleague has asked me to run some analyses to see if there is a correlation
between the prevalence of the disease with child abuse. Now the data on
child abuse she gave me was Child Protective Services (CPS) data, which is
aggregate data. It's a 1-pager Excel spreadsheet with the zipcodes as rows
and as columns, the type of CPS intervention (5 categories, this is where I
came up with the idea of BMI classification). Essentially, it's a proc freq
of zipcode*(category of CPS intervention). That is all I have to work
with. Like I said, I do have individual level data from the disease
surveillance system, I just have very limited data to compare it against.
I hope this clarifies things a bit. I've forgotten everything I learned
about aggregate data analysis since I left school 6 months ago :(.
Thanks guys!
On Fri, Dec 19, 2008 at 4:13 PM, Mary <mlhoward@avalon.net> wrote:
> Yes, Peter is right, you would need to have information on the people
> at the football game who did not buy food (it is only food at the football
> game, Peter, not food altogether--presumably this does not include
> the food at tailgating and so forth before or after the game). You'd
> have to have the zip codes of the people who didn't buy food in
> order to compare those who bought food versus not.
>
> -Mary
> ----- Original Message -----
> From: Peter Flom
> To: SAS-L@LISTSERV.UGA.EDU
> Sent: Friday, December 19, 2008 2:52 PM
> Subject: Re: Aggregate and individual-level data analysis
>
>
> Rieza wrote
>
> > Yet another question from me.
> >
> > Suppose I have 2 datasets:
> >
> > 1. Dataset1--Contains individual-level data on who bought food at a
> > concession stand during a football game
> > 2. Dataset2--Contains aggregate data on prevalence of obesity (bmi
> >=30)
> > and overweight (bmi >=25) by zip code
> >
> >
> > Dataset1 looks roughly like this:
> > name zip code John 78530 Jane 78531 Angie 78532 Eileen 78530 Tim
> 78530
> > Bob 78532
> >
> > et cetera...let's say there are 3000 people in this dataset, all of
> these
> > people bought food.
> >
> >
> > Dataset2 looks roughly like this:
> > zip code overwt obese 78530 500 200 78531 600 500 78532 100 50
> >
> > Supposing I wanted to know if there was a correlation between buying
> food
> > and obesity, what procedure can I run? Notice that overweight and
> obese
> > are BMI classifications, so really, Dataset2 represents data from 1950
> > respondents. I get a feeling that I need to disaggregate Dataset2,
> because
> > I was kicking myself in the head when I tried to turn Dataset1 into an
> > aggregate dataset, and finding it impossible (and stupid) to try to
> plot the
> > data...
> >
> > As always, I welcome and appreciate any suggestions on how to tackle
> this.
>
>
> I don't think there is much you can do here .... because everyone in
> dataset 1 bought food.
>
> But maybe ....
>
> Are the people in data set 1 a random sample from each ZIP code?
> Do you know the total population of each ZIP code?
> Do you know what percentage of people buy food? (I'm thinking it's close
> to everyone, one way or another .... How does someone never buy food?)
>
> Peter
>
> Peter L. Flom, PhD
> Statistical Consultant
> www DOT peterflom DOT com
>
--
Rieza H Soelaeman, MPH