I'm not really sure that there is a quick way of doing this- in my last job I had 3000 SNP's. You might be able to generate code to get the IF statements, but with only 50 that won't really be worth the effort over this:
if rs300005='AA' then rs300005_num=0; else if rs300005='AB' then rs300005_num=1; else if rs300005='BB then rs300005_num=2; else rs300005_num=.;
A bigger problem is thinking about whether it is appropriate to run a SNP as a continous rather than a categorical variable. I had a great deal of difficulty interpreting the output when I attempted to do this; is being het for a SNP half way in between being Homo on Allele 1 and Homo on Allele2? What does it mean if this is significant? I think what I arrived at is that SNPs are not really continous; there are three categories, much like a color would have as red, blue, green, and not like a likert scale of disagree, neutral, agree. Thus you'd need to question the wisdom of running them as ordinal variables rather than categorical variables.
Although I did do this initially, in going to press we decided NOT to use it; it was just too difficult to defend to publish. You might consider using disease as your dependent variable with a categorical SNP as your independent variable (or one of your variables) in a logistic regression predicting disease or no disease, as the independent variable could have three categories.
From: Lance Smith <medicaltrial@GMAIL.COM>
Subject: Using character variales as continuous variables
Date: Wed, 10 Mar 2010 15:53:02 -0800
I have a database of 50 SNP variables. Each SNP variable has 3 levels
let’s say AA, AG, GG. The levels vary with different SNPs, so another
one may be CC CT and TT and still another may be AA AC and CC.
I also have levels of four markers that are on a continuous scale.
I need to do univariate linear regression to predict the level of
biomarkers using wach SNP seperately.
Thus I need to do 50*4 = 200 univariate linear regressions.
The SNPs need to be recoded to 0,1,2 for the regression as we want to
treat them as a continuous variable with the heterozygotes (AG or CT
or AC) coded as 1.
Is there a way to efficiently do the recoding to 0,1,2 in SAS without
having to recode all the 50 SNPs separately? Or is there a way to tell
SAS to treat them as continuous variables even though they are coded
as character variables?