```Date: Sat, 29 Mar 2008 22:13:14 -0700 Reply-To: Phil Holman Sender: "SAS(r) Discussion" From: Phil Holman Subject: Re: help with thoughts about the chi square test of independence Comments: To: sas-l@uga.edu "jenmoocat" wrote in message news:6ec0cceb-9ffb-4487-aea5-6850694dfcda@h11g2000prf.googlegroups.com... > Hello all. > I hope some of you can shed some light on a problem I am having. > I do have a stats degree, but I got it over 15 years ago. > I've used the web to try to research this idea, but I don't really see > it addressed.... > > > I was tasked with creating an audit function. > Part of a process flow is to randomly assign customers to one of two > groups. > We want to make sure that the customers in group 1 look like the > customers in group 2. > I thought that a chi-square test of independence could be a way to do > this. > > I chose a couple of factors that define our customers: age, tenure, > risk-score (for example). > I then perform the chi-square test of independence on each factor > separately. > In each case, I am essentially posing the null hypothesis that the > factor is independent of group membership: > age is unrelated to group membership, tenure is independent of group > membership, etc... > In my thinking, if the null hypothesis is true along all of the > factors of importance, then the two groups have truly been populated > randomly. > > In the actual mechanics of the test, I have tens of thousands (if not > hundreds of thousands) of observations. > I then bin the factor --- break age down into 9 groups for example: > under 18 > 18 to 25 > 25 to 35 > etc.... > > In that way I then get two distributions: the distribution of group 1 > by age and the distribution of group 2 by age. > I have read in the statistics literature that, because the chi-square > test by nature is sensitive to sample size, the significance level of > such a test should be something like 0.01, rather than the more common > 0.05. > > So I perform my test on the independence of age and group membership. > I graph the two histograms together, so I can get a visual aid. And I > also calculate the chi-square statistic... > > And I have found that even small differences will cause the null > hypothesis to be rejected. > > In the data below, if you graph the two histograms together, they line > up very closely. > The data, eyeballed, looks as if age is independent from group > membership. > However, the calculated chi-square stat is 46, compared to the > critical value of 21 for 9 degrees of freedom and a significance level > of 0.01. The p-value is miniscule. I intrepret this to be the > probability of the calculated chi-square stat (or seeing these two > histograms) if the null hypothesis of independence were true, is very > tiny. > > age group 1 group 2 > 1 86 77 > 2 415 440 > 3 1,559 1,577 > 4 5,810 5,751 > 5 22,450 22,000 > 6 26,182 26,182 > 7 16,947 16,947 > 8 5,336 6,000 > 9 184 168 > 10 8 11 > > My bosses think that the test is not good at these high numbers and > are thinking about scrapping it. What gives with age group 8? If it wasn't for that one age group, your X^2 value would be ~10. Phil H ```

Back to: Top of message | Previous page | Main SAS-L page