Date: Sat, 29 Mar 2008 22:13:14 0700
ReplyTo: Phil Holman <piholmanc@YOURSERVICE.UGA.EDU>
Sender: "SAS(r) Discussion" <SASL@LISTSERV.UGA.EDU>
From: Phil Holman <piholmanc@YOURSERVICE.UGA.EDU>
Subject: Re: help with thoughts about the chi square test of independence
"jenmoocat" <sollje2002@yahoo.com> wrote in message
news:6ec0cceb9ffb4487aea56850694dfcda@h11g2000prf.googlegroups.com...
> Hello all.
> I hope some of you can shed some light on a problem I am having.
> I do have a stats degree, but I got it over 15 years ago.
> I've used the web to try to research this idea, but I don't really see
> it addressed....
>
>
> I was tasked with creating an audit function.
> Part of a process flow is to randomly assign customers to one of two
> groups.
> We want to make sure that the customers in group 1 look like the
> customers in group 2.
> I thought that a chisquare test of independence could be a way to do
> this.
>
> I chose a couple of factors that define our customers: age, tenure,
> riskscore (for example).
> I then perform the chisquare test of independence on each factor
> separately.
> In each case, I am essentially posing the null hypothesis that the
> factor is independent of group membership:
> age is unrelated to group membership, tenure is independent of group
> membership, etc...
> In my thinking, if the null hypothesis is true along all of the
> factors of importance, then the two groups have truly been populated
> randomly.
>
> In the actual mechanics of the test, I have tens of thousands (if not
> hundreds of thousands) of observations.
> I then bin the factor  break age down into 9 groups for example:
> under 18
> 18 to 25
> 25 to 35
> etc....
>
> In that way I then get two distributions: the distribution of group 1
> by age and the distribution of group 2 by age.
> I have read in the statistics literature that, because the chisquare
> test by nature is sensitive to sample size, the significance level of
> such a test should be something like 0.01, rather than the more common
> 0.05.
>
> So I perform my test on the independence of age and group membership.
> I graph the two histograms together, so I can get a visual aid. And I
> also calculate the chisquare statistic...
>
> And I have found that even small differences will cause the null
> hypothesis to be rejected.
>
> In the data below, if you graph the two histograms together, they line
> up very closely.
> The data, eyeballed, looks as if age is independent from group
> membership.
> However, the calculated chisquare stat is 46, compared to the
> critical value of 21 for 9 degrees of freedom and a significance level
> of 0.01. The pvalue is miniscule. I intrepret this to be the
> probability of the calculated chisquare stat (or seeing these two
> histograms) if the null hypothesis of independence were true, is very
> tiny.
>
> age group 1 group 2
> 1 86 77
> 2 415 440
> 3 1,559 1,577
> 4 5,810 5,751
> 5 22,450 22,000
> 6 26,182 26,182
> 7 16,947 16,947
> 8 5,336 6,000
> 9 184 168
> 10 8 11
>
> My bosses think that the test is not good at these high numbers and
> are thinking about scrapping it.
What gives with age group 8? If it wasn't for that one age group, your
X^2 value would be ~10.
Phil H
