Separating populations and estimating line-fit parametersGiven a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high R 2 , how does one separate the two populations (and incidentally compute the line-fit)?This is fairly easy to do graphically - just create a scatterplot and the two lines are pretty apparent. But how does one do this algorithmically?More generally, given a dataset containing an unknown number n of populations, each of which can be fit to a line with some lower bound on R 2 (e.g., .95), how does one separate the data into the minimum number of populations satisfying the R 2 criterion?

Question

Separating populations and estimating line-fit parametersGiven a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high       R    2  , how does one separate the two populations (and incidentally compute the line-fit)?This is fairly easy to do graphically - just create a scatterplot and the two lines are pretty apparent. But how does one do this algorithmically?More generally, given a dataset containing an unknown number n of populations, each of which can be fit to a line with some lower bound on       R    2   (e.g., .95), how does one separate the data into the minimum number of populations satisfying the       R    2   criterion?

Kaylee Evans · Accepted Answer

Step 1If I properly understand, you have two relations   Y  =  a  1  +  b  1 X for one population and   Y  =  a  2  +  b  2 X for a second population and you would like to merge them.Step 2If my hypothesis is correct, build a model   Y  =  a  +  b  X  +  c  Z in which Z will be 1 if belonging to the first population, 2 to the second population and so on.

Antwan Perez · Accepted Answer

Step 1See below for a better way, but how about a very naive iterative algorithm:1. Use the dataset and estimate a line fit:   y  =  a  +  b  X.2. Throw out the observation with the greatest distance to the estimated line, i.e., remove       argmax    i    (      y    i    −  a  −  b      X    i        )    2   if       max    i    (      y    i    −  a  −  b      X    i        )    2    &amp;gt;  t  ≥  0, where t is some threshold, and continue at (1.); otherwise stop.Once you stop, the data set that remains should be close around some line; that would be population 1. Everything you threw out should be population 2. But I am pretty sure this may not always converge to a good classification. (This depends on t, but also how noisy the data is.) You can check if the classification was successful by looking at the fit for all the observations you threw out in the iterative prodedure; if the line fits well, it worked.Step 2Otherwise there are some cluster methods which find groups that belong together in your data. In particular, you could look at k means clustering (which is a less naive way to do the above, it seems). If you chose such a method, find one where you can use your knowledge that the relationship within clusters is linear, and that you have two clusters. It will improve classiciation considerably.

Separating populations and estimating line-fit parameters. Given a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high R^2, how does one separate the two populations (and incidentally compute the line-fit)?

Answered question

Answer & Explanation

New Questions in College Statistics