Separating populations and estimating line-fit parameters. Given a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high R^2, how does one separate the two populations (and incidentally compute the line-fit)?

Tara Mayer

Tara Mayer

Answered question

2022-10-31

Separating populations and estimating line-fit parameters
Given a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high R 2 , how does one separate the two populations (and incidentally compute the line-fit)?
This is fairly easy to do graphically - just create a scatterplot and the two lines are pretty apparent. But how does one do this algorithmically?
More generally, given a dataset containing an unknown number n of populations, each of which can be fit to a line with some lower bound on R 2 (e.g., .95), how does one separate the data into the minimum number of populations satisfying the R 2 criterion?

Answer & Explanation

Kaylee Evans

Kaylee Evans

Beginner2022-11-01Added 20 answers

Step 1
If I properly understand, you have two relations Y = a 1 + b 1 X for one population and Y = a 2 + b 2 X for a second population and you would like to merge them.
Step 2
If my hypothesis is correct, build a model Y = a + b X + c Z in which Z will be 1 if belonging to the first population, 2 to the second population and so on.
Antwan Perez

Antwan Perez

Beginner2022-11-02Added 6 answers

Step 1
See below for a better way, but how about a very naive iterative algorithm:
1. Use the dataset and estimate a line fit: y = a + b X.
2. Throw out the observation with the greatest distance to the estimated line, i.e., remove argmax i ( y i a b X i ) 2 if max i ( y i a b X i ) 2 > t 0, where t is some threshold, and continue at (1.); otherwise stop.
Once you stop, the data set that remains should be close around some line; that would be population 1. Everything you threw out should be population 2. But I am pretty sure this may not always converge to a good classification. (This depends on t, but also how noisy the data is.) You can check if the classification was successful by looking at the fit for all the observations you threw out in the iterative prodedure; if the line fits well, it worked.
Step 2
Otherwise there are some cluster methods which find groups that belong together in your data. In particular, you could look at k means clustering (which is a less naive way to do the above, it seems). If you chose such a method, find one where you can use your knowledge that the relationship within clusters is linear, and that you have two clusters. It will improve classiciation considerably.

Do you have a similar question?

Recalculate according to your conditions!

New Questions in College Statistics

Ask your question.
Get an expert answer.

Let our experts help you. Answer in as fast as 15 minutes.

Didn't find what you were looking for?