Identifying the k points in 2D geographic space which are 'most distant' from each other I have a

Santino Bautista

Santino Bautista

Answered question

2022-06-19

Identifying the k points in 2D geographic space which are 'most distant' from each other
I have a set of DNA samples from Y plants in a given geographic area. I'm going to be doing DNA sequencing on individuals in this population (and a number of other, separate populations), however due to financial constraints I'm unable to perform sequencing on all Y individuals. I've decided that I can afford to sequence k (out of Y) in each separate area.
I'd like to select the k samples which are farthest apart/most geographically distributed within the Y samples I collected. Samples that are taken from plants in close proximity to each other are more likely to be closely related, and I'd like to sequence what are ultimately the most genetically diverse samples for my later analysis.
So, the question is: given a set of Y points/samples in 2d geographic space (lat/long coordinates), how do I select the k points that are most geographically distributed/distant from each other?
As I've explored this a bit, I think the problem I'm really having is how to define 'distance' or 'most distributed'. Some of the metrics I've though of (e.g. maximum average distance between the k points) result in really unintuitive point selection in certain cases (for example, if two points are right next to each other but far away from another cluster of all the other points, the two points will be included even if they're almost on top of each other).
I have a feeling that this is not an uncommon problem, and there must be good answers out there. I'll add that I have access to a large cluster and I can brute force the problem to some extent.
I'd appreciate any and all advice or discussion; this is an interesting theoretical topic to me.

Answer & Explanation

Blaine Foster

Blaine Foster

Beginner2022-06-20Added 33 answers

I'm not sure how to find the best solution, but I would suggest using a genetic algorithm to get a decent approximation. Such an algorithm works like this:
Suppose we have a function f : P ( Y ) R which calcuates the 'fitness' of a particular subset of Y. I'll discuss possible fitness functions later. Then, let us choose n random samples of Y with size k which we will call the first generation of samples. The idea is that given the i th generation, we will generate the ( i + 1 ) th generation which on average, should have a better fitness level.
To do this, we first determine a way to produce the offspring of any two sets A and B. One way to do this is to simply randomly choose a k sized subset C A B. Furthermore, we will have that any offspring may additionally have a mutation at some rate α, which in this case would mean we would randomly swap one element of C with the one elemtent of the complement of C
To get a new generation, we simply produce n offspring, making it more likely that offspring will be produced between 'fitter' parents. For example we could just produce offspring between parents in the top quartile of the current generation. Continuing this process, we should eventually produce a very good approximation of the fittest subset of Y, but in polynomial time.
I'm not sure if I'm interpreting what you want correctly, but you could calculate f this way: given A Y, let f ( A ) = 1 a 1 a 2 for ( a 1 , a 2 ) A × A and a 1 a 2 , where a lower value of f ( A ) would correspond to a better fitness level.

Do you have a similar question?

Recalculate according to your conditions!

New Questions in High school statistics

Ask your question.
Get an expert answer.

Let our experts help you. Answer in as fast as 15 minutes.

Didn't find what you were looking for?