Using a "population" consisting of probabilities to predict accuracy of sample. Since I'm not sure if the title explains my question well enough I've come up with an example myself: Let's say I live in a country where every citizen goes to work everyday and every citizen has the choice to go by bus or by train (every citizen makes this choice everyday again - there are almost no citizens who always go by train and never by bus, and vice-versa).

Emmanuel Giles

Emmanuel Giles

Answered question

2022-11-02

Using a "population" consisting of probabilities to predict accuracy of sample
Since I'm not sure if the title explains my question well enough I've come up with an example myself:
Let's say I live in a country where every citizen goes to work everyday and every citizen has the choice to go by bus or by train (every citizen makes this choice everyday again - there are almost no citizens who always go by train and never by bus, and vice-versa).
I've done a lot of sampling and I have data on one million citizens about their behaviour in the past 1000 days. So, I calculate the "probability" per citizen of going by train on a single day. I can also calculate the average of those calculated probabilities of all citizens, let's say the average probability of a citizen going by train is 0.27. I figured that most citizens have tendencies around this number (most citizens have an individual probability between 0.22 and 0.32 of going by train for example).
Now, I started sampling an unknown person (but known to be living in the same country) and after asking him 10 consecutive days whether he went by train or by bus, I know that this person went to his work by train 4 times, and by bus 6 times.
My final question: how can I use my (accurate) data on one million citizens to approximate this person's probability of going by train?
I know that if I do the calculation the other way around, so, calculate the probability of this event occurring given the fact that I know this person's REAL probability is 0.4 this results in: 0.4 4 0.6 6 10 C 4 =∼ 25 %. I could calculate this probability for all possible probabilities between 0.00 and 1.00 (so, 0 % 100 % without any numbers in between) and sum them all, which sums to about 910%. I could set this to 100% (dividing by 9.1) and set all other percentages accordingly (dividing everything by 9.1 - so, our 25% becomes ~2.75%) and come up with a weighted sum: 2.75 % 0.4 + X % 0.41 etc., but this must be wrong since I'm not taking my accurate samples of the population into account.

Answer & Explanation

Savion Chaney

Savion Chaney

Beginner2022-11-03Added 14 answers

Step 1
These are two different probability distributions.
Using all the data for the country, you are calculating the probability that a citizen, whom you have picked randomly from all the citizens, will take the train .
Using the data for one particular citizen, you end up with the probability that that citizen will take the train.
In both cases the maximum likelihood estimator for the probability of taking the train is
p ^ = i = 1 N x i i = 1 N n i
where N is the total number of all the samples, x i is the amount of times the train was taken in sample i and n i is the sample size of sample i.
Step 2
Let's say you have the data for Sarah and Bob, the only two citizens in the country. The data shows Sarah has taken the train 4 times out of 10 and Bob has taken the train 6 times out of 10.
Then
p 1 ^ = P ( Sarah takes the train ) = 0.4 p 2 ^ = P ( Bob takes the train ) = 0.6
but
p ^ = P ( Citizen takes the train ) = P ( Picking Sarah ) P ( Sarah takes the train ) + P ( Picking Bob ) P ( Bob takes the train ) = 0.5 × 0.4 + 0.5 × 0.6 = 0.5
And this just simplifies to the estimator given earlier.
Karley Castillo

Karley Castillo

Beginner2022-11-04Added 3 answers

Step 1
After some research I found my answer, which -looking back- was obvious: Bayesian probability. More specificially: posterior distributions.
Not solving the matter totally but hinting in the right direction for any people in the future looking for the same answer:
Step 2
To calculate whether a citizen goes by train 25% of the time given the fact that we observed the citizen going 4 times by train out of 10 times, we can write it as:
P ( A | B ) = P ( B | A ) P ( A ) P ( B )
where P ( A ) = citizen goes by train 25% of the time and P ( B ) = citizen is observed to go 4 out of 10 times by train

Do you have a similar question?

Recalculate according to your conditions!

New Questions in College Statistics

Ask your question.
Get an expert answer.

Let our experts help you. Answer in as fast as 15 minutes.

Didn't find what you were looking for?