Detecting corrupted data in birthdates of a population. I have a population of N birthdates. Let's assume that birthdates are uniformly distributed over the year. I'm concerned that some of these records have been corrupted, for example by someone pasting over filtered rows in excel, or otherwise introduced by error.

miklintisyt

miklintisyt

Answered question

2022-10-30

Detecting corrupted data in birthdates of a population
I have a population of N birthdates. Let's assume that birthdates are uniformly distributed over the year.
I'm concerned that some of these records have been corrupted, for example by someone pasting over filtered rows in excel, or otherwise introduced by error.
I would like a test to identify those records in N that share a birthdate which is over-represented in the data, indicating that they may have false dates. Any record might have been corrupted with any date, but I'm assuming the nature of the corruption was to overwrite the dates on a bunch of records with a single (false) date.
If I count the number of records on each date, what is the number above which I should suspect that some of the dates on those records are false? Obviously random variation means that the counts of records will not be N/365 for each date, but how much higher does it need to be on any given date for me to be 95% confident that I'm not just just seeing random variation?

Answer & Explanation

dwubiegrw

dwubiegrw

Beginner2022-10-31Added 13 answers

Step 1
Without corruption, number of recorded birthdays X i for each day i is a binomial B(N,p)( p 1 = 365 = n). Since for b > p
P ( i : X i N b ) n P ( X i N b ) n exp ( N D ( b | | p ) )
you want to pick b so that
exp ( N D ( b | | p ) ) = α n 1.37 10 4
Step 2
Hence
D ( b | | p ) = b log b p + ( 1 b ) log 1 b 1 p = log ( α n ) N = K 2.22 10 6
Write b = p ( 1 + ϵ ) and expand LHS in ϵ (and p) to get:
ϵ 2 K n = 2 n log n α N 0.04
Hence if any of the empirical frequencies of birthday deviate upwards from expected by more than 4%, the data have been corrupted at α = 0.05.

Do you have a similar question?

Recalculate according to your conditions!

New Questions in College Statistics

Ask your question.
Get an expert answer.

Let our experts help you. Answer in as fast as 15 minutes.

Didn't find what you were looking for?