Normalizing sample data to unit variance. Suppose you have data points X_1,...,X_n drawn from some distribution P (regard them as iid random variables) and let the distribution be "well-behaved" so that the mean is mu and variance is sigma^2.

Ayanna Jarvis

Ayanna Jarvis

Answered question

2022-10-23

Normalizing sample data to unit variance
Suppose you have data points X 1 , . . . , X n drawn from some distribution P (regard them as iid random variables) and let the distribution be "well-behaved" so that the mean is μ and variance is σ 2 . Now suppose that we want to normalize our data, i.e., find a mapping X i Y i such that Y i are iid drawn from a distribution with mean = 0 and variance 1. Suppose that we don't know the true values μ , σ 2 , then it's not too hard to make the mean zero by using the map
X i Y i X i μ ^ , μ ^ 1 n 1 n X i
Indeed, μ ^ is the sample mean and it's quite easy to check that
E Y i = 0
Now if I would want to do the same for the variance, the natural method would be something like
X i Y i X i μ ^ σ ^ , σ ^ 2 1 n 1 i = 1 n ( X i μ ^ ) 2
Now we know by the convergence in distribution, that when n , we see that Y i ( X i μ ) / σ in distribution, but the variance doesn't seem to be normalized for arbitrary n.
Question. Is is actually possible to normalize the variance of a sample data set without knowing the true mean and variance?

Answer & Explanation

Jean Deleon

Jean Deleon

Beginner2022-10-24Added 14 answers

Step 1
It's possible to "normalize" the data so that the sample mean is 0 and the sample variance is 1, but normalizing based on the sample mean and variance causes problems for the independence of the resulting statistics, because each post-map observation depends on sample statistics, which in turn depend on the other observation. This interdependence causes the sample variance to, on average, be smaller than the population variance, if we calculate the sample variance as 1 n i = 1 n ( X i μ ^ ) 2 . This is why the formula 1 n 1 i = 1 n ( X i μ ^ ) 2 is generally used.
Step 2
So we either use the formula with n in the denominator, in which case we're not getting what we would get from the base distribution, or we use n 1, in which case our sample variance is not going to be 1. And neither choice for the denominator addresses the interdependence issue (this issue is also why we use a t-distribution rather than a normal one for hypothesis testing when the base distribution is normal).

Do you have a similar question?

Recalculate according to your conditions!

New Questions in College Statistics

Ask your question.
Get an expert answer.

Let our experts help you. Answer in as fast as 15 minutes.

Didn't find what you were looking for?