vittorecostao1

2022-06-21

I'm reading The Elements of Statistical Learning. I have a question about the curse of dimensionality.

In section 2.5, p.22:

Consider N data points uniformly distributed in a p-dimensional unit ball centered at the origin. suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression:

$d(p,N)={(1-\frac{1}{{2}^{1/N}})}^{1/p}.$

For N=500, p=10, $d(p,N)\approx 0.52$, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point.

I accept the equation. My question is, how we deduce this conclusion?

In section 2.5, p.22:

Consider N data points uniformly distributed in a p-dimensional unit ball centered at the origin. suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression:

$d(p,N)={(1-\frac{1}{{2}^{1/N}})}^{1/p}.$

For N=500, p=10, $d(p,N)\approx 0.52$, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point.

I accept the equation. My question is, how we deduce this conclusion?

jarakapak7

Beginner2022-06-22Added 14 answers

This is exercise 2.3, which they mention there.

PDF mentions Probability Distribution Function. CDF means Cumulative Distribution Function. As we are considering continuous distributions, the former is the derivative of the latter.

The volume of a ball of radius r in ${\mathbb{R}}^{p}$ is ${\omega}_{p}{r}^{p},$, where ωp is a constant depending only on p, the value indicated by shorthand

${\omega}_{p}=\frac{{\pi}^{p/2}}{(p/2)!}.$.

As a result, the probability that a point, taken uniformly in the unit ball, is within distance x of the origin is the volume of that ball divided by the volume of the unit ball. The factors of ${\omega}_{p}$ cancel, so we get CDF

$F(x)={x}^{p},\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}0\le x\le 1.$.

The corresponding PDF is the derivative,

$f(x)=p{x}^{p-1},\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}0\le x\le 1.$.

From page 150, section 4.6 of Introduction to Mathematical Statistics by Hogg and Craig, we are told that the marginal (individual) PDF for ${y}_{1},$, the smallest order statistic (the minimum) of n points with CDF F and PDF f is

$g(y)=n{(1-F(y))}^{n-1}f(y).$.

In our case that gives

$g(y)=n{(1-{y}^{p})}^{n-1}p{y}^{p-1},$,

which can be readily integrated to give the CDF

$G(y)=1-{(1-{y}^{p})}^{n}.$.

The mean, or expected value of y, is a messy integral. Instead, the median is defined to be simply the value of the random variable y such that G(y)=1/2, in the case of a continuous variable. If you did this experiment many times the probability of getting a minimum smaller than the median is 50 percent, the probability of getting a minimum larger than the median is also 50 percent. For the traditional bell curve, the median and the mean are likely to be pretty close. In this case, a polynomial restricted to a short interval, I am not sure the median and mean are necessarily close to each other.

I do not see how you can read this book without a full semester calculus-based course in mathematical statistics.

Solve G(y)=1/2, you get their expression.

PDF mentions Probability Distribution Function. CDF means Cumulative Distribution Function. As we are considering continuous distributions, the former is the derivative of the latter.

The volume of a ball of radius r in ${\mathbb{R}}^{p}$ is ${\omega}_{p}{r}^{p},$, where ωp is a constant depending only on p, the value indicated by shorthand

${\omega}_{p}=\frac{{\pi}^{p/2}}{(p/2)!}.$.

As a result, the probability that a point, taken uniformly in the unit ball, is within distance x of the origin is the volume of that ball divided by the volume of the unit ball. The factors of ${\omega}_{p}$ cancel, so we get CDF

$F(x)={x}^{p},\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}0\le x\le 1.$.

The corresponding PDF is the derivative,

$f(x)=p{x}^{p-1},\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}0\le x\le 1.$.

From page 150, section 4.6 of Introduction to Mathematical Statistics by Hogg and Craig, we are told that the marginal (individual) PDF for ${y}_{1},$, the smallest order statistic (the minimum) of n points with CDF F and PDF f is

$g(y)=n{(1-F(y))}^{n-1}f(y).$.

In our case that gives

$g(y)=n{(1-{y}^{p})}^{n-1}p{y}^{p-1},$,

which can be readily integrated to give the CDF

$G(y)=1-{(1-{y}^{p})}^{n}.$.

The mean, or expected value of y, is a messy integral. Instead, the median is defined to be simply the value of the random variable y such that G(y)=1/2, in the case of a continuous variable. If you did this experiment many times the probability of getting a minimum smaller than the median is 50 percent, the probability of getting a minimum larger than the median is also 50 percent. For the traditional bell curve, the median and the mean are likely to be pretty close. In this case, a polynomial restricted to a short interval, I am not sure the median and mean are necessarily close to each other.

I do not see how you can read this book without a full semester calculus-based course in mathematical statistics.

Solve G(y)=1/2, you get their expression.