Log predictive density asmptotically in predictive information criteria for Bayesian models I am re

Gabriella Sellers 2022-06-13 Answered
Log predictive density asmptotically in predictive information criteria for Bayesian models
I am reading this paper, Andrew Gelman's Understanding predictive information criteria for Bayesian models, and I will give a screenshot as below:
Under standard conditions, the posterior distribution, p ( θ y ), approaches a normal distribution in the limit of increasing sample size (see, e.g., DeGroot, 1970). In this asymptotic limit, the posterior is dominated by the likelihood-the prior contributes only one factor, while the likelihood contributes n factors, one for each data point-and so the likelihood function also approaches the same normal distribution.
As sample size n →∝, we can label the limiting posterior distribution as θ | y N ( θ 0 , V 0 / n ). In this limit the log predictive density is
l o g p ( y θ ) = c ( y ) 1 2 ( k l o g ( 2 π ) + l o g | V o / n l + ( θ θ 0 ) T ( V o / n ) 1 ( θ θ 0 ) )
where c(y) is a constant that only depends on the data y and the model class but not on the parameters θ.
The limiting multivariate normal distribution for 0 induces a posterior distribution for the log predictive density that ends up being a constant (equal to  c ( y ) 1 2 ( k l o g ( 2 π ) + l o g | V o / n | ) ) minus 1 2 times a χ k 2 random variable, where k is the dimension of θ, that is, the number of parameters in the model. The maximum of this distribution of the log predictive density is attained when equals the maximum likelihood estimate (of course), and its posterior mean is at a vaue lower.
For actual posterior distributions, this asymptotic result is only an approximation, but it will be useful as a benchmark for interpreting the log predictive density as a measure of fit.
With singular models (e.g. mixture models and overparameterized complex models more gener- ally) a set of different parameters can map to a single data model, the Fisher information matrix i not positive definite, plug-in estimates are not representative of the posterior, and the distribution of the deviance does not converge to a χ 2 distribution. The asymptotic behavior of such models can be analyzed using singular learning theory (Watanabe, 2009, 2010).
Sorry for the long paragraph. The things that confuse me are:
1. Why here seems like we know the posterior distribution f ( θ | y ) first, then we use it to find the log p ( y | θ )? Shouldn't we get the model, log p ( y | θ ) first?
2. What does the green line "its posterior mean is at a value k 2 lower" mean? My understanding is since there is a term 1 2 χ k 2 in the expression and the expectation of χ k 2 is k, which lead to a k 2 lower. But k 2 lower than what?
3. How does the log p ( y | θ ) interpreting the measure of fit? I can see that there is a mean square error(MSE) term in this expression but it is an MSE of the parameter θ, not the data y.
Thanks for any help!
You can still ask an expert for help

Expert Community at Your Service

  • Live experts 24/7
  • Questions are typically answered in as fast as 30 minutes
  • Personalized clear answers
Learn more

Solve your problem for the price of one coffee

  • Available 24/7
  • Math expert for every subject
  • Pay only if we can solve it
Ask Question

Answers (1)

Jayce Bates
Answered 2022-06-14 Author has 18 answers
1. When we look at the posterior distribution, we are concerned with two contributing factors: the prior and the likelihood. As we are looking at the asymptotic limit n , we know that the influence of the prior is negligible. We can model this limiting θ | y as N ( θ 0 , V 0 / n ) to ascertain the behavior of the posterior from the likelihood. In the excerpt you have provided this is merely a heuristic for measuring fitness of your model because the log predictive density is approximately inferred from the likelihood. So to answer your question, we kind of are.
2. The author is saying that the log predictive density posterior distribution c ( y ) 1 2 ( k log ( 2 π ) + log | V 0 / n | ) 1 2 χ k 2 is maximized when θ equals the maximum likelihood estimate. So we differentiate the log predictive density and set it equal to zero in order to solve for the value of θ that maximizes the distribution. This distribution for this θ has a mean which is equal to k 2 less than the maximum possible value of the log predictive density posterior distribution or maximum likelihood estimation. We expect this because [ c ( y ) 1 2 ( k log ( 2 π ) + log | V 0 / n | ) ] [ c ( y ) 1 2 ( k log ( 2 π ) + log | V 0 / n | ) 1 2 χ k 2 ] is k 2 .
3. I hope that the answer to this question is made more clear by the previous two answers. Ultimately the idea is that these approximation will work well if the model is a good fit.
Alright, I did my best. I hope that this helps a little bit.
Not exactly what you’re looking for?
Ask My Question

Expert Community at Your Service

  • Live experts 24/7
  • Questions are typically answered in as fast as 30 minutes
  • Personalized clear answers
Learn more

You might be interested in

asked 2021-02-23
Interpreting z-scores: Complete the following statements using your knowledge about z-scores.
a. If the data is weight, the z-score for someone who is overweight would be
-positive
-negative
-zero
b. If the data is IQ test scores, an individual with a negative z-score would have a
-high IQ
-low IQ
-average IQ
c. If the data is time spent watching TV, an individual with a z-score of zero would
-watch very little TV
-watch a lot of TV
-watch the average amount of TV
d. If the data is annual salary in the U.S and the population is all legally employed people in the U.S., the z-scores of people who make minimum wage would be
-positive
-negative
-zero
asked 2022-07-25
An elevator started on the 29th floor, went up 2 tloors, down 3 fioors, down 1 more floor, up 5 floors, and down 6 floors. At what floor did the elevator stop last?
asked 2022-07-03
a saw some proofs of Lebesgue regularity and open set G (for E L ) such that λ ( G E ) < ε, but how to proof this one: Namely suppose E L is such that for ε > 0 there exist a closed set F E such that λ ( E F ) < ε?
asked 2022-04-01
I have to prove that there is no natural number whose multiplication of digits is equal to 3570
What would be the proper mathematical solution to this question?
asked 2022-06-16
Let ν denote counting measure and λ denote Lebesgue measure. What is ( ν λ ) ( { ( x , x ) } ) x R ?
I am a little fuzzy on the product measure and have tried two ways to do this. Can someone explain why the incorrect version is wrong and the correct version is right?
By definition of the product measure ( ν λ ) ( { ( x , x ) } ) x R = ν ( { ( x , x ) } x R ) λ ( { ( x , x ) } x R ) = 0 since each λ ( { ( x , x ) } ) = 0 and
( ν λ ) ( { ( x , x ) } ) x R = λ ( { x } ) ( χ { ( x , x ) } ) y d μ = 0 and ( ν λ ) ( { ( x , x ) } ) x R = μ ( { x } ) ( χ { ( x , x ) } ) x d λ = . So this value is not well defined. If any of these are right, I would like to know why one is right, the other is wrong. If they are both wrong, I would appreciate an explanation on how to do this properly.
asked 2022-05-26
Let X   N ( 0 , 1 ) be a normally distributed random variable.
I have to say I'm quite puzzled on how to actually show that t R : E [ exp ( t X ) ] = n = 0 t n n ! E [ X n ]. I was first wondering that a simple application of the monotone convergence theorem would do the trick, namely that f = n = 0 ( t X ) n n ! , f n = m = 0 n ( t X ) m m ! , lim n f n = f. But the MCT requires that the partial functions are non-negative, and while the exp is a non-negative function, the partial sums might not be. How could the equality be proven rigorously?
asked 2022-07-27
Create a linear relationship (equation) that you would use or rely on in your field or everyday life and explain what it is and how it is used. Identify the initial condition, such as your starting weight, and the rate of change. For a linear equation, the rate of change (slope) needs to be constant.

New questions