Reparameterization of hyperprior distribution I'm reading about Bayesian data analysis by Gelman et

myntfalskj4 2022-07-04 Answered
Reparameterization of hyperprior distribution
I'm reading about Bayesian data analysis by Gelman et al. and I'm having big trouble interpreting the following part in the book (note, the rat tumor rate θ in the following text has: θ B e t a ( α , β ):
Choosing a standard parameterization and setting up a ‘noninformative’ hyperprior dis- tribution. Because we have no immediately available information about the distribution of tumor rates in populations of rats, we seek a relatively diffuse hyperprior distribution for ( α , β ). Before assigning a hyperprior distribution, we reparameterize in terms of logit ( α α + β ) = log ( α β ) and log ( α + β ), which are the logit of the mean and the logarithm of the ‘sample size’ in the beta population distribution for θ . It would seem reasonable to assign independent hyperprior distributions to the prior mean and ‘sample size,’ and we use the logistic and logarithmic transformations to put each on a ( , ) scale. Unfortunately, a uniform prior density on these newly transformed parameters yields an improper posterior density, with an infinite integral in the limit ( α + β ) , and so this particular prior density cannot be used here.
In a problem such as this with a reasonably large amount of data, it is possible to set up a ‘noninformative’ hyperprior density that is dominated by the likelihood and yields a proper posterior distribution. One reasonable choice of diffuse hyperprior density is uniform on ( α α + β , ( α + β ) 1 / 2 ), which when multiplied by the appropriate Jacobian yields the following densities on the original scale,
p ( α , β ) ( α + β ) 5 / 2 ,
and on the natural transformed scale:
p ( log ( α β ) , log ( α + β ) ) α β ( α + β ) 5 / 2 .
My problem is especially the bolded parts in the text.
Question (1): What does the author explicitly mean by: "is uniform on ( α α + β , ( α + β ) 1 / 2 )"
Question (2): What is the appropriate Jacobian?
Question (3): How does the author arrive into the original and transformed scale priors?
To me the book hides many details under the hood and makes understanding difficult for a beginner on the subject due to seemingly ambiguous text.
P.S. if you need more information, or me to clarify my questions please let me know.
You can still ask an expert for help

Expert Community at Your Service

  • Live experts 24/7
  • Questions are typically answered in as fast as 30 minutes
  • Personalized clear answers
Learn more

Solve your problem for the price of one coffee

  • Available 24/7
  • Math expert for every subject
  • Pay only if we can solve it
Ask Question

Answers (1)

furniranizq
Answered 2022-07-05 Author has 20 answers
I figured the solution out myself so I'm gonna share it here if anyone is going to bump into the same part in Gelman's book (pages 110-111).
Answer (1):
p ( α α + β , ( α + β ) 1 / 2 ) = constant 1.
Answer (2):
When the author talks about "appropriate Jacobian" he's talking about the determinant of the Jacobian matrix in the change of variables formula for density functions:
p ( ϕ ) = p ( θ ) det ( d θ d ϕ )
Answer (3):
The author simply applies the change of variables formula two times. We know that
p ( γ , δ ) = p ( γ ( α , β ) , δ ( α , β ) ) = p ( α α + β , ( α + β ) 1 / 2 ) = constant 1.
If we denote θ = ( γ , δ ) and ϕ = ( α , β ), then:
det ( d θ d ϕ ) = | d γ d α d γ d β d δ d α d δ d β | = | β ( α + β ) 2 α ( α + β ) 2 1 2 ( α + β ) 3 / 2 1 2 ( α + β ) 3 / 2 | = 1 2 ( α + β ) 5 / 2 .
From change of variables formula we get:
p ( α , β ) = p ( α α + β , ( α + β ) 1 / 2 ) = constant  1 ( 1 2 ( α + β ) 5 / 2 ) ( α + β ) 5 / 2 ,
and there it is, i.e. the prior in original scale.
For the alternative scale, by using change of variables in exactly the same manner:
p ( α , β ) = p ( log ( α β ) , log ( α + β ) ) det ( d θ d ϕ ) ,
where this time γ ( α , β ) = log ( α β ) and δ ( α , β ) = log ( α + β ). For the Jacobian determinant we get:
det ( d θ d ϕ ) = | d γ d α d γ d β d δ d α d δ d β | = | 1 / α 1 / β ( α + β ) 1 ( α + β ) 1 | = 1 α β ,
so we get:
p ( α , β ) ( α + β ) 5 / 2 = p ( log ( α β ) , log ( α + β ) ) 1 α β ,,
or
p ( log ( α β ) , log ( α + β ) ) α β ( α + β ) 5 / 2 ..

We have step-by-step solutions for your answer!

Expert Community at Your Service

  • Live experts 24/7
  • Questions are typically answered in as fast as 30 minutes
  • Personalized clear answers
Learn more

You might be interested in

asked 2021-02-23
Interpreting z-scores: Complete the following statements using your knowledge about z-scores.
a. If the data is weight, the z-score for someone who is overweight would be
-positive
-negative
-zero
b. If the data is IQ test scores, an individual with a negative z-score would have a
-high IQ
-low IQ
-average IQ
c. If the data is time spent watching TV, an individual with a z-score of zero would
-watch very little TV
-watch a lot of TV
-watch the average amount of TV
d. If the data is annual salary in the U.S and the population is all legally employed people in the U.S., the z-scores of people who make minimum wage would be
-positive
-negative
-zero
asked 2021-09-16
Simplify the following expression:
5+i+96i
asked 2022-06-22
I am trying to understand how to average certain data, interpret the result of that, and potentially convert the results to percentages. I have four categories for the data:
1: (0-50%): Not Well
2: (50-75%): Okay
3: (75-90%): Well Done
4: (90-100%): Excellent
This is being used in a rating system. Say, for instance, we have 3 ratings each of category 4, and 7 ratings of category 3: {3, 3, 3, 3, 3, 3, 3, 4, 4, 4}. I don't believe doing a straight arithmetic mean will be useful in this case since category 1 is larger than category 2, category 2 is larger than category 3, and category 3 is larger than category 4. For instance, what would a result of 3.3 mean if we took the arithmetic mean? I'm trying to understand how to interpret the results of the data, but am at a loss for how to do this. Is there a way to convert the results back to one of the 4 categories or a percentage?
Edit: I think I've made some progress, but I could be off the mark. Taking a few different examples...
{1,1,1,1,1,1,1,1,4,4}.
They got a rating of 1 80% of the time, and a rating of 4 20% of the time .8*1+.2*4 = 1.6, possibly a Not Well result?
{1,1,1,3,3,4,4,4,4,4}: .3*1+.5*4+.2*3 = 2.9, maybe an Okay result?
{1,1,1,1,2,2,2,4,4,4}: .4*1+.3*2+.3*4 = 2.2, maybe an Okay result?
The problem with using that method is that Excellent will only come up if every rating is a 4. I am not sure how to fix that or if what I've tried is the correct method.
asked 2022-06-19
Equivalence of the persistence landscape diagram and the barcode?
I am studying persistent homology for the first time. I was reading Peter Bubenik's paper "Statistical Topological Data Analysis using Persistence Landscapes" from 2015 introducing persistent landscapes. I am quite confused on the approach on finding the values of the persistence landscape function using a barcode/persistence diagram. I feel like I have a naive misunderstanding of this topic as I shall attempt to explain.
Suppose X is a finite set of points in Euclidean space. From my understanding, if we consider the (finite length) persistence vector space given by the simplicial complex homology for a fixed dimension l, { H l ( X k ) } k = 1 n with maps { δ k , k }, for 1 k , k n, 1 k , k n, we have
{ H l ( X k ) } k = 1 n i = 1 m I ( b i , d i )
(Theorem 4.10 of this paper) for some multiset { ( b i , d i ) } i = 1 m , where I ( b , d ) gives the persistence vector space of length n,
0 . . . 0 R R R . . . 0 . . . 0
with non-zero vector spaces at values of the specified interval.
This multiset corresponds to the persistence diagram/barcode so that the k-th Betti number can be identified by finding the number of lines of the barcode that intersect the line x=k in R 2
Now Bubenik defines the Betti number of the persistence vector space for an interval [a,b] by β a , b = dim ( im ( δ a , b ) ), and the persistence landscape functions λ k : R [ , ], for k N by
λ k ( t ) = sup { m 0 β t m , t + m k }
Shouldn't β t m , t + m then correspond to the number of lines on the barcode that contain the interval [ t m , t + m ]], so that λ k ( t ) is the largest value of m that has at least k lines of the barcode intersecting [ t m , t + m ]?
I am confused on how the triangle construction is equivalent to the persistence landscape function instead. Any help would be much appreciated!
asked 2021-02-14
A ferry boat transports tourists among three islands. It sails from the first island to the second island, 4.42 km away, in a direction 37.0° north of east. It then sails from the second island to the third island in a direction 80.5° west of north. Finally it returns to the first island, sailing in a direction 28.0° east of south. (a) Calculate the distance between the second and third islands. km (b) Calculate the distance between the first and third islands.
asked 2022-06-05
How was the conditional probability obtained in this equation?
I'm reading about inverse problems and the Bayesian approach and I'm having troubles understanding how the following equations were obtained.
We consider the following equation initially.
y=G(u)+ η
y is a set of measured data. G is a mathematical model and u are the parameters of the model. The noise that is present in the observed data is given by η (0 mean noise).
The authors then go on to describe the Bayesian approach and they say that the likelihood function, that is the probability of y given u is given by:
ρ ( y | u ) = ρ ( y G ( u ) )
How did they derive this relation above?
asked 2022-06-11
Three Questions on Interpreting the Outcome of the Probability Density Function
I have three questions: What is the interpretation of the outcome of the Probability Density Function (PDF) at a particular point? How is this result related to probability? What we exactly do when we maximize the likelihood?
To better explain my questions:
(i) Consider a continuous random variable X with a normal distribution such that μ = 1.5 and σ 2 = 2. If we evaluate the PDF at a particular point, say 3.4, using the formula:
f ( X ) = 1 2 π σ 2 e ( X μ ) 2 2 σ 2   ,
we get f ( 3.4 ) = 0.1144. How we interpret this value?
(ii) I previously read that the result of 0.1144 is not necessarily the probability that X takes the value of 3.4. But how the result is related to probability concept?
(iii) Consider a sample of the continuous random variable X of size N=2.5, such that X 1 = 2 and X 2 = 3.5. We can use this sample to maximize the log-likelihood:
max ln L ( μ , σ | X 1 , X 2 ) = ln f ( X 1 ) + ln f ( X 2 )
If f(X) is not exactly a probability, what are we maximizing? Some texts detail that "we are maximizing the probability that a model (set of parameters) reproduces the original data". Is this phrase incorrect?