We have data come from a normally distributed population with standard deviation 2.6, what sample size is needed to make sure that with 99% probability, the mean of the sample will be in error by at most 0.25?

Clara Dennis
2022-11-19
Answered

You can still ask an expert for help

dilettato5t1

Answered 2022-11-20
Author has **25** answers

Step 1

It seems you are talking about the margin of error for or of a confidence interval for the mean. The general formula is

$\overline{X}\pm {z}^{\ast}\cdot \frac{\sigma}{\sqrt{n}}$

Where $\overline{X}$ is the sample mean, $\sigma $ is the population standard deviation, z∗ is the critical value from the standard normal distribution and depends on the size of the confidence interval, and n is the sample size.

We want the term on the right, the margin of error (denoted ME below) to be less than or equal to a fixed size. Thsu we need

${z}^{\ast}\cdot \frac{\sigma}{\sqrt{n}}\le ME$

Rearranging to solve for n gives

$n\ge {\left(\frac{{z}^{\ast}\sigma}{ME}\right)}^{2}$

The critical value for a 99% confidence interval is ${z}^{\ast}=2.5758$ (the 99.5th percentile of a Standard Normal distribution to 4 decimal places). Then using your values of $\sigma =2.6$ and $ME=0.25$ gives

$n\ge {\left(\frac{(2.5758)(2.6)}{0.25}\right)}^{2}\approx 717.6$

Thus $n=718$ will do as we wish. The way you stated the problem seemed to suggest that 2.6 was the population standard deviation. If it was actually the sample standard deviation then the formula is a bit different. In that case the confidence interval is given by

$\overline{X}\pm {t}_{df}^{\ast}\cdot \frac{s}{\sqrt{n}}$

where the degrees of freedom, $df=n-1$. The main issue here is that the critical t value depends not only on the confidence level but also on the sample size. In any case, for a given confidence level, ${t}_{df}^{\ast}>{z}^{\ast}$ so the value we used ${z}^{\ast}=2.5758$ will underestimate our sample size but not by much. If we use $n=718$ as an estimate for the sample size then $df=718-1=717$ and then for a 99% confidence interval ${t}_{717}^{\ast}=2.5827$ which then gives $n\ge 722$ as our sample size and another iteration has no effect on the result.

Step 2

The other issue is that s also depends on sample size. This should just result in a better estimate for $\sigma $, the population standard deviation, but it can still affect the sample size to achieve the desired ME. In general, except in the case of very small samples, neither of these sources of inaccuracy will be significant. Use the sample size used to get s to determine the critical t value and your estimate for required sample size should at least be large enough as the value of ${t}_{df}^{\ast}$ decreases as n and thus df increases.

It seems you are talking about the margin of error for or of a confidence interval for the mean. The general formula is

$\overline{X}\pm {z}^{\ast}\cdot \frac{\sigma}{\sqrt{n}}$

Where $\overline{X}$ is the sample mean, $\sigma $ is the population standard deviation, z∗ is the critical value from the standard normal distribution and depends on the size of the confidence interval, and n is the sample size.

We want the term on the right, the margin of error (denoted ME below) to be less than or equal to a fixed size. Thsu we need

${z}^{\ast}\cdot \frac{\sigma}{\sqrt{n}}\le ME$

Rearranging to solve for n gives

$n\ge {\left(\frac{{z}^{\ast}\sigma}{ME}\right)}^{2}$

The critical value for a 99% confidence interval is ${z}^{\ast}=2.5758$ (the 99.5th percentile of a Standard Normal distribution to 4 decimal places). Then using your values of $\sigma =2.6$ and $ME=0.25$ gives

$n\ge {\left(\frac{(2.5758)(2.6)}{0.25}\right)}^{2}\approx 717.6$

Thus $n=718$ will do as we wish. The way you stated the problem seemed to suggest that 2.6 was the population standard deviation. If it was actually the sample standard deviation then the formula is a bit different. In that case the confidence interval is given by

$\overline{X}\pm {t}_{df}^{\ast}\cdot \frac{s}{\sqrt{n}}$

where the degrees of freedom, $df=n-1$. The main issue here is that the critical t value depends not only on the confidence level but also on the sample size. In any case, for a given confidence level, ${t}_{df}^{\ast}>{z}^{\ast}$ so the value we used ${z}^{\ast}=2.5758$ will underestimate our sample size but not by much. If we use $n=718$ as an estimate for the sample size then $df=718-1=717$ and then for a 99% confidence interval ${t}_{717}^{\ast}=2.5827$ which then gives $n\ge 722$ as our sample size and another iteration has no effect on the result.

Step 2

The other issue is that s also depends on sample size. This should just result in a better estimate for $\sigma $, the population standard deviation, but it can still affect the sample size to achieve the desired ME. In general, except in the case of very small samples, neither of these sources of inaccuracy will be significant. Use the sample size used to get s to determine the critical t value and your estimate for required sample size should at least be large enough as the value of ${t}_{df}^{\ast}$ decreases as n and thus df increases.

asked 2022-10-20

Discrete version of continuous SIR model

I'm working with a SIR infection model, which is

$$\begin{array}{rcl}\frac{dS}{dt}& =& -\beta IS\\ \frac{dI}{dt}& =& \beta IS-\gamma I\\ \frac{dR}{dt}& =& \gamma I\end{array}$$

in continuous time, where S, I, and R are the proportion of Susceptible, Infected, and Recovered, respectively.

However, since I am working with fixed-width discrete-time data, I think it would be more appropriate to modify the equations accordingly. I know this is incorrect (based on getting a negative values for $\hat{\gamma}$ and $\hat{\beta}$, neither of which should be negative):

$$\begin{array}{rcl}{S}_{t+1}& =& -\beta {I}_{t}{S}_{t}\\ {I}_{t+1}& =& \beta {I}_{t}{S}_{t}-\gamma {I}_{t}\\ {R}_{t+1}& =& \gamma {I}_{t}\end{array}$$

Ultimately, I would like to estimate $\beta $ and $\gamma $ by doing regression on

$${I}_{t+1}=\beta ({I}_{t}{S}_{t})+\gamma (-{I}_{t})+{U}_{t}\text{, where}{U}_{t}\text{~}N(0,{\sigma}^{2})$$

or whatever the discrete version is.

I'm working with a SIR infection model, which is

$$\begin{array}{rcl}\frac{dS}{dt}& =& -\beta IS\\ \frac{dI}{dt}& =& \beta IS-\gamma I\\ \frac{dR}{dt}& =& \gamma I\end{array}$$

in continuous time, where S, I, and R are the proportion of Susceptible, Infected, and Recovered, respectively.

However, since I am working with fixed-width discrete-time data, I think it would be more appropriate to modify the equations accordingly. I know this is incorrect (based on getting a negative values for $\hat{\gamma}$ and $\hat{\beta}$, neither of which should be negative):

$$\begin{array}{rcl}{S}_{t+1}& =& -\beta {I}_{t}{S}_{t}\\ {I}_{t+1}& =& \beta {I}_{t}{S}_{t}-\gamma {I}_{t}\\ {R}_{t+1}& =& \gamma {I}_{t}\end{array}$$

Ultimately, I would like to estimate $\beta $ and $\gamma $ by doing regression on

$${I}_{t+1}=\beta ({I}_{t}{S}_{t})+\gamma (-{I}_{t})+{U}_{t}\text{, where}{U}_{t}\text{~}N(0,{\sigma}^{2})$$

or whatever the discrete version is.

asked 2022-11-06

I haven't used Bayes' theorem much before so any help would be greatly appreciated.

Suppose you are given following data.5% of the population have heart disease.

If you have heart disease, the probability that you have high blood pressure is 90%

If you do not have heart disease, the probability that you have high blood pressure is 15%

What is the probability that a person chosen at random from the population has high blood pressure?

$$P(B)=P(B|H)P(H)+P(B|{H}^{\prime})P({H}^{\prime})$$

$$P(B)=(.9)(.05)+(.15)(.95)=.1875$$

Using Bayes Theorem calculate the probability that the person has heart disease, if they have high blood pressure.

$$P(H|B)=\frac{P(B|H)P(H)}{P(B)}$$

$$(.9)(.05)/.1875=.24$$

Using Bayes Theorem calculate the probability that the person has heart disease, if they do not have high blood pressure.

$$P(H|{B}^{\prime})=\frac{P({B}^{\prime}|H)P({B}^{\prime})}{P(H)}$$

When I sub in for this part I'm getting an invalid answer

Suppose you are given following data.5% of the population have heart disease.

If you have heart disease, the probability that you have high blood pressure is 90%

If you do not have heart disease, the probability that you have high blood pressure is 15%

What is the probability that a person chosen at random from the population has high blood pressure?

$$P(B)=P(B|H)P(H)+P(B|{H}^{\prime})P({H}^{\prime})$$

$$P(B)=(.9)(.05)+(.15)(.95)=.1875$$

Using Bayes Theorem calculate the probability that the person has heart disease, if they have high blood pressure.

$$P(H|B)=\frac{P(B|H)P(H)}{P(B)}$$

$$(.9)(.05)/.1875=.24$$

Using Bayes Theorem calculate the probability that the person has heart disease, if they do not have high blood pressure.

$$P(H|{B}^{\prime})=\frac{P({B}^{\prime}|H)P({B}^{\prime})}{P(H)}$$

When I sub in for this part I'm getting an invalid answer

asked 2022-10-28

Modelling wealth with a Pareto distribution: how do I estimate the parameters?

I wish to create a function that will estimate the wealth of a person in the United States. It would be used to make a table with each decile and their estimated wealth.

This estimate will be based on very rudimentary data, and is only for personal interest. The data is:

- The total wealth of the bottom 90% is equal to the total wealth of the top 0.1%.

- Both proportions have 22% of the total wealth.

- The total wealth under the distribution is $80 trillion.

- The total population is 160 million households.

Given this data, how would I create parameter estimates for the exponent and scale of a pareto distribution? What would be f(x) where x is from (0,1), and the solution is the wealth of someone richer than that proportion of people? For example f(0.1) is someone richer than or equal to exactly 10% of the least wealthy, and could equal 1,000 dollars. F(0.5) is the median wealth, and could be 200,000 dollars. F(0.9999) is richer than 99.99% and would be somewhere in the tens or hundreds of millions of dollars.

I wish to create a function that will estimate the wealth of a person in the United States. It would be used to make a table with each decile and their estimated wealth.

This estimate will be based on very rudimentary data, and is only for personal interest. The data is:

- The total wealth of the bottom 90% is equal to the total wealth of the top 0.1%.

- Both proportions have 22% of the total wealth.

- The total wealth under the distribution is $80 trillion.

- The total population is 160 million households.

Given this data, how would I create parameter estimates for the exponent and scale of a pareto distribution? What would be f(x) where x is from (0,1), and the solution is the wealth of someone richer than that proportion of people? For example f(0.1) is someone richer than or equal to exactly 10% of the least wealthy, and could equal 1,000 dollars. F(0.5) is the median wealth, and could be 200,000 dollars. F(0.9999) is richer than 99.99% and would be somewhere in the tens or hundreds of millions of dollars.

asked 2022-11-15

A man tests for HIV. What is the predictive probability that his second test is negative?

In a population, it is estimated HIV prevalence to be $\lambda $. For a new test for HIV:

- $\theta $ is the probability of an HIV positive person to test positive

- $\eta $ is the probability an HIV negative person tests positive in this test.

A person takes the test to check whether they have HIV, he tests positive.

What is the predictive probability he tests negative on the second test?

Assumption: Repeat tests on the same person are conditionally independent.

From my notes predictive probability is given as:

$P(\stackrel{~}{Y}=\stackrel{~}{y}|Y=y)=\int p(\stackrel{~}{y}|\tau )p(\theta |\tau )$ here $\stackrel{~}{Y}$ is the unknown observable, y is the observed data and η the unknown.

I am interested in the probability of the second test is negative, given that the first test is positive,without knowing if the man really has HIV or not.

To facilitate this I define:

${y}_{1}$ as the event of the first test being positive and

$\stackrel{~}{{y}_{2}}$ as the second test being negative

Would this adaption to the formula given above be the correct/best approach to this problem ?

$p(\stackrel{~}{{y}_{2}},{y}_{1}|\tau )=p(\stackrel{~}{{y}_{2}}|\tau )p({y}_{1}|\tau )p(\tau )$ and this is really $\propto p(\stackrel{~}{{y}_{2}}|\tau )p(\tau |{y}_{1})$

I've gotten for the $p(\tau |{y}_{1})$ from Bayes' theorem:

$p(\tau |{y}_{1})=\frac{p(\tau )p({y}_{1}|\tau )}{p({y}_{1})}\phantom{\rule{0ex}{0ex}}=\frac{\lambda \theta}{\lambda \theta +\eta (1-\lambda )}$

How could I then find $p(\stackrel{~}{{y}_{2}}|\tau )$? Is this the correct approach?

In a population, it is estimated HIV prevalence to be $\lambda $. For a new test for HIV:

- $\theta $ is the probability of an HIV positive person to test positive

- $\eta $ is the probability an HIV negative person tests positive in this test.

A person takes the test to check whether they have HIV, he tests positive.

What is the predictive probability he tests negative on the second test?

Assumption: Repeat tests on the same person are conditionally independent.

From my notes predictive probability is given as:

$P(\stackrel{~}{Y}=\stackrel{~}{y}|Y=y)=\int p(\stackrel{~}{y}|\tau )p(\theta |\tau )$ here $\stackrel{~}{Y}$ is the unknown observable, y is the observed data and η the unknown.

I am interested in the probability of the second test is negative, given that the first test is positive,without knowing if the man really has HIV or not.

To facilitate this I define:

${y}_{1}$ as the event of the first test being positive and

$\stackrel{~}{{y}_{2}}$ as the second test being negative

Would this adaption to the formula given above be the correct/best approach to this problem ?

$p(\stackrel{~}{{y}_{2}},{y}_{1}|\tau )=p(\stackrel{~}{{y}_{2}}|\tau )p({y}_{1}|\tau )p(\tau )$ and this is really $\propto p(\stackrel{~}{{y}_{2}}|\tau )p(\tau |{y}_{1})$

I've gotten for the $p(\tau |{y}_{1})$ from Bayes' theorem:

$p(\tau |{y}_{1})=\frac{p(\tau )p({y}_{1}|\tau )}{p({y}_{1})}\phantom{\rule{0ex}{0ex}}=\frac{\lambda \theta}{\lambda \theta +\eta (1-\lambda )}$

How could I then find $p(\stackrel{~}{{y}_{2}}|\tau )$? Is this the correct approach?

asked 2022-11-03

Bayesian Statistics - Basic question about prior

I try to get an understanding of bayesian statistics. My intuition tells me that in the expression for the posterior

$$p(\vartheta |x)=\frac{p(x|\vartheta )p(\vartheta )}{{\int}_{\mathrm{\Theta}}p(x|\theta )p(\theta )d\theta}$$

the term $p(\vartheta )$ is the marginal distribution of the likelihood-function $p(\vartheta ,x)$. It is obtained by

$$p(\vartheta )={\int}_{X}p(\vartheta |x){p}_{X}(x)dx$$

where ${p}_{X}(x)$ should be the marginal distribution of the Observable data. Does that make sense?

To this point it makes sense with this example: Offering somebody a car insurance without knowing the person's style of driving (determined by $\vartheta \in \mathrm{\Theta}$) to feed some statistical model, we still can make use of the nation's car-crash statistics as our prior, which is a pdf on $\mathrm{\Theta}$. That would be the marginal distribution of the "driving styles" across the population.

Maybe I am just oversimplifying here, because my resources did not mention this.

I try to get an understanding of bayesian statistics. My intuition tells me that in the expression for the posterior

$$p(\vartheta |x)=\frac{p(x|\vartheta )p(\vartheta )}{{\int}_{\mathrm{\Theta}}p(x|\theta )p(\theta )d\theta}$$

the term $p(\vartheta )$ is the marginal distribution of the likelihood-function $p(\vartheta ,x)$. It is obtained by

$$p(\vartheta )={\int}_{X}p(\vartheta |x){p}_{X}(x)dx$$

where ${p}_{X}(x)$ should be the marginal distribution of the Observable data. Does that make sense?

To this point it makes sense with this example: Offering somebody a car insurance without knowing the person's style of driving (determined by $\vartheta \in \mathrm{\Theta}$) to feed some statistical model, we still can make use of the nation's car-crash statistics as our prior, which is a pdf on $\mathrm{\Theta}$. That would be the marginal distribution of the "driving styles" across the population.

Maybe I am just oversimplifying here, because my resources did not mention this.

asked 2022-11-04

Determining sample size of a set of boolean data where the probability is not 50%

I'll lay out the problem as a simplified puzzle of what I am attempting to calculate. I imagine some of this may seem fairly straightforward to many but I'm starting to get a bit lost in my head while trying to think through the problem.

Let's say I roll a 1000-sided die until it lands on the number 1. Let's say it took me 700 rolls to get there. I want to prove that the first 699 rolls were not number 1 and obviously the only way to deterministically do this is to include the first 699 failures as part of the result to show they were in fact "not 1".

However, that's a lot of data I would need to prove this. I would have to include all 700 rolls, which is a lot. Therefore, I want to probabilistically demonstrate the fact that I rolled 699 "not 1s" prior to rolling a 1. To do this, I decide I will randomly sample my "not 1" rolls to reduce the set to a statistically significant, yet more wieldy number. It will be good enough to demonstrate that I very probably did not roll a 1 prior to roll 700.

Here are my current assumptions about the state of this problem:

- My initial experiment of rolling until success is one of geometric distribution.

- However my goal for this problem is to demonstrate to a third party that I am not lying, therefore the skeptical third party is not concerned with geometric distribution but would view this simply as a binomial distribution problem.

A lot of sample size calculators exist on the web. They are all based around binomial distribution from what I can tell. So here's the formula I am considering:

$$n=\frac{N\times X}{X+N\u20131}$$

$$X=\frac{{{Z}_{\alpha /2}}^{2}\xad\times p\times (1-p)}{{\mathsf{M}\mathsf{O}\mathsf{E}}^{2}}$$

n is sample size

N is population size

Z is critical value ($\alpha $ is $1-\mathsf{c}\mathsf{o}\mathsf{n}\mathsf{f}\mathsf{i}\mathsf{d}\mathsf{e}\mathsf{n}\mathsf{c}\mathsf{e}\text{}\mathsf{l}\mathsf{e}\mathsf{v}\mathsf{e}\mathsf{l}\text{}\mathsf{a}\mathsf{s}\text{}\mathsf{p}\mathsf{r}\mathsf{o}\mathsf{b}\mathsf{a}\mathsf{b}\mathsf{i}\mathsf{l}\mathsf{i}\mathsf{t}\mathsf{y}$)

p is sample proportion

MOE is margin of error

As an aside, the website where I got this formula says it implements "finite population correction", is this desirable for my requirements?

Here is the math executed on my above numbers. I will use ${Z}_{a/2}=2.58$ for $\alpha =0.01$, $p=0.001$ and $\mathsf{M}\mathsf{O}\mathsf{E}=0.005$. As stated above, $N=699$ on account of there being 699 failure cases that I would like to sample with a certain level of confidence.

Based on my understanding, what this math will do is recommend a sample size that will show, with 99% confidence, that the sample result is within 0.5 percentage points of reality.

Doing the math, $X=265.989744$ and $n=192.8722086653\approx 193$, implying that I can have a sample size of 193 to fulfill this confidence level and interval.

My main question is whether my assumption about $p=\frac{1}{1000}$ is valid. If it's not, and I use the conservative $p=0.5$, then my sample size shoots up to $\approx 692$. So I would like to know if my assumptions about what sample proportion actually is are correct.

More broadly, am I on the right track at all with this? From my attempt at demonstrating this probabilistically to my current thought process, is any of this accurate at all?

I'll lay out the problem as a simplified puzzle of what I am attempting to calculate. I imagine some of this may seem fairly straightforward to many but I'm starting to get a bit lost in my head while trying to think through the problem.

Let's say I roll a 1000-sided die until it lands on the number 1. Let's say it took me 700 rolls to get there. I want to prove that the first 699 rolls were not number 1 and obviously the only way to deterministically do this is to include the first 699 failures as part of the result to show they were in fact "not 1".

However, that's a lot of data I would need to prove this. I would have to include all 700 rolls, which is a lot. Therefore, I want to probabilistically demonstrate the fact that I rolled 699 "not 1s" prior to rolling a 1. To do this, I decide I will randomly sample my "not 1" rolls to reduce the set to a statistically significant, yet more wieldy number. It will be good enough to demonstrate that I very probably did not roll a 1 prior to roll 700.

Here are my current assumptions about the state of this problem:

- My initial experiment of rolling until success is one of geometric distribution.

- However my goal for this problem is to demonstrate to a third party that I am not lying, therefore the skeptical third party is not concerned with geometric distribution but would view this simply as a binomial distribution problem.

A lot of sample size calculators exist on the web. They are all based around binomial distribution from what I can tell. So here's the formula I am considering:

$$n=\frac{N\times X}{X+N\u20131}$$

$$X=\frac{{{Z}_{\alpha /2}}^{2}\xad\times p\times (1-p)}{{\mathsf{M}\mathsf{O}\mathsf{E}}^{2}}$$

n is sample size

N is population size

Z is critical value ($\alpha $ is $1-\mathsf{c}\mathsf{o}\mathsf{n}\mathsf{f}\mathsf{i}\mathsf{d}\mathsf{e}\mathsf{n}\mathsf{c}\mathsf{e}\text{}\mathsf{l}\mathsf{e}\mathsf{v}\mathsf{e}\mathsf{l}\text{}\mathsf{a}\mathsf{s}\text{}\mathsf{p}\mathsf{r}\mathsf{o}\mathsf{b}\mathsf{a}\mathsf{b}\mathsf{i}\mathsf{l}\mathsf{i}\mathsf{t}\mathsf{y}$)

p is sample proportion

MOE is margin of error

As an aside, the website where I got this formula says it implements "finite population correction", is this desirable for my requirements?

Here is the math executed on my above numbers. I will use ${Z}_{a/2}=2.58$ for $\alpha =0.01$, $p=0.001$ and $\mathsf{M}\mathsf{O}\mathsf{E}=0.005$. As stated above, $N=699$ on account of there being 699 failure cases that I would like to sample with a certain level of confidence.

Based on my understanding, what this math will do is recommend a sample size that will show, with 99% confidence, that the sample result is within 0.5 percentage points of reality.

Doing the math, $X=265.989744$ and $n=192.8722086653\approx 193$, implying that I can have a sample size of 193 to fulfill this confidence level and interval.

My main question is whether my assumption about $p=\frac{1}{1000}$ is valid. If it's not, and I use the conservative $p=0.5$, then my sample size shoots up to $\approx 692$. So I would like to know if my assumptions about what sample proportion actually is are correct.

More broadly, am I on the right track at all with this? From my attempt at demonstrating this probabilistically to my current thought process, is any of this accurate at all?

asked 2022-10-31

Separating populations and estimating line-fit parameters

Given a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high ${R}^{2}$, how does one separate the two populations (and incidentally compute the line-fit)?

This is fairly easy to do graphically - just create a scatterplot and the two lines are pretty apparent. But how does one do this algorithmically?

More generally, given a dataset containing an unknown number n of populations, each of which can be fit to a line with some lower bound on ${R}^{2}$ (e.g., .95), how does one separate the data into the minimum number of populations satisfying the ${R}^{2}$ criterion?

Given a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high ${R}^{2}$, how does one separate the two populations (and incidentally compute the line-fit)?

This is fairly easy to do graphically - just create a scatterplot and the two lines are pretty apparent. But how does one do this algorithmically?

More generally, given a dataset containing an unknown number n of populations, each of which can be fit to a line with some lower bound on ${R}^{2}$ (e.g., .95), how does one separate the data into the minimum number of populations satisfying the ${R}^{2}$ criterion?