Fast-Track Population Data Learning: Tips and More!

Yaretzi Mcconnell 2022-11-13

Standard Error is of Population Total
We have the following data and we are required to obtain the standard error of unbiased estimate of the population total:
$N = 160, n = 64, σ^{2} = 4$
My approach
We know that:
$S E (\bar{X}) = \frac{σ}{\sqrt{n}}$ . So, it can be written as:
$S E (\frac{T o t a l}{n}) = \frac{σ}{\sqrt{n}}$
Which in turn will be equal to:
$S E (T o t a l) = (σ) (\sqrt{n})$ .
In the above formula, after plugging in values, I am getting $S E = (2) (8) = 16$
But this is not correct. The correct answer is 40. Am I doing it incorrectly? I am not sure.
Any help?

Clara Dennis 2022-11-12

I am not a mathematician, so go easy on me. I'm a programmer.
I have a database that I got from the Internet (USDA National Nutrient Database for Standard Reference), detailing the amount of each nutrient in each of a few thousand foodstuffs. I wanted to write a program that would be able to create a maximally nutritious meal based on this data.
For each nutrient, I have a target and two penalties - one for going over and one for going under the target (since, for example, it's a lot worse to get too much saturated fat than not enough). The goal is to minimize the sum of the penalties.
The meal can select from all the thousands of foodstuffs, but can only contain five or six.
I wrote the program in Java, implemented a genetic algorithm, specified my requirements, and let it run. It produced recommendations that were pure poison, and didn't seem to improve with time.
Maybe I just don't get genetic algorithms? Let's see what I did...
1) Create a population of randomly generated meals.
2) Normalize each one so it has 2000 calories, by multiplying the amount of each foodstuff proportionally.
3) Select the best 10% of meals to be parents.
4) Create a new generation - a few random to avoid local minima, the rest created by combining the numbers and amounts from the parents.
5) GOTO 2.
What other algorithm can I try? Someone advised me to use simplex algorithm, but I can't seem to explain to it (the implementation in Apache Commons Math) what my fitness function is. But he claimed it would be a natural fit, and I have even heard of someone who used simplex for exactly this.

dannigurl21ck2 2022-11-12

For non-negative data the sample mean is not smaller than its standard error.
(1) Let $X_{1}, X_{2}, \dots, X_{n}$ be a random sample from a population with non-negative values. Then show that $\bar{X} \geq S / \sqrt{n},$ where $S^{2} = [\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}] / (n - 1) .$
I have not seen this inequality stated before. It is not difficult to prove directly, but may be implied by some more general result I am overlooking.
(2) Also use this inequality to find a counterexample, showing that the sample mean and variance are not independent for exponential (or beta) data; perhaps use $n = 4$ for simplicity. [of course, $\bar{X}$ and S are independent for normal data.]

Alice Chen 2022-11-11

Calculate the Standard Deviation for Multiple Combined Distributions
You have the mean and standard deviation data for a population of individual unit pumps' reporting error. Each engine has either 4 or 6 unit pumps and the total reporting error for the combined performance of all pumps on the engine is being evaluated (i.e four pumps on an engine all of which are over reporting by 10%, the engine reporting error is 10%). Considering that each pump installed on the engines will be selected from the distribution data provided for an individual pump, what is the overall mean and standard deviation for the entire engine in terms of reporting error for both 4 and 6 pump engines?

apopiw83 2022-11-10

A random sample of 500 is taken from a large population, which is known to be equally divided between males and females, and values for the quantity of interest are recorded. On examination of the results, it is found that the sample taken includes 200 females for which the mean is 10.2 with standard deviation of 0.6 and 300 males for which the mean is 14.8 and standard deviation 2.4.
Question:
Which one of the following statements in NOT correct?
1. Taking the mean value of the data for the 500 sampled would over-estimate the true population mean
2. The most accurate estimate of the mean of the quantity of interest would have been obtained by sampling equal numbers of males and females
3. Given the sample that was taken, the best estimate of the population mean is the average of the means of the males and females i.e. 12.5
4. The estimate from the sample of the mean for females is likely to be more accurate than that for males
My Attempt:
My guess is that option 4) is incorrect because your told in the beginning the population is known to be divided equally?

Melissa Walker 2022-11-07

Do 50% of people die on either side of an average life expectancy?
I am no statistician so this may come across as very naive, but I've been trying to get my head around how to interpret life expectancy data.
Say you have a population of n and a range of possible ages from $0 - m$ .
Am I right in thinking that given an average life expectancy of x years, roughly 50% of the population must fall on either side of that line (ignoring those who fall directly on it)?
As such, given an average life expectancy of 75, a maximum age of 120 and a population of 10,000, 5,000 (ish) would be expected to die between the ages of 0 and 74 and 5,000 (ish) between 76 and 110? And that that second half will be more densely packed than the first half (as it covers 34 years rather than 74 with the same number of people)?
Finally, if the above is broadly correct (and if we assume for simplicity's sake that each age has an equal likelihood of fatality), does that mean you are more likely to die on the left side of the average until you make it halfway between 0 and x, whereupon you are equally as likely to die on either side of the average, and each subsequent year survived increases your likelihood of outliving the average?
My thinking being that if you have a $\frac{1}{120}$ chance of dying each year (0.83%), each year of the left side is $\frac{1}{74}$ (1.35%) whereas each year on the right side is $\frac{1}{34}$ , or ~ $\frac{2}{74}$ (2.94%).

unabuenanuevasld 2022-11-07

How to find the initial and the future population based on today's data?
A certain species of bird was introduced in a certain county 25 years ago. Biologists observe that the population doubles every 10 years, and now the population is 27,000.
(A) - What was the initial size of the bird population? (Round your answer to the nearest whole number.)
$n (initial) = \frac{27, 000}{2^{(25 / 10)}} ⟹ [n] (initial) = 4773$ - correct.
(B) - Estimate the bird population 8 years from now. (Round your answer to the nearest whole number.)
$n (8 years later) = 4773 \times 2^{(8 / 10)} ⟹ [n] (8 years later) = 8310$ - wrong.

Brenda Jordan 2022-11-07

Analytical solution to mixed distribution fit to failure time data - Lambert W perhaps?
I have a set of n device failure times ${t_{i} > 0}$ for $i = 1... n$ and $N - n$ devices which have not yet failed. Using maximum likelihood I am attempting to find a closed-form analytical solution to fit the data to the following cumulative distribution function:
$F (t | λ, p) = p (1 - e^{- λ t})$
where $0 < p < 1$ is the asymptotic fraction of units to eventually fail and $λ > 0$ the sub-population failure rate. The likelihood for this MLE attempt is given by:
$L = (1 - F (t_{n}))^{N - n} \prod_{i = 1}^{n} f (t_{i})$
and
$\ln L = (N - n) \ln (1 - p + p e^{- λ t_{n}}) + n λ p - λ \sum_{i = 1}^{n} t_{i}$
with pdf of $f (t) = d F / d t = λ p e^{- λ t}$ . Here we take $\nabla_{λ p} L = 0$ or $\nabla_{λ p} \ln L = 0$ to solve for p and $λ$ at max likelihood (or log likelihood). I've just recently learned a smidgen about the Lambert W function and was hoping that someone with a more nimble mind than mine might be able to derive a closed form solution using this and/or other cleverness.

Aleah Avery 2022-11-06

Standard deviation of the mean of sample data
I can't quite understand what this formula means:
$σ_{\bar{x}} = \frac{σ}{\sqrt{n}}$
I know what standard deviation $σ$ is - it's the average distance of my data points (samples) from the mean. But this part is confusing:
For example, suppose the random variable X records a randomly selected student's score on a national test, where the population distribution for the score is normal with mean 70 and standard deviation 5 (N(70,5)). Given a simple random sample (SRS) of 200 students, the distribution of the sample mean score has mean 70 and standard deviation
$\frac{5}{\sqrt{200}} \approx \frac{5}{14.14} \approx 0.35$
Source
I thought the standard deviation $σ = 5$ means that if I take the scores of all students and calculate the mean, then the average distance of a score from that mean will be equal to 5. The set of all scores is called the 'population', right? But here it says the more students' scores I take, the lower the standard deviation - thus the closer the number of samples gets to the size of population, the lower the standard deviation (and its get further from 5).

Alfredo Cooley 2022-11-06

I haven't used Bayes' theorem much before so any help would be greatly appreciated.
Suppose you are given following data.5% of the population have heart disease.
If you have heart disease, the probability that you have high blood pressure is 90%
If you do not have heart disease, the probability that you have high blood pressure is 15%
What is the probability that a person chosen at random from the population has high blood pressure?
$P (B) = P (B | H) P (H) + P (B | H^{'}) P (H^{'})$
$P (B) = (.9) (.05) + (.15) (.95) = .1875$
Using Bayes Theorem calculate the probability that the person has heart disease, if they have high blood pressure.
$P (H | B) = \frac{P (B | H) P (H)}{P (B)}$
$(.9) (.05) / .1875 = .24$
Using Bayes Theorem calculate the probability that the person has heart disease, if they do not have high blood pressure.
$P (H | B^{'}) = \frac{P (B^{'} | H) P (B^{'})}{P (H)}$
When I sub in for this part I'm getting an invalid answer

ritualizi6zk 2022-11-05

How to refine population statistic when more data is available
Suppose I have two pieces of data about two populations. The first piece of data is the national accident rate, denoted A The second piece of data is the national safety rate, a related, but not exactly inverse piece of data, denoted S
Now if I were to be given an additional piece of data, say a particular cities safety rating, and asked, what is the best guess of that cities accident rating, how would I approach this problem?
Also, I am not sure what this kind of situation/problem is called, if someone could point out the branch of statistics this falls under, that would also be helpful.

charmbraqdy 2022-11-05

Getting confused over a T-test
its been a while since I've done this and I am getting rather confused.
Let's say I have two data sets of size $n_{1}$ and $n_{2}$
$X = X_{1}, X_{2}, . . X_{n_{1}}$
$Y = Y_{1}, Y_{2}, . . Y_{n_{2}}$
and want to construct a t-test. I have seen this formula in lots of books, what is the intuition and is this correct for finding a t-test?
$T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}}$
I am getting confused a little, I see some books telling me to be careful is my variances are equal. Can i not just put them into this formula either way?
Also are we always referring to sample population, mean and standard deviation. How does the formula change if we have a the population data?
How does this formula change if the data is paired?

atgnybo4fq 2022-11-04

Determining sample size of a set of boolean data where the probability is not 50%
I'll lay out the problem as a simplified puzzle of what I am attempting to calculate. I imagine some of this may seem fairly straightforward to many but I'm starting to get a bit lost in my head while trying to think through the problem.
Let's say I roll a 1000-sided die until it lands on the number 1. Let's say it took me 700 rolls to get there. I want to prove that the first 699 rolls were not number 1 and obviously the only way to deterministically do this is to include the first 699 failures as part of the result to show they were in fact "not 1".
However, that's a lot of data I would need to prove this. I would have to include all 700 rolls, which is a lot. Therefore, I want to probabilistically demonstrate the fact that I rolled 699 "not 1s" prior to rolling a 1. To do this, I decide I will randomly sample my "not 1" rolls to reduce the set to a statistically significant, yet more wieldy number. It will be good enough to demonstrate that I very probably did not roll a 1 prior to roll 700.
Here are my current assumptions about the state of this problem:
- My initial experiment of rolling until success is one of geometric distribution.
- However my goal for this problem is to demonstrate to a third party that I am not lying, therefore the skeptical third party is not concerned with geometric distribution but would view this simply as a binomial distribution problem.
A lot of sample size calculators exist on the web. They are all based around binomial distribution from what I can tell. So here's the formula I am considering:
$n = \frac{N \times X}{X + N - 1}$
$X = \frac{{Z_{α / 2}}^{2} \times p \times (1 - p)}{{M O E}^{2}}$
n is sample size
N is population size
Z is critical value ( $α$ is $1 - c o n f i d e n c e l e v e l a s p r o b a b i l i t y$ )
p is sample proportion
MOE is margin of error
As an aside, the website where I got this formula says it implements "finite population correction", is this desirable for my requirements?
Here is the math executed on my above numbers. I will use $Z_{a / 2} = 2.58$ for $α = 0.01$ , $p = 0.001$ and $M O E = 0.005$ . As stated above, $N = 699$ on account of there being 699 failure cases that I would like to sample with a certain level of confidence.
Based on my understanding, what this math will do is recommend a sample size that will show, with 99% confidence, that the sample result is within 0.5 percentage points of reality.
Doing the math, $X = 265.989744$ and $n = 192.8722086653 \approx 193$ , implying that I can have a sample size of 193 to fulfill this confidence level and interval.
My main question is whether my assumption about $p = \frac{1}{1000}$ is valid. If it's not, and I use the conservative $p = 0.5$ , then my sample size shoots up to $\approx 692$ . So I would like to know if my assumptions about what sample proportion actually is are correct.
More broadly, am I on the right track at all with this? From my attempt at demonstrating this probabilistically to my current thought process, is any of this accurate at all?

charmbraqdy 2022-11-04

What is the difference between $\frac{1}{N} \sum \frac{y_{i}}{x_{i}}$ and $\sum \frac{\bar{y}}{\bar{x}}$
I have a data set and I was looking at the population ratio and trying to estimate it using different methods. I was expecting eq1 and eq2 to be different but was very surprised that it was by a factor of almost 3x. I was just wondering why it was so different, is it partly due to n cancelling on eq2. Any explanation would be appreciated.

Simone Watts 2022-11-04

Finding the $(X - \bar{X})^{2}$ of first 5 data of dataset given mean and population variance
Mean and population variance of the dataset $x_{1}, x_{2} . . x_{10}$ are 19 and 49 respectively. If the value $\sum_{i = 6}^{10} x_{i}^{2} = 1900$ , what is the value of $\sum_{i = 1}^{5} x_{i}^{2} = ?$ .
I've solved it as following and it is wrong:
Population variance:
$S^{2} = \frac{\sum (x_{i} - \bar{x})^{2}}{n} 49 = \frac{\sum_{i = 1}^{5} x_{i}^{2}}{10} + \frac{1900}{10} 49 = \frac{\sum_{i = 1}^{5} x_{i}^{2}}{10} + 190 - 190 + 49 = \frac{\sum_{i = 1}^{5} x_{i}^{2}}{10} \sum_{i = 1}^{5} x_{i}^{2} = - 141 * 10 = - 1410$
And this solution is wrong. How to solve this problem?

clealtAfforcewug 2022-11-03

Bayesian Statistics - Basic question about prior
I try to get an understanding of bayesian statistics. My intuition tells me that in the expression for the posterior
$p (ϑ | x) = \frac{p (x | ϑ) p (ϑ)}{\int_{Θ} p (x | θ) p (θ) d θ}$
the term $p (ϑ)$ is the marginal distribution of the likelihood-function $p (ϑ, x)$ . It is obtained by
$p (ϑ) = \int_{X} p (ϑ | x) p_{X} (x) d x$
where $p_{X} (x)$ should be the marginal distribution of the Observable data. Does that make sense?
To this point it makes sense with this example: Offering somebody a car insurance without knowing the person's style of driving (determined by $ϑ \in Θ$ ) to feed some statistical model, we still can make use of the nation's car-crash statistics as our prior, which is a pdf on $Θ$ . That would be the marginal distribution of the "driving styles" across the population.
Maybe I am just oversimplifying here, because my resources did not mention this.

kituoti126 2022-11-02

Solve PDE using method of characteristics with non-local boundary conditions.
Given the population model by the following linear first order PDE in u(a,t) with constants b and $μ$ :
$u_{a} + u_{t} = - μ t u a, t > 0$
$u (a, 0) = u_{0} (a) a \geq 0$
$u (0, t) = F (t) = b \int_{0}^{\infty} u (a, t) d a$
We can split the integral in two with our non-local boundary data:
$F (t) = b \int_{0}^{t} u (a, t) d a + b \int_{t}^{\infty} u (a, t) d a$
Choosing the characteristic coordinates $(ξ, τ)$ and re-arranging the expression to form the normal to the solution surface we have the following equation with initial conditions:
$(u_{a}, u_{t}, - 1) ∙ (1, 1, - μ t u) = 0$
$x (0) = ξ, t (0) = 0, u (0) = u_{0} (ξ)$
Characteristic equations:
$\frac{d a}{d τ} = 1, \frac{d t}{d τ} = 1, \frac{d u}{d τ} = - μ t u$
Solving each of these ODE's in $τ$ gives the following:
$(1) \int d a = \int d τ (2) \int d t = \int d τ (3) \int d u = - \int μ t u d τ$
$a = τ + F (ξ) t = τ + F (ξ)$
$∴ a = τ + ξ ∴ t = τ$
$\int d u = - \int μ τ u d τ$
$\int \frac{1}{u} d u = - \int μ τ d τ$
$\ln u = - \frac{1}{2} μ τ^{2} + F (ξ)$
$u = G (ξ) e^{- \frac{1}{2} μ τ^{2}}$
$∴ u = u_{0} (ξ) e^{- \frac{1}{2} μ τ^{2}}$
Substituting back the original coordinates we can re-write this expression with a coordinate change:
$ξ = a - t τ = t$
$∴ u (a, t) = u_{0} (a - t) e^{- \frac{1}{2} t^{2}}$
Now this is where I get stuck, how do I use the boundary data to come up with a well-posed solution?
$u (0, t) = u_{0} (- t) e^{- \frac{1}{2} μ t^{2}} = b \int_{0}^{t} u (a, t) d a + b \int_{t}^{\infty} u (a, t) d a$

Emmanuel Giles 2022-11-02

Using a "population" consisting of probabilities to predict accuracy of sample
Since I'm not sure if the title explains my question well enough I've come up with an example myself:
Let's say I live in a country where every citizen goes to work everyday and every citizen has the choice to go by bus or by train (every citizen makes this choice everyday again - there are almost no citizens who always go by train and never by bus, and vice-versa).
I've done a lot of sampling and I have data on one million citizens about their behaviour in the past 1000 days. So, I calculate the "probability" per citizen of going by train on a single day. I can also calculate the average of those calculated probabilities of all citizens, let's say the average probability of a citizen going by train is 0.27. I figured that most citizens have tendencies around this number (most citizens have an individual probability between 0.22 and 0.32 of going by train for example).
Now, I started sampling an unknown person (but known to be living in the same country) and after asking him 10 consecutive days whether he went by train or by bus, I know that this person went to his work by train 4 times, and by bus 6 times.
My final question: how can I use my (accurate) data on one million citizens to approximate this person's probability of going by train?
I know that if I do the calculation the other way around, so, calculate the probability of this event occurring given the fact that I know this person's REAL probability is 0.4 this results in: ${0.4}^{4} \cdot {0.6}^{6} \cdot 10 C 4 =\sim 25 %$ . I could calculate this probability for all possible probabilities between 0.00 and 1.00 (so, $0 % - 100 %$ without any numbers in between) and sum them all, which sums to about 910%. I could set this to 100% (dividing by 9.1) and set all other percentages accordingly (dividing everything by 9.1 - so, our 25% becomes ~2.75%) and come up with a weighted sum: $2.75 % \cdot 0.4 + X % \cdot 0.41$ etc., but this must be wrong since I'm not taking my accurate samples of the population into account.

Tara Mayer 2022-10-31

Separating populations and estimating line-fit parameters
Given a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high $R^{2}$ , how does one separate the two populations (and incidentally compute the line-fit)?
This is fairly easy to do graphically - just create a scatterplot and the two lines are pretty apparent. But how does one do this algorithmically?
More generally, given a dataset containing an unknown number n of populations, each of which can be fit to a line with some lower bound on $R^{2}$ (e.g., .95), how does one separate the data into the minimum number of populations satisfying the $R^{2}$ criterion?

miklintisyt 2022-10-30

Detecting corrupted data in birthdates of a population
I have a population of N birthdates. Let's assume that birthdates are uniformly distributed over the year.
I'm concerned that some of these records have been corrupted, for example by someone pasting over filtered rows in excel, or otherwise introduced by error.
I would like a test to identify those records in N that share a birthdate which is over-represented in the data, indicating that they may have false dates. Any record might have been corrupted with any date, but I'm assuming the nature of the corruption was to overwrite the dates on a bunch of records with a single (false) date.
If I count the number of records on each date, what is the number above which I should suspect that some of the dates on those records are false? Obviously random variation means that the counts of records will not be N/365 for each date, but how much higher does it need to be on any given date for me to be 95% confident that I'm not just just seeing random variation?

Expert Assistance for Population Data: Comprehensive Resources and Practice Problems