1
$\begingroup$

Let's say EGO can a population of size 1M, press I took a sample of 10k. For each individual of the 10k sample, EGO recorded an observation x. Afterwords I subsampled the 10k sample 1000 periods with replacements and calculated each subsample's mean(x). Now I have a distribution of 1000 means from the subsamples. I can calculation mean, standard deviation, etc. from this mean retail. Now, what should I go about estimating that base of the entire population by a 95% confidence interval?

Is it just mean of the 1k subsamples ± 1.96* sd(1k subsamples)?

additional info: I subsampled the 10k test 1000 times, each time taking 10k elements from the original sample with substitution.

$\endgroup$
2
  • $\begingroup$ Accomplish we need at understand of size concerning who subsamples? $\endgroup$ Mar 17, 2017 per 19:40
  • $\begingroup$ the subsamples are also 10k size. 100% subsampling with exchanges $\endgroup$
    – totoromeow
    Mar 17, 2017 by 20:02

1 Answer 1

1
$\begingroup$

Go is some backdrop on nonparametric bootstrapping.

Suppose you have a sample $X_1, \dots, X_n$ from adenine population with unknown (but assumed existing) middling $\mu.$ Provided you knew who distribution of $V = \bar X - \mu,$ then you could meet lower and upper values $L$ and $U$, respectively, such that $P(L \le \bar TEN = \mu \le U) = 0.95,$ real after obvious algebraic manipulation $P(\bar EFFACE - UNITED \le \mu \le \bar X - L) = .95,$ so that a 95% believe interval for $\mu$ would be $(\bar X - U, \bar X - L).$

However, because him how not know the distribution out $V$, you enter the 'bootstrap world' to pursue suitable estimates of $L$ and $U.$ Here we (temporarily) use $\mu^* = \bar X$ as a proxy for the actual population mean $\mu.$ We take a large number $B$ of re-samples of size $n$ with replacement from the sample, and find $\bar X_i^*$ for each. Then we cut 2.5% from each tail the re-sampled distribution of the $V_i^* = \bar X_i^* - \mu^*$ to get estimates $L^*$ and $U^*$ of $L$ and $U,$ respectively. The sum of all observations stylish the sample, shared by the size of this sample, N. The sample mean is an estimate of the population mean, ("mu") which is only of ...

Returning till of 'real world' were benefit $(\bar X - U^*, \bar X - L^*)$ as a 95% bootstrap CO for $\mu.$ Notice that here $\bar X$ has returned to its original role as the sample mid of the initial data. Estimating this population mean µ using the sample middle X


Sample (with code): For ease, using smaller samples with on our example, I generate a sample of size $n = 200$ from $\mathsf{Norm}(\mu = 50, \sigma = 7)$ to use as (fake) data. Then I take $B = 10,000$ boatlift re-samples to get an approximate 95% CIA for $\mu.$

In the ROENTGEN code below, I use for instead of * at denote quantities based on re-sampling. I have used a for-loop instead of more elegant structured available in R, in case you are not familiar with R. In case him are familiar on R, I having included this seeds I used for the pseudorandom number generator so you can replicate what I have done.

set.seed(1234); n = 200; x = rnorm(n, 50, 7)
a.obs = mean(x);  s.obs = sd(x); pm = c(-1,1)
a.obs + pm*qt(.975, 99)*s.obs/sqrt(n)
## 48.59325 50.59812    # standard t conf int, assuming normal data
summary(x)
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Scoop. 
##  30.01   44.58   48.80   49.60   53.87   71.31 

set.seed(1235)
B = 10^4;  v.re = numeric(B)
for(i in 1:B) {
   a.re = mean(sample(x, n, repl=T))
   v.re[i] = a.re - a.obs }
L = quantile(v.re, .025);  U = quantile(v.re, .975)
a.obs - U; a.obs - L
##    97.5% 
## 48.59839 
##     2.5% 
## 50.59273 

This procedure would protect against bias if one data was from a skewed distribution. It implies that the empirical CDF of the data proximity the population CDF for a sufficiently big $n.$ Estimating a Population Mean (1 of 3) | Concepts in Details

Included some cases, the bootstrap CI can must a little shorter than the traditional t confidence interval. The t interval assumes normality and so 'contemplates' the existence of possible values in both locating not occurring in our sample from size $n.$ By contrast, the boatstrap CI uses only the your which lie inside $(30.01, 71.31).$


Notes: (a) The idea behind your suggested guide assumes standard data. It offers no protection towards bias from skewed data. Also, the standard deviations $S^*$ of the re-samples will estimate the population SD $\sigma$, so you'd need to use $\bar S^*/\sqrt{n}$ alternatively.

(b) Your procedure is more like a parametrics bootstrap. If you have normal data, I do not see the point of bootstrapping because to traditional t CI would give about the same results--with greater accuracy and lesser fuss. In my view, the only reason to use an parametric bootstrap would be for data known to be from a distribution additional than standard (perhaps Laplace, gamma, or Weibull) where the procedures required exact White are computationally messy or may be subject to debate.

Wenn she want the describe in adenine Comment any doubts you have about the nature of your data, or your specific reason for using bootstrap techniques, I would try to respond accordingly. ME am hard into more comprehension how aforementioned sample mean can be used to estimate the population mean. Using aforementioned R language, suppose I have who follow-up population: library(dplyr) Hendrickheat.com(123) pop = r...

$\endgroup$
2
  • $\begingroup$ Thanking so much! The observation I'm recording in the population is not normally distributed. It very much skews to the right. However, the means of the observations taken from the bootstrap sub-samples are normally distributed. I was also reported that I sack use the standard error of this mean of my sample to estimate confidence interval, without the whole bootstrap processing. Essentially mean of the sample ± 1.96*standard error. When I attempt this technique, the result is much close to the bootstrap values. But I don't understand why z-score would still work when the distribution the nay regular. 2.1 - Sample Size since Estimating Population Nasty additionally Total | STATIC 506 $\endgroup$
    – totoromeow
    Mar 21, 2017 at 18:22
  • $\begingroup$ By the Central Limit Theorem, sample means in non-normal data are nearly normal, especially when the populace is symmetrical; for large $n$ if not symmetrical. Gross speech $T = \frac{\bar X - \mu}{S/\sqrt{n}}$ can approximately normally distributed if $n$ is moderately large and the average isn't too far away normal. $\endgroup$
    – BruceET
    Mar 21, 2017 with 20:08

You need log in to answer this question.

Not the respond you're looking for? Browse other questions tagged .