Skip to main content

Intermediate Statistics

Section 6 Sampling I

Subsection 6.1 Sampling Random Variables

Let \((\Omega,P)\) be a probability space. Recall that an element \((\omega_1,\omega_2,\ldots,\omega_n)\) of \(\Omega^n\) is called a random sample from \(\Omega\text{.}\) We extend this sampling language to random variables on \(\Omega\) as follows. Given a random variable \(X\colon \Omega\to \R\text{,}\) let \(X_1,X_2,\ldots,X_n\) be random variables given by
\begin{equation} X_k(\omega_1,\omega_2,\ldots,\omega_n)=X(\omega_k)\tag{6.1} \end{equation}
for \(1\leq k\leq n\text{.}\) We call the collection \((X_1,X_2,\ldots,X_n)\) a (random) sample of \(X\text{.}\) Intuitively, the variables \(X_k\) behave like independent copies of \(X\text{.}\) That this is the case is the content of the following proposition.

Checkpoint 6.2.

Next, we consider random variables \(\overline{X}\) and \(s^2\text{,}\) called the sample average and the sample variance, defined as follows.
\begin{align} \overline{X} \amp =\frac{1}{n}\sum_k X_k\tag{6.7}\\ s^2 \amp = \frac{1}{n-1}\sum_k (X_k-\overline{X})^2\tag{6.8} \end{align}
The quantity \(s\) (the square root of the sample variance) is called the sample standard deviation. The quantity
\begin{equation} n\overline{X}=\sum_k X_k\tag{6.9} \end{equation}
is called the sample sum. In general, any random variable written as a function of \(X_1,X_2,\ldots,X_n\) is called a (sample) statistic. The sample statistics \(n\overline{X},\overline{X},s^2\) play a key role in sampling theory. Here are some important properties.
Comments and terminology. The last equation (6.14) explains the strange-looking denominator \(n-1\) in (6.8). The square roots of the variances (6.11) and (6.13) are usually denoted \(\sigma_{\overline{X}}\) and \(\sigma_{n\overline{X}}\text{,}\) respectively. The quantity \(\sigma_{\overline{X}}\) is called the standard error for the sample average. The quantity \(\sigma_{n\overline{X}}\) is called the standard error for the sample sum.

Checkpoint 6.4.

Subsection 6.2 Simple random sample variables

Let \((\Omega,P)\) be a finite probability space with \(N=|\Omega|\) and with constant probability function \(p(\omega)=\frac{1}{N}\) for all \(\omega\in\Omega\text{.}\) Recall that \(\Omega^{n\ast}\) denotes the set of all one-to-one sequences of length \(n\) in \(\Omega\text{,}\) called simple random samples of size \(n\) taken from \(\Omega\text{.}\) Recall that the probability function \(p_{\Omega^{n\ast}}\) is constant, with constant value \(\frac{1}{N(N-1)\cdots (N-n+1)}\text{.}\) Given a random variable \(X\colon \Omega \to \R\text{,}\) we define sample random variables \(X_1,X_2,\ldots,X_n\) on \(\Omega^{n\ast}\) by the same formula (6.1)as for ordinary sample variables. We call the \(n\)-tuple \((X_1,X_2,\ldots,X_n)\) of variables on \(\Omega^{n\ast}\) a simple random sample of size \(n\) of \(X\text{.}\) As for ordinary samples, the simple random sample variables look like copies of \(X\text{.}\) However, in contrast with the ordinary sample case, the simple random sample variables \(X_k\) are dependent. Here are some key properties.
Terminology: Simple random samples are also called samples taken without replacement, or survey samples. This refers to the applied scenario in which the probability space \(\Omega\) models a human population, where each individual in the population has the same chance of being selected for a survey (or some kind of measurement). Once surveyed, that individual will not be surveyed again; in other words, survey samples produce one-to-one sequences. As for ordinary samples, the quantities \(\sigma_{\overline{X}},\sigma_{n\overline{X}}\) (the square roots of the variances (6.19) and (6.20), respectively) are called standard errors. The quantity \(\sqrt{\frac{N-n}{N-1}}\) that occurs in both standard errors is called the correction factor for the standard errors when sampling without replacement (or for simple random sampling, or for survey sampling).

Checkpoint 6.6.

Subsection 6.3 Some important variables arising from sampling

Let \(X\) be a Bernoulli variable with \(p=P(X=1)\) and \(q=1-p=P(X=0)\text{.}\) Let \(X_1,X_2,X_3\ldots\) be an infinite sequence of samples of \(X\text{.}\) Each random variable \(S_n=\sum_{k=1}^n X_k\) is called a binomial random variable. Let \(G\) be defined by \(G(X_1,X_2,\ldots) = k\) if \(k\) is the lowest index such that \(X_k=1\text{,}\) that is, if \(0=X_1=X_2=\cdots \ X_{k-1}\) and \(X_k=1\text{.}\) The variable \(Z\) is called a geometric random variable. It is easy to show that the cdf \(F_G\) is given by \(F_G(k) = 1-q^k\) for \(k=1,2,3,\ldots\text{.}\) The continuous function \(F\colon \R\to\R\) given by \(F(x)=1-q^x\) satisfies the properties of a distribution function. By the fact alluded to in the opening paragraph of Subsectionย 5.1, it follows that there exists a random variable \(H\) such that \(F\) is the distribution function \(F_H\) of \(H\text{.}\) The variable \(H\) is called the exponential random variable with parameter \(\lambda=-\ln q\text{.}\)

Checkpoint 6.7.

  1. Find the probability function, the distribution function, and find the mean and the variance for the binomial distribution.
  2. Find the probability function, the distribution function, and find the mean and the variance for the geometric distribution.
  3. Find the probability density function and find the mean and the variance for the exponential distribution.
The (standard) normal variable plays a key role in sampling theory, due to the Central Limit Theorem, which we study below. In this paragraph we describe how the standard normal variable arises as a limit process applied to a sequence of binomial variables. Let \(Y\) be a Bernoulli variable with \(P(Y=1)=1/2=P(Y=0)\text{.}\) It is easy to check that \(\mu_Y=1/2\) and \(\sigma_Y=1/2\text{.}\) Let \(Y_1,Y_2,Y_3,\ldots\) be an infinite sequence of samples of \(Y\text{,}\) and let \(S_n\) be the binomial variable \(S_n=Y_1+Y_2+\ldots Y_n\text{.}\) Using the sampling formulas (6.12) and (6.13), we have \(E(S_n)=n\mu\) and \(\var(S_n)=n\sigma\text{.}\) Let \(T_n=(S_n-n\mu)/(\sigma\sqrt{n})\) be the normalized version of \(S_n\text{.}\) It turns out that there is exists a limit function \(\Phi = \lim_{n\to\infty}T_n\text{,}\) which means that \(\Phi(x) = \lim_{n\to\infty} T_n(x)\) for every real number \(x\text{.}\) The limit function \(\Phi\) satisfies the properties of a distribution function. By the fact alluded to in the opening paragraph of Subsectionย 5.1, it follows that there exists a random variable \(Z\) such that \(\Phi\) is the distribution function \(F_Z\) of \(Z\text{.}\) The variable \(Z\) is called the (standard) normal variable. It is also called a Gaussian variable, in honor of C.F. Gauss, who discovered it. The standard normal distribution has a mean of zero, a standard deviation of 1, and has probability density function
\begin{equation} f(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}.\tag{6.22} \end{equation}

Checkpoint 6.8.