Skip to main content

Intermediate Statistics

Section 8 Regression

The general goal of regression analysis is to find a “nice” function whose graph provides a “good fit” to a set of data points. In this section, we explain how regression works for 2-dimensional data, that is, points in the plane, and we give the details for the linear function that gives a best fit to the given data.

Subsection 8.1 Measuring fit

Let \(X,Y\colon \Omega\to \R\) be random variables on a probability space \(\Omega\text{.}\) Given a function \(y=f(x)\text{,}\) we define the variable \(e\colon \Omega\to \R\) by
\begin{equation*} e=Y-f(X). \end{equation*}
(the letter \(e\) is for error function). The overall fit of \(f\) to the data is measured by the standard deviation \(\sigma_e\) of the error variable \(e\text{.}\) The idea is that if \(\sigma_e\) is low, then \(f\) is a good fit, in the sense that the overall error incurred by using values \(f(X)\) to predict values of \(Y\) is minimized. The quantity \(\sigma_e\) is called the rms error for the function \(f\text{.}\)
 1 
The acronym rms is for “root-mean-square”, which comes from the formula for standard deviation, which is the square root of the mean of the square deviation.

Checkpoint 8.1.

Plot the graph of \(y=f(x)=x^2\) on the interval \([1,3]\text{.}\) Make up random variables \(X,Y\) on \(\Omega=\{a,b,c\}\text{,}\) with probability function given by \(p(\omega)=1/3\) for \(\omega\in \Omega\text{.}\) Choose your variable \(X\) so that its values are in the interval \([1,3]\text{,}\) and choose your variable \(Y\) so that the points \((X(a),Y(a)), (X(b),Y(b)), (X(c),Y(c))\) lie near, but not exactly on, the graph \(y=x^2\text{.}\) Calculate \(\sigma_e\text{.}\) Repeat this procedure for two or three more data sets of three points each in a way that illustrates how the size of \(\sigma_e\) relates to your visual sense of “fit”.

Subsection 8.2 Linear regression

The regression line for \(Y\) on \(X\) (also called the line of best fit, or the line of least squares error), is the linear function \(y=f(x)=mx+b\) that has the smallest value of \(\sigma_e\text{,}\) among all possible lines, for a given pair of random variables \(X,Y\text{.}\) It is not obvious, at first, that such a function should exist, or that is should be unique. But it turns out that there is precisely one such best fit line.
It is straightforward to use the linearity properties of expected value to solve for the values of \(m,b\) that minimize the expression \(\sigma_e^2=E([Y-(mX+b)]^2)\text{.}\) First consider the case where \(X,Y\) are both standardized variables, that is, we have \(E(X)=E(Y)=0\) and \(\var(X)=\var(Y)=1\text{.}\) One quickly obtains the minimum possible value \(\sigma_e=\sqrt{1-\covar(X,Y)^2}\) realized by the the optimizing values \(b=0\) and \(m=\covar(X,Y)\text{.}\) The regression line for standardized variables is the following.
\begin{align} y=\covar(X,Y)x \amp \amp (\text{for standardized variables } X,Y)\tag{8.1} \end{align}
For the general case, the regression line is given by
\begin{align} \frac{y-\mu_Y}{\sigma_Y}=\frac{\covar(X,Y)}{\sigma_X \sigma_Y}\frac{x-\mu_X}{\sigma_X} \amp \amp (\text{for any variables } X,Y)\tag{8.2} \end{align}
and the rms error for the regression line is \(\sigma_e=\sigma_Y\sqrt{1-r^2}\text{,}\) where \(r=\frac{\covar(X,Y)}{\sigma_X \sigma_Y}\) is called the (Pearson) correlation coefficient. The regression line passes through the point \((\mu_X,\mu_Y)\text{,}\) called the point of averages, and has slope equal to \(r\frac{\sigma_Y}{\sigma_X}\text{.}\)

Checkpoint 8.2.

In practice, as is often the case with applications of sampling theory, we do not have full knowledge of the random variables \(X,Y\text{,}\) but we have a sample
\begin{equation*} ((x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)) \end{equation*}
where \((x_k,y_k)=(X(\omega_k),Y(\omega_k))\) for some sample \((\omega_1,\omega_2,\ldots,\omega)\) from \(\Omega\text{.}\) In this case, we must estimate the quantities \(\mu_X,\mu_Y,\sigma_X,\sigma_Y,\covar(X,Y)\) from the finite data set. To do this we use the sample average \(\overline{X}\) (see (6.7)) and the sample standard deviation \(s\) (see (6.8)) to estimate \(\mu_X\) and \(\sigma_X\text{,}\) respectively. Similarly, we use the sample mean and sample standard deviation for \(Y\) to estimate \(\mu_Y,\sigma_Y\text{.}\) To estimate \(\covar(X,Y)\text{,}\) we use
\begin{equation} \covar_{\small \rm samp}(X,Y) = \frac{1}{n-1}\sum_{k=1}^n (x_k-\overline{X})(y_k-\overline{Y}).\tag{8.3} \end{equation}
Read Section 2.2 of [3] for further vocabulary and facts about the regression line.

Exercises 8.3 Exercises

1.

Work through all of the examples in Section 1.6.2 and all of the problems in problems 1.6.3 in Section 1.6 in [3].