Regression

Section 8 Regression

The general goal of regression analysis is to find a “nice” function whose graph provides a “good fit” to a set of data points. In this section, we explain how regression works for 2-dimensional data, that is, points in the plane, and we give the details for the linear function that gives a best fit to the given data.

🔗

Subsection 8.1 Measuring fit

Let \(X,Y\colon \Omega\to \R\) be random variables on a probability space \(\Omega\text{.}\) Given a function \(y=f(x)\text{,}\) we define the variable \(e\colon \Omega\to \R\) by

\begin{equation*} e=Y-f(X). \end{equation*}

(the letter \(e\) is for error function). The overall fit of \(f\) to the data is measured by the standard deviation \(\sigma_e\) of the error variable \(e\text{.}\) The idea is that if \(\sigma_e\) is low, then \(f\) is a good fit, in the sense that the overall error incurred by using values \(f(X)\) to predict values of \(Y\) is minimized. The quantity \(\sigma_e\) is called the rms error for the function \(f\text{.}\)

The acronym rms is for “root-mean-square”, which comes from the formula for standard deviation, which is the square root of the mean of the square deviation.

🔗

Checkpoint 8.1.

Plot the graph of \(y=f(x)=x^2\) on the interval \([1,3]\text{.}\) Make up random variables \(X,Y\) on \(\Omega=\{a,b,c\}\text{,}\) with probability function given by \(p(\omega)=1/3\) for \(\omega\in \Omega\text{.}\) Choose your variable \(X\) so that its values are in the interval \([1,3]\text{,}\) and choose your variable \(Y\) so that the points \((X(a),Y(a)), (X(b),Y(b)), (X(c),Y(c))\) lie near, but not exactly on, the graph \(y=x^2\text{.}\) Calculate \(\sigma_e\text{.}\) Repeat this procedure for two or three more data sets of three points each in a way that illustrates how the size of \(\sigma_e\) relates to your visual sense of “fit”.

🔗

Subsection 8.2 Linear regression

The regression line for \(Y\) on \(X\) (also called the line of best fit, or the line of least squares error), is the linear function \(y=f(x)=mx+b\) that has the smallest value of \(\sigma_e\text{,}\) among all possible lines, for a given pair of random variables \(X,Y\text{.}\) It is not obvious, at first, that such a function should exist, or that is should be unique. But it turns out that there is precisely one such best fit line.

🔗

It is straightforward to use the linearity properties of expected value to solve for the values of \(m,b\) that minimize the expression \(\sigma_e^2=E([Y-(mX+b)]^2)\text{.}\) First consider the case where \(X,Y\) are both standardized variables, that is, we have \(E(X)=E(Y)=0\) and \(\var(X)=\var(Y)=1\text{.}\) One quickly obtains the minimum possible value \(\sigma_e=\sqrt{1-\covar(X,Y)^2}\) realized by the the optimizing values \(b=0\) and \(m=\covar(X,Y)\text{.}\) The regression line for standardized variables is the following.

\begin{align} y=\covar(X,Y)x \amp \amp (\text{for standardized variables } X,Y)\tag{8.1} \end{align}

For the general case, the regression line is given by

\begin{align} \frac{y-\mu_Y}{\sigma_Y}=\frac{\covar(X,Y)}{\sigma_X \sigma_Y}\frac{x-\mu_X}{\sigma_X} \amp \amp (\text{for any variables } X,Y)\tag{8.2} \end{align}

and the rms error for the regression line is \(\sigma_e=\sigma_Y\sqrt{1-r^2}\text{,}\) where \(r=\frac{\covar(X,Y)}{\sigma_X \sigma_Y}\) is called the (Pearson) correlation coefficient. The regression line passes through the point \((\mu_X,\mu_Y)\text{,}\) called the point of averages, and has slope equal to \(r\frac{\sigma_Y}{\sigma_X}\text{.}\)

🔗

Checkpoint 8.2.

Show the derivations for (8.1) and (8.2).

🔗

In practice, as is often the case with applications of sampling theory, we do not have full knowledge of the random variables \(X,Y\text{,}\) but we have a sample

\begin{equation*} ((x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)) \end{equation*}

where \((x_k,y_k)=(X(\omega_k),Y(\omega_k))\) for some sample \((\omega_1,\omega_2,\ldots,\omega)\) from \(\Omega\text{.}\) In this case, we must estimate the quantities \(\mu_X,\mu_Y,\sigma_X,\sigma_Y,\covar(X,Y)\) from the finite data set. To do this we use the sample average \(\overline{X}\) (see (6.7)) and the sample standard deviation \(s\) (see (6.8)) to estimate \(\mu_X\) and \(\sigma_X\text{,}\) respectively. Similarly, we use the sample mean and sample standard deviation for \(Y\) to estimate \(\mu_Y,\sigma_Y\text{.}\) To estimate \(\covar(X,Y)\text{,}\) we use

\begin{equation} \covar_{\small \rm samp}(X,Y) = \frac{1}{n-1}\sum_{k=1}^n (x_k-\overline{X})(y_k-\overline{Y}).\tag{8.3} \end{equation}

🔗

Read Section 2.2 of [3] for further vocabulary and facts about the regression line.

🔗

Exercises 8.3 Exercises

1.

Work through all of the examples in Section 1.6.2 and all of the problems in problems 1.6.3 in Section 1.6 in [3].

🔗

Prev Top Next