Linear Regression

Linear Regression
An Introduction

Prof. David Bernstein
James Madison University

Computer Science Department

bernstdh@jmu.edu

Motivation

A Quiz:
- Are you smarter than the average person?
From Our Study of Statistical Inference:
- If "you" is plural (i.e., refers to the entire class) then we can make use of the sample mean and standard deviation and conduct a \(t\) test
Today's Question:
- Why are you so smart?

Some History

Francis Galton's Original (1886) Usage:
- Average height of sons of tall fathers was less than the height of the fathers
- Average height of sons of short fathers was more than the height of the fathers
- Tall and short sons regress towards the average
- The article - "Regression towards Mediocrity in Hereditary Stature"
An Example of Modern Usage:
- How does the average height of sons depend on the height of their fathers?

Some History (cont.)

Galton's Data

Regression Analysis

Defined:
- Study of the dependence of one variable (called the endogenous or dependent variable) on one or more other variables (called the exogenous or independent or explanatory variables)
Be Careful:
- It identifies a statistical dependence not causation
Relation to the Discipline of Statistics:
- Statistical estimation determines model parameters based on empirical data
- Regression analysis is a statistical estimation technique

Getting Started - A Model with One Explanatory Variable

Notation:
- \(X\) denotes the explanatory variable
- \(Y\) denotes the dependent variable
The Model:
- \(Y = f(X)\)

Our Focus

Models that are Linear in Parameters:
- \(Y = \alpha + \beta X\)
The Parameters to Estimate:
- \(\alpha\) and \(\beta\)
An Important Observation:
- The model is linear in the parameters. So, the explanatory variable can be anything (e.g., it can be income-squared, it can involve logarithms, etc...)

The Data

What We Collect:
- Observations of \(X\) and corresponding observations of \(Y\)
Notation:
- \(X_i\) denotes the \(i\)th observation of the explanatory variable
- \(Y_i\) denotes the corresponding \(i\)th observation of the dependent variable

Some Unintersting Special Cases

Exactly One Observation, \((Y_1, X_1)\):
- Since it takes two points to define a line, we can't estimate both \(a\) and \(b\)
Exactly Two Observations, \((Y_1, X_1)\) and \((Y_2, X_2)\)
- We have two equations and two unknowns (so can find both \(\alpha\) and \(\beta\)) but not enough data to find statistically significant results

An Unlikely Special Case

A Perfect Fit:
- We have many observations and they are all on a single line
Why this is Unlikely:
- The theory/model is unlikely to be perfect
- It is difficult to measure some variables
- Intrinsic randomness in the dependent variable

The Common Cases

A Typical Scatter Plot:
Mathematical Representation:
- \(Y_i = \alpha + \beta X_i + \epsilon_i \; \; \; i = 1, \ldots, n\)

Our Goal

In General:
- Choose the "best" line (i.e., find the values of \(\alpha\) and \(\beta\) that give the "best" fit)
Least Absolute Deviation:
- \[\min \sum_{i=1}^{n} | \epsilon_i |\]
Least Squared Deviation:
- \[\min \sum_{i=1}^{n} \epsilon_i^2\]

Which Approach to Use?

An Observation:
- Both have good properties
- Both can be solved numerically
An Advantage of Least Squares:
- We can solve for \(\alpha\) and \(\beta\) in closed form using calculus

Nerd Humor

(Courtesy of xkcd)

Deriving the Least Squares Estimators - Linear Models

For linear models through the origin (i.e., with just \(\beta\)) we have:

\[ \min \sum_{i=1}^n (Y_i - \beta X_i)^2 = (Y_1 - \beta X_1)^2 + \cdots + (Y_n - \beta X_n)^2 \]

which we need to differentiate, set to 0, and solve.

\[ \frac{d}{d \beta} \sum_{i=1}^{n}(Y_i - \beta X_i)^2 = \frac{d (Y_1 - \beta X_1)^2}{d\beta} + \cdots + \frac{d (Y_n - \beta X_n)^2}{d\beta} \] \[ = 2\cdot-X_1\cdot(Y_1 - \beta X_1) + \cdots + 2 \cdot -X_n \cdot (Y_n - \beta X_n) \] \[ = 2 \sum_{i=1}{n} -X_i \cdot (Y_i - \beta X_i) = 2 \sum_{i=1}^{n} -X_i Y_i + 2 \sum_{i=1}^{n} \beta X_i^2 \]

Deriving the Least Squares Estimators - Linear Models (cont.)

So, at the minimum:

\[ \sum_{i=1}^n X_i Y_i = \sum_{i=1}^n X_i^2 \]

which implies that, at the minimum, \(\beta = \frac{\sum_{i=1}^n X_i Y_i}{\sum_{i=1}^n X_i^2}\)

Deriving the Least Squares Estimators - Affine Models

For affine models (i.e., with both \(\alpha\) and \(\beta\)) we need to take partial derivatives, set them equal to 0, and solve.

Starting with \(\alpha\):

\[ \frac{\partial}{\partial \alpha} \sum_{i=1}^{n}(Y_i - \alpha - \beta X_i)^2 = 2 \sum_{i=1}^{n}(Y_i - \alpha - \beta X_i) \cdot -1 \] At the minimum: \[ \sum_{i=1}^{n}Y_i - \sum_{i=1}^{n} \alpha - \sum_{i=1}^{n} \beta X_i = 0 \Rightarrow n \cdot \alpha = \sum_{i=1}^{n}Y_i - \beta \sum_{i=1}^{n}X_i \] Dividing by \(n\): \[ \alpha = \overline{Y} - \beta \overline{X} \] where \(\overline{Y}\) and \(\overline{X}\) are the mean of \(Y\) and \(X\) respectively.

Deriving the Least Squares Estimators (cont.)

Now solving for \(\beta\):

\[ \frac{\partial}{\partial \beta} \sum_{i=1}^{n}(Y_i - \alpha - \beta X_i)^2 = 2 \sum_{i=1}^{n}(Y_i - \alpha - \beta X_i) \cdot -X_i \] At the minimum: \[ \sum_{i=1}^{n}Y_i X_i - \sum_{i=1}^{n}\alpha X_i - \sum_{i=1}^{n} \beta X_i^2 = 0 \Rightarrow \sum_{i=1}^{n}Y_i X_i - \alpha \sum_{i=1}^{n} X_i - \beta \sum_{i=1}^{n}X_i^2 = 0 \] Substituting \(\alpha = \overline{Y} - \beta \overline{X}\) \[ \sum_{i=1}^{n}Y_i X_i - \overline{Y} \sum_{i=1}^{n} X_i + \beta \overline{X} \sum_{i=1}^{n} X_i - \beta \sum_{i=1}^{n}X_i^2 = 0 \] which implies \[ \sum_{i=1}^{n}X_i(Y_i - \overline{Y}) + \beta \sum_{i=1}^{n} X_i (\overline{X} - X_i) = 0 \] Solving for \(\beta\): \[ \beta = \frac{\sum_{i=1}^{n}X_i(Y_i - \overline{Y})}{\sum_{i=1}^{n} X_i (X_i - \overline{X})} \]

The Probabilistic Nature of Regression

Errors in the Process:
- The model is a simplification of reality
- There is error in the collection and measurement of the data
The Implications:
- Each observation of \(Y\) (i.e., each \(Y_i\)) is a random variable
- We should think about the properties of \(\epsilon\) and what this means for the model \(Y_i = \alpha + \beta X_i + \epsilon_i\)

The Probabilistic Nature (cont.)

Assumptions:
- The error term has zero expected value [i.e., \(E(\epsilon_i)=0\)], and constant variance [i.e., \(E(\epsilon_i^2)=\sigma^2\)]
- The random variables \(\epsilon_i\) are uncorrelated
Implications for OLS Estimators:
- \(\text{Var}(\beta) = \frac{\sigma^2}{\sum(X_i - \overline{X})^2}\)
- \(\text{Var}(\alpha) = \sigma^2 \left[ \frac{\sum X_i^2} {n \sum(X_i - \overline{X})^2} \right]\)

Statistical Significance of the Estimators

We Can Estimate \(\sigma^2\):
- \(s^2 = \frac{\epsilon_i^2}{n-2}\)
We Can Test Hypotheses:
- Using \(s^2\) in place of \(\sigma^2\) we can determine whether \(\alpha\) and \(\beta\) are significantly different from 0

Multiple Regression

A Question:
- What can we do if there are multiple explanatory variables?
The Answer:
- We can do the same kind of thing for models of the form:
- \(Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki} + \epsilon_i\)
- where \(X_{1}\) denotes the first independent variable and \(X_{1i}\) denotes the \(i\)th observation of the first independent variable, etc...

Goodness of Fit

vs.

Goodness of Fit (cont.)

Percentage of the Total Variation that is Explained:
- \[ R^2 = 1 - \frac{\sum_{i=1}^{n}\epsilon_i^2}{\sum_{i=1}^{n}(Y_i - \overline{Y})^2} \]
Correcting for the Number of Explanatory Variables (\(k\)):
- \[ \overline{R}^2 = 1 - \frac{\sum_{i=1}^{n}\epsilon_i^2/(n-k)}{\sum_{i=1}^{n}(Y_i - \overline{Y})^2/(n-1)} \]

There's Always More to Learn