Linear Regression

Determine which explanatory variables have a significant effect on the mean of the quantitative response variable.

Simple Linear Regression

Simple linear regression is a good analysis technique when the data consists of a single quantitative response variable $Y$ and a single quantitative explanatory variable $X$.

Overview

Mathematical Model

The true regression model assumed by a regression analysis is given by

Math Code

$$
  \underbrace{Y_i}_\text{Some Label} = \overbrace{\beta_0}^\text{y-int} + \overbrace{\beta_1}^\text{slope} \underbrace{X_i}_\text{Some Label} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2)
$$

$Y_i$ The response variable. The “i” denotes that this is the y-value for individual “i”, where “i” is 1, 2, 3,… and so on up to $n$, the sample size. $=$ This states that we are assuming $Y_i$ was created, or is “equal to” the formula that will follow on the right-hand-side of the equation. $\underbrace{\overbrace{\beta_0}^\text{y-intercept} + \overbrace{\beta_1}^\text{slope} X_i \ }_\text{true regression relation}$ The true regression relation is a line, a line that is typically unknown in real life. It can be likened to “God’s Law” or “Natural Law”. Something that governs the way the data behaves, but is unkown to us. $+$ This plus sign emphasizes that the actual data, the $Y_i$, is created by adding together the value from the true line $\beta_0 + \beta_1 X_i$ and an individual error term $\epsilon_i$, which allows each dot in the regression to be off of the line by a certain amount called $\epsilon_i$. $\overbrace{\epsilon_i}^\text{error term}$ Error term for each individual $i$. The error terms are “random” and unique for each individual. This provides the statistical relationship of the regression. It is what allows each dot to be different, while still coming from the same line, or underlying law. $\quad \text{where}$ Some extra comments are needed about $\epsilon_i$… $\ \overbrace{\epsilon_i \sim N(0, \sigma^2)}^\text{error term normally distributed}$ The error terms $\epsilon_i$ are assumed to be normally distributed with constant variance. Pay special note that the $\sigma$ does not have an $i$ in it, so it is the same for each individual. In other words, the variance is constant. The mean of the errors is zero, which causes the dots to be spread out symmetrically both above and below the line.

The estimated regression line obtained from a regression analysis, pronounced “y-hat”, is written as

Math Code

$$
  \underbrace{\hat{Y}_i}_\text{Some Label} = \overbrace{b_0}^\text{est. y-int} + \overbrace{b_1}^\text{est. slope} \underbrace{X_i}_\text{Some Label}
$$

$\hat{Y}_i$ The estimated average y-value for individual $i$ is denoted by $\hat{Y}_i$. It is important to recognize that $Y_i$ is the actual value for individual $i$, and $\hat{Y}_i$ is the average y-value for all individuals with the same $X_i$ value. $=$ The formula for the average y-value, $\hat{Y}_i$ is equal to what follows… $\underbrace{\overbrace{\ b_0 \ }^\text{y-intercept} + \overbrace{b_1}^\text{slope} X_i \ }_\text{estimated regression relation}$ Two things are important to notice about this equation. First, it uses $b_0$ and $b_1$ instead of $\beta_0$ and $\beta_1$. This is because $b_0$ and $b_1$ are the estimated y-intercept and slope, respectively, not the true y-intercept $\beta_0$ and true slope $\beta_1$. Second, this equation does not include $\epsilon_i$. In other words, it is the estimated regression line, so it only describes the average y-values, not the actual y-values.

Note: see the Explanation tab The Mathematical Model for details about these equations.

Hypotheses

Math Code

$$
\left.\begin{array}{ll}
H_0: \beta_1 = 0 \\  
H_a: \beta_1 \neq 0
\end{array}
\right\} \ \text{Slope Hypotheses}
$$

$$
\left.\begin{array}{ll}
H_0: \beta_0 = 0 \\  
H_a: \beta_0 \neq 0
\end{array}
\right\} \ \text{Intercept Hypotheses}
$$

\[ \left.\begin{array}{ll} H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \end{array} \right\} \ \text{Slope Hypotheses}^{\quad \text{(most common)}}\quad\quad \]

\[ \left.\begin{array}{ll} H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \end{array} \right\} \ \text{Intercept Hypotheses}^{\quad\text{(sometimes useful)}} \]

If $\beta_1 = 0$, then the model reduces to $Y_i = \beta_0 + \epsilon_i$, which is a flat line. This means $X$ does not improve our understanding of the mean of $Y$ if the null hypothesis is true.

If $\beta_0 = 0$, then the model reduces to $Y_i = \beta_1 X + \epsilon_i$, a line going through the origin. This means the average $Y$-value is $0$ when $X=0$ if the null hypothesis is true.

Assumptions

This regression model is appropriate for the data when five assumptions can be made.

Linear Relation: the true regression relation between $Y$ and $X$ is linear.
Normal Errors: the error terms $\epsilon_i$ are normally distributed with a mean of zero.
Constant Variance: the variance $\sigma^2$ of the error terms is constant (the same) over all $X_i$ values.
Fixed X: the $X_i$ values can be considered fixed and measured without error.
Independent Errors: the error terms $\epsilon_i$ are independent.

Note: see the Explanation tab Residual Plots & Regression Assumptions for details about checking the regression assumptions.

Interpretation

The slope is interpreted as, “the change in the average y-value for a one unit change in the x-value.” It is not the average change in y. It is the change in the average y-value.

The y-intercept is interpreted as, “the average y-value when x is zero.” It is often not meaningful, but is sometimes useful. It just depends if x being zero is meaningful or not within the context of your analysis. For example, knowing the average price of a car with zero miles is useful. However, pretending to know the average height of adult males that weigh zero pounds, is not useful.

R Instructions

Console Help Command: ?lm()

Perform the Regression

mylm This is some name you come up with that will become the R object that stores the results of your linear regression lm(...) command. <- This is the “left arrow” assignment operator that stores the results of your lm() code into mylm name. lm( lm(…) is an R function that stands for “Linear Model”. It performs a linear regression analysis for Y ~ X. Y Y is your quantitative response variable. It is the name of one of the columns in your data set. ~ The tilde symbol ~ is used to tell R that Y should be treated as the response variable that is being explained by the explanatory variable X. X, X is the quantitative explanatory variable (at least it is typically quantitative but could be qualitative) that will be used to explain the average Y-value. data = NameOfYourDataset NameOfYourDataset is the name of the dataset that contains Y and X. In other words, one column of your dataset would be your response variable Y and another column would be your explanatory variable X. ) Closing parenthesis for the lm(…) function.
summary(mylm) The summary command allows you to print the results of your linear regression that were previously saved in mylm name. Click to Show Output Click to View Output.

Example output from a regression. Hover each piece to learn more.

Call:
lm(formula = dist ~ speed, data = cars) This is simply a statement of your original lm(…) “call” that you made when performing your regression. It allows you to verify that you ran what you thought you ran in the lm(…).

Residuals: Residuals are the vertical difference between each point and the line, $Y_i - \hat{Y}_i$. The residuals are supposed to be normally distributed, so a quick glance at their five-number summary can give us insight about any skew present in the residuals.
min -29.069 “min” gives the value of the residual that is furthest below the regression line. Ideally, the magnitude of this value would be about equal to the magnitude of the largest positive residual (the max) because the hope is that the residuals are normally distributed around the line.	1Q -9.525 “1Q” gives the first quartile of the residuals, which will always be negative, and ideally would be about equal in magnitude to the third quartile.	Median -2.272 “Median” gives the median of the residuals, which would ideally would be about equal to zero. Note that because the regression line is the least squares line, the mean of the residuals will ALWAYS be zero, so it is never included in the output summary. This particular median value of -2.272 is a little smaller than zero than we would hope for and suggests a right skew in the data because the mean (0) is greater than the median (-2.272) witnessing the residuals are right skewed. This can also be seen in the maximum being much larger in magnitude than the minimum.	3Q 9.215 “3Q” gives the third quartile of the residuals, which would ideally would be about equal in magnitude to the first quartile. In this case, it is pretty close, which helps us see that the first quartile of residuals on either side of the line is behaving fairly normally.	Max 43.201 “Max” gives the maximum positive residuals, which would ideally would be about equal in magnitude to the minimum residual. In this case, it is much larger than the minimum, which helps us see that the residuals are likely right skewed.

Coefficients: Notice that in your lm(…) you used only $Y$ and $X$. You did type out any coefficients, i.e., the $\beta_0$ or $\beta_1$ of the regression model. These coefficients are estimated by the lm(…) function and displayed in this part of the output along with standard errors, t-values, and p-values.
	Estimate To learn more about the “Estimates” of the “Coefficients” see the “Explanation” tab, “Estimating the Model Parameters” section for details.	Std. Error To learn more about the “Standard Errors” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.	t value To learn more about the “t value” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.	Pr(>\|t\|) The “Pr” stands for “Probability” and the “(> \|t\|)” stands for “more extreme than the observed t-value”. Thus, this is the p-value for the hypothesis test of each coefficient being zero. To learn more about the “p-value” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.
(Intercept) This always says “Intercept” for any lm(…) you run in R. That is because R always assumes there is a y-intercept for your regression function.	-17.5791 This is the estimate of the y-intercept, $\beta_0$. It is called $b_0$. It is the average y-value when X is zero.	6.7584 This is the standard error of $b_0$. It tells you how much $b_0$ varies from sample to sample. The closer to zero, the better.	-2.601 This is the test statistic t for the test of $\beta_0 = 0$. It is calculated by dividing the “Estimate” of the intercept (-17.5791) by its standard error (6.7584). It gives the “number of standard errors” away from zero that the “estimate” has landed. In this case, the estimate of -17.5791 is -2.601 standard errors (6.7584) from zero, which is a fairly surprising distance as shown by the p-value.	0.0123 This is the p-value of the test of the hypothesis that $\beta_0 = 0$. It measures the probability of observing a t-value as extreme as the one observed. To compute it yourself in R, use `pt(-abs(your t-value), df of your regression)*2`.	* This is called a “star”. One star means significant at the 0.1 level of $\alpha$.
speed This is always the name of your X-variable in your lm(Y ~ X, …).	3.9324 This is the estimate of the slope, $\beta_1$. It is called $b_1$. It is the change in the average y-value as X is increased by 1 unit.	0.4155 This is the standard error of $b_1$. It tells you how much $b_1$ varies from sample to sample. The closer to zero, the better.	9.464 This is the test statistic t for the test of $\beta_1 = 0$. It is calculated by dividing the “Estimate” of the slope (3.9324) by its standard error (0.4155). It gives the “number of standard errors” away from zero that the “estimate” has landed. In this case, the estimate of 3.9324 is 9.464 standard errors (0.4155) from zero, which is a really surprising distance as shown by the smallness of the p-value.	1.49e-12 This is the p-value of the test of the hypothesis that $\beta_1 = 0$. To compute it yourself in R, use `pt(-abs(your t-value), df of your regression)*2`	*** This is called a “star”. Three stars means significant at the 0.01 level of $\alpha$.

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 These “codes” explain what significance level the p-value is smaller than based on how many “stars” * the p-value is labeled with in the Coefficients table above.

Residual standard error: This is the estimate of $\sigma$ in the regression model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0,\sigma^2)$. It is the square root of the MSE.

15.38 For this particular regression, the estimate of $\sigma$ is 15.38. Squaring this number gives you the MSE, which is the estimate of $\sigma^2$.

on 48 degrees of freedom This is $n-p$ where $n$ is the sample size and $p$ is the number of parameters in the regression model. In this case, there is a sample size of 50 and two parameters, $\beta_0$ and $\beta_1$, so 50-2 = 48.

Multiple R-squared: This is $R^2$, the percentage of variation in $Y$ that is explained by the regression model. It is equal to the SSR/SSTO or, equivalently, 1 - SSE/SSTO. 0.6511, In this particular regression, 65.11% of the variation in stopping distance dist is explained by the regression model using speed of the car. Adjusted R-squared: The adjusted R-squared will always be at least slightly smaller than $R^2$. The closer to R-squared that it is, the better. When it differs dramatically from $R^2$, it is a sign that the regression model is over-fitting the data. 0.6438 In this case, the value of 0.6438 is quite close to the original $R^2$ value, so there is no fear of over-fitting with this particular model. That is good.

F-statistic: The F-statistic is found as the ratio of the MSR/MSP where MSR = SSR/(p-1) and MSE = SSE/(n-p) where n is the sample size and p is the number of parameters in the regression model.

89.57 This is the value of the F-statistic for the lm(dist ~ speed, data=cars) regression. Note that SSE = sum( cars.lm$res^2 ) = 11353.52 with n - p = 50 - 2 = 48 degrees of freedom for this data. Further, SSR = sum( (cars.lm$fit - mean(cars$dist))^2 ) = 21185.46 with p - 1 = 1 degree of freedom. So MSR = 21185.46 and MSE = 11353.52 / 48 = 236.5317. So MSR / MSE = 21185.46 / 236.5317 = 89.56711.

on 1 and 48 DF, The 1 degree of freedom is the SSR degrees of freedom (p-1). The 48 is the SSE degrees of freedom (n-p).

p-value: 1.49e-12 The p-value for an F-statistic is found by the code pf(89.56711, 1, 48, lower.tail=FALSE), which gives the probability of being more extreme than the observed F-statistic in an F distribution with 1 and 48 degrees of freedom.

Check Assumptions 1, 2, 3, and 5

par( The par(…) command stands for “Graphical PARameters”. It allows you to control various aspects of graphics in Base R. mfrow= This stands for “multiple frames filled by row”, which means, put lots of plots on the same row, starting with the plot on the left, then working towards the right as more plots are created. c( The combine function c(…) is used to specify how many rows and columns of graphics should be placed together. 1, This specifies that 1 row of graphics should be produced. 3 This states that 3 columns of graphics should be produced. ) Closing parenthesis for c(…) function. ) Closing parenthesis for par(…) function.
plot( This version of plot(…) will actually create several regression diagnostic plots by default. mylm, This is the name of an lm object that you created previously. which= This allows you to select “which” regression diagnostic plots should be drawn. 1 Selecting 1, would give the residuals vs. fitted values plot only. : The colon allows you to select more than just one plot. 2 Selecting 2 also gives the Q-Q Plot of residuals. If you wanted to instead you could just use which=1 to get the residuals vs fitted values plot, then you could use qqPlot(mylm$residuals) to create a fancier Q-Q Plot of the residuals. ) Closing parenthesis for plot(…) function.
plot( This version of plot(…) will be used to create a time-ordered plot of the residuals. The order of the residuals is the original order of the x-values in the original data set. If the original data set doesn’t have an order, then this plot is not interesting. mylm The lm object that you created previously. $ This allows you to access various elements from the regression that was performed. residuals This grabs the residuals for each observation in the regression. ) Closing parenthesis for plot(…) function. Click to Show Output Click to View Output.

Plotting the Regression Line

To add the regression line to a scatterplot use the abline(...) command:

plot( The plot(…) function is used to create a scatterplot with a y-axis (the vertical axis) and an x-axis (the horizontal axis). Y This is the “response variable” of your regression. The thing you are interested in predicting. This is the name of a “numeric” column of data from the data set called YourDataSet. ~ The tilde “~” is used to relate Y to X and can be found on the top-left key of your keyboard. X, This is the explanatory variable of your regression. It is the name of a “numeric” column of data from YourDataSet. . data= The data= statement is used to specify the name of the data set where the columns of “X” and “Y” are located. YourDataSet This is the name of your data set, like KidsFeet or cars or airquality. ) Closing parenthesis for plot(…) function.
abline( This stands for “a” (intercept) “b” (slope) line. It is a function that allows you to add a line to a plot by specifying just the intercept and slope of the line. mylm This is the name of an lm(…) that you created previoiusly. Since mylm contains the slope and intercept of the estimated line, the abline(…) function will locate these two values from within mylm and use them to add a line to your current plot(…). ) Closing parenthesis for abline(…) function. Click to Show Output Click to View Output.

mylm <- lm(dist ~ speed, data = cars)
plot(dist ~ speed, data = cars)
abline(mylm)

You can customize the look of the regression line with

abline( This stands for “a” (intercept) “b” (slope) line. It is a function that allows you to add a line to a plot by specifying just the intercept and slope of the line. mylm, This is the name of an lm(…) that you created previoiusly. Since mylm contains the slope and intercept of the estimated line, the abline(…) function will locate these two values from within mylm and use them to add a line to your current plot(…). lty= The lty= stands for “line type” and allows you to select between 0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash. 1, This creates a solid line. Remember, other options include: 0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash. lwd= The lwd= allows you to specify the width of the line. The default width is 1. Using lwd=2 would double the thickness, and so on. Any positive value is allowed. 1, Default line width. To make a thicker line, us 2 or 3… To make a thinner line, try 0.5, but 1 is already pretty thin. col= This allows you to specify the color of the line using either a name of a color or rgb(.5,.2,.3,.2) where the format is rgb(percentage red, percentage green, percentage blue, percent opaque). “someColor” Type colors() in R for options. ) Closing parenthesis for abline(…) function. Click to Show Output Click to View Output.

mylm <- lm(dist ~ speed, data = cars)
plot(dist ~ speed, data = cars)
abline(mylm, lty=1, lwd=1, col="firebrick")

You can add points to the regression with…

points( This is like plot(…) but adds points to the current plot(…) instead of creating a new plot. newY newY should be a column of values from some data set. Or, use points(newX, newY) to add a single point to a graph. ~ This links Y to X in the plot. newX, newX should be a column of values from some data set. It should be the same length as newY. If just a single value, use points(newX, newY) instead. data=YourDataSet, If newY and newX come from a dataset, then use data= to tell the points(…) function what data set they come from. If newY and newX are just single values, then data= is not needed. col=“skyblue”, This allows you to specify the color of the points using either a name of a color or rgb(.5,.2,.3,.2) where the format is rgb(percentage red, percentage green, percentage blue, percent opaque). pch=16 This allows you to specify the type of plotting symbol to be used for the points. Type ?pch and scroll half way down in the help file that appears to learn about other possible symbols. ) Closing parenthesis for points(…) function. Click to Show Output Click to View Output.

mylm <- lm(dist ~ speed, data = cars)
plot(dist ~ speed, data = cars)
points(7,40, pch=16, col="skyblue", cex=2)
text(7,40, "New Dot", pos=3, cex=0.5)
points(dist ~ speed, data=filter(cars, mylm$res > 2), cex=.8, col="red")
abline(mylm, lty=1, lwd=1, col="firebrick")

To add the regression line to a scatterplot using the ggplot2 approach, first ensure:

library(ggplot2) or library(tidyverse)

is loaded. Then, use the geom_smooth(method = lm) command:

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

There are a number of ways to customize the appearance of the regression line:

## `geom_smooth()` using formula 'y ~ x'

In addition to customizing the regression line, you can customize the points, add points, add lines, and much more.

ggplot( Every ggplot2 graphic begins with the ggplot() command, which creates a framework, or coordinate system, that you can add layers to. Without adding any layers, ggplot() produces a blank graphic. cars, This is simply the name of your data set, like KidsFeet or starwars. aes( aes stands for aesthetic. Inside of aes(), you place elements that you want to map to the coordinate system, like x and y variables. x = “x = ” declares which variable will become the x-axis of the graphic, your explanatory variable. Both “x= ” and “y= ” are optional phrasesin the ggplot2 syntax. speed, This is the explanatory variable of the regression: the variable used to explain the mean of y. It is the name of the “numeric” column of YourDataSet. y = “y= ” declares which variable will become the y-axis of the grpahic. dist This is the response variable of the regression: the variable that you are interested in predicting. It is the name of a “numeric” column of YourDataSet. ) Closing parenthesis for aes(…) function. ) Closing parenthesis for ggplot(…) function. + The + allows you to add more layers to the framework provided by ggplot(). In this case, you use + to add a geom_point() layer on the next line.
  geom_point( geom_point() allows you to add a layer of points, a scatterplot, over the ggplot() framework. The x and y coordinates are received from the previously specified x and y variables declared in the ggplot() aesthetic. size = 1.5, Use size = 1.5 to change the size of the points. color = “skyblue” Use color = “skyblue” to change the color of the points to Brother Saunders’ favorite color. alpha = 0.5 Use alpha = 0.5 to change the transparency of the points to 0.5. ) Closing parenthesis of geom_point() function. + The + allows you to add more layers to the framework provided by ggplot().
  geom_smooth( geom_smooth() is a smoothing function that you can use to add different lines or curves to ggplot(). In this case, you will use it to add the least-squares regression line to the scatterplot. method = Use “method = ” to tell geom_smooth() that you are going to declare a specific smoothing function, or method, to alter the line or curve.. “lm”, lm stands for linear model. Using method = “lm” tells geom_smooth() to fit a least-squares regression line onto the graphic. The regression line is modeled using y ~ x, which variables were declared in the initial ggplot() aesthetic. color = “navy”, Use color = “navy” to change the color of the line to navy blue. size = 1.5, Use size = 1.5 to adjust the thickness of the line to 1.5. se = FALSE se stands for “standard error”. Specifying FALSE turns this feature off. When TRUE, a gray band showing the “confidence band” for the regression is shown. Unless you know how to interpret this confidence band, leave it turned off. ) Closing parenthesis of geom_smooth() function. + The + allows you to add more layers to the framework provided by ggplot().
  geom_hline( Use geom_hline() to add a horizontal line at a specified y-intercept. You can also use geom_vline(xintercept = some_number) to add a vertical line to the graph. yintercept = Use “yintercept =” to tell geom_hline() that you are going to declare a y intercept for the horizontal line. 75 75 is the value of the y-intercept. , color = “firebrick” Use color = “firebrick” to change the color of the horizontal line to firebrick red. , size = 1, Use size = 1 to adjust the thickness of the horizontal line to size 1.
             linetype = “longdash” Use linetype = “longdash” to change the solid line to a dashed line with longer dashes. Some linetype options include “dashed”, “dotted”, “longdash”, “dotdash”, etc. , alpha = 0.5 Use alpha = 0.5 to change the transparency of the horizontal line to 0.5. ) Closing parenthesis of geom_hline function. + The + allows you to add more layers to the framework provided by ggplot().
  geom_segment( geom_segment() allows you to add a line segment to ggplot() by using specified start and end points. x = “x =” tells geom_segment() that you are going to declare the x-coordinate for the starting point of the line segment. 14, 14 is a number on the x-axis of your graph. It is the x-coordinate of the starting point of the line segment. y =
“y =” tells geom_segment() that you are going to declare the y-coordinate for the starting point of the line segment. 75, 75 is a number on the y-axis of your graph. It is the y-coordinate of the starting point of the line segment. xend = “xend =” tells geom_segment() that you are going to declare the x-coordinate for the end point of the line segment. 14, 14 is a number on the x-axis of your graph. It is the x-coordinate of the end point of the line segment. yend = “yend =” tells geom_segment() that you are going to declare the y-coordinate for the end point of the line segment. 38, 38 is a number on the y-axis of your graph. It is the y-coordinate of the end point of the line segment.
               size = 1 Use size = 1 to adjust the thickness of the line segment. , color = “lightgray” Use color = “lightgray” to change the color of the line segment to light gray. , linetype = “longdash” Use *linetype = “longdash* to change the solid line segment to a dashed one. Some linetype options include”dashed“,”dotted“,”longdash“,”dotdash", etc. ) Closing parenthesis for geom_segment() function. + The + allows you to add more layers to the framework provided by ggplot().
  geom_point( geom_point() can also be used to add individual points to the graph. Simply declare the x and y coordinates of the point you want to plot. x = “x =” tells geom_point() that you are going to declare the x-coordinate for the point. 14, 14 is a number on the x-axis of your graph. It is the x-coordinate of the point. y = “y =” tells geom_point() that you are going to declare the y-coordinate for the point. 75 75 is a number on the y-axis of your graph. It is the y-coordinate of the point. , size = 3 Use size = 3 to make the point stand out more. , color = “firebrick” Use color = “firebrick” to change the color of the point to firebrick red. ) Closing parenthesis of the geom_point() function. + The + allows you to add more layers to the framework provided by ggplot().
  geom_text( geom_text() allows you to add customized text anywhere on the graph. It is very similar to the base R equivalent, text(…). x = “x =” tells geom_text() that you are going to declare the x-coordinate for the text. 14, 14 is a number on the x-axis of your graph. It is the x-coordinate of the text. y = “y =” tells geom_text() that you are going to declare the y-coordinate for the text. 84, 84 is a number on the y-axis of your graph. It is the y-coordinate of the text. label = “label =” tells geom_text() that you are going to give it the label. “My Point (14, 75)”, “My Point (14, 75)” is the text that will appear on the graph.
            color = “navy” Use color = “navy” to change the color of the text to navy blue. , size = 3 Use size = 3 to change the size of the text. ) Closing parenthesis of the geom_text() function. + The + allows you to add more layers to the framework provided by ggplot().
  theme_minimal() Add a minimalistic theme to the graph. There are many other themes that you can try out. Click to Show Output Click to View Output.

## `geom_smooth()` using formula 'y ~ x'

Accessing Parts of the Regression

Finally, note that the mylm object contains the names(mylm) of

mylm$coefficients Contains two values. The first is the estimated $y$-intercept. The second is the estimated slope.

## (Intercept)       speed 
##  -17.579095    3.932409

mylm$residuals Contains the residuals from the regression in the same order as the actual dataset.

##          1          2          3          4          5          6          7 
##   3.849460  11.849460  -5.947766  12.052234   2.119825  -7.812584  -3.744993 
##          8          9         10         11         12         13         14 
##   4.255007  12.255007  -8.677401   2.322599 -15.609810  -9.609810  -5.609810 
##         15         16         17         18         19         20         21 
##  -1.609810  -7.542219   0.457781   0.457781  12.457781 -11.474628  -1.474628 
##         22         23         24         25         26         27         28 
##  22.525372  42.525372 -21.407036 -15.407036  12.592964 -13.339445  -5.339445 
##         29         30         31         32         33         34         35 
## -17.271854  -9.271854   0.728146 -11.204263   2.795737  22.795737  30.795737 
##         36         37         38         39         40         41         42 
## -21.136672 -11.136672  10.863328 -29.069080 -13.069080  -9.069080  -5.069080 
##         43         44         45         46         47         48         49 
##   2.930920  -2.933898 -18.866307  -6.798715  15.201285  16.201285  43.201285 
##         50 
##   4.268876

mylm$fitted.values The values of $\hat{Y}$ in the same order as the original dataset.

##         1         2         3         4         5         6         7         8 
## -1.849460 -1.849460  9.947766  9.947766 13.880175 17.812584 21.744993 21.744993 
##         9        10        11        12        13        14        15        16 
## 21.744993 25.677401 25.677401 29.609810 29.609810 29.609810 29.609810 33.542219 
##        17        18        19        20        21        22        23        24 
## 33.542219 33.542219 33.542219 37.474628 37.474628 37.474628 37.474628 41.407036 
##        25        26        27        28        29        30        31        32 
## 41.407036 41.407036 45.339445 45.339445 49.271854 49.271854 49.271854 53.204263 
##        33        34        35        36        37        38        39        40 
## 53.204263 53.204263 53.204263 57.136672 57.136672 57.136672 61.069080 61.069080 
##        41        42        43        44        45        46        47        48 
## 61.069080 61.069080 61.069080 68.933898 72.866307 76.798715 76.798715 76.798715 
##        49        50 
## 76.798715 80.731124

mylm$… several other things that will not be explained here.

Making Predictions

predict( The R function predict(…) allows you to use an lm(…) object to make predictions for specified x-values. mylm, This is the name of a previously performed lm(…) that was saved into the name mylm <- lm(...). data.frame( To specify the values of $x$ that you want to use in the prediction, you have to put those x-values into a data set, or more specifally, a data.frame(…). X= The value for X= should be whatever x-variable name was used in the original regression. For example, if mylm <- lm(dist ~ speed, data=cars) was the original regression, then this code would read speed = instead of X=… Further, the value of $Xh$ should be some specific number, like speed=12 for example. Xh The value of $Xh$ should be some specific number, like 12, as in speed=12 for example. ), Closing parenthesis for the data.frame(…) function. interval= This optional command allows you to specify if the predicted value should be accompanied by either a confidence interval or a prediction interval. “prediction” This specifies that a prediction interval will be included with the predicted value. A prediction interval gives you a 95% confidence interval that captures 95% of the data, or $Y_i$ values for the specific $X$-value specified in the prediction. ) Closing parenthesis of the predict(…) function.

predict( The R function predict(…) allows you to use an lm(…) object to make predictions for specified x-values. mylm, This is the name of a previously performed lm(…) that was saved into the name mylm <- lm(...). data.frame( To specify the values of $x$ that you want to use in the prediction, you have to put those x-values into a data set, or more specifally, a data.frame(…). X= The value for X= should be whatever x-variable name was used in the original regression. For example, if mylm <- lm(dist ~ speed, data=cars) was the original regression, then this code would read speed = instead of X=… Further, the value of $Xh$ should be some specific number, like speed=12 for example. Xh The value of $Xh$ should be some specific number, like 12, as in speed=12 for example. ), Closing parenthesis for the data.frame(…) function. interval= This optional command allows you to specify if the predicted value should be accompanied by either a confidence interval or a prediction interval. “confidence” This specifies that a confidence interval for the prediction should be provided. This is of use whenever your interest is in just estimating the average y-value, not the actual y-values. ) Closing parenthesis of the predict(…) function.

Finding Confidence Intervals for Model Parameters

confint( The R function confint(…) allows you to use an lm(…) object to compute confidence intervals for one or more parameters (like $\beta_0$ or $\beta_1$) in your model. mylm, This is the name of a previously performed lm(…) that was saved into the name mylm <- lm(...). level = “level =” tells the confint(…) function that you are going to declare at what level of confidence you want the interval. The default is “level = 0.95.” If you want to find 95% confidence intervals for your parameters, then just run confint(mylm). someConfidenceLevel someConfidenceLevel is simply a confidence level you choose when you want something other than a 95% confidence interval. Some examples of appropriate levels include 0.90 and 0.99. ) Closing parenthesis for confint(..) function.

mylm <- lm(dist ~ speed, data = cars)

confint(mylm, level = 0.90)

	5 % The lower bound of a 90% confidence interval occurs at the 5th percentile. This is because at 90% confidence, 10% is left in the tails, with 5% on each end. The upper bound will thus end at the 95th percentile, hence the 5% and 95% as the column names.	95 % The upper bound of a 90% confidence interval ends at the 95th percentile.
(Intercept) This row of output specifies a confidence interval for $\beta_0$, the true y-intercept.	-28.914514 This is the lower bound for the confidence interval of the y-intercept, $\beta_0$. In this example, the confidence interval for the y-intercept does not make sense because you cannot have negative distance.	-6.243676 This is the upper bound for the confidence interval for $\beta_0$, the y-intercpet. In this example, the confidence interval for the y-intercept does not make sense because you cannot have negative distance.
speed This row of the output provides the upper and lower bound for the confidence interval for $\beta_1$, the true slope. In this case, you can be 90% confident that the true slope lies between 3.235501 and 4.629317.	3.235501 This is the lower bound of the confidence interval. In this case, you can be 90% confident that the slope lies between 3.235501 and 4.629317.	4.629317 This is the upper bound of the confidence interval. In this case, you can be 90% confident that the slope lies between 3.235501 and 4.629317.

mylm <- lm(dist ~ speed, data = cars)

confint(mylm, level = 0.95)

	2.5 % The lower bound of a 95% confidence interval occurs at the 2.5th percentile. This is because at 95% confidence, 5% is left in the tails, with 2.5% on each end. The upper bound will thus end at the 97.5th percentile, hence the 2.5% and 97.5% as the column names for the lower and upper bounds, respectively.	97.5 % The upper bound of a 95% confidence interval ends at the 97.5th percentile.
(Intercept) This row of output specifies a confidence interval for $\beta_0$, the true y-intercept.	-31.167850 This is the lower bound for the confidence interval of the y-intercept, $\beta_0$. In this example, the confidence interval for the y-intercept does not make sense because you cannot have negative distance.	-3.990340 This is the upper bound for the confidence interval for $\beta_0$, the y-intercpet. In this example, the confidence interval for the y-intercept does not make sense because you cannot have negative distance.
speed This row of the output provides the upper and lower bound for the confidence interval for $\beta_1$, the true slope. In this case, you can be 90% confident that the true slope lies between 3.096964 and 4.767853.	3.096964 This is the lower bound of the confidence interval. In this case, you can be 90% confident that the slope lies between 3.096964 and 4.767853	4.767853 This is the upper bound of the confidence interval. In this case, you can be 95% confident that the slope lies between 3.096964 and 4.767853

Explanation

Linear regression has a rich mathematical theory behind it. This is because it uses a mathematical function and a random error term to describe the regression relation between a response variable $Y$ and an explanatory variable called $X$.

Expand each element below to learn more.

Regression Cheat Sheet (Expand)

Term	Pronunciation	Meaning	Math	R Code
$Y_i$ $Y_i$	“why-eye”	The data	$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2)$ `$Y_i = \beta_0 + \beta_1 X_i +` `\epsilon_i \quad \text{where} \` `\epsilon_i \sim N(0, \sigma^2)$`	`YourDataSet$YourYvariable`
$\hat{Y}_i$ $\hat{Y}_i$	“why-hat-eye”	The fitted line	$\hat{Y}_i = b_0 + b_1 X_i$ $\hat{Y}_i = b_0 + b_1 X_i$	`lmObject$fitted.values`
$E\{Y_i\}$ $E\{Y_i\}$	“expected value of why-eye”	True mean y-value	$E\{Y_i\} = \beta_0 + \beta_1 X_i$ $E\{Y_i\} = \beta_0 + \beta_1 X_i$	`<none>`
$\beta_0$ $\beta_0$	“beta-zero”	True y-intercept	`<none>`	`<none>`
$\beta_1$ $\beta_1$	“beta-one”	True slope	`<none>`	`<none>`
$b_0$ $b_0$	“b-zero”	Estimated y-intercept	$b_0 = \bar{Y} - b_1\bar{X}$ `$b_0 = \bar{Y} - b_1\bar{X}`	`b_0 <- mean(Y) - b_1*mean(X)$`
$b_1$ $b_1$	“b-one”	Estimated slope	$b_1 = \frac{\sum X_i(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}$ `$b_1 = \frac{\sum X_i(Y_i - \bar{Y})}` `{\sum(X_i - \bar{X})^2}$`	`b_1 <- sum( X*(Y - mean(Y)) ) / sum( (X - mean(X))^2 )`
$\epsilon_i$ $\epsilon_i$	“epsilon-eye”	Distance of dot to true line	$\epsilon_i = Y_i - E\{Y_i\}$ $\epsilon_i = Y_i - E\{Y_i\}$	`<none>`
$r_i$ $r_i$	“r-eye” or “residual-eye”	Distance of dot to estimated line	$r_i = Y_i - \hat{Y}_i$ $r_i = Y_i - \hat{Y}_i$	`lmObject$residuals`
$\sigma^2$ $\sigma^2$	“sigma-squared”	Variance of the $\epsilon_i$	$Var\{\epsilon_i\} = \sigma^2$ $Var\{\epsilon_i\} = \sigma^2$	`<none>`
$MSE$ $MSE$	“mean squared error”	Estimate of $\sigma^2$	$MSE = \frac{SSE}{n-p}$ $MSE = \frac{SSE}{n-p}$	`sum( lmObject$res^2 ) / (n - p)`
$SSE$ $SSE$	“sum of squared error” (residuals)	Measure of dot’s total deviation from the line	$SSE = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2$`$SSE = \sum_{i=1}^n` `(Y_i - \hat{Y}_i)^2$`	`sum( lmObject$res^2 )`
$SSR$ $SSR$	“sum of squared regression error”	Measure of line’s deviation from y-bar	$SSR = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2$`$SSR = \sum_{i=1}^n` `(\hat{Y}_i - \bar{Y})^2$`	`sum( (lmObject$fit - mean(YourData$Y))^2 )`
$SSTO$ $SSTO$	“total sum of squares”	Measure of total variation in Y	$SSR + SSE = SSTO = \sum_{i=1}^n (Y_i - \bar{Y})^2$`$SSR + SSE = SSTO = \sum_{i=1}^n` `(Y_i - \bar{Y})^2$`	`sum( (YourData$Y - mean(YourData$Y)^2 )`
$R^2$ $R^2$	“R-squared”	Proportion of variation in Y explained by the regression	$R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO}$`$R^2 = \frac{SSR}{SSTO} = 1` `- \frac{SSE}{SSTO}$`	`SSR/SSTO`
$\hat{Y}_h$ $\hat{Y}_h$	“why-hat-aitch”	Estimated mean y-value for some x-value called $X_h$	$\hat{Y}_h = b_0 + b_1 X_h$ $\hat{Y}_h = b_0 + b_1 X_h$	`predict(lmObject, data.frame(XvarName=#))`
$X_h$ $X_h$	“ex-aitch”	Some x-value, not necessarily one of the $X_i$ values used in the regression	$X_h =$ some number $X_h = $	`Xh = #`
Confidence Interval	“confidence interval”	Estimated bounds at a certain level of confidence for a parameter	$b_0 \pm t^* \cdot s_{b_0}$ or $b_1 \pm t^* \cdot s_{b_1}$	`confint(mylm, level = someConfidenceLevel)`

Parameter	Estimate
$\beta_0$	$b_0$
$\beta_1$	$b_1$
$\epsilon_i$	$r_i$
$\sigma^2$	$MSE$
$\sigma$	$\sqrt{MSE}$, the Residual standard error

The Mathematical Model (Expand)

$Y_i$, $\hat{Y}_i$, and $E\{Y_i\}$…

There are three main elements to the mathematical model of regression. Each of these three elements is pictured below in the “Regression Relation Diagram.”

Study both the three bullet points and their visual representations in the plot below for a clearer understanding.

The true line, i.e., the regression relation:

$\underbrace{E\{Y\}}_{\substack{\text{true mean} \\ \text{y-value}}} = \underbrace{\overbrace{\beta_0}^\text{y-intercept} + \overbrace{\beta_1}^\text{slope} X}_\text{equation of a line}$

(Read more…)

The dots, i.e., the regression relation plus an error term:

$Y_i = \underbrace{\beta_0 + \beta_1 X_i}_{E\{Y_i\}} + \underbrace{\epsilon_i}_\text{error term} \quad \text{where} \ \epsilon_i\sim N(0,\sigma^2)$

(Read more…)

The estimated line, i.e., the line we get from a sample of data.

$\underbrace{\hat{Y}_i}_{\substack{\text{estimated mean} \\ \text{y-value}}} = \underbrace{b_0 + b_1 X_i}_\text{estimated regression equation}$

(Read more…)

This graphic depicts the true, but typically unknown, regression relation (dotted line). It also shows how a sample of data from the true regression relation (the dots) can be used to obtain an estimated regression equation (solid line) that is fairly close to the truth (dotted line).

Something to ponder: The true line, when coupled with the error terms, “creates” the data. The estimated (or fitted) line uses the sampled data to try to “re-create” the true line.

We could loosely call this the “order of creation” as shown by the following diagram.

par(mfrow=c(1,3), mai=c(.2,.2,.4,.1))
plot(y ~ x, col="white",  main="A Law is Given", yaxt='n', xaxt='n')
curve(beta0 + beta1*x, add=TRUE, lty=2)
plot(y ~ x, pch=16, main="Data is Created", xaxt='n', yaxt='n')
curve(beta0 + beta1*x, add=TRUE, lty=2)
plot(y ~ x, pch=16, xaxt='n', yaxt='n', main="The Law is Estimated")
curve(xylm$coef[1] + xylm$coef[2]*x, add=TRUE, yaxt='n', xaxt='n')
curve(beta0 + beta1*x, add=TRUE, lty=2)

A Law is Given	Data is Created	The Law is Estimated
$E\{Y_i\} = \beta_0 + \beta_1 X_i$	$Y_i = E\{Y_i\} + \epsilon_i$	$\hat{Y}_i = b_0 + b_1 X_i$
The true line is the “law”.	The $Y_i$ are created by adding $\epsilon_i$ to $E\{Y_i\}$ where $E\{Y_i\} = \beta_0 + \beta_1 X_i$.	The law is estimated with $\hat{Y}_i$ which is given with `lm(...)`.

Click open the “Code” buttom below to the right to find code that runs a simulation demonstrating this “order of creation”.

## Simulating Data from a Regression Model
## This R-chunk is meant to be played in your R Console.
## It allows you to explore how the various elements
## of the regression model combine together to "create"
## data and then use the data to "re-create" the line.

set.seed(101) #Allows us to always get the same "random" sample
              #Change to a new number to get a new sample

  n <- 30 #set the sample size

  X_i <- runif(n, 15, 45) #Gives n random values from a uniform distribution between 15 to 45.

  beta0 <- 3 #Our choice for the y-intercept. 

  beta1 <- 1.8 #Our choice for the slope. 

  sigma <- 2.5 #Our choice for the std. deviation of the error terms.

  epsilon_i <- rnorm(n, 0, sigma) #Gives n random values from a normal distribution with mean = 0, st. dev. = sigma.

  Y_i <- beta0 + beta1*X_i + epsilon_i #Create Y using the normal error regression model

  fabData <- data.frame(y=Y_i, x=X_i) #Store the data as data

  View(fabData) 
  
  #In the real world, we begin with data (like fabData) and try to recover the model that (we assume) was used to created it.

  fab.lm <- lm(y ~ x, data=fabData) #Fit an estimated regression model to the fabData.

  summary(fab.lm) #Summarize your model. 

  plot(y ~ x, data=fabData) #Plot the data.

  abline(fab.lm) #Add the estimated regression line to your plot.

# Now for something you can't do in real life... but since we created the data...

  abline(beta0, beta1, lty=2) #Add the true regression line to your plot using a dashed line (lty=2). 

  legend("topleft", legend=c("True Line", "Estimated Line"), lty=c(2,1), bty="n") #Add a legend to your plot specifying which line is which.

Interpreting the Model Parameters (Expand)

$\beta_0$ (intercept) and $\beta_1$ (slope), estimated by $b_0$ and $b_1$, interpreted as…

The Histogram	The Boxplot	The Scatterplot
The histogram on the left shows gas mileages of vehicles from the mtcars data set. The average gas mileage is 20.09.	The boxplot in the middle shows that if we look at gas mileage for 4, 6, and 8 cylinder vehicles separately, we find the means to be 26.66, 19.74, and 15.1, respectively. If we wanted to, we could talk about the change in the means across cylinders, and would see that the mean is decreasing, first by \(26.66 - 19.74 = 6.92\) mpg, then by \(19.74 - 15.1 = 4.64\) mpg.	The scatterplot on the right shows that the average gas mileage (for just automatic transmission vehicles) increases by a slope of 1.44 for each 1 second increase in quarter mile time. In other words, the line gives the average y-value for any x-value. Thus, the slope of the line is the change in the average y-value.

Residuals and Errors (Expand)

$r_i$, the residual, estimates $\epsilon_i$, the true error…

Residual \(r_i\)	Error \(\epsilon_i\)
Distance between the dot \(Y_i\) and the estimated line \(\hat{Y}_i\)	Distance between the dot \(Y_i\) and the true line \(E\{Y_i\}\).
\(r_i = Y_i - \hat{Y}_i\)	\(\epsilon_i = Y_i - E\{Y_i\}\)
Known	Typically Unknown

Assessing the Fit of a Regression (Expand)

$R^2$, SSTO, SSR, and SSE…

Not all regressions are created equally as the three plots below show. Sometimes the dots are a clustered very tightly to the line. At other times, the dots spread out fairly dramatically from the line.

par(mfrow=c(1,3), mai=c(.1,.1,.5,.1))
set.seed(2)
x <- runif(30,0,20)
y1 <- 2 + 3.5*x + rnorm(30,0,2)
y2 <- 2 + 3.5*x + rnorm(30,0,8)
y3 <- 2 + 3.5*x + rnorm(30,0,27)
plot(y1 ~ x, pch=16, col="darkgray", xlim=c(-1,21), yaxt='n', xaxt='n', ylim=c(-10,100), main="Excellent Fit")
abline(lm(y1 ~ x), col="gray")
plot(y2 ~ x, pch=16, col="darkgray", xlim=c(-1,21), yaxt='n', xaxt='n', ylim=c(-10,100), main="Good Fit")
abline(lm(y2 ~ x), col="gray")
plot(y3 ~ x, pch=16, col="darkgray", xlim=c(-1,21), yaxt='n', xaxt='n', ylim=c(-10,100), main="Poor Fit")
abline(lm(y3 ~ x), col="gray")

A common way to measure the fit of a regression is with correlation. While this can be a useful measurement, there is greater insight in using the square of the correlation, called $R^2$. Before you can understand $R^2$, you must understand three important “sums of squares”.

(Read more about sums…)

Individual	speed	dist
1	4	2
2	4	10
3	7	4
4	7	22
5	8	16
6	9	10

Sum of Squared Errors	Sum of Squares Regression	Total Sum of Squares
$\text{SSE} = \sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2$	$\text{SSR} = \sum_{i=1}^n \left(\hat{Y}_i - \bar{Y}\right)^2$	$\text{SSTO} = \sum_{i=1}^n \left(Y_i - \bar{Y}\right)^2$
Measures how much the residuals deviate from the line.	Measures how much the regression line deviates from the average y-value.	Measures how much the y-values deviate from the average y-value.
Equals SSTO - SSR	Equals SSTO - SSE	Equals SSE + SSR
`sum( (Y - mylm$fit)^2 )`	`sum( (mylm$fit - mean(Y))^2 )`	`sum( (Y - mean(Y))^2 )`

It is important to remember that SSE and SSR split up SSTO, so that \[ \text{SSTO} = \text{SSE} + \text{SSR} \] This implies that if SSE is large (close to SSTO) then SSR is small (close to zero) and visa versa. The following three graphics demonstrate how this works.

The above graphs reveal that the idea of correlation is tightly linked with sums of squares. In fact, the correlation squared is equal to SSR/SSTO. And this fraction, SSR/SSTO is called $R^2$ (“r-squared”).

R-Squared ($R^2$) \[ \underbrace{R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO}}_\text{Interpretation: Proportion of variation in Y explained by the regression.} \]

The smallest $R^2$ can be is zero, and the largest it can be is 1. This is because $SSR$ must be between 0 and SSTO, inclusive.

Residual Plots & Regression Assumptions (Expand)

Residuals vs. fitted-values, Q-Q Plot of the residuals, and residuals vs. order plots…

Estimating the Model Parameters (Expand)

How to get $b_0$ and $b_1$: least squares & maximum likelihood…

There are two approaches to estimating the parameters $\beta_0$ and $\beta_1$ in the regression model. The oldest and most tradiational approach is using the idea of least squares. A more general approach uses the idea of maximum likelihood (see below). Fortunately, for simple linear regression, the estimates for $\beta_0$ and $\beta_1$ obtained from either method are identical. The estimates for the true parameter values $\beta_0$ and $\beta_1$ are typically denoted by $b_0$ and $b_1$, respectively, and are given by the following formulas.

Parameter Estimate	Mathematical Formula	R Code
Slope	$b_1 = \frac{\sum X_i(Y_i-\bar{Y})}{\sum(X_i-\bar{X})^2}$	`b_1 <- sum( X*(Y - mean(Y)) ) / sum( (X - mean(X))^2 )`
Intercept	$b_0 = \bar{Y} - b_1\bar{X}$	`b_0 <- mean(Y) - b_1*mean(X)`

It is important to note that these estimates are entirely determined from the observed data $X$ and $Y$. When the regression equation is written using the estimates instead of the parameters, we use the notation $\hat{Y}$, which is the estimator of $E\{Y\}$. Thus, we write \[\begin{equation} \hat{Y}_i = b_0 + b_1 X_i \end{equation}\] which is directly comparable to the true, but unknown values \[\begin{equation} E\{Y_i\} = \beta_0 + \beta_1 X_i. \label{exp} \end{equation}\]

Least Squares

To estimate the model parameters $\beta_0$ and $\beta_1$ using least squares, we start by defining the function $Q$ as the sum of the squared errors, $\epsilon_i$. \[ Q = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (Y_i - (\beta_0 + \beta_1 X_i))^2 \] Then we use the function Q as if it were a function of $\beta_0$ and $\beta_1$. Ironically, the values of $Y$ and $X$ are considered fixed. However, this makes sense because once a particular data set has been observed, these values are all known for that data set. What we don’t know are the values of $\beta_0$ and $\beta_1$.

This least squares applet is a good way to explore how various choices of the slope and intercept yield different values of the “sum of squared residuals”. But it turns out that there is one “best” choice of the slope and intercept that yields a “smallest” value of the “sum of squared residuals.” This best choice can actually be found using calculus by taking the partial derivatives of $Q$ with respect to both $\beta_0$ and $\beta_1$. \[ \frac{\partial Q}{\partial \beta_0} = -2\sum (Y_i - \beta_0 - \beta_1X_i) \] \[ \frac{\partial Q}{\partial \beta_1} = -2\sum X_i(Y_i-\beta_0-\beta_1X_i) \] Setting these partial derivatives to zero, and solving the resulting system of equations provides the values of the parameters which minimize $Q$ for a given set of data. After all the calculations are completed we find the values of the parameter estimators $b_0$ and $b_1$ (of $\beta_0$ and $\beta_1$, respectively) are as stated previously.

Maximum Likelihood

The idea of maximum likelihood estimation is opposite that of least squares. Instead of choosing those values of $\beta_0$ and $\beta_1$ which minime the least squares $Q$ function, we choose the values of $\beta_0$ and $\beta_1$ which maximize the likelihood function. The likelihood function is created by first determining the joint distribution of the $Y_i$ for all observations $i=1,\ldots,n$. We can do this rather simply by using the assumption that the errors, $\epsilon_i$ are independently normally distributed. When events are independent, their joint probability is simply the product of their individual probabilities. Thus, if $f(Y_i)$ denotes the probability density function for $Y_i$, then the joint probability density for all $Y_i$, $f(Y_1,\ldots,Y_n)$ is given by \[ f(Y_1,\ldots,Y_n) = \prod_{i=1}^n f(Y_i) \] Since each $Y_i$ is assumed to be normally distributed with mean $\beta_0 + \beta_1 X_i$ and variance $\sigma^2$ (see model ()) we have that \[ f(Y_i) = \frac{1}{\sqrt{2\pi}\sigma}\exp{\left[-\frac{1}{2}\left(\frac{Y_i-\beta_0-\beta_1X_i}{\sigma}\right)^2\right]} \] which provides the joint probability as \[ f(Y_1,\ldots,Y_n) = \prod_{i=1}^n f(Y_i) = \frac{1}{(2\pi\sigma^2)^{n/2}}\exp{\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i-\beta_0-\beta_1X_i)^2\right]} \] The likelihood function $L$ is then given by consider the $Y_i$ and $X_i$ fixed and the parameters $\beta_0$, $\beta_1$ and $\sigma^2$ as the variables in the function. \[ L(\beta_0,\beta_1,\sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}}\exp{\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i-\beta_0-\beta_1X_i)^2\right]} \] Instead of taking partial derivatives of $L$ directly (with respect to all parameters) we take the partial derivatives of the $\log$ of $L$, which is easier to work with. In a similar, but more difficult calculation, to that of minimizing $Q$, we obtain the values of $\beta_0$, $\beta_1$, and $\sigma^2$ which maximize the log of $L$, and which therefore maximize $L$. (This is not an obvious result, but can be verified after some intense calculations.) The additional result that maximimum likelihood estimation provides that the least squares estimates did not give us is the estimate $\hat{\sigma}^2$ of $\sigma^2$. \[ \hat{\sigma}^2 = \frac{\sum(Y_i-\hat{Y}_i)^2}{n} \]

Estimating the Model Variance (Expand)

Estimating $\sigma^2$ with MSE…

As shown previously in the “Estimating Model Parameters” section of this page, we can obtain estimates for the model parameters $\beta_0$ and $\beta_1$ by using either least squares estimation or maximum likelihood estimation. Those estimates were given by the formulas

\[ b_1 = \frac{\sum X_i(Y_i-\bar{Y})}{\sum(X_i-\bar{X})^2} \quad \text{(Unbiased Estimate of $\beta_1$)} \]

\[ b_0 = \bar{Y} - b_1\bar{X} \quad \text{(Unbiased Estimate of $\beta_0$)} \]

It turns out that these estimates for $\beta_0$ and $\beta_1$ are nice in the sense that on average they provide the correct estimate of the true parameter, i.e., they are unbiased estimators. Unfortunately, this is not the case for the maximum likelihood estimate $\widehat{\sigma}^2$ of the model variance $\sigma^2$. This estimate turns out to be a biased estimator. This means that it is consistently wrong in its estimates of $\sigma^2$. If we left the estimator alone, our estimates for $\sigma^2$ would always be wrong. This is bad. Fortunately, there is a way to fix it, and this corrected version of the estimator is what we will actually use in practice to estimate $\sigma^2$.

Without going into all the details, to “fix” the biased estimator of $\sigma^2$ that is given to us through maximum likelihood estimation, we need to correct its denominator so that it properly represent the degrees of freedom associated with the numerator, which it does not currently. To find the correct degrees of freedom, we have to notice that the $\hat{Y}_i$ in the numerator of $\widehat{\sigma}^2$ is defined by \[\begin{equation} \widehat{Y}_i = b_0 + b_1X_i \label{hatY} \end{equation}\] From this equation, we notice that two means, $\bar{X}$ and $\bar{Y}$, were estimated from the data in order to obtain $\hat{Y}_i$. (See the formulas for $b_0$ and $b_1$ above, and note how they use both $\bar{X}$ and $\bar{Y}$ in their calculation.) Anytime a mean is estimated from the data we lose a degree of freedom. Hence, the denominator for $\hat{\sigma}^2$ should be $n-2$ instead of $n$. Some incredibly long calculations will show that the “fixed” estimator \[\begin{equation} s^2 = MSE = \frac{\sum(Y_i-\hat{Y}_i)^2}{n-2} \quad \text{(Unbiased Estimator of $\sigma^2$)} \end{equation}\] is an unbiased estimator of $\sigma^2$. Here $MSE$ stands for mean squared error, which is the most obvious name for a formula that squares the errors $Y_i-\hat{Y}_i$ then adds them up and divides by their degrees of freedom. Similarly, we call the numerator $\sum(Y_i-\hat{Y}_i)^2$ the sum of the squared errors, denoted by $SSE$. It is also important to note that the errors are often denoted by $r_i = Y_i-\hat{Y}_i$, the residuals. Putting this all together we get the following equivalent statements for $MSE$. \[\begin{equation} s^2 = MSE = \frac{SSE}{n-2} = \frac{\sum(Y_i-\widehat{Y}_i)^2}{n-2} = \frac{\sum r_i^2}{n-2} \end{equation}\] As a final note, even though the expected value $E\{MSE\} = \sigma^2$, which shows $MSE$ is an unbiased estimator of $\sigma^2$, it unfortunately isn’t true that $\sqrt{MSE}$ is an unbiased estimator of $\sigma$. This presents a few problems later on, but these are minimal enough that we can overlook the issue and move forward.

Transformations (Expand)

$Y'$, $X'$, and returning to the original space…

Value of \(\lambda\)	Transformation to Use	R Code
-2	\(Y' = Y^{-2} = 1/Y^2\)	`lm(Y^-2 ~ X)`
-1	\(Y' = Y^{-1} = 1/Y\)	`lm(Y^-1 ~ X)`
0	\(Y' = \log(Y)\)	`lm(log(Y) ~ X)`
0.5	\(Y' = \sqrt(Y)\)	`lm(sqrt(Y) ~ X)`
1	\(Y' = Y\)	`lm(Y ~ X)`
2	\(Y' = Y^2\)	`lm(Y^2 ~ X)`

	Estimate	Std. Error	t value	Pr(>
(Intercept)	1.277	0.4844	2.636	0.01126
speed	0.3224	0.02978	10.83	1.773e-14

Inference for the Model Parameters (Expand)

t test formulas, sampling distributions, confidence intervals, and F tests…

We are sometimes interested in making inference about $\beta_0$, the y-intercept. However, most inference in regression is focused on the slope, $\beta_1$. Recall that the interpretation of $\beta_1$ is the amount of increase (or decrease) in the expected value (average value) of $Y$ per unit change in $X$.

Two types of inference about $\beta_1$, or similarly $\beta_0$ when applicable, are of interest.

Hypotheses Test Statistic P-value

Hypotheses	Test Statistic	P-value
\(H_0: \beta_0 =\) \(\underbrace{0}_\text{a number}\) This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown. \(H_a: \beta_0\) \(\,\neq\,\) You could use \(>\) or \(<\) instead of \(\neq\) for the alternative hypothesis. By default, the p-value from summary(mylm) in R uses \(\neq\). \(\underbrace{0}_\text{a number}\) This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown.	\[t = \frac{b_0 - \overbrace{0}^\text{a number}}{s_{b_0}}\] This is the formula for the test statistic. It measures how far the estimated y-intercept \(b_0\) is from the null hypothesis for \(\beta_0\) in units of “standard errors of \(b_0\)”. Thus the division by \(s_{b_0}\). Though the hypothesized value of \(\beta_0\) is typically 0, it could be any number.
\(H_0: \beta_1 =\) \(\underbrace{0}_\text{a number}\) This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown. \(H_a: \beta_1\) \(\,\neq\,\) You could use \(>\) or \(<\) instead of \(\neq\) for the alternative hypothesis. By default, the p-value from summary(mylm) in R uses \(\neq\). \(\underbrace{0}_\text{a number}\) This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown.	\[t = \frac{b_1 - \overbrace{0}^\text{a number}}{s_{b_1}}\] This is the formula for the test statistic. It measures how far the estimated slope \(b_1\) is from the null hypothesis for \(\beta_1\) in units of “standard errors of \(b_1\)”. Thus the division by \(s_{b_1}\). Though the hypothesized value of \(\beta_1\) is typically 0, it could be any number.	Left-tailed p-value = `pt(-abs(tvalue), degrees of freedom)`. Double it to get the two-sided p-value.

$H_0: \beta_0 =$ $\underbrace{0}_\text{a number}$ This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown.

$H_a: \beta_0$ $\,\neq\,$ You could use $>$ or $<$ instead of $\neq$ for the alternative hypothesis. By default, the p-value from summary(mylm) in R uses $\neq$. $\underbrace{0}_\text{a number}$ This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown.

\[t = \frac{b_0 - \overbrace{0}^\text{a number}}{s_{b_0}}\] This is the formula for the test statistic. It measures how far the estimated y-intercept $b_0$ is from the null hypothesis for $\beta_0$ in units of “standard errors of $b_0$”. Thus the division by $s_{b_0}$. Though the hypothesized value of $\beta_0$ is typically 0, it could be any number.

$H_0: \beta_1 =$ $\underbrace{0}_\text{a number}$ This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown.

$H_a: \beta_1$ $\,\neq\,$ You could use $>$ or $<$ instead of $\neq$ for the alternative hypothesis. By default, the p-value from summary(mylm) in R uses $\neq$. $\underbrace{0}_\text{a number}$ This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown.

\[t = \frac{b_1 - \overbrace{0}^\text{a number}}{s_{b_1}}\] This is the formula for the test statistic. It measures how far the estimated slope $b_1$ is from the null hypothesis for $\beta_1$ in units of “standard errors of $b_1$”. Thus the division by $s_{b_1}$. Though the hypothesized value of $\beta_1$ is typically 0, it could be any number.

Left-tailed p-value = pt(-abs(tvalue), degrees of freedom).

Double it to get the two-sided p-value.

Confidence Interval	Formula	Standard Error
$\beta_0$	$b_0 \pm$ $t^*$ This is called the “critical value” and denotes the number of standard deviations that are needed to obtain a 95% confidence interval from a t distribution with degrees of freedom $n-p$ (sample size - number of parameters in the regression model). $\cdot$ The critical value is multiplied by the standard error of $b_0$. $s_{b_0}$ The standard error of $b_0$, denoted by $s_{b_0}$ is provided in the regression summary output under the column header called “Std. Error” for the “(Intercept)” row of the output. It is calculated using the formula shown below.	\[s^2_{b_0} = MSE\left[\frac{1}{n} + \frac{\bar{X}^2}{\sum(X_i-\bar{X})^2}\right]\] This is called the “estimated variance of $b_0$”. Taking the square root of this number gives the “standard error of $b_0$”.
$\beta_1$	$b_1 \pm$ $t^*$ This is called the “critical value” and denotes the number of standard deviations that are needed to obtain a 95% confidence interval from a t distribution with degrees of freedom $n-p$ (sample size - number of parameters in the regression model). $\cdot$ The critical value is multiplied by the standard error of $b_1$. $s_{b_1}$ The standard error of $b_1$, denoted by $s_{b_1}$ is provided in the regression summary output under the column header called “Std. Error”. It is calculated using the formula shown below.	\[s^2_{b_1} = \frac{MSE}{\sum(X_i-\bar{X})^2}\] This is called the “estimated variance of $b_1$”. Taking the square root of this number gives the “standard error of $b_1$”.

To be more exact, the types of inference we are interested in are the following.

Determine if there is evidence of a meaningful linear relationship in the data. If $\beta_1 = 0$, then there is no relation between $X$ and $E\{Y\}$. Hence we might be interested in testing the hypotheses \[ H_0: \beta_1 = 0 \] \[ H_a: \beta_1 \neq 0 \]
Determine if the slope is greater, less than, or different from some other hypothesized value. In this case, we would be interested in using hypotheses of the form \[ H_0: \beta_1 = \beta_{10} \] \[ H_a: \beta_1 \neq \beta_{10} \] where $\beta_{10}$ is some hypothesized number.
To provide a confidence interval for the true value of $\beta_1$.

Before we discuss how to test the hypotheses listed above or construct a confidence interval, we must understand the sampling distribution of the estimate $b_1$ of the parameter $\beta_1$. And, while we are at it, we may as well come to understand the sampling distribution of the estimate $b_0$ of the parameter $\beta_0$.

Review sampling distributions from Math 221.

Since $b_1$ is an estimate, it will vary from sample to sample, even though the truth, $\beta_1$, remains fixed. (The same holds for $b_0$ and $\beta_0$.) It turns out that the sampling distribution of $b_1$ (where the $X$ values remain fixed from study to study) is normal with mean and variance: \[ \mu_{b_1} = \beta_1 \] \[ \sigma^2_{b_1} = \frac{\sigma^2}{\sum(X_i-\bar{X})^2} \]

## Simulation to Show relationship between Standard Errors

##-----------------------------------------------
## Edit anything in this area... 

n <- 100 #sample size
Xstart <- 30 #lower-bound for x-axis
Xstop <- 100 #upper-bound for x-axis

beta_0 <- 2 #choice of true y-intercept
beta_1 <- 3.5 #choice of true slope
sigma <- 13.8 #choice of st. deviation of error terms

## End of Editable area.
##-----------------------------------------------


# Create X, which will be used in the next R-chunk.
X <- rep(seq(Xstart,Xstop, length.out=n/2), each=2) 

## After playing this chunk, play the next chunk as well.

To see that this is true, consider the regression model with values specified for each parameter as follows.

\[ Y_i = \overbrace{\beta_0}^{2} + \overbrace{\beta_1}^{3.5} X_i + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \overbrace{\sigma^2}^{\sigma=13.8}) \]

Using the equations above for $\mu_{b_1}$ and $\sigma^2_{b_1}$ we obtain that the mean of the sampling distribution of $b_1$ will be

$\mu_{b_1} = \beta_1 = 3.5$

Further, we see that the variance of the sampling distribution of $b_1$ will be

$\sigma^2_{b_1} = \frac{\sigma^2}{\sum(X_i-\bar{X})^2} = \frac{13.8^2}{4.25\times 10^{4}}$

Taking the square root of the variance, the standard deviation of the sampling distribution of $b_1$ will be

$\sigma_{b_1} = 0.067$.

That’s very nice. But to really believe it, let’s run a simulation ourselves. The “Code” below is worth studying. It runs a simulation that (1) takes a sample of data from the true regression relation, (2) fits the sampled data with an estimated regression equation (gray lines in the plot), and (3) computes the estimated values of $b_1$ and $b_0$ for that regression.

After doing this many, many times, the results of every single regression are plotted (in gray lines, which creates a gray shaded region because there are so many lines) in the scatterplot below. Further, each obtained estimate of $b_0$ is plotted in the histogram on the left (below the scatterplot) and each obtained estimate of $b_1$ is plotted in the histogram on the right. Looking at the histograms carefully, it can be seen that the mean of each histogram is very close to the true parameter value of $\beta_0$ or $\beta_1$, respectively. Also, the “Std. Error” of each histogram is incredibly close (if not exact to 3 decimal places) to the computed value of $\sigma_{b_0}$ and $\sigma_{b_1}$, respectively. Amazing!

N <- 5000 #number of times to pull a random sample
storage_b0 <- storage_b1 <- storage_rmse <- rep(NA, N)
for (i in 1:N){
  Y <- beta_0 + beta_1*X + rnorm(n, 0, sigma) #Sample Y from true model
  mylm <- lm(Y ~ X)
  storage_b0[i] <- coef(mylm)[1]
  storage_b1[i] <- coef(mylm)[2]
  storage_rmse[i] <- summary(mylm)$sigma
}


layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE), widths=c(2,2), heights=c(3,3))

Ystart <- 0 #min(0,min(Y)) 
Ystop <- 500 #max(max(Y), 0)
Yrange <- Ystop - Ystart

plot(Y ~ X, xlim=c(min(0,Xstart-2), max(0,Xstop+2)), 
     ylim=c(Ystart, Ystop), pch=16, col="gray",
     main="Regression Lines from many Samples (gray lines) \n Plus Residual Standard Deviation Lines (green lines)")
text(Xstart, Ystop, bquote(sigma == .(sigma)), pos=1)
text(Xstart, Ystop-.1*Yrange, bquote(sum ((x[i]-bar(x))^2, i==1, n) == .(var(X)*(n-1))), pos=1)
text(Xstart, Ystop-.25*Yrange, bquote(sqrt(MSE) == .(mean(storage_rmse))), pos=1)


for (i in 1:N){
  abline(storage_b0[i], storage_b1[i], col="darkgray")  
}
abline(beta_0, beta_1, col="green", lwd=3)
abline(beta_0+sigma, beta_1, col="green", lwd=2)
abline(beta_0-sigma, beta_1, col="green", lwd=2)
abline(beta_0+2*sigma, beta_1, col="green", lwd=1)
abline(beta_0-2*sigma, beta_1, col="green", lwd=1)
abline(beta_0+3*sigma, beta_1, col="green", lwd=.5)
abline(beta_0-3*sigma, beta_1, col="green", lwd=.5)

par(mai=c(1,.6,.5,.01))

  addnorm <- function(m,s, col="firebrick"){
    curve(dnorm(x, m, s), add=TRUE, col=col, lwd=2)
    lines(c(m,m), c(0, dnorm(m,m,s)), lwd=2, col=col)
    lines(rep(m-s,2), c(0, dnorm(m-s, m, s)), lwd=2, col=col)
    lines(rep(m-2*s,2), c(0, dnorm(m-2*s, m, s)), lwd=2, col=col)
    lines(rep(m-3*s,2), c(0, dnorm(m-3*s, m, s)), lwd=2, col=col)
    lines(rep(m+s,2), c(0, dnorm(m+s, m, s)), lwd=2, col=col)
    lines(rep(m+2*s,2), c(0, dnorm(m+2*s, m, s)), lwd=2, col=col)
    lines(rep(m+3*s,2), c(0, dnorm(m+3*s, m, s)), lwd=2, col=col)
    legend("topleft", legend=paste("Std. Error = ", round(s,3)), cex=0.7, bty="n")
  }

  h0 <- hist(storage_b0, 
             col="skyblue3", 
             main="Sampling Distribution\n Y-intercept",
             xlab=expression(paste("Estimates of ", beta[0], " from each Sample")),
             freq=FALSE, yaxt='n', ylab="")
  m0 <- mean(storage_b0)
  s0 <- sd(storage_b0)
  addnorm(m0,s0, col="green")
  
  h1 <- hist(storage_b1, 
             col="skyblue3", 
             main="Sampling Distribution\n Slope",
             xlab=expression(paste("Estimates of ", beta[1], " from each Sample")),
             freq=FALSE, yaxt='n', ylab="")
  m1 <- mean(storage_b1)
  s1 <- sd(storage_b1)
  addnorm(m1,s1, col="green")

t Tests

Using the information above about the sampling distributions of $b_1$ and $b_0$, an immediate choice of statistical test to test the hypotheses \[ H_0: \beta_1 = \beta_{10} \] \[ H_a: \beta_1 \neq \beta_{10} \] where $\beta_{10}$ can be zero, or any other value, is a t test given by \[ t = \frac{b_1 - \beta_{10}}{s_{b_1}} \] where $s^2_{b_1} = \frac{MSE}{\sum(X_i-\bar{X})^2}$. (You may want to review the section “Estimating the Model Variance” of this file to know where MSE came from.) With quite a bit of work it has been shown that $t$ is distributed as a $t$ distribution with $n-2$ degrees of freedom. The nearly identical test statistic for testing \[ H_0: \beta_0 = \beta_{00} \] \[ H_a: \beta_0 \neq \beta_{00} \] is given by \[ t = \frac{b_0 - \beta_{00}}{s_{b_0}} \] where $s^2_{b_0} = MSE\left[\frac{1}{n}+\frac{\bar{X}^2}{\sum(X_i-\bar{X})^2}\right]$. This version of $t$ has also been shown to be distributed as a $t$ distribution with $n-2$ degrees of freedom.

Confidence Intervals

Creating a confidence interval for either $\beta_1$ or $\beta_0$ follows immediately from these results using the formulas \[ b_1 \pm t^*_{n-2}\cdot s_{b_1} \] \[ b_0 \pm t^*_{n-2}\cdot s_{b_0} \] where $t^*_{n-2}$ is the critical value from a t distribution with $n-2$ degrees of freedom corresponding to the chosen confidence level.

F tests

Another way to test the hypotheses \[ H_0: \beta_1 = \beta_{10} \quad\quad \text{or} \quad\quad H_0: \beta_0 = \beta_{00} \] \[ H_a: \beta_1 \neq \beta_{10} \quad\quad \ \ \quad \quad H_a: \beta_0 \neq \beta_{00} \] is with an $F$ Test. One downside of the F test is that we cannot construct confidence intervals. Another is that we can only perform two-sided tests, we cannot use one-sided alternatives with an F test. The upside is that an $F$ test is very general and can be used in many places that a t test cannot.

In its most general form, the $F$ test partitions the sums of squared errors into different pieces and compares the pieces to see what is accounting for the most variation in the data. To test the hypothesis that $H_0:\beta_1=0$ against the alternative that $H_a: \beta_1\neq 0$, we are essentially comparing two models against each other. If $\beta_1=0$, then the corresponding model would be $E\{Y_i\} = \beta_0$. If $\beta_1\neq0$, then the model remains $E\{Y_i\}=\beta_0+\beta_1X_i$. We call the model corresponding to the null hypothesis the reduced model because it will always have fewer parameters than the model corresponding to the alternative hypothesis (which we call the full model). This is the first requirement of the $F$ Test, that the null model (reduced model) have fewer “free” parameters than the alternative model (full model). To demonstrate what we mean by “free” parameters, consider the following example.

Say we wanted to test the hypothesis that $H_0:\beta_1 = 2.5$ against the alternative that $\beta_1\neq2.5$. Then the null, or reduced model, would be $E\{Y_i\}=\beta_0+2.5X_i$. The alternative, or full model, would be $E\{Y_i\}=\beta_0+\beta_1X_i$. Thus, the null (reduced) model contains only one “free” parameter because $\beta_1$ has been fixed to be 2.5 and is no longer free to be estimated from the data. The alternative (full) model contains two “free” parameters, both are to be estimated from the data. The null (reduced) model must contain fewer free parameters than the alternative (full) model.

Once the null and alternative models have been specified, the General Linear Test is performed by appropriately partitioning the squared errors into pieces corresponding to each model. In the first example where we were testing $H_0: \beta_1=0$ against $H_a:\beta_1\neq0$ we have the partition \[ \underbrace{Y_i-\bar{Y}}_{Total} = \underbrace{\hat{Y}_i - \bar{Y}}_{Regression} + \underbrace{Y_i-\hat{Y}_i}_{Error} \] The reason we use $\bar{Y}$ for the null model is that $\bar{Y}$ is the unbiased estimator of $\beta_0$ for the null model, $E\{Y_i\} = \beta_0$. Thus we would compute the following sums of squares: \[ SSTO = \sum(Y_i-\bar{Y})^2 \] \[ SSR = \sum(\hat{Y}_i-\bar{Y})^2 \] \[ SSE = \sum(Y_i-\hat{Y}_i)^2 \] and note that $SSTO = SSR + SSE$. Important to note is that $SSTO$ uses the difference between the observations $Y_i$ and the null (reduced) model. The $SSR$ uses the diffences between the alternative (full) and null (reduced) model. The $SSE$ uses the differences between the observations $Y_i$ and the alternative (full) model. From these we could set up a General $F$ table of the form

	Sum Sq	Df	Mean Sq	F Value
Model Error	$SSR$	$df_R-df_F$	$\frac{SSR}{df_R-df_F}$	$\frac{SSR}{df_R-df_F}\cdot\frac{df_F}{SSE}$
Residual Error	$SSE$	$df_F$	$\frac{SSE}{df_F}$
Total Error	$SSTO$	$df_R$

Prediction and Confidence Intervals for $\hat{Y}_h$ (Expand)

predict(…, interval=“prediction”)…

It is a common mistake to assume that averages (means) describe individuals. They do not. So, when providing predictions on individuals, it is crucial to capture the variability of individuals around the line.

Interval	R Code	Math Equation	When to Use
Prediction	`predict(..., interval="prediction")`	$\hat{Y}_i \pm t^* \cdot s_{\text{Pred}\ Y}$	Predict an individual’s value.
Confidence	`predict(..., interval="confidence")`	$\hat{Y}_i \pm t^* \cdot s_{\hat{Y}}$	Estimate location of the mean y-value.

predict(mylm, data.frame(XvarName = number), interval=...)

For example, consider this graph. Then click here to read about the graph.

fit	lwr	upr
41.41	10.17	72.64

plot(dist ~ speed, data=cars, pch=20, col="firebrick", cex=1.2, las=1,
     xlab="Speed of the Vehicle (mph) \n the Moment the Brakes were Applied", ylab="Distance (ft) it took the Vehicle to Stop",
     main="Don't Step in front of a Moving 1920's Vehicle...")
mtext(side=3, text="...they take a few feet to stop.", cex=0.7, line=.5)
legend("topleft", legend="Stopping Distance Experiment", bty="n")
points(dist ~ speed, data=subset(cars, speed==15), pch=20, col="firebrick2", cex=1.5)

cars.lm <- lm(dist ~ speed, data=cars)
abline(cars.lm, lwd=2, col=rgb(.689,.133,.133, .3))
abline(h=seq(0,120,20), v=seq(5,25,5), lty=2, col=rgb(.2,.2,.2,.2))
abline(v=15, lty=2, col="firebrick")

preds <- predict(cars.lm, data.frame(speed=15), interval="prediction")
lines(c(15,15), preds[2:3] - c(-.5,.5), col=rgb(.529,.8078,.9216,.4), lwd=12)
lines(c(0,15), preds[c(2,2)], col=rgb(.529,.8078,.9216,.8))
lines(c(0,15), preds[c(3,3)], col=rgb(.529,.8078,.9216,.8))

Now, for the details behind prediction intervals and confidence intervals.

Let’s begin by recalling some details (from the section “Inference for the Model Parameters”) about the standard error of the y-intercept, $b_0$. Recall that the y-intercept is the average y-value for the given x-value of $x=0$. Recall further that the formula for the standard error of $b_0$ is given by the formula

\[ s^2_{b_0} = MSE\left[\frac{1}{n} + \frac{\bar{X}^2}{\sum(X_i-\bar{X})^2}\right] \]

If we wanted to be more exact with this formula, we would write it as

\[ s^2_{b_0} = MSE\left[\frac{1}{n} + \frac{(0-\bar{X})^2}{\sum(X_i-\bar{X})^2}\right] \]

Did you notice the addition of $(0 - \bar{X})^2$ instead of just $\bar{X}^2$ in the numerator of the right-most part of the equation? This more complete statement obviously would reduce to just $\bar{X}^2$, but that is only because $X=0$ when we are working with the y-intercept, $b_0$. We could be working with other values of $X$ than just zero.

Let’s take a quick detour and talk notation for a second. Typically, $X_i$ and $Y_i$ are used to denote the x-value and y-value of points that are contained in our data set. When we want to reference a point that wasn’t within our original data set, we use the notation $X_h$ and $Y_h$. (The letter h is close to i, but different from i, so why not. There is really no other reason to use h.) Thus, $Y_h$ is the y-value for the $X_h$ x-value, neither of which were included in our original regression of $X_i$’s and $Y_i$’s.

Now, back to the previous discussion. If $X_h = 0$, then $\hat{Y}_h$ is the y-intercept, so $\hat{Y}_h = b_0$ when $X_h=0$. So, we could write,

\[ s^2_{\hat{Y}_h} = MSE\left[\frac{1}{n} + \frac{(X_h-\bar{X})^2}{\sum(X_i-\bar{X})^2}\right] \]

Did you notice how the $b_0$ in $s_{b_0}$ was replaced with $\hat{Y}_h$ to get $s_{\hat{Y}_h}$ and the 0 in $(0 - \bar{X})^2$ was replaced with $X_h$ to get $(X_h - \bar{X})^2$? Interesting. We now have a formula that would give us the standard error of $\hat{Y}_h$ for any $X_h$ value, not just $X_h = 0$, or the y-intercept, $b_0$. That is fantastic. It would look like this if plotted. Notice how the gray region is showing the standard error for each $\hat{Y}_h$ value? (It is technically showing the confidence interval for $E\{Y_h\}$ at every possible $X_h$ value, but that is just $\hat{Y}_h \pm t^* \cdot s_{\hat{Y}_h}$.)

ggplot(cars, aes(x=speed, y=dist)) + 
  geom_point() +
  geom_smooth(method="lm", color="skyblue") +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

Confidence Interval for $\hat{Y}_h$

\[ \hat{Y}_h \pm t^* s_{\hat{Y}_h} \quad \text{where} \ s_{\hat{Y}_h}^2 = MSE\left[\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum(X_i - \bar{X})^2}\right] \]

The confidence interval is a wonderful tool for estimating $E\{Y_h\}$, the “true” average y-value for a given x-value of $X_h$. However, it is not valuable for predicting an individual dot, or $Y_h$ value. Notice how few of the dots of the regression are actually contained within the confidence interval band in the plot? The confidence interval does not really predict where the dots will land, just where the average y-value is located for each x-value.

Remember the 68-95-99.7 Rule of the normal distribution? If not, here is a link back to that concept in the Math 221 textbook. This rule states that roughly 95% of data, when normally distributed, will be between $z=-2$ and $z=2$ standard deviations from the mean. So, is going two “residual standard errors” to both sides of the regression line enough to capture 95% of the data? The answer is, not quite. The reason for this is because our knowledge of where the true mean lies is uncertain. (Notice the confidence interval band shown in the plot.) However, adding two standard errors to the edges of the confidence band would get us in the right place. In other words, there are two sources of variability at play here, (1) our uncertaintity in where the regression line is sitting, and (2) the natural variability of the data points around the line. Thus, the “prediction interval” requires accounting for both of these sources of variability to produce the following equation.

Prediction Interval for $Y_h$

\[ \hat{Y}_h \pm t^* s_{Pred \hat{Y}_h} \quad \text{where} \ s_{Pred \hat{Y}_h}^2 = MSE\left[\frac{1}{n} + 1 + \frac{(X_h - \bar{X})^2}{\sum(X_i - \bar{X})^2}\right] \]

This formula provides a useful band for identifying a region where we are 95% confident that a new observation for $Y_h$ will land, given the value of $X_h$.

It looks as follows. Notice the prediction interval is much wider than the confidence interval. This is because data varies far more than do means. Prediction is for where the individual data points will land, confidence is for where the mean will land.

cars.lm <- lm(dist ~ speed, data=cars)
predy <- predict(cars.lm, data.frame(speed=15), interval="prediction")

ggplot(cars, aes(x=speed, y=dist)) + 
  geom_point() +
  geom_smooth(method="lm", color="skyblue") +
  geom_segment(aes(x=15, xend=15, y=predy[2], yend=predy[3]), lwd=4, color=rgb(.5,.7,.5,.01)) + 
  geom_point(aes(x=15, y=predy[1]), cex=2, color="skyblue", pch=15) +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

Lowess (and Loess) Curves (Expand)

A non-parametric approach to estimating $E\{Y_i\}$…

Advantages	Disadvantages
Quick. Good at ignoring outliers. Good at capturing the general pattern in the data. Good for making predictions within the scope of the data.	No mathematical model. Not interpretable. No p-values. No adjusted R-squared.

Examples: bodyweight, cars

Multiple Linear Regression

Multiple regression allows for more than one explanatory variable to be included in the modeling of the expected value of the quantitative response variable $Y_i$. There are infinitely many possible multiple regression models to choose from. Here are a few “basic” models that work as building blocks to more complicated models.

Overview

Select a model to see interpretation details, an example, and R Code help.

\[ Y_i = \overbrace{\underbrace{\beta_0 + \beta_1 X_i}_{E\{Y_i\}}}^\text{Simple Model} + \epsilon_i \]

The Simple Linear Regression model uses a single x-variable once: $X_i$.

Parameter	Effect
$\beta_0$	Y-intercept of the Model
$\beta_1$	Slope of the line

\[ Y_i = \overbrace{\underbrace{\beta_0 + \beta_1 X_i + \beta_2 X_i^2}_{E\{Y_i\}}}^\text{Quadratic Model} + \epsilon_i \]

The Quadratic model uses the same $X$-variable twice, once with a $\beta_1 X_i$ term and once with a $\beta_2 X_i^2$ term. The $X_i^2$ term is called the “quadratic” term.

Parameter	Effect
$\beta_0$	Y-intercept of the Model.
$\beta_1$	Controls the x-position of the vertex of the parabola by $\frac{-\beta_1}{2\cdot\beta_2}$.
$\beta_2$	Controls the concavity and “steepness” of the Model: negative values face down, positive values face up; large values imply “steeper” parabolas and low values imply “flatter” parabolas. Also involved in the position of the vertex, see $\beta_1$’s explanation.

(Show Example…)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-95.73	15.24	-6.281	3.458e-09
Month	48.72	4.489	10.85	1.29e-20
I(Month^2)	-3.283	0.3199	-10.26	4.737e-19

\[ Y_i = \overbrace{\underbrace{\beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \beta_3 X_i^3}_{E\{Y_i\}}}^\text{Cubic Model} + \epsilon_i \]

The Cubic model uses the same $X$-variable thrice, once with a $\beta_1 X_i$ term, once with a $\beta_2 X_i^2$ term, and once with a $\beta_3 X_i^3$ term. The $X_i^3$ term is called the “cubic” term.

Parameter	Effect
$\beta_0$	Y-intercept of the Model.
$\beta_1$	No clear interpretation, but could be called the “base slope coefficient” and contributes to the position of the inflection points of the cubic function.
$\beta_2$	No clear interpretation, but it also contributes to the location of the inflection points.
$\beta_3$	This is the coefficient of the cubic term. No clear interpretation, but it determines the concavity of the model by its sign.

(Show Example…)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-1.483	5.043	-0.2941	0.7694
conc	0.1814	0.0416	4.36	3.83e-05
I(conc^2)	-0.0003063	9.067e-05	-3.378	0.00113
I(conc^3)	1.601e-07	5.512e-08	2.905	0.004745

\[ Y_i = \overbrace{\underbrace{\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{1i} X_{2i}}_{E\{Y_i\}}}^\text{Two-lines Model} + \epsilon_i \]

\[ X_{2i} = \left\{\begin{array}{ll} 1, & \text{Group B} \\ 0, & \text{Group A} \end{array}\right. \]

The so called “two-lines” model uses a quantitative $X_{1i}$ variable and a 0,1 indicator variable $X_{2i}$. It is a basic example of how a “dummy variable” or “indicator variable” can be used to turn qualitative variables into quantitative terms. In this case, the indicator variable $X_{2i}$, which is either 0 or 1, produces two separate lines: one line for Group A, and one line for Group B.

Parameter	Effect
$\beta_0$	Y-intercept of the Model.
$\beta_1$	Controls the slope of the “base-line” of the model, the “Group 0” line.
$\beta_2$	Controls the change in y-intercept for the second line in the model as compared to the y-intercept of the “base-line” line.
$\beta_3$	Called the “interaction” term. Controls the change in the slope for the second line in the model as compared to the slope of the “base-line” line.

(Show Example…)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-9.01	8.218	-1.096	0.2823
qsec	1.439	0.45	3.197	0.003432
am	-14.51	12.48	-1.163	0.2548
qsec:am	1.321	0.7017	1.883	0.07012

\[ Y_i = \overbrace{\underbrace{\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{1i}X_{2i}}_{E\{Y_i\}}}^\text{3D Model} + \epsilon_i \]

The so called “3D” regression model uses two different quantitative x-variables, an $X_{1i}$ and an $X_{2i}$. Unlike the two-lines model where $X_{2i}$ could only be a 0 or a 1, this $X_{2i}$ variable is quantitative, and can take on any quantitative value.

Parameter	Effect
$\beta_0$	Y-intercept of the Model
$\beta_1$	Slope of the line in the $X_1$ direction.
$\beta_2$	Slope of the line in the $X_2$ direction.
$\beta_3$	Interaction term that allows the model, which is a plane in three-dimensional space, to “bend”. If this term is zero, then the regression surface is just a flat plane.

(Show Example…)

(Intercept)	Temp	Month
-139.6	2.659	-3.522

\[ Y_i = \overbrace{\underbrace{\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_{p-1}X_{p-1,i}}_{E\{Y_i\}}}^\text{"High Dimensional Models"} + \epsilon_i \]

The so called “HD”, or “High Dimensional”, regression model uses three or more different quantitative x-variables, an $X_{1i}$, an $X_{2i}$, and at least an $X_{3i}$, but could use many, many other variables as well. Unlike the 3D model where the final regression could be shown as either a contour plot or a 3D-graphic, the high dimensional model exists in 4 or more dimensions. Thus, it is impossible to graph this model in its full form. Further, it isn’t really even possible to “mentally connect” with this type of model is it exists beyond what our 3D minds can really comprehend.

Parameter	Effect
$\beta_0$	Y-intercept of the Model
$\beta_1$	Slope of the line in the $X_1$ direction.
$\beta_2$	Slope of the line in the $X_2$ direction.
$...$	Slopes in other directions depending on how many other variables are included in the model.
$\beta_{p-1}$	Final term in the model where there are $p$ total $\beta$’s. The reason for the $p-1$ on the last term is because we started with $\beta_0$ for the first term, leaving $\beta_{p-1}$ as the last term.

(Show Example…)

The coefficient $\beta_j$ is interpreted as the change in the expected value of $Y$ for a unit increase in $X_{j}$, holding all other variables constant, for $j=1,\ldots,p-1$.

See the Explanation tab for details about possible hypotheses here.

R Instructions

NOTE: These are general R Commands for all types of multiple linear regressions. See the “Overview” section for R Commands details about a specific multiple linear regression model.

Console Help Command: ?lm()

Finding Variables

pairs( EXPLANATION. cbind( EXPLANATION. Res = EXPLANATION. mylm$ EXPLANATION. residuals, EXPLANATION. YourDataSet), EXPLANATION. panel = EXPLANATION. panel.smooth, EXPLANATION. col = EXPLANATION. as.factor( EXPLANATION. YourDataSet$ EXPLANATION. X)) EXPLANATION.

Perform the Regression

Everything is the same as in simple linear regression except that more variables are allowed in the call to lm().

mylm <- lm( mylm is some name you come up with to store the results of the lm() test. Note that lm() stands for “linear model.” Y Y must be a “numeric” vector of the quantitative response variable. ~ EXPLANATION. X1 + X2 X1 and X2 are the explanatory variables. These can either be quantitative or qualitative. Note that R treats “numeric” variables as quantitative and “character” or “factor” variables as qualitative. R will automatcially recode qualitative variables to become “numeric” variables using a 0,1 encoding. See the Explanation tab for details. + X1:X2 X1:X2 is called the interaction term. See the Explanation tab for details. + …, * ... emphasizes that as many explanatory variables as are desired can be included in the model. data = YourDataSet) YourDataSet is the name of your data set.
summary( EXPLANATION. mylm EXPLANATION. ) Closing parenthesis for summary(…) function.

Example output from a regression. Hover each piece to learn more.

Call:
lm(formula = mpg ~ hp + am + hp:am, data = mtcars) This is simply a statement of your original lm(…) “call” that you made when performing your regression. It allows you to verify that you ran what you thought you ran in the lm(…).

Residuals: Residuals are the vertical difference between each point and the line, $Y_i - \hat{Y}_i$. The residuals are supposed to be normally distributed, so a quick glance at their five-number summary can give us insight about any skew present in the residuals.
min -4.3818 “min” gives the value of the residual that is furthest below the regression line. Ideally, the magnitude of this value would be about equal to the magnitude of the largest positive residual (the max) because the hope is that the residuals are normally distributed around the line.	1Q -2.2696 “1Q” gives the first quartile of the residuals, which will always be negative, and ideally would be about equal in magnitude to the third quartile.	Median 0.1344 “Median” gives the median of the residuals, which would ideally would be about equal to zero. Note that because the regression line is the least squares line, the mean of the residuals will ALWAYS be zero, so it is never included in the output summary. This particular median value of -0.0191 is a little smaller than zero than we would hope for and suggests a right skew in the data because the mean (0) is greater than the median (-0.0191) witnessing the residuals are right skewed. This can also be seen in the maximum being much larger in magnitude than the minimum.	3Q 1.7058 “3Q” gives the third quartile of the residuals, which would ideally would be about equal in magnitude to the first quartile. In this case, it is pretty close, which helps us see that the first quartile of residuals on either side of the line is behaving fairly normally.	Max 5.8752 “Max” gives the maximum positive residuals, which would ideally would be about equal in magnitude to the minimum residual. In this case, it is much larger than the minimum, which helps us see that the residuals are likely right skewed.

Coefficients: Notice that in your lm(…) you used only $Y$ and $X$. You did type out any coefficients, i.e., the $\beta_0$ or $\beta_1$ of the regression model. These coefficients are estimated by the lm(…) function and displayed in this part of the output along with standard errors, t-values, and p-values.
	Estimate To learn more about the “Estimates” of the “Coefficients” see the “Explanation” tab, “Estimating the Model Parameters” section for details.	Std. Error To learn more about the “Standard Errors” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.	t value To learn more about the “t value” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.	Pr(>\|t\|) The “Pr” stands for “Probability” and the “(> \|t\|)” stands for “more extreme than the observed t-value”. Thus, this is the p-value for the hypothesis test of each coefficient being zero. To learn more about the “p-value” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.
(Intercept) This always says “Intercept” for any lm(…) you run in R. That is because R always assumes there is a y-intercept for your regression function.	26.6248479 This is the estimate of the y-intercept, $\beta_0$. It is called $b_0$. It is the average y-value when X is zero.	2.1829432 This is the standard error of $b_0$. It tells you how much $b_0$ varies from sample to sample. The closer to zero, the better.	12.197 This is the test statistic t for the test of $\beta_0 = 0$. It is calculated by dividing the “Estimate” of the intercept (30.8735) by its standard error (3.1882). It gives the “number of standard errors” away from zero that the “estimate” has landed. In this case, the estimate of 30.8735 is 3.1882 standard errors (9.684) from zero, which is a fairly surprising distance as shown by the p-value.	1.01e-12 This is the p-value of the test of the hypothesis that $\beta_0 = 0$. It measures the probability of observing a t-value as extreme as the one observed. To compute it yourself in R, use `pt(-abs(your t-value), df of your regression)*2`.	*** This is called a “star”. Three stars means significant at the 0 level of $\alpha$.
hp This is always the name of your X-variable in your lm(Y ~ X, …).	-0.0591370 This is the estimate of the slope, $\beta_1$. It is called $b_1$. It is the change in the average y-value as X is increased by 1 unit.	0.0129449 This is the standard error of $b_1$. It tells you how much $b_1$ varies from sample to sample. The closer to zero, the better.	-4.568 This is the test statistic t for the test of $\beta_1 = 0$. It is calculated by dividing the “Estimate” of the slope (-1.9757) by its standard error (0.4485). It gives the “number of standard errors” away from zero that the “estimate” has landed. In this case, the estimate of -1.9757 is -4.405 standard errors (0.4485) from zero, which is a really surprising distance as shown by the smallness of the p-value.	9.02e-05 This is the p-value of the test of the hypothesis that $\beta_1 = 0$. To compute it yourself in R, use `pt(-abs(your t-value), df of your regression)*2`	*** This is called a “star”. Three stars means significant at the 0.01 level of $\alpha$.
am EXPLANATION.	5.2176534 EXPLANATION	2.6650931 EXPLANATION	1.958 EXPLANATION	0.0603 EXPLANATION	. EXPLANATION.
hp:am EXPLANATION.	0.0004029 EXPLANATION.	0.0164602 EXPLANATION.	0.024 EXPLANATION.	0.9806 EXPLANATION.

---

2.939 For this particular regression, the estimate of $\sigma$ is 2.939. Squaring this number gives you the MSE, which is the estimate of $\sigma^2$.

on 28 degrees of freedom This is $n-p$ where $n$ is the sample size and $p$ is the number of parameters in the regression model. In this case, there is a sample size of 32 and two parameters, $\beta_0$ and $\beta_1$, so 32-4 = 28.

Multiple R-squared: This is $R^2$, the percentage of variation in $Y$ that is explained by the regression model. It is equal to the SSR/SSTO or, equivalently, 1 - SSE/SSTO. 0.7852, In this particular regression, 78.52% of the variation in stopping distance dist is explained by the regression model using speed of the car. Adjusted R-squared: The adjusted R-squared will always be at least slightly smaller than $R^2$. The closer to R-squared that it is, the better. When it differs dramatically from $R^2$, it is a sign that the regression model is over-fitting the data. 0.7621 In this case, the value of 0.7621 is quite close to the original $R^2$ value, so there is no fear of over-fitting with this particular model. That is good.

F-statistic: EXPLANATION

34.11 EXPLANATION.

on 3 and 28 DF, EXPLANATION.

p-value: 1.73e-09 EXPLANATION.

Plotting the Regression Lines

To add two regression lines to a scatterplot use two abline(...) commands:

plot( EXPLANATION. Y EXPLANATION. ~ EXPLANATION. $X_1$, EXPLANATION. col = EXPLANATION. as.factor( EXPLANATION. $X_2$), EXPLANATION. data = YourDataSet) EXPLANATION.
b EXPLANATION. <- EXPLANATION. mylm EXPLANATION. $ EXPLANATION. coefficients EXPLANATION.
abline( EXPLANATION. b[1], EXPLANATION. b[2]) EXPLANATION.
abline( EXPLANATION. (b[1]+ EXPLANATION. b[3]), EXPLANATION. (b[2]+ EXPLANATION. b[4]), EXPLANATION. col = EXPLANATION. “red”) EXPLANATION.

Customize the look:

b EXPLANATION. <- EXPLANATION. mylm EXPLANATION. $ EXPLANATION. coefficients EXPLANATION.
palette( EXPLANATION. c( EXPLANATION. “SomeColor”, EXPLANATION. “DifferentColor”)) EXPLANATION.
plot( EXPLANATION. Y EXPLANATION. ~ EXPLANATION. $X_1$, EXPLANATION. col = EXPLANATION. as.factor( EXPLANATION. $X_2$), EXPLANATION. data = YourDataSet, EXPLANATION. pch = 16) EXPLANATION.
abline( EXPLANATION. b[1], EXPLANATION. b[2], EXPLANATION. col = palette()[1], EXPLANATION. lty = 1, EXPLANATION. lwd = 1) EXPLANATION.
abline( EXPLANATION. (b[1]+ EXPLANATION. b[3]), EXPLANATION. (b[2]+ EXPLANATION. b[4]), EXPLANATION. col = palette()[2], EXPLANATION. lty = 2, EXPLANATION. lwd = 2) EXPLANATION.

ggplot(data = mtcars, aes(x = hp, y = mpg, color = factor(am))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Customize the look:

ggplot( ggplot() provides a framework which you add layers to. Layers usually take the form of different geoms, like geom_point() or geom_boxplot(). Running ggplot() without any layers produces simply a blank graph. data = “Data =” tells ggplot() that you are going to declare the data set that you will use in the graph. YourDataSet, YourDataSet is literally the name of the data set that you want to use to make the graph, like mtcars or KidsFeet. aes( aes stands for aesthetic. Inside of aes(), you place elements that you want to map to the framework, like x and y variables. x = “x =” declares which variable will become the x-axis of the graphic, your explanatory variable. Both “x =” and “y =” are optional phrases in the ggplot2 syntax. $X_{1}$, $X_{1}$ is an explanatory variable of the regression, and is typically quantitative or numeric. It is the name of a “numeric” column of YourDataSet. y = “y =” declares which variable will become the y-axis of the graphic. $Y$, $Y$ is the response variable of the regression: the variable that you are interested in predicting. It is the name of a “numeric” column of YourDataSet. color = “color =” tells ggplot() that you are about to declare the way in which you will color the elements in the graph. Notice that, in this case, the color argument goes inside of the aesthetic. This is because you are coloring by a variable. factor( Using factor() tells R to treat qualitative variables as factors, where each unique value or category of the variable is a level. $X_{2}$ $X_{2}$ is a qualitative explanatory variable of the regression. It is the name of a column of YourDataSet. Casting $X_{2}$ as a factor() ) Closing parenthesis of factor(). ) Closing parenthesis of aes(). ) Closing parenthesis of ggplot(). + The + allows you to add more layers to the framework provided by ggplot(). In this case, you use + to add a geom_point() layer on the next line.
geom_point() geom_point() allows you to add a layer of points, a scatterplot, over the ggplot() framework. The x and y coordinates are received from the previously specified x and y variables declared in the ggplot() aesthetic. + Here the + is used to add yet another layer to ggplot().
geom_smooth( geom_smooth() is a smoothing function that you can use to add lines or curves to ggplot(). In this case, you will use it to add the least-squares regression line to the scatterplot. method = Use “method =” to tell geom_smooth() that you are going to declare a specific smoothing function, or method, to alter the line(s) or curve(s). “lm,” lm stands for linear model. Using method = “lm” tells geom_smooth() to fit a least-squares regression line (or multiple lines) to the data. In this case, the regression lines are modeled using $Y$ ~ $X_1$ + $X_2$, which variables were declared in the initial ggplot() aesthetic. There are several other methods that could be used here. se = FALSE se stands for “standard error.” Specifying FALSE turns this feature off. When TRUE, a gray band showing the “confidence band” for the regression is shown. Unless you know how to interpret this confidence band, leave it turned off. ) Closing parenthesis for geom_smooth(). + Here the + is used to add yet another layer to ggplot().
scale_color_manual( This function allows you to override the default ggplot colors and specify your own colors. values = This tells scale_color_manual() that you are about to declare the values for the custom colors. c( This is the combine, or “backpack” function, that combines values into a vector or list. Put the necessary number of color values in c(). “color1”, “color2” These are literally the colors that you want to show up in your plot, like “skyblue” or “orange.” ) Closing parenthesis of the c() function. ) Closing parenthesis of the scale_color_manual() function. + Here the + is used to add yet another layer to ggplot().
labs( labs stands for labels. Modify the labels on your graph with the labs() function. title = “YourTitle”, Declare the title with title = and then write the title in quotes.
    subtitle = “YourSubtitle”, Declare a subtitle with subtitle = and then write the subtitle in quotes.
    caption = “Caption”, Declare a caption with caption = and then write the caption in quotes. By default, captions appear at the bottom right of the graphic.
    x = “X”, Use x = to declare the label for the x-axis of your graph. “X” is the actual label that you want to appear on the x-axis.
    y = “Y”, Use y = to declare the label for the y-axis of your graph. “Y” is the actual label that you want to appear on the y-axis.
    color = “Legend Title” Use color = to tell labs() that you are going modify the color legend title. ) Closing parenthesis for labs(). + Here the + is used to add yet another layer to ggplot().
  theme_bw() Add a theme to ggplot(). Click to Show Output Click to View Output.

ggplot(data = mtcars, aes(x = hp, y = mpg, color = factor(am))) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_manual(values = c("skyblue", "orange")) +
  labs(title = "Title",
       subtitle = "Subtitle",
       caption = "Caption",
       x = "X",
       y = "Y",
       color = "Legend Title"
       ) +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

Making Predictions

predict( The R function predict(…) allows you to use an lm(…) object to make predictions for specified x-values. mylm, This is the name of a previously performed lm(…) that was saved into the name mylm <- lm(...). newdata = data.frame( To specify the values of $x$ that you want to use in the prediction, you have to put those x-values into a data set, or more specifally, a data.frame(…). $X_1$= The value for X= should be whatever x-variable name was used in the original regression. For example, if mylm <- lm(mpg ~ hp + am + hp:am, data=mtcars) was the original regression, then this code would read hp = instead of X1 =… Further, the value of $X_{1h}$ should be some specific number, like hp=123 for example. $X_{1h}$, The value of $X_{1h}$ should be some specific number, like 123, as in hp=123 for example. $X_2$= This is the value of the second x-variable, say am. $X_{2h}$) Since the am column can only be a 1 or 0, we would try am=1 for example, or am=0. ) Closing parenthesis.

mylm <- lm(mpg ~ hp + am + hp:am, data = mtcars)

predict(mylm, data.frame(hp = 120, am = 1), data = mtcars, type = "response")

##        1 
## 24.79441

The value given is the “fitted-value” or “predicted-value” for the specified x-value. In this case, a car with a speed of 12 is predicted to have a stopping distance of 29.60981 feet.

mylm <- lm(mpg ~ hp + am + hp:am, data = mtcars)

predict(mylm, data.frame(hp = 120, am = 1), data = mtcars, interval = "prediction")

mylm <- lm(mpg ~ hp + am + hp:am, data = mtcars)
predict(mylm, data.frame(hp = 120, am = 1), data = mtcars, interval = "prediction")

##        fit      lwr      upr
## 1 24.79441 18.49923 31.08959

The “fit” is the predicted value. The “lwr” is the lower bound. The “upr” is the upper bound.

In this case, a car with a speed of 12 mph is predicted to have a stopping distance of 29.60981 feet. However, we are wise enough to recognize that the stopping distance for individual cars will vary anywhere from -1.749529 (or 0 because distance can’t go negative) feet to 60.96915 feet.

mylm <- lm(mpg ~ hp + am + hp:am, data = mtcars)

predict(mylm, data.frame(hp = 120, am = 1), data = mtcars, interval = "confidence")

mylm <- lm(mpg ~ hp + am + hp:am, data = mtcars)
predict(mylm, data.frame(hp = 120, am = 1), interval = "confidence")

##        fit      lwr      upr
## 1 24.79441 23.10635 26.48247

The “fit” is the predicted value. The “lwr” is the lower bound. The “upr” is the upper bound.

In this case, cars with a speed of 12 mph are predicted to have an average stopping distance of 29.60981 feet, where the average could be anywhere from 24.39514 feet to 34.82448 feet.

Explanation

Assessing the Model Fit (Expand)

$R^2$, adjusted $R^2$, AIC, BIC…

Model Validation (Expand)

Verifying a model’s ability to generalize to new data…

Interpretation (Expand)

$\beta_j$ is the change in the average y-value…

Added Variable Plots (Expand)

When to add another $X$-variable to the model…

The assumptions of multiple linear regression are nearly identical to simple linear regression, with the addition of one new assumption.

The regression relation between $Y$ and $X$ is linear.
The error terms are normally distributed with $E\{\epsilon_i\}=0$.
The variance of the error terms is constant over all $X$ values.
The $X$ values can be considered fixed and measured without error.
The error terms are independent.
All important variables are included in the model.

Checking the Assumptions

The process of checking assumptions is the same for multiple linear regression as it is for simple linear regression, with the addition of one more tool, the added variable plot. Added variable plots can be used to determine if a new variable should be included in the model.

Let $X_{new}$ be a new explanatory variable that could be added to the current multiple regression model. Plotting the residuals from the current linear regression against $X_{new}$ allows us to determine if $X_{new}$ has any information to add to the current model. If there is a trend in the plot, then $X_{new}$ should be added to the model. If there is no trend in the plot, then the $X_{new}$ should be left out.

| Show Examples |

Inference for the Model Parameters (Expand)

t Tests and F tests in multiple regression…

Class Examples

Now let’s apply your knowledge to three different pictures from three different, but similar models.

Use the starter code in each example.

palette(c("skyblue","firebrick"))

plot(mpg ~ qsec, data=mtcars, col=as.factor(am), xlim=c(0,30), ylim=c(-30,40), main="1974 Motor Trend Cars", pch=16)

Problem 1: Equal Slopes Model

(different intercepts)

someNameForYourLM <- lm(mpg ~ qsec + am, data = mtcars)

summary(someNameForYourLM)

## 
## Call:
## lm(formula = mpg ~ qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3447 -2.7699  0.2938  2.0947  6.9194 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.8893     6.5970  -2.863  0.00771 ** 
## qsec          1.9819     0.3601   5.503 6.27e-06 ***
## am            8.8763     1.2897   6.883 1.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.487 on 29 degrees of freedom
## Multiple R-squared:  0.6868, Adjusted R-squared:  0.6652 
## F-statistic:  31.8 on 2 and 29 DF,  p-value: 4.882e-08

plot(mpg ~ qsec, data=mtcars, col=as.factor(am), xlim=c(0,30), ylim=c(-30,40), main="1974 Motor Trend Cars", pch=16)

abline(-18.8893, 1.9819, col=palette()[1])

abline(-18.8893+8.8763, 1.9819, col=palette()[2])

legend("topleft", legend=c("automatic","manual"), pch=1, col=palette(), title="Transmission (am)", bty="n")

Problem 2: Equal Intercepts Model

(different slopes)

someNameForYourLM <- lm(mpg ~ qsec + qsec:am, data = mtcars)

summary(someNameForYourLM)

## 
## Call:
## lm(formula = mpg ~ qsec + qsec:am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3306 -2.2453  0.1917  2.3112  6.9815 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.30050    6.22260  -2.459   0.0201 *  
## qsec          1.78149    0.34186   5.211 1.41e-05 ***
## qsec:am       0.50958    0.06994   7.286 5.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.364 on 29 degrees of freedom
## Multiple R-squared:  0.7086, Adjusted R-squared:  0.6885 
## F-statistic: 35.26 on 2 and 29 DF,  p-value: 1.716e-08

plot(mpg ~ qsec, data=mtcars, col=as.factor(am), xlim=c(0,30), ylim=c(-30,40), main="1974 Motor Trend Cars", pch=16)

abline(-15.30050, 1.78149, col=palette()[1])

abline(-15.30050, 1.78149+0.50958, col=palette()[2])

legend("topleft", legend=c("automatic","manual"), pch=1, col=palette(), title="Transmission (am)", bty="n")

Problem 3: Full Model

(different slopes & different intercepts)

someNameForYourLM <- lm(mpg ~ qsec + am + qsec:am, data = mtcars)

summary(someNameForYourLM)

## 
## Call:
## lm(formula = mpg ~ qsec + am + qsec:am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4551 -1.4331  0.1918  2.2493  7.2773 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -9.0099     8.2179  -1.096  0.28226   
## qsec          1.4385     0.4500   3.197  0.00343 **
## am          -14.5107    12.4812  -1.163  0.25481   
## qsec:am       1.3214     0.7017   1.883  0.07012 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.343 on 28 degrees of freedom
## Multiple R-squared:  0.722,  Adjusted R-squared:  0.6923 
## F-statistic: 24.24 on 3 and 28 DF,  p-value: 6.129e-08

plot(mpg ~ qsec, data=mtcars, col=as.factor(am), xlim=c(0,30), ylim=c(-30,40), main="1974 Motor Trend Cars", pch=16)

abline(-9.0099,  1.4385, col=palette()[1])

abline(-9.0099 -14.5107,  1.4385 + 1.3214, col=palette()[2])

legend("topleft", legend=c("automatic","manual"), pch=1, col=palette(), title="Transmission (am)", bty="n")

Examples: Civic Vs Corolla cadillacs

Coefficients: Notice that in your lm(…) you used only \(Y\) and \(X\). You did type out any coefficients, i.e., the \(\beta_0\) or \(\beta_1\) of the regression model. These coefficients are estimated by the lm(…) function and displayed in this part of the output along with standard errors, t-values, and p-values.
	Estimate To learn more about the “Estimates” of the “Coefficients” see the “Explanation” tab, “Estimating the Model Parameters” section for details.	Std. Error To learn more about the “Standard Errors” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.	t value To learn more about the “t value” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.	Pr(>\|t\|) The “Pr” stands for “Probability” and the “(> \|t\|)” stands for “more extreme than the observed t-value”. Thus, this is the p-value for the hypothesis test of each coefficient being zero. To learn more about the “p-value” of the “Coefficients” see the “Explanation” tab, “Inference for the Model Parameters” section.
(Intercept) This always says “Intercept” for any lm(…) you run in R. That is because R always assumes there is a y-intercept for your regression function.	-17.5791 This is the estimate of the y-intercept, \(\beta_0\). It is called \(b_0\). It is the average y-value when X is zero.	6.7584 This is the standard error of \(b_0\). It tells you how much \(b_0\) varies from sample to sample. The closer to zero, the better.	-2.601 This is the test statistic t for the test of \(\beta_0 = 0\). It is calculated by dividing the “Estimate” of the intercept (-17.5791) by its standard error (6.7584). It gives the “number of standard errors” away from zero that the “estimate” has landed. In this case, the estimate of -17.5791 is -2.601 standard errors (6.7584) from zero, which is a fairly surprising distance as shown by the p-value.	0.0123 This is the p-value of the test of the hypothesis that \(\beta_0 = 0\). It measures the probability of observing a t-value as extreme as the one observed. To compute it yourself in R, use `pt(-abs(your t-value), df of your regression)*2`.	* This is called a “star”. One star means significant at the 0.1 level of \(\alpha\).
speed This is always the name of your X-variable in your lm(Y ~ X, …).	3.9324 This is the estimate of the slope, \(\beta_1\). It is called \(b_1\). It is the change in the average y-value as X is increased by 1 unit.	0.4155 This is the standard error of \(b_1\). It tells you how much \(b_1\) varies from sample to sample. The closer to zero, the better.	9.464 This is the test statistic t for the test of \(\beta_1 = 0\). It is calculated by dividing the “Estimate” of the slope (3.9324) by its standard error (0.4155). It gives the “number of standard errors” away from zero that the “estimate” has landed. In this case, the estimate of 3.9324 is 9.464 standard errors (0.4155) from zero, which is a really surprising distance as shown by the smallness of the p-value.	1.49e-12 This is the p-value of the test of the hypothesis that \(\beta_1 = 0\). To compute it yourself in R, use `pt(-abs(your t-value), df of your regression)*2`	*** This is called a “star”. Three stars means significant at the 0.01 level of \(\alpha\).

fit The “fit” is the predicted value.	lwr The “lwr” is the lower bound.	upr The “upr” is the upper bound.
1 29.60981 In this case, a car with a speed of 12 mph is predicted to have a stopping distance of 29.60981 feet. However, we are wise enough to recognize that the stopping distance for individual cars will vary anywhere from -1.749529 (or 0 because distance can’t go negative) feet to 60.96915 feet.	-1.749529 This is the lower bound of the prediction interval. While we predict a stopping distance of 29.60981 feet, this prediction interval reminds us the stopping distance could be as quick as -1.749529 feet (or 0 because distance can’t go negative).	60.96915 This is the upper bound of the prediction interval. While we predict a stopping distance of 29.60981 feet, this prediction interval reminds us that the actual stopping distance could be as high as 60.96915 feet.

fit The “fit” is the predicted value.	lwr The “lwr” is the lower bound.	upr The “upr” is the upper bound.
1 29.60981 In this case, cars with a speed of 12 mph are predicted to have an average stopping distance of 29.60981 feet, where the average could be anywhere from 24.39514 feet to 34.82448 feet.	24.39514 This is the lower bound of the confidence interval. We are 95% confident that the average stopping distance of cars going 12 mph is greater than this value.	34.82448 This is the upper bound of the confidence interval. We are 95% confident that the average stopping distance of cars going 12 mph is less than this value.

A Law is Given	Data is Created	The Law is Estimated
\(E\{Y_i\} = \beta_0 + \beta_1 X_i\)	\(Y_i = E\{Y_i\} + \epsilon_i\)	\(\hat{Y}_i = b_0 + b_1 X_i\)
The true line is the “law”.	The \(Y_i\) are created by adding \(\epsilon_i\) to \(E\{Y_i\}\) where \(E\{Y_i\} = \beta_0 + \beta_1 X_i\).	The law is estimated with \(\hat{Y}_i\) which is given with `lm(...)`.

	Sum Sq	Df	Mean Sq	F Value
Model Error	\(SSR\)	\(df_R-df_F\)	\(\frac{SSR}{df_R-df_F}\)	\(\frac{SSR}{df_R-df_F}\cdot\frac{df_F}{SSE}\)
Residual Error	\(SSE\)	\(df_F\)	\(\frac{SSE}{df_F}\)
Total Error	\(SSTO\)	\(df_R\)

Term	Pronunciation	Meaning	Math	R Code
\(Y_i\) $Y_i$	“why-eye”	The data	\(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2)\) `$Y_i = \beta_0 + \beta_1 X_i +` `\epsilon_i \quad \text{where} \` `\epsilon_i \sim N(0, \sigma^2)$`	`YourDataSet$YourYvariable`
\(\hat{Y}_i\) $\hat{Y}_i$	“why-hat-eye”	The fitted line	\(\hat{Y}_i = b_0 + b_1 X_i\) $\hat{Y}_i = b_0 + b_1 X_i$	`lmObject$fitted.values`
\(E\{Y_i\}\) $E\{Y_i\}$	“expected value of why-eye”	True mean y-value	\(E\{Y_i\} = \beta_0 + \beta_1 X_i\) $E\{Y_i\} = \beta_0 + \beta_1 X_i$	`<none>`
\(\beta_0\) $\beta_0$	“beta-zero”	True y-intercept	`<none>`	`<none>`
\(\beta_1\) $\beta_1$	“beta-one”	True slope	`<none>`	`<none>`
\(b_0\) $b_0$	“b-zero”	Estimated y-intercept	\(b_0 = \bar{Y} - b_1\bar{X}\) `$b_0 = \bar{Y} - b_1\bar{X}`	`b_0 <- mean(Y) - b_1*mean(X)$`
\(b_1\) $b_1$	“b-one”	Estimated slope	\(b_1 = \frac{\sum X_i(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}\) `$b_1 = \frac{\sum X_i(Y_i - \bar{Y})}` `{\sum(X_i - \bar{X})^2}$`	`b_1 <- sum( X*(Y - mean(Y)) ) / sum( (X - mean(X))^2 )`
\(\epsilon_i\) $\epsilon_i$	“epsilon-eye”	Distance of dot to true line	\(\epsilon_i = Y_i - E\{Y_i\}\) $\epsilon_i = Y_i - E\{Y_i\}$	`<none>`
\(r_i\) $r_i$	“r-eye” or “residual-eye”	Distance of dot to estimated line	\(r_i = Y_i - \hat{Y}_i\) $r_i = Y_i - \hat{Y}_i$	`lmObject$residuals`
\(\sigma^2\) $\sigma^2$	“sigma-squared”	Variance of the \(\epsilon_i\)	\(Var\{\epsilon_i\} = \sigma^2\) $Var\{\epsilon_i\} = \sigma^2$	`<none>`
\(MSE\) $MSE$	“mean squared error”	Estimate of \(\sigma^2\)	\(MSE = \frac{SSE}{n-p}\) $MSE = \frac{SSE}{n-p}$	`sum( lmObject$res^2 ) / (n - p)`
\(SSE\) $SSE$	“sum of squared error” (residuals)	Measure of dot’s total deviation from the line	\(SSE = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2\)`$SSE = \sum_{i=1}^n` `(Y_i - \hat{Y}_i)^2$`	`sum( lmObject$res^2 )`
\(SSR\) $SSR$	“sum of squared regression error”	Measure of line’s deviation from y-bar	\(SSR = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2\)`$SSR = \sum_{i=1}^n` `(\hat{Y}_i - \bar{Y})^2$`	`sum( (lmObject$fit - mean(YourData$Y))^2 )`
\(SSTO\) $SSTO$	“total sum of squares”	Measure of total variation in Y	\(SSR + SSE = SSTO = \sum_{i=1}^n (Y_i - \bar{Y})^2\)`$SSR + SSE = SSTO = \sum_{i=1}^n` `(Y_i - \bar{Y})^2$`	`sum( (YourData$Y - mean(YourData$Y)^2 )`
\(R^2\) $R^2$	“R-squared”	Proportion of variation in Y explained by the regression	\(R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO}\)`$R^2 = \frac{SSR}{SSTO} = 1` `- \frac{SSE}{SSTO}$`	`SSR/SSTO`
\(\hat{Y}_h\) $\hat{Y}_h$	“why-hat-aitch”	Estimated mean y-value for some x-value called \(X_h\)	\(\hat{Y}_h = b_0 + b_1 X_h\) $\hat{Y}_h = b_0 + b_1 X_h$	`predict(lmObject, data.frame(XvarName=#))`
\(X_h\) $X_h$	“ex-aitch”	Some x-value, not necessarily one of the \(X_i\) values used in the regression	\(X_h =\) some number $X_h = $	`Xh = #`
Confidence Interval	“confidence interval”	Estimated bounds at a certain level of confidence for a parameter	\(b_0 \pm t^* \cdot s_{b_0}\) or \(b_1 \pm t^* \cdot s_{b_1}\)	`confint(mylm, level = someConfidenceLevel)`

Parameter	Estimate
\(\beta_0\)	\(b_0\)
\(\beta_1\)	\(b_1\)
\(\epsilon_i\)	\(r_i\)
\(\sigma^2\)	\(MSE\)
\(\sigma\)	\(\sqrt{MSE}\), the Residual standard error

Sum of Squared Errors	Sum of Squares Regression	Total Sum of Squares
\(\text{SSE} = \sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2\)	\(\text{SSR} = \sum_{i=1}^n \left(\hat{Y}_i - \bar{Y}\right)^2\)	\(\text{SSTO} = \sum_{i=1}^n \left(Y_i - \bar{Y}\right)^2\)
Measures how much the residuals deviate from the line.	Measures how much the regression line deviates from the average y-value.	Measures how much the y-values deviate from the average y-value.
Equals SSTO - SSR	Equals SSTO - SSE	Equals SSE + SSR
`sum( (Y - mylm$fit)^2 )`	`sum( (mylm$fit - mean(Y))^2 )`	`sum( (Y - mean(Y))^2 )`

Parameter Estimate	Mathematical Formula	R Code
Slope	\(b_1 = \frac{\sum X_i(Y_i-\bar{Y})}{\sum(X_i-\bar{X})^2}\)	`b_1 <- sum( X*(Y - mean(Y)) ) / sum( (X - mean(X))^2 )`
Intercept	\(b_0 = \bar{Y} - b_1\bar{X}\)	`b_0 <- mean(Y) - b_1*mean(X)`

Confidence Interval	Formula	Standard Error
\(\beta_0\)	\(b_0 \pm\) \(t^*\) This is called the “critical value” and denotes the number of standard deviations that are needed to obtain a 95% confidence interval from a t distribution with degrees of freedom \(n-p\) (sample size - number of parameters in the regression model). \(\cdot\) The critical value is multiplied by the standard error of \(b_0\). \(s_{b_0}\) The standard error of \(b_0\), denoted by \(s_{b_0}\) is provided in the regression summary output under the column header called “Std. Error” for the “(Intercept)” row of the output. It is calculated using the formula shown below.	\[s^2_{b_0} = MSE\left[\frac{1}{n} + \frac{\bar{X}^2}{\sum(X_i-\bar{X})^2}\right]\] This is called the “estimated variance of \(b_0\)”. Taking the square root of this number gives the “standard error of \(b_0\)”.
\(\beta_1\)	\(b_1 \pm\) \(t^*\) This is called the “critical value” and denotes the number of standard deviations that are needed to obtain a 95% confidence interval from a t distribution with degrees of freedom \(n-p\) (sample size - number of parameters in the regression model). \(\cdot\) The critical value is multiplied by the standard error of \(b_1\). \(s_{b_1}\) The standard error of \(b_1\), denoted by \(s_{b_1}\) is provided in the regression summary output under the column header called “Std. Error”. It is calculated using the formula shown below.	\[s^2_{b_1} = \frac{MSE}{\sum(X_i-\bar{X})^2}\] This is called the “estimated variance of \(b_1\)”. Taking the square root of this number gives the “standard error of \(b_1\)”.

Interval	R Code	Math Equation	When to Use
Prediction	`predict(..., interval="prediction")`	\(\hat{Y}_i \pm t^* \cdot s_{\text{Pred}\ Y}\)	Predict an individual’s value.
Confidence	`predict(..., interval="confidence")`	\(\hat{Y}_i \pm t^* \cdot s_{\hat{Y}}\)	Estimate location of the mean y-value.

Model	\(R^2\)	Adjusted \(R^2\)
True	0.9958725	0.9947718
Simple	0.8114836	0.8010105
Complicated	0.9984527	0.9941204

Model	Original \(R^2\)	Original Adj. \(R^2\)	Validation \(R^2\)	Validation Adj. \(R^2\)
True	0.9959	0.9948	0.9928	0.9908
Simple	0.8115	0.8010	0.8002	0.7891
Complicated	0.9985	0.9941	0.8686	0.5008

Linear Regression

Simple Linear Regression

Overview

R Instructions

Explanation

The Mathematical Model (Expand)

Interpreting the Model Parameters (Expand)

Residuals and Errors (Expand)

Assessing the Fit of a Regression (Expand)

Residual Plots & Regression Assumptions (Expand)

Residuals versus Fitted-values Plot: Checks Assumptions #1 and #3

Q-Q Plot of the Residuals: Checks Assumption #2

Residuals versus Order Plot: Checks Assumption #5

Problems from Failed Assumptions

Estimating the Model Parameters (Expand)

Least Squares

Maximum Likelihood

Estimating the Model Variance (Expand)

Transformations (Expand)

Scatterplot Recognition

Box-Cox Suggestion

An Example

X-Transformations

Inference for the Model Parameters (Expand)

t Tests

Confidence Intervals

F tests

Prediction and Confidence Intervals for \(\hat{Y}_h\) (Expand)

Lowess (and Loess) Curves (Expand)

Multiple Linear Regression

Overview

R Instructions

Explanation

Assessing the Model Fit (Expand)

Model Validation (Expand)

Interpretation (Expand)

Added Variable Plots (Expand)

Checking the Assumptions

Inference for the Model Parameters (Expand)

t Tests

F Tests

Class Examples

Problem 1: Equal Slopes Model

Problem 2: Equal Intercepts Model

Problem 3: Full Model