3.3 Modeling Linear Relationships with Regression

Learning Objectives

Upon completion of this section, you should be able to

Construct a scatter plot
Calculate the correlation coefficient to describe the linear relationship between two quantitative variables
Create a linear regression model between two quantitative variables using a spreadsheet
Interpret a linear regression model in applications

Modeling Data

When we have bivariate data (data with two quantitative variables) that has a linear relationship it is rare for all the data points to fit directly on a straight line. If it did fit a straight line we could just use a little algebra to find an equation of a line that goes through all the points (as we did earlier). Since this typically does not happen, we have to find another way to model the data with a linear equation. One method that you may have used in the past is picking two points whose line best fits the data. Since two people can come up with different solutions for the same data set by picking different points a better method is needed. The method we are going to look at fits the data by trying to minimize the distance of all points for the line. This is called a Line of Best Fit, Least Squares Regression Line, or Trendline.

Warning: Always check first that the relationship between the two variables look linear before finding the regression line. If it doesn't look linear you may want to use a different type of equation to model the data. Most spreadsheet programs, graphing calculators, and statistical software packages will have a variety of function to choose from for the model to use. We will start this section off with constructing Scatter Plots to inspect visually the relationship between the variables.

Scatter Plots

Before we take up the discussion of linear regression and correlation, we need to examine a way to display the relation between two quantitative variables $x$ and $y$ . By way of illustration, let's consider something with which we may be familiar with: height and weight. If we were to randomly select several 25-year-old men and measure the height and weight of each one, we might obtain a collection of $(x, y)$ pairs something like this:

Table 1. Sample of 25 year old men's height and weight.
Height (in)	68	69	70	70	71	72	72	72	73	73	74	75
Weight (lb)	151	146	157	164	171	160	163	180	170	175	178	188

A plot of these data is shown below "Plot of Height and Weight Pairs". Such a plot is called a scatter diagram or scatter plot. Looking at the plot it is evident that there exists a linear relationship between height $x$ and weight $y$ , but not a perfect one. The points appear to be following a line, but not exactly. There is an element of randomness present. We also observe that as the height increases so does the weight. We say this is a positive correlation between the two variables.

Plot of Height and Weight Pairs Google Sheet for data

Please be aware that certain images and Google documents may not be properly viewed on mobile devices.

Scatter Plot

A scatter plot is a visualization of data points on a Cartesian coordinate system from two variables that are connected to each other (observations from the same subject in a study). The entire set of paired data values are represented by a symbol on the coordinate system (typically a dot).

To construct a scatter plot pick one variable to be represented on the x-axis and the other variable values will be represented on the y-axis. If one variable would explain the change in the observed value in the other variable we call it an independent variable (or explanatory variable) and put it on the x-axis. Plot each ordered pair in the data set as a point on the Cartesian coordinate system.

Example 1

Construct a scatter plot for the data given in the table below where the age $x$ in months and the number of words in the vocabulary $y$ is measured for 5 children.

Example 1 Table 1. Sample of five children's ages and number of words in the vocabulary.
$x$ , age in months	12	13	16	16	18
$y$ , number of words in vocabulary	5	9	18	22	31

Solution

Each column would represent an ordered pair for the date set. We would expect the age of the child to explain the output for the number of words, so using age to represent the variable on the $x$ -axis values makes sense and is appropriate as it is the independent (explanatory) variable. Plot the following data points: $(12, 5), (13, 9), (16, 18), (16, 22), (18, 31)$ .

When constructing the scatter plot it is important to include labels on the x and y axis to indicate what the values represent. You also want to make sure the scale is consistent on each axis (equal distance between two values represents the same distance in each case on the number line). On the graph below we can see a close linear relationship between the age of the child and the number of words in the vocabulary with a positive slope.

Quiz score with 6 class intervals: Google Sheet for Example 1

Please be aware that certain images and Google documents may not be properly viewed on mobile devices.

Not all scatter plots show linear relationships. The scatterplot below shows the results of an experiment conducted by Galileo on projectile motion. In the experiment, Galileo rolled balls down an incline and measured how far they traveled as a function of the release height. It is clear from scatterplot that the relationship between "Release Height" and "Distance Traveled" is not described well by a straight line: If you drew a line connecting the lowest point and the highest point, all of the remaining points would be above the line. The data are better fit by a parabola.

description in text above — Galileo's data showing a non-linear relationship.

Scatter plots that show linear relationships between variables can differ in several ways including the slope of the line about which they cluster and how tightly the points cluster about the line. A statistical measure of the strength of the relationship between two quantitative variables that takes these factors into account is the subject of the section "Correlation Coefficient."

Correlation Coefficient

The official name for the correlation coefficient is a bit of a mouthful "Pearson product-moment correlation coefficient" and was developed by Karl Pearson in the 1900s. The correlation coefficient is a measure of the strength of the linear relationship between two variables. It is referred to as Pearson's correlation or simply as the correlation coefficient. If the relationship between the variables is not linear, then the correlation coefficient does not adequately represent the strength of the relationship between the variables. It is always important to check the actual scatter plot to see if it is appropriate to calculate the correlation coefficient.

The symbol for Pearson's correlation is the greek letter "rho" or "ρ" when it is measured in the population and "r" when it is measured in a sample. Because we will be dealing almost exclusively with samples, we will use r to represent Pearson's correlation unless otherwise noted.

Pearson's r can range from -1 to 1. An r of -1 indicates a perfect negative linear relationship between variables, an r of 0 indicates no linear relationship between variables, and an r of 1 indicates a perfect positive linear relationship between variables.

The scatterplots below "Linear Relationships of Varying Strengths" illustrates linear relationships between two variables $x$ and $y$ of varying strengths. It is visually apparent that in the situation in panel (a), $x$ could serve as a useful predictor of $y$ , it would be less useful in the situation illustrated in panel (b), and in the situation of panel (c) the linear relationship is so weak as to be practically nonexistent. The correlation coefficient is a number computed directly from the data that measures the strength of the linear relationship between the two variables $x$ and $y$ .

a shows scatterplot with tightly packed data forming approximately all falling on a line, b shows data more loosely compacted around the form of a line, c shows more random buble of data with some structure of a line. — Linear Relationships of Varying Strengths.

If you have Java enabled on your browser you can practice guessing correlation coefficient values with this simulator at: http://onlinestatbook.com/2/describing_bivariate_data/pearson_demo.html

Calculating the correlation coefficient by hand can be challenging when just looking at the formula. Thankfully most spreadsheet programs and calculators will do this for you. The definition below will show the formula, but in general software is used to do the calculations over doing the work by hand as we are often using data sets with hundreds or thousands of values.

Correlation Coefficient, r

The correlation coefficient for data based on a sample from a population is found by,

$r = \frac{n Σ (x y) - (Σ x) (Σ y)}{\sqrt{[n Σ x^{2} - {(Σ x)}^{2}] [n Σ y^{2} - {(Σ y)}^{2}]}}$

where n = the number of paired data points.

Remember the greek letter sigma, "Σ", means to sum up the values. We did similar calculations when finding the mean and standard deviation.

In the next example we will revisit the age of a child in months and the child's vocabulary to find the correlation coefficient. The formula will be used, but information on how to use a spreadsheet to do this calculation is also provided. Using a spreadsheet program will significantly decrease the time that is needed to do these computations and is recommended.

Example 2

Find the correlation coefficient for the data given in the table below where the age $x$ in months and the number of words in the vocabulary $y$ is measured for 5 children.

Example 2 Table 1. Sample of five children's ages vs number of words in their vocabulary.
$x$ , age in months	12	13	16	16	18
$y$ , number of words in vocabulary	5	9	18	22	31

Solution

To start we are going to rearrange the table and add in columns that are needed in the formula. We need to sum up all the values of each variable, the squared values, and the products. The table below shows the appropriate columns added and on the bottom row the sum of that column was found.

Example 2 Table 2. Columns added to table 1 for the squared values and product of x and y.
	x	y	x²	y²	xy
	12	5	144	25	60
	13	9	169	81	117
	16	18	256	324	288
	16	22	256	484	352
	18	31	324	961	558
SUM	75	85	1149	1875	1375

Now that we have the values we will put each output in that sum row into the appropriate location in the formula for the correlation coefficient, r. Keep in mind that n represent the number of paired values, so $n = 5$

$\begin{array}{l} r = \frac{n Σ (x y) - (Σ x) (Σ y)}{\sqrt{[n Σ x^{2} - {(Σ x)}^{2}] [n Σ y^{2} - {(Σ y)}^{2}]}} \\ r = \frac{5 (1375) - (75) (85)}{\sqrt{([5 (1149) - {(75)}^{2}] [5 (1875) - {(85)}^{2}])}} \\ r = \frac{500}{\sqrt{[120] [2150]}} \\ r = \frac{500}{\sqrt{258000}} \\ r \approx .984 \end{array}$

Visually we can see the scatterplot below:

Link for Example 2 using Desmos

Please be aware that certain images and Google documents may not be properly viewed on mobile devices.

Now if we decided to use a spread sheet program like Google Sheets or Excel we can use a function called "Correl" inside and have the correlation coefficient computed for the data. A view of each with the table and the function used is given below.

Excel: Enter the data into either rows or columns. In the image below we see the table as originally presented with values of the same variable in rows. To find the value of r we enter into another cell on the spread sheet "CORREL(Arrayx,Arrayy)" Where the Arrayx is the cell locations for just the data values of age and Arrayy is the cell location for the data values of number of words (vocabulary). The output is seen to match the value we computed above, $r \approx .984$ .

Excel view of data with the use of Correl function to find correlation — Excel view of computation of r

Google Sheets: The instructions are exactly the same for Google Sheets.

Google Sheets view of data with the use of Correl function to find correlation — Google Sheets view of computation of r: Google Sheets for example 2 with correl function being used

Now that we have seen how to calculate the correlation coefficient, $r$ , we will look at the properties to better understand what a given value of $r$ represents.

Properties of Correlation Coefficient, r

1. The value of $r$ lies between -1, and 1, inclusive. $- 1 \leq r \leq 1$

2. The sign of $r$ indicates the direction of the linear relationship between $x$ and $y$ :

i. If $r < 0$ then $y$ tends to decrease as $x$ is increased. We call that a negative correlation.
ii. If $r > 0$ then $y$ tends to increase as $x$ is increased. We call that a positive correlation.

(a) A scatter plot showing data with a positive correlation. $0 < r < 1$
(b) A scatter plot showing data with a negative correlation. $- 1 < r < 0$
(c) A scatter plot showing data with zero correlation. $r = 0$

Three scatter plots with lines of best fit. The first scatterplot shows points ascending from the lower left to the upper right. The line of best fit has positive slope. The second scatter plot shows points descending from the upper left to the lower right. The line of best fit has negative slope. The third scatter plot of points form a horizontal pattern. The line of best fit is a horizontal line.

3. The size of $| r |$ indicated the strength of the linear relationship between $x$ and $y$ :

i. If $| r |$ is near 1 (that is, if $r$ is near either 1 or -1) then the linear relationship between $x$ and $y$ is strong.

ii. If $| r |$ is near 0 (that is, if $r$ is near 0 and of either sign) then the linear relationship between $x$ and $y$ is weak.

Below are six different scatter plots to show the behavior of the correlation coefficient with different spread of data. Take note of both the sign of r and the size of r for how the distribution of points change the value of r.

(a) scatter plot with $r = - 1$ showing a perfect linear association with negative slope.
(b) scatter plot with $r = - 0.94$ showing a close linear association with negative slope.
(c) scatter plot with $r = 0.08$ showing a relationship that has almost no linear relationship.
(d) scatter plot with $r = 1$ showing a perfect linear association with positive slope.
(e) scatter plot with $r = 0.86$ showing a close linear association with a positive slope.
(f) scatter plot that has a perfect relationship that is not linear with $r = 0$ .

Try it Now 1

Calculate the correlation coefficient for the sample of 25 year old mens height and weight given in the table below.

Sample of 25 year old mens height and weight.
Height (in)	68	69	70	70	71	72	72	72	73	73	74	75
Weight (lb)	151	146	157	164	171	160	163	180	170	175	178	188

Hint 1

If calculating with the formula start with the table that includes squared values and products of the two values. If calculating with technology start with copying and pasting the table into Google Sheets or Excel, then use the correl function to find the correlation coefficient.

Answer

Calculating the correlation using the formula start with the table

Try it Now 1. Table 2. Columns added to table 1 for the squared values and product of x and y.
	x	y	x²	y²	xy
	68	151	4624	22801	10268
	69	146	4761	21316	10074
	70	157	4900	24649	10990
	70	164	4900	26896	11480
	71	171	5041	29241	12141
	72	160	5184	25600	11520
	72	163	5184	26569	11736
	72	180	5184	32400	12960
	73	170	5329	28900	12410
	73	175	5329	30625	12775
	74	178	5476	31684	13172
	75	188	5625	35344	14100
SUM	859	2003	61537	336025	143626

\begin{array}{l} r = \frac{12 (143626) - (859) (2003)}{\sqrt{([12 (61537) - {(859)}^{2}] [12 (336025) - {(2003)}^{2}])}} \\ r = \frac{2935}{\sqrt{563 \cdot 20291}} \\ r = \frac{2935}{\sqrt{11423833}} \\ r \approx 0.868 \end{array}

We found the correlation coefficient to be approximately 0.863.

By Google Sheets:

Copy the table into Google Sheets and use the function correl. We have a correlation coefficient of approximately 0.863 as we did by hand. See screenshot below. Google Sheets for Try it Now1

Shows data table in Google sheets using the correl function to find correlation coefficient. Data in cells A1:M2 with labels in column A. Correl function written as =correl(B2:M2,B1:M1) in cell B4.

Least Squares Regression Line

Once a scatterplot of the data has been drawn and we visually verify a linear model is a good fit (and perhaps the correlation coefficient $r$ computed to quantitatively verify the linear trend), the next step in the analysis is to find the straight line that best fits the data.

Least Squares Regression Line

Given a collection of pairs $(x, y)$ of numbers (in which not all the x-values are the same), there is a line $\hat{y} = m x + b$ that best fits the data in the sense of minimizing the sum of the squared distances (errors) from a point to the regression line. The slope, m, and y-intercept b, are computed using the formulas below:

$\begin{array}{l} m = r \frac{s_{y}}{s_{x}} \\ b = \bar{y} - m \bar{x} \\ r is the Correlation Coefficient \\ s_{y} is the Sample Standard Deviation for y -variable data \\ s_{x} is the Sample Standard Deviation for x -variable data \\ \bar{y} is the Sample Mean for y -variable data \\ \bar{x} is the Sample Mean for x -variable data \end{array}$

Note: You must find "m" first in order to find "b" if you are calculating this by hand.

In some statistics books you may see the form of the line written as $\hat{y} = a + b x$ , where b would represent the slope and a is the y-intercept.

The $y$ is read " $y$ hat" and is the estimated value of $y$ . It is the value of y obtained using the regression line. It is not generally equal to y from data. The interesting fact is that there is only one solution for the regression line that minimizes those squared distances (unless you have all the x-values being exactly the same).

Calculating the trend line by hand is not typically done. In the first example below we are going to revisit the data of five children where we compared the age in months to the number of words in the childs vocabulary. The work for the trend line based on the formula is shown, but below that the steps for adding a trend line are given.

Example 3

Construct the least squares regression line for the data given in the table below where the age $x$ in months and the number of words in the vocabulary $y$ is measured for 5 children.

Example 3 Table 1. Sample of five children's ages vs number of words in their vocabulary.
$x$ , age in months	12	13	16	16	18
$y$ , number of words in vocabulary	5	9	18	22	31

Solution

In our work from the last section we already found the correlation coefficient, $r$ , to be approximately 0.984. Find the sample mean and standard deviation for each variable (this can be done from the formulas or by using technology - the formula method is shown below).

$\begin{array}{l} \bar{x} = \frac{12 + 13 + 16 + 16 + 18}{5} = 15 \\ s_{x} = \sqrt{\frac{{(12 - 15)}^{2} + {(13 - 15)}^{2} + {(16 - 15)}^{2} + {(16 - 15)}^{2} + {(18 - 15)}^{2}}{5 - 1}} \approx 2.45 \\ \bar{y} = \frac{5 + 9 + 18 + 22 + 31}{5} = 17 \\ s_{y} = \sqrt{\frac{{(5 - 17)}^{2} + {(9 - 17)}^{2} + {(18 - 17)}^{2} + {(22 - 17)}^{2} + {(31 - 17)}^{2}}{5 - 1}} \approx 10.37 \end{array}$

Now use the formula and find the slope and intercept for the regression line.

$\begin{array}{l} m = r \frac{s_{y}}{s_{x}} \approx .984 (\frac{10.37}{2.45}) = 4.165 \\ b = \bar{y} - m \bar{x} = 17 - 4.165 (15) = - 45.475 \\ \hat{y} = m x + b \\ \hat{y} = - 45.475 + (4.165) x \end{array}$

The scatter plot and regression line is shown below.

Quiz score with 6 class intervals: Google Sheet for Example 3

Please be aware that certain images and Google documents may not be properly viewed on mobile devices.

To find the linear regression line using Excel or Google Sheets we can use the following two functions that give the slope for the regression line and the $y$ -intercept: SLOPE, INTERCEPT. For both function we enter the array of values for the data, but we put in the y-variable values in first as instructed in the function when called in the program.

Excel:

Google Sheets:

Finding a linear regression line can be done with software very easily, so the focus from this point is on when it should be found and how to use it (interpret the values).

Understanding the Linear Regression Line

To better understand the linear regression line let us look at another example. A random sample of 11 statistics students produced the following data, where $x$ is the third exam score out of 80, and $y$ is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score? The table below is the results from the 11 students.

Table showing the scores on the final exam based on scores from the third exam.
x (third exam score)	65	67	71	71	66	75	67	70	71	69	69
y (final exam score)	175	133	185	163	126	198	153	163	159	151	159

The third exam score, $x$ , is the independent variable and the final exam score, $y$ , is the dependent variable. Before we jump in to find the regression line it is best to plot the data to confirm a linear relationship looks visible.

This is a scatter plot of the data provided. The third exam score is plotted on the x-axis, and the final exam score is plotted on the y-axis. The points form a somewhat strong, positive, linear pattern. — Scatter plot showing the scores on the final exam based on scores from the third exam.

In the graph we can see that the data falls close to a straight line with a positive slope. The least squares regression line for the third-exam/final-exam example has the equation:

$\hat{y} = - 173.51 + 4.83 x$

The scatter plot of exam scores with a line of best fit. — The graph of the linear regression line for the third-exam/final-exam example.

The slope of the line, $m$ , describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.

Interpretation of the slope

The slope of the regression line tells us how the dependent variable ( $y$ ) changes for every one unit increase in the independent ( $x$ ) variable, on average.

THIRD EXAM vs FINAL EXAM EXAMPLE
Slope: The slope of the line is $m = 4.83$ .

Interpretation: For a one-point increase in the score on the third exam, the final exam score increases by 4.83 points, on average.

Notice that in the interpretation we refer to the value of the slope (4.83) to be the change in the dependent variable for a one unit change of the independent variable (third exam scores). Visually we think of the slope as how far in the $y$ -direction the line moves for a one unit increase in value of the $x$ variable.

What about the $y$ -intercept?

Interpretation of $y$ -intercept

The $y$ -intercept would represent the predicated value when $x = 0$ . This sometimes means it is a starting amount, but in other times the value may make no sense in terms of the application.

THIRD EXAM vs FINAL EXAM EXAMPLE
$y$ -intercept: The $y$ -intercept of the line is $b = - 173.51$ .

Interpretation: In this case the $y$ -intercept does not have any meaning as it is impossible to have an exam score that is below 0. What happened here is that we are trying to interpret something outside of a reasonable region of interest based on the data collected and we ended up with a value that just doesn't make sense in the application.

We should only use the regression line to predict values within the range of the given data set. For instance in our example with children's vocabulary and age it is okay to use the linear regression line to predict the vocabulary of a child at age 15 months as 15 months between the lowest and highest recorded age, but using that regression line to predict the vocabulary at age 24 months is not appropriate as we don't know if that linear relationship continues as the child gets older. It most likely will not as the childs language skills tend to develop faster as they get older (until a certain point). Trying to use the linear regression line outside of the data range it was constructed from is called extrapolation.

Interpolation and Extrapolation

The process of using the least squares regression equation to estimate the value of $y$ at a value of $x$ that lies within the range of the $x$ -values in the data set that was used to form the regression line is called interpolation.

The process of using the least squares regression equation to estimate the value of $y$ at a value of $x$ that does not lie in the range of the $x$ -values in the data set that was used to form the regression line is called extrapolation. It is an invalid use of the regression equation that can lead to errors, hence should be avoided.

Going back to our third exam/final exam example suppose we want to estimate, or predict, the mean final exam score of statistics students who received 73 on the third exam. The exam scores ( $x$ -values) range from 65 to 75. Since 73 is between the $x$ -values 65 and 75, substitute $x = 73$ into the equation. Then:

$\hat{y} = 173.51 + 4.83 (73) = 179.08$

We predict that statistics students who earn a grade of 73 on the third exam will earn a grade of 179.08 on the final exam, on average.

If we wanted to use the linear regression line to predict the final exam score for a student who scored a 90 on the third exam we will get an unexpected result as we are now trying to use the model outside of the given range for the data set (our third exam scores were between 65 and 75 for the regression line). Make the substitution $x = 90$ into the equation and see what happens:

$\hat{y} = 173.51 + 4.83 (90) = 261.19$

The final-exam score is predicted to be 261.19. The largest the final-exam score can be is 200.

Try it Now 2

Data was collected on the relationship between the number of hours per week practicing a musical instrument and scores on a math test. The data had values ranging from 2 hours to 14 hours of musical practice time per week. The linear regression line is as follows:

$\hat{y} = 72.5 + 2.8 x$

What would you predict the score on a math test would be for a student who practices a musical instrument for five hours a week?

Hint 1

Before using the linear regression line verify that we are not extrapolating outside of the given data region. We are trying to predict the score on the math test for the given number of hours being five. Since five hours is within the given set of hours of practice used in the data set it is safe to use the linear regression line.

Hint 2

Evaluate the linear regression line by substitution in $x = 5$ into the equation ( $x$ being the number of hours practiced).

Answer

Evaluate at $x = 5$ since the value falls within the given data set range used to construct the linear regression line.

$\hat{y} = 72.5 + 2.8 (5) = 86.5$

If a student practices five hours during the week on the musical instrument it is predicted that the final exam score would be 86.5.

Exercises

For a certain class, the relationship between the amount of time spent studying and the test grade earned was examined. It was determined that as the amount of time they studied increased, so did their grades. Is this a positive or negative association?
Answer

This is a positive association (positive correlation).
For this same class, the relationship between the amount of time spent studying and the amount of time spent socializing per week was also examined. It was determined that the more hours they spent studying, the fewer hours they spent socializing. Is this a positive or negative association?
Answer

This is a negative association (negative correlation).
In each case state whether you expect the two variables x and y indicated to have positive, negative, or zero correlation.
1. the number $x$ of pages in a book and the age $y$ of an adult author
2. the number $x$ of pages in a book and the age $y$ of the intended reader
3. the weight $x$ of an automobile and the fuel economy $y$ in miles per gallon
4. the weight $x$ of an automobile and the reading $y$ on its odometer
5. the amount $x$ of a sedative a person took an hour ago and the time $y$ it takes them to respond to a stimulus
Answer
1. We would expect zero correlation.
2. We would expect a positive correlation (as this would include younger readers as well).
3. We would expect a negative correlation.
4. We would expect zero correlation.
5. We would expect a negative correlation.
In each case state whether you expect the two variables x and y indicated to have positive, negative, or zero correlation.
1. the temparature $x$ outside and the number of eegees (cold frozen typically fruity drink) $y$ sold per day.
2. the average length $x$ of time that calls to a retail call center are on hold one day and the number $y$ of calls received that day
3. the length $x$ of a regularly scheduled commercial flight between two cities and the headwind $y$ encountered by the aircraft
4. the value $x$ of a house and the its size $y$ in square feet
5. the average temperature $x$ on a winter day and the energy consumption $y$ of the furnace to heat a house.
Answer
1. We would expect a positive correlation.
2. We would expect a positive correlation.
3. We would expect zero correlation.
4. We would expect a positive correlation.
5. We would expect a negative correlation.
True/False: The correlation in real life between height and weight is $r = 1$ .
Answer

False
True/False: It is possible for variables to have $r = 0$ but still have a strong association.
Answer

True
True/False: Two variables with a correlation of 0.3 have a stronger linear relationship than two variables with a correlation of -0.7.
Answer

False
True/False: A correlation of $r = 1.2$ is not possible.
Answer

True

For the sample data

$x$	1	3	5	7	9
$y$	9	6	6	4	2

Draw the scatter plot
Find the correlation coefficient (round answer to two decimal places)
Find the linear regression line.

Answer

Scatter plot below includes linear regression line.

Scatterplot y vs x and linear regression line" Google Sheet for Exercise
Please be aware that certain images and Google documents may not be properly viewed on mobile devices.
Correlation coefficient is $- 0.97$
The equation for the linear regression line is $\hat{y} = 9.4 - 0.8 x$

For the sample data

$x$	1	4	4	5	10	11	13	18
$y$	2	5	4	7	15	12	16	15

Draw the scatter plot
Find the correlation coefficient (round results to four decimal places)
Find the linear regression line. (round results to four decimal places)

Answer

Scatter plot below includes linear regression line.

Scatterplot y vs x and linear regression line" Google Sheet for Exercise
Please be aware that certain images and Google documents may not be properly viewed on mobile devices.
Correlation coefficient is $- 0.9166$
The equation for the linear regression line is $\hat{y} = 2.0297 + 0.9055 x$

For the sample data

$x$	4	5	6	7	8
$y$	9	12	15	18	21

Draw the scatter plot
Find the correlation coefficient
Find the linear regression line.

Answer

Scatter plot below includes linear regression line.

Scatterplot y vs x and linear regression line" Google Sheet for Exercise
Please be aware that certain images and Google documents may not be properly viewed on mobile devices.
Correlation coefficient is $1.0$
The equation for the linear regression line is $\hat{y} = - 3 + 3 x$

For the sample data

$x$	1	3	5	7	9
$x$	26	24	22	20	18

Draw the scatter plot
Find the correlation coefficient
Find the linear regression line.

Answer

Scatter plot below includes linear regression line.

Scatterplot y vs x and linear regression line" Google Sheet for Exercise
Please be aware that certain images and Google documents may not be properly viewed on mobile devices.
Correlation coefficient is $- 1.0$
The equation for the linear regression line is $\hat{y} = 27 - x$

The curb weight

x

(in hundreds of pounds) and braking distance

y

(in feet), at 50 miles per hour on dry pavement, were measured for five vehicles, with results shown in the table.

$x$	25	27.5	32.5	35	45
$y$	105	125	140	140	150

Find the linear regression line. Round values to four decimal places.
Interpret the slope in terms of the application.
If the curb weight is 3000 lbs use the linear regression model to predict the braking distance. Round answer to two decimal places.
Would it be appropriate to use the linear regression line to predict the braking distance of a vehicle with a curb weight limit of 4000 lbs? What about 6000 lbs?

Answer

Start by first constructing the scatter plot (shown below) and finding the correlation coefficient. The correlation coefficient is 0.8835, which shows we do have a linear relationship and it makes sense to find the least squares regression line. The least squares regression line is $\hat{y} = 66.3402 + 1.9897 x$

Scatter plot breaking distance y vs curb weight x and linear regression line" Google Sheet for Exercise
Please be aware that certain images and Google documents may not be properly viewed on mobile devices.
The slope was found to be approximately 1.99. This means for each increase of 100 lbs of the curb weight of the vehicle we expect on average the braking distance to increase by 1.99 ft.
If the curb weight is 3000 lbs, then this corresponds to $x = 30$ (and is within the given data set range of 25 to 45). Evaluate the linear regression line to get: $\begin{array}{l} \hat{y} = 66.3402 + 1.9897 x \\ \hat{y} = 66.3402 + 1.9897 (30) \\ \hat{y} = 126.03 \end{array}$ The estimate braking distance is 126.03 ft for a curb weight vehicle of 3000 lbs going 50 miles per hour.
It is appropriate to estimate the braking distance for a vehicle with a curb weight of 4000 lbs, but not for 6000 lbs. The data range goes from 2500 lbs to 4500 lbs.

The height

x

at age 2 and

y

at age 20, both in inches, for ten women are tabulated in the table below.

$x$	31.3	31.7	32.5	33.5	34.4	35.2	35.8	32.7	33.6	34.8
$y$	60.7	61.0	63.1	64.2	65.9	68.2	67.6	62.3	64.9	66.8

Find the linear regression line. Round values to four decimal places
Interpret the slope in terms of the application.
Use the linear regression line (if appropriate) to predict the height of a women at age 20 if the height at age 2 was 25 inches. Round answer to two decimal places.
Use the linear regression line (if appropriate) to predict the height of a women at age 20 if the height at age 2 was 35 inches. Round answer to two decimal places.

Answer

Start by first constructing the scatter plot (shown below) and finding the correlation coefficient. The correlation coefficient is 0.9819, which shows we do have a linear relationship and it makes sense to find the least squares regression line. The least squares regression line is $\hat{y} = 5.9694 + 1.7437 x$

Scatter plot height at age 20 y vs height at age 2 x and linear regression line" Google Sheet for Exercise
Please be aware that certain images and Google documents may not be properly viewed on mobile devices.
The slope was found to be approximately 1.74. This means for each one inch increase in the height at age 2 we expect on average the height at age 20 to increase by 1.74 inches.
It may not be appropriate to estimate the height with the linear regression model when the at age 2 they are 25 inches as our data range for age 2 starts at a value greater than 31 inches.
It is appropriate to estimate the height at age 20 when using an input of 35 inches for the height at age 2 as it falls within the data range of 31.3 to 35.8 inches. Evaluate the linear regression model at $x = 35$ : $\begin{array}{l} \hat{y} = 5.9694 + 1.7437 x \\ \hat{y} = 5.9694 + 1.7437 (35) \\ \hat{y} \approx 67.0 \end{array}$ The model shows for a height of 35 inches at age 2 a women would have a predicated height of 67 inches

The age

x

and resting heart rate

y

were measured for ten men, with the results show in in the table below.

$x$	20	23	30	37	35	45	51	55	60	63
$y$	72	71	73	74	74	73	72	79	75	77

Find the linear regression line. Round values to four decimal places.
Interpret the slope in terms of the application.
Use the linear regression line (if appropriate) to predict the heart rate of a 25 year old male. Round answer to two decimal places.
Use the linear regression line (if appropriate) to predict the heart rate of a 85 years old male. Round answer to two decimal places.

Answer

Start by first constructing the scatter plot (shown below) and finding the correlation coefficient. The correlation coefficient is 0.7090, which shows we do have a linear relationship and it makes sense to find the least squares regression line. The least squares regression line is $\hat{y} = 69.2215 + 0.114 x$

Scatter plot heart rate y vs age x and linear regression line" Google Sheet for Exercise
Please be aware that certain images and Google documents may not be properly viewed on mobile devices.
The slope was found to be approximately 0.114. This means for each one year increase in age we expect on average the resting heart rate to increase by 0.114.
It is appropriate to estimate the resting heart rate when using an input of 25 years old. Evaluate the linear regression model at $x = 25$ : $\begin{array}{l} \hat{y} = 69.2215 + 0.114 x \\ \hat{y} = 69.2215 + 0.114 (25) \\ \hat{y} \approx 72.0715 \end{array}$ The model shows for at age 25 the resting heart rate should be approximately 72.07.
It would not be appropriate to use the model for age 85 as the data had age ranges from 20 to 63.

This page contains modified content from David Lippman, "Math In Society, 2nd Edition." Licensed under CC BY-SA 4.0.
This page contains modified content from "Beginning Statistics (v. 1.0)." Download for free and license information can by found at: https://2012books.lardbucket.org/. Licensed under CC BY-NC-SA 3.0.
This page contains content by Robert Foth, Math Faculty, Pima Community College, 2021. Licensed under CC BY 4.0.

Learning Objectives

Modeling Data

Scatter Plots

Scatter Plot

Example 1

Correlation Coefficient

Correlation Coefficient, r

Example 2

Properties of Correlation Coefficient, r

Try it Now 1

Hint 1

Answer

Least Squares Regression Line

Least Squares Regression Line

Example 3

Understanding the Linear Regression Line

Interpretation of the slope

Interpretation of y -intercept

Interpolation and Extrapolation

Try it Now 2

Hint 1

Hint 2

Answer

Exercises

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Interpretation of $y$ -intercept