# The document Arizona Residential Property Valuation System, published by the Arizona Department of Revenue, describes how county assessors use computerized systems to value single-family residential properties for property tax purposes. a) Obtain a scatterplot for the data. b) Decide whether finding a regression line for the data is reasonable. If so, then also do parts (c)-(f). c) Determine and interpret the regression equation for the data. d) Identify potential outliers and influential observations. e) In case a potential outlier is present, remove it and discuss the effect. f) In case a potential influential observation is present, remove it and discuss the effect.

Question
Scatterplots
The document Arizona Residential Property Valuation System, published by the Arizona Department of Revenue, describes how county assessors use computerized systems to value single-family residential properties for property tax purposes. a) Obtain a scatterplot for the data. b) Decide whether finding a regression line for the data is reasonable. If so, then also do parts (c)-(f). c) Determine and interpret the regression equation for the data. d) Identify potential outliers and influential observations. e) In case a potential outlier is present, remove it and discuss the effect. f) In case a potential influential observation is present, remove it and discuss the effect.

2021-02-25

Given: $$\displaystyle{n}=\ \text{Sample size}\ ={44}$$ a) Lot Size is on the horizontal axis and value is on the vertical axis. b) It is reasonable to find a regression lien for the data if there is no strong curvature present in the scatterplot. We note that there is no strong curvature in the scatterplot of part (a) and thus it is reasonable to find a regression line for the data. c) Let us first determine the necessary sums: $$\displaystyle\sum\ {x}_{{{i}}}={101.93}$$
$$\displaystyle\sum\ {{x}_{{{i}}}^{{{2}}}}={241.5221}$$
$$\displaystyle\sum\ {y}_{{{i}}}={19689}$$
$$\displaystyle\sum\ {x}_{{{i}}}{y}_{{{i}}}={45973.23}$$ Next, we can determine $$\displaystyle{S}_{{xx}}$$ and $$\displaystyle{S}_{{{x}{y}}}$$
$$\displaystyle{S}_{{xx}}=\ \sum\ {{x}_{{{i}}}^{{{2}}}}\ -\ {\frac{{{\left(\sum\ {x}_{{{i}}}\right)}^{{{2}}}}}{{{n}}}}={241.5221}\ -\ {\frac{{{101.93}^{{{2}}}}}{{{44}}}}={5.3920}$$
$$\displaystyle{S}_{{{x}{y}}}=\ \sum\ {x}_{{{i}}}{y}_{{{i}}}\ -\ {\frac{{{\left(\sum\ {x}_{{{i}}}\right)}{\left(\sum\ {y}_{{{i}}}\right)}}}{{{n}}}}={45973.23}\ -\ {\frac{{{101.93}\ \cdot\ {19689}}}{{{44}}}}={361.8716}$$ The estimate b of the slope $$\displaystyle\beta$$ is the ratio of $$\displaystyle{S}_{{{x}{y}}}$$ and $$\displaystyle{S}_{{xx}}$$: $$\displaystyle{b}=\ {\frac{{{S}_{{{x}{y}}}}}{{{S}_{{xx}}}}}=\ {\frac{{{361.8716}}}{{{5.3920}}}}={67.1128}$$ The mean is the sum of all values divided by the number of values: $$\displaystyle\overline{{{x}}}=\ {\frac{{\sum\ {x}_{{{i}}}}}{{{n}}}}=\ {\frac{{{101.93}}}{{{44}}}}={2.3166}$$
$$\displaystyle\overline{{{y}}}=\ {\frac{{\sum\ {y}_{{{i}}}}}{{{n}}}}=\ {\frac{{{19689}}}{{{44}}}}={447.4773}$$ The estimate a of the intercept $$\displaystyle\alpha$$ is the average of y decreased by the product of the estimate of the slope and the average of x. $$\displaystyle{a}=\ \overline{{{y}}}\ -\ {b}\ \overline{{{x}}}={447.4773}\ -\ {67.1128}\ \cdot\ {2.3166}={292.0043}$$ General least-squares equation: $$\displaystyle\hat{{{y}}}=\ \alpha\ +\ \beta\ {x}$$. Replace $$\displaystyle\alpha$$ by $$\displaystyle{a}={292.0043}$$ and $$\displaystyle\beta$$ by $$\displaystyle{b}={67.1128}$$ in the general least-squares equation: $$\displaystyle{y}={a}\ +\ {b}{x}={292.0043}\ +\ {67.1128}{x}$$ d) There appear to be no outliers, because the right most point lies much futher to the right than the other points in the graph. The outlier also appears to be a potential outlier, because it is possible that this point pulls the regression line down. e) Let us first determine the necessary sums: $$\displaystyle\sum\ {x}_{{{i}}}={98.31}$$
$$\displaystyle\sum\ {{x}_{{{i}}}^{{{2}}}}={228.4177}$$
$$\displaystyle\sum\ {y}_{{{i}}}={19314}$$
$$\displaystyle\sum\ {x}_{{{i}}}{y}_{{{i}}}={44615.73}$$ Next, we can determine $$\displaystyle{S}_{{\times}}$$ and $$\displaystyle{S}_{{{x}{y}}}$$
$$\displaystyle{S}_{{\times}}=\ \sum\ {{x}_{{{i}}}^{{{2}}}}\ -\ {\frac{{{\left(\sum\ {x}_{{{i}}}\right)}^{{{2}}}}}{{{n}}}}={228.4177}\ -\ {\frac{{{98.31}^{{{2}}}}}{{{43}}}}={3.6536}$$
$$\displaystyle{S}_{{{x}{y}}}=\ \sum\ {x}_{{{i}}}{y}_{{{i}}}\ -\ {\frac{{{\left(\sum\ {x}_{{{i}}}\right)}{\left(\sum\ {y}_{{{i}}}\right)}}}{{{n}}}}={44615.73}\ -\ {\frac{{{98.31}\ \cdot\ {19314}}}{{{43}}}}={458.5360}$$ The estimate b of the slope $$\displaystyle\beta$$ is the ratio of $$\displaystyle{S}_{{{x}{y}}}$$ and $$\displaystyle{S}_{{\times}}$$: $$\displaystyle{b}=\ {\frac{{{S}_{{{x}{y}}}}}{{{S}_{{\times}}}}}=\ {\frac{{{458.5360}}}{{{3.6536}}}}={125.5024}$$ The mean is the sum of all values divided by the number of values: $$\displaystyle\overline{{{x}}}=\ {\frac{{\sum\ {x}_{{{i}}}}}{{{n}}}}=\ {\frac{{{98.31}}}{{{43}}}}={2.2863}$$
$$\displaystyle\overline{{{y}}}=\ {\frac{{\sum\ {y}_{{{i}}}}}{{{n}}}}=\ {\frac{{{19314}}}{{{43}}}}={449.1628}$$ The estimate a of the intercept $$\displaystyle\alpha$$ is the average of y decreased by the product of the estimate of the slope and the average of x. $$\displaystyle{a}=\ \overline{{{y}}}\ -\ {b}\ \overline{{{x}}}={449.1628}\ -\ {125.5024}\ \cdot\ {2.2863}={162.2293}$$ General least-squares equation: $$\displaystyle\hat{{{y}}}=\ \alpha\ +\ \beta\ {x}$$. Replace $$\displaystyle\alpha$$ by $$\displaystyle{a}={162.2293}$$ and $$\displaystyle\beta$$ by $$\displaystyle{b}={125.5024}$$ in the general least-squares equation: $$\displaystyle{y}={a}\ +\ {b}{x}={162.2293}\ +\ {125.5024}{x}$$ This regression line is much steeper than the regression line in part (c) and thus the outlier stroungly influences the regression line. f) Let us first determine the necessary sums: $$\displaystyle\sum\ {x}_{{{i}}}={98.31}$$
$$\displaystyle\sum\ {{x}_{{{i}}}^{{{2}}}}={228.4177}$$
$$\displaystyle\sum\ {y}_{{{i}}}={19314}$$
$$\displaystyle\sum\ {x}_{{{i}}}{y}_{{{i}}}={44615.73}$$ Next, we can determine $$\displaystyle{S}_{{\times}}$$ and $$\displaystyle{S}_{{{x}{y}}}$$
$$\displaystyle{S}_{{\times}}=\ \sum\ {{x}_{{{i}}}^{{{2}}}}\ -\ {\frac{{{\left(\sum\ {x}_{{{i}}}\right)}^{{{2}}}}}{{{n}}}}={228.4177}\ -\ {\frac{{{98.31}^{{{2}}}}}{{{43}}}}={3.6536}$$
$$\displaystyle{S}_{{{x}{y}}}=\ \sum\ {x}_{{{i}}}{y}_{{{i}}}\ -\ {\frac{{{\left(\sum\ {x}_{{{i}}}\right)}{\left(\sum\ {y}_{{{i}}}\right)}}}{{{n}}}}={44615.73}\ -\ {\frac{{{98.31}\ \cdot\ {19314}}}{{{43}}}}={458.5360}$$ The estimate b of the slope $$\displaystyle\beta$$ is the ratio of $$\displaystyle{S}_{{{x}{y}}}$$ and $$\displaystyle{S}_{{\times}}$$: $$\displaystyle{b}=\ {\frac{{{S}_{{{x}{y}}}}}{{{S}_{{\times}}}}}=\ {\frac{{{458.5360}}}{{{3.6536}}}}={125.5024}$$ The mean is the sum of all values divided by the number of values: $$\displaystyle\overline{{{x}}}=\ {\frac{{\sum\ {x}_{{{i}}}}}{{{n}}}}=\ {\frac{{{98.31}}}{{{43}}}}={2.2863}$$
$$\displaystyle\overline{{{y}}}=\ {\frac{{\sum\ {y}_{{{i}}}}}{{{n}}}}=\ {\frac{{{19314}}}{{{43}}}}={449.1628}$$ The estimate a of the intercept $$\displaystyle\alpha$$ is the average of y decreased by the product of the estimate of the slope and the average of x. $$\displaystyle{a}=\ \overline{{{y}}}\ -\ {b}\ \overline{{{x}}}={449.1628}\ -\ {125.5024}\ \cdot\ {2.2863}={162.2293}$$ General least-squares equation: $$\displaystyle\hat{{{y}}}=\ \alpha\ +\ \beta\ {x}$$. Replace $$\displaystyle\alpha$$ by $$\displaystyle{a}={162.2293}$$ and $$\displaystyle\beta$$ by $$\displaystyle{b}={125.5024}$$ in the general least-squares equation: $$\displaystyle{y}={a}\ +\ {b}{x}={162.2293}\ +\ {125.5024}{x}$$ This regression line is much steeper than the regression line in part (c) and thus the potential influential observation strongly influences the regression line.

### Relevant Questions

Does a higher state per capita income equate to a higher per capita beer consumption? From the document Survey of Current Business, published by the U.S. Bureau of Economic Analysis, and from the Brewer’s Almanac, published by the Beer Institute, we obtained data on personal income per capita, in thousands of dollars, and per capita beer consumption, in gallons, for the 50 states and Washington, D.C. a) Obtain a scatterplot for the data. b) Decide whether finding a regression line for the data is reasonable. If so, then also do parts (c)-(f). c) Determine and interpret the regression equation for the data. d) Identify potential outliers and influential observations. e) In case a potential outlier is present, remove it and discuss the effect. f) In case a potential influential observation is present, remove it and discuss the effect.
Polychlorinated biphenyls (PCBs), industrial pollutants, are known to be carcinogens and a great danger to natural ecosystems. As a result of several studies, PCB production was banned in the United States in 1979 and by the Stockholm Convention on Persistent Organic Pollutants in 2001: One study, published in 1972 by R. Risebrough, is titled “Effects of Environmental Pollutants Upon Animals Other Than Man”. In that study, 50 Anacapa pelican eggs were collected and measured for their shell thickness, in millimetres (mm), and concentration of PCBs, in parts per million (ppm). a) Obtain a scatterplot for the data. b) Decide whether finding a regressimz line for the data is reasonable. If so, then also do parts (c)-(f). c) Determine and interpret the regression equation for the data. d) Identify potential outliers and influential observations. e) In case a potential outlier is present, remove it and discuss the effect. f) In case a potential influential observation is present, remove it and discuss the effect.
The magazine Consumer Reports publishes information on automobile gas mileage and variables that affect gas mileage. In one issue, data on gas mileage (in miles per gallon) and engine displacement (in liters) were published for 121 vehicles. a) Obtain a scatterplot for the data. b) Decide whether finding a regression line for the data is reasonable. If so, then also do parts (c)-(f). c) Determine and interpret the regression equation for the data. d) Identify potential outliers and influential observations. e) In case a potential outlier is present, remove it and discuss the effect. f) In case a potential influential observation is present, remove it and discuss the effect.
Polychlorinated biphenyls (PCBs), industrial pollutants, are known to be a great danger to natural ecosystems. In a study by R. W. Risebrough titled “Effects of Environmental Pollutants Upon Animals Other Than Man” (Proceedings of the 6th Berkeley Symposium on Mathematics and Statistics, VI, University of California Press, pp. 443-463), 60 Anacapa pelican eggs were collected and measured for their shell thickness, in millimeters (mm), and concentration of PCBs, in parts per million (ppm). a) Obtain a scatterplot for the data. b) Decide whether finding a regression line for the data is reasonable. If so, then also do parts (c)–(f). c) Determine and interpret the regression equation for the data. d) Identify potential outliers and influential observations. e) In case a potential outlier is present, remove it and discuss the effect. f) In case a potential influential observation is present, remove it and discuss the effect.