Question

# What is the correlation coefficient and what is its significance? Explain why correlation coefficient is bound between -1 and 1. How correlation coefficient depicts itself in scatterplots? Aside from probabilistic models, explain what is least squares line fitting.

Bivariate numerical data
What is the correlation coefficient and what is its significance? Explain why correlation coefficient is bound between -1 and 1. How correlation coefficient depicts itself in scatterplots? Aside from probabilistic models, explain what is least squares line fitting.

2021-02-02
Step 1
Correlation:
Correlation a measure which indicates the “go-togetherness” of two data sets. It can be denoted as r. The value of correlation coefficient lies between –1 and +1. The positive 1 indicates that the two data sets are perfect and both are in same direction. The negative 1 indicates that the two data sets are perfect and both are in opposite direction. It will be zero when there is no relationship between the two data sets.
Correlation coefficient, r:
The Karl Pearson’s product-moment correlation coefficient or simply, the Pearson’s correlation coefficient is a measure of the strength of a linear association between two variables and is denoted by r or $$\displaystyle{r}_{{{x}{y}}}$$.
The coefficient of correlation $$\displaystyle{r}_{{{x}{y}}}$$ between two variables x and y for the bivariate data set $$\displaystyle{\left({x}_{{i}},{y}_{{i}}\right)}{f}{\quad\text{or}\quad}{i}={1},{2},{3}…{N}$$ is given below:
$$\displaystyle{r}_{{{x}{y}}}=\frac{{{n}{\left(\sum{x}{y}\right)}-{\left(\sum{x}\right)}{\left(\sum{y}\right)}}}{{\sqrt{{{\left[{n}{\left(\sum{x}^{{2}}\right)}-{\left(\sum{x}^{{2}}\right)}\right]}\times{\left[{n}{\left(\sum{y}^{{2}}\right)}-{\left(\sum{y}^{{2}}\right)}\right]}}}}}$$
Step 2
Scatterplot and correlation:
A scatterplot is a type of data display that shows the relationship between two numerical variables. Each member of the data set gets plotted as a point whose (x, y) coordinates relates to its values for the two variables.
When the y variable tends to increase as the x variable increases, it can be said that there is a positive correlation between the variables. In other words, when the points on the scatterplot produce a lower left to upper right pattern, there is a positive correlation between the variables.
When the y variable tends to decrease as the x variable increases, it can be said that there is a negative correlation between the variables. In other words, when the points on the scatterplot produce an upper left to lower right pattern, there is a positive correlation between the variables.
When all the points on a scatterplot lie on a straight line, it can be said that there is a perfect correlation between the two variables.
A scatterplot in which the points do not have a linear trend (either positive or negative) is called a zero correlation or a near-zero correlation.
Form of the association between variables:
The form of the association describes whether the data points follow a linear pattern or some other complicated curves. For data if it appears that a line would do a reasonable job of summarizing the overall pattern in the data. Then, the association between two variables is linear.
Direction of association:
If the increase in the values of one variable increases the values of another variable, then the direction is positive. If the increase in the values of one variable decreases the values of another variable, then the direction is negative.
Strength of the association:
The association is said to be strong if all the points are close to the straight line. It is said to be weak if all points are far away from the straight line and it is said to be moderate if the data points are moderately close to straight line.
Step 3
Least squares line fitting:
Regression analysis estimates the relationship among variables. That is, it estimates the relationship between one dependent variable and one or more independent variables.
The general form of first-order regression model is $$\displaystyle{y}-\cap=\beta_{{0}}+\beta_{{1}}{x}+\epsilon$$, Where, the variable y is the dependent variable that is to be modelled or predicted, the variable x is the independent variable that is used to predict the dependent variable, and ε is the error term.
The difference between of the observed value of y and predicted value of value of y is called as residual. Hence, the value of residual is represented as $$\displaystyle{y}–{\left({y}-\cap\right)}$$.
If the sum of the squares of the residuals is expressed as smallest sum possible, then the straight line satisfies the least squares property. The regression line of the straight line satisfies the least-squares property then that "best'' fits the points in a scatterplot.