Regression & Correlation

Simple linear regression, Spearman & Pearson Coefficients, and more.

Other statistics tools:: Fisher Transformations; Normal CDF Transformations; Standardizer; p-values; and many more »

Instructions

What is computed?

The tool decomposes the input data set into two new sets of size n and degrees of freedom υ = n - 1
X = {x₁, x₂,... x_n} (1)

Y = {y₁, y₂,... y_n} (2)

and then fix (1) and (2) to the simple linear regression model

y = β₀ + β₁x (3)

where (3) regresses y for every x = x_i value. This is why (3) is called a regression equation.

In (3)

is the intercept of the regression curve.
is the slope of the regression curve, where r_p is Pearson's Correlation Coefficient and s_x and s_y are the variables standard deviations.

The tool also computes the following statistics:
- Spearman Rank-Order Correlation Coefficient, r_s
- Pearson Product-Moment Correlation Coefficient, r_p
- Covariance, COV(x,y) = 1/υΣ(x_i - x̄)(y_i - ȳ)
- Signal (Determination Coefficient), r_p²
- Noise (Indetermination Coefficient), 1 - r_p²
- Signal/Noise Ratio, r_p²/(1 - r_p²)
The tool uses an algorithm that converts values to ranks and averages any ties that might be present. This comes handy when we need to compute a Spearman's r_s from ranks with a large number of ties. See the following Notes section.

Notes on Pearson's and Spearman's Correlation Coefficients
Pearson's Correlation Coefficient, r_p, can be defined as the covariance between two variables normalized by their standard deviations (Rodger & Nicewander, 1988).

(4)

where (4) incorporates variability information between and within variables.

Spearman's Correlation Coefficient, r_s, can be computed as a function of the rank difference d_i between paired ranks (R_x, R_y).

(5)

or as Pearson's coefficient computed from ranks,

(6)

To some extent, the d_i term in (5) accounts for variability information between ranks from different sets, but says nothing about their variability and frequency within a given set.

Therefore, (5) is valid for ranks free of ties. A workaround consists in introducing some smoothing correction by averaging ties. For few ties, (5) and (6) tend to agree to two decimal places, but as the number of ties increases the coefficient computed with (5) becomes increasingly overestimated and useless.

A better treatment consists in using (6) which accounts for between/within ranks variability and even works after averaging any number of ties. When there is a discrepancy between the results obtained from (5) and (6), one should always accept those from (6). To illustrate this point, consider the sets:

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 2]

using (5) r_s = 0.6364 ≈ 0.64 whereas with (6) r_s = 0.5222 ≈ 0.52.

A refreshing consequence of computing r_s as in (6), is that now we can verify its result from , the slope of a simple linear regression model (3) or from the model coefficient of determination, r_s².

For all practical purposes, one can safely avoid (5) altogether and use (6) when computing correlations from ranks.

On the Non-Additivity of Pearson's and Spearman's Correlation Coefficients
The default example given at this page was intentionally selected to illustrate that for ranks free of ties or with ties averaged, a Spearman's correlation coefficient is a Pearson's correlation coefficient (Garcia, 2017).

Since the latter is a cosine value, and cosines are not additive, neither of these coefficients can be arithmetically added and averaged. This is easy to demonstrate:
- Take any two sets of paired ranks. If there are any ties, average these within each set.
- Mean center each set by subtracting the set mean from each element.
- Treating each set as a vector, compute the cosine of the angle between them, cos(θ). This is commonly known as the cosine similarity between the vectors.
- Convince yourself that cos(θ) = r_s.
Early articles on the literature of correlation coefficients theory failed to recognize the nonadditivity of Pearson's and Spearman's Correlation Coefficients. Sadly to say, this is sometimes reflected in research articles, textbooks, and online publications.

Some authors arbitrarily convert correlations into Fisher Z scores and backward without realizing that these so-called Fisher r-to-Z and Z-to-r Transformations are only valid if the data is bivariate normally distributed (Zimmerman, Zumbo, & Williams, 2003), something hard to achieve since by their own statistical nature correlation coefficients are biased estimators.

A solution to the problem of computing averages from nonadditive quantities consists in applying the Self-Weighting Model or SWM (Garcia, 2012). The model works in the presence or absence of rank ties and bivariate normality, and is not limited to correlation estimators. A two-part tutorial on SWM is available (Garcia, 2015a; 2015b).
Our tool also returns the following one-variable statistics:
- Degrees of Freedom, υ = n - 1
- Averages, x̄ = 1/nΣx_i and ȳ = 1/nΣy_i
- Residuals, Σ(x_i - x̄)² and Σ(y_i - ȳ)²
- Variances, s_x² = 1/υΣ(x_i - x̄)² and s_y² = 1/υΣ(y_i - ȳ)²
- Standard Deviations, s_x and s_y
- Variability Coefficient, (s_x/x̄)*100, and (s_y/ȳ)*100
Methods and algorithms for calculating the above are available elsewhere (MathWorld, 2017; Wikipedia, 2017a; 2017b; 2017c).

Who can use this tool?

Suggested Exercises

References

Feedback

Contact us for any suggestion or question regarding this tool.