Regression & Correlation
- This tool does a simple linear regression and correlation analysis from a set of paired variables xi,yi. Pearson's and Spearman's correlation coefficients are also computed. The nonadditivity of these coefficients is also demonstrated.
- The xi,yi values to be submitted must be separated by spaces or commas. To use the tool, please enter one paired variable per line, ending each line by pressing the
Enter
key so these are recognized as individual pairs. You may want to try the default example first. - This tool complements our Standardizer tool which analyses a one-variable data set.
- The tool decomposes the input data set into two new sets of size n and degrees of freedom υ = n - 1
X = {x1, x2,... xn} (1)
Y = {y1, y2,... yn} (2)
and then fix (1) and (2) to the simple linear regression model
y = β0 + β1x (3)
where (3) regresses y for every x = xi value. This is why (3) is called a regression equation.
- is the intercept of the regression curve.
- is the slope of the regression curve, where rp is Pearson's Correlation Coefficient and sx and sy are the variables standard deviations.
- The tool also computes the following statistics:
- Spearman Rank-Order Correlation Coefficient, rs
- Pearson Product-Moment Correlation Coefficient, rp
- Covariance, COV(x,y) = 1/υΣ(xi - x̄)(yi - ȳ)
- Signal (Determination Coefficient), rp2
- Noise (Indetermination Coefficient), 1 - rp2
- Signal/Noise Ratio, rp2/(1 - rp2)
The tool uses an algorithm that converts values to ranks and averages any ties that might be present. This comes handy when we need to compute a Spearman's rs from ranks with a large number of ties. See the following Notes section.
Notes on Pearson's and Spearman's Correlation Coefficients
Pearson's Correlation Coefficient, rp, can be defined as the covariance between two variables normalized by their standard deviations (Rodger & Nicewander, 1988).(4)
where (4) incorporates variability information between and within variables.
Spearman's Correlation Coefficient, rs, can be computed as a function of the rank difference di between paired ranks (Rx, Ry).
(5)
or as Pearson's coefficient computed from ranks,
(6)
To some extent, the di term in (5) accounts for variability information between ranks from different sets, but says nothing about their variability and frequency within a given set.
Therefore, (5) is valid for ranks free of ties. A workaround consists in introducing some smoothing correction by averaging ties. For few ties, (5) and (6) tend to agree to two decimal places, but as the number of ties increases the coefficient computed with (5) becomes increasingly overestimated and useless.
A better treatment consists in using (6) which accounts for between/within ranks variability and even works after averaging any number of ties. When there is a discrepancy between the results obtained from (5) and (6), one should always accept those from (6). To illustrate this point, consider the sets:
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 2]using (5) rs = 0.6364 ≈ 0.64 whereas with (6) rs = 0.5222 ≈ 0.52.
A refreshing consequence of computing rs as in (6), is that now we can verify its result from , the slope of a simple linear regression model (3) or from the model coefficient of determination, rs2.
For all practical purposes, one can safely avoid (5) altogether and use (6) when computing correlations from ranks.
On the Non-Additivity of Pearson's and Spearman's Correlation Coefficients
The default example given at this page was intentionally selected to illustrate that for ranks free of ties or with ties averaged, a Spearman's correlation coefficient is a Pearson's correlation coefficient (Garcia, 2017).Since the latter is a cosine value, and cosines are not additive, neither of these coefficients can be arithmetically added and averaged. This is easy to demonstrate:
- Take any two sets of paired ranks. If there are any ties, average these within each set.
- Mean center each set by subtracting the set mean from each element.
- Treating each set as a vector, compute the cosine of the angle between them, cos(θ). This is commonly known as the cosine similarity between the vectors.
- Convince yourself that cos(θ) = rs.
Early articles on the literature of correlation coefficients theory failed to recognize the nonadditivity of Pearson's and Spearman's Correlation Coefficients. Sadly to say, this is sometimes reflected in research articles, textbooks, and online publications.
Some authors arbitrarily convert correlations into Fisher Z scores and backward without realizing that these so-called Fisher r-to-Z and Z-to-r Transformations are only valid if the data is bivariate normally distributed (Zimmerman, Zumbo, & Williams, 2003), something hard to achieve since by their own statistical nature correlation coefficients are biased estimators.
A solution to the problem of computing averages from nonadditive quantities consists in applying the Self-Weighting Model or SWM (Garcia, 2012). The model works in the presence or absence of rank ties and bivariate normality, and is not limited to correlation estimators. A two-part tutorial on SWM is available (Garcia, 2015a; 2015b).
- Our tool also returns the following one-variable statistics:
- Degrees of Freedom, υ = n - 1
- Averages, x̄ = 1/nΣxi and ȳ = 1/nΣyi
- Residuals, Σ(xi - x̄)2 and Σ(yi - ȳ)2
- Variances, sx2 = 1/υΣ(xi - x̄)2 and sy2 = 1/υΣ(yi - ȳ)2
- Standard Deviations, sx and sy
- Variability Coefficient, (sx/x̄)*100, and (sy/ȳ)*100
Methods and algorithms for calculating the above are available elsewhere (MathWorld, 2017; Wikipedia, 2017a; 2017b; 2017c).
In (3)
- Anyone that need to do regressional or correlational analysis.
- Prove that β1 = rp(sy/sx) = COV(x,y)/sx2.
- Garcia, E. (2012). The Self-Weighting Model. Communications in Statistics - Theory and Methods, 41:8,1421-1427. Taylor and Francis, London.
- Garcia, E. (2015a). The Self-Weighting Model Tutorial: Part 1.
- Garcia, E. (2015b). The Self-Weighting Model Tutorial: Part 2.
- Garcia, E. (2017). On the Nonadditivity of Correlation Coefficients - Part 1: Pearson's r and Spearman's rs.
- MathWorld (2017). Least Squares Fitting.
- Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, Vol. 42, No. 1, 59-66.
- Wikipedia (2017a). Simple linear regression.
- Wikipedia (2017b). Algorithms for calculating variance.
- Wikipedia (2017c). Covariance.
- Zimmerman, D. W., Zumbo, B. D., & Williams, R. H. (2003). Bias in estimation and hypothesis testing of correlation. Psicológica 24:133-158. See also this Redalyc article.
Feedback
Contact us for any suggestion or question regarding this tool.