Cosine Similarity Calculator
- This tool computes the traditional cosine similarity between two data sets, Y and X, of same size n. The tool also computes the centered cosine similarity, which is then transformed into a distance metric.
- In addition, the tool computes the following statistics: correlation coefficient, determination coefficient, alienation coefficient, and a t-statistic at n - 2 degrees of freedom, υ.
- To use the tool, enter data set values, one value per line, and ending each line by pressing the
Enterkey. Try the example provided, where the data provided by Swiscow on the height (Y) and pulmonary anatomical dead space (X) in 15 children is analyzed (Swiscow, 1997).
- Cosine Similarity Defined
The cosine of the angle between two vectors, cos(θ), is a special measure of association known as cosine similarity. This is a measure of the resemblance between two data sets.
If θ = 0°, cos(θ) = 1 so the vectors point to identical directions. By contrast if θ = 90°, cos(θ) = 0 and the vectors point to perpendicular directions; i.e., the vectors are orthogonal.
Orthogonality is a special case of linear independence. By definition linearly independent variables are those whose vectors do not fall along the same line.
Table 1 depicts possible types of angles that can be obtained from (1).
Table 1. Types of Angles. Type acute right obtuse straight reflex perigon Angle θ < 90° θ = 90° 180° > θ > 90° θ = 180° 360° > θ > 180° θ = 360°
- Centered Cosine Similarity
Replacing the variables in (1) with their means, x' = xi - x̄ and y' = yi - ȳ, leads to what is known as the centered cosine similarity.
Mean-centering can change the angle, then the cosine similarity, between vectors (Garcia, 2015a; Rodgers, Nicewander, & Toothaker, 1984), exposing the true nature of the variables. As shown in Table 2, the following can happen to variable vectors after centering:
- If they are perpendicular, can become not perpendicular so the variables are orthogonal, but not uncorrelated.
- If they are not perpendicular, can become perpendicular so the variables are uncorrelated, but not orthogonal.
- If they are perpendicular and remain perpendicular, the variables are orthogonal and uncorrelated.
- If they are not perpendicular and remain not perpendicular, the variables are neither orthogonal nor uncorrelated--although their angle can change.
Table 2. Types of Variables Before and After Centering Uncentered Centered Variables Type perpendicular, θ = 90° not perpendicular, θ ≠ 90° orthogonal, not uncorrelated not perpendicular, θ ≠ 90° perpendicular, θ = 90° uncorrelated, not orthogonal perpendicular, θ = 90° perpendicular, θ = 90° orthogonal, uncorrelated not perpendicular, θ ≠ 90° not perpendicular, θ ≠ 90° not orthogonal, not uncorrelated
Our cosine similarity tool can detect these changes.
- Transforming Cosine Similarities into Distances
One of the most interesting problems in data mining and cluster analysis relates to the transformation of similarities into distances without breaking the triangular inequality condition for a distance metric.
Some of the transformations found in the literature are based on heuristics and tricks of the trade, or based on assumptions applicable to a given knowledge domain. This topic is discussed in one of our tutorials (Garcia, 2015b).
We have incorporated to our calculator tool a simple methodology that easily transforms cosine similarities into distances. It all boils down to mean-centering the variables.
It can be shown that a centered cosine similarity is the same as Pearson's correlation coefficient, r:
Pearson's r is a measure of the strength of the linear correlation between two variables. Since Pearson's r is a cosine, it is not possible to arithmetically add, subtract, or average r values.
We would expect uncentered and centered cosine similarities to differ in some way, and they do. Uncentered and centered cosine similarities are both measures of the strength of the linear association between two variables. However, a centered cosine similarity also measures the correlation between them, while an uncentered cosine similarity says nothing about how linearly correlated the variables are.
Since a centered cosine similarity is a correlation coefficient, its transformation into a distance metric is straightforward. Squaring r, we obtain a coefficient of determination, r2, which denotes the proportion of the variability of the dependent variable y that can be attributed to its linear relation with the independent variable x.
Subtracting r2 from 1 we obtain the alienation coefficient, also known as the undetermination coefficient, 1 - r2. This coefficient measures the proportion of variance not shared by x and y; i.e., an alienation coefficient measures the lack of relationship between two variables (Abdi, 2007; Vogt, 2009). Some authors define the alienation coefficient as the square root of 1 - r2. We prefer to call sqrt(1 - r2) just a correlation distance, dc
as squaring leads to a distance metric on the unit circle; i.e., dc2 + r2 = 1.
Dongen & Enright (2012) call the right side of (4) the absolute correlation distance. Unfortunately, this can cause some confusion since some authors also call 1- |r| in the same way, even when said expression is not a distance metric (Chen, Ng, Lin, Jiang, & Li, 2019).
Dividing (3) by (4) and multiplying the result by the square root of the number of degrees of freedom υ = n - 2, yields the t-statistic of significance.
With (5), we can perform the following test of significance. We first state the null hypothesis H0 that there is no linear relationship between the paired independent random variables (r = 0); i.e., we make the hypothesis that r is not statistically different from zero.
Next, we calculate the t-statistic with (5) and compare it with the critical value listed in a statistical t-table at the appropriate degrees of freedom and desired level of confidence. If such a table is not readibly available, you may want to use our t-Values Calculator or Student-t Table Generator tools.
If tcalculated > tcritic, we reject H0. In other words, there is a significant relationship between the paired variables.
As we can see, centering provides the basis for transforming cosine similarities into distances using solid statistical concepts.
- Researchers, teachers, students, or anyone working with data sets.
- Users with a basic knowledge of statistics.
- What is the difference, if any, between orthogonal and uncorrelated sets?
- Does an uncentered cosine similarity of zero mean that paired variables are uncorrelated? Explain.
- Are the following sets orthogonal, uncorrelated, or both? Why?
Y = [1,-5,3,-1]
X = [5,1,1,3]
- Abdi, H. (2007). Coefficients of Correlation, Alienation, and Determination. The University of Texas at Dallas. In Encyclopedia of Measurement and Statistics. Neil Salkind (Ed.). Sage.
- Chen, J., Ng, Y. K., Lin, L., Jiang, Y., and Li, S. (2019). On Triangular Inequalities of Correlation-based Distances for Gene Expression Profiles. bioRxiv. DOI: 10.1101/582106.
- Dungen, S. V. and Enright, A. J. (2012). Metric Distances Derived from Cosine Similarity and Pearson and Spearman Correlations. Arxiv.org. See also ResearchGate copy.
- Garcia, E. (2015a). A Cosine Similarity Tutorial.
- Garcia, E. (2015b). A Tutorial on Distance and Similarity.
- Rodgers, J. L., Nicewander, W. A., Toothaker, L. (1984). Linearly Independent, Orthogonal, and Uncorrelated Variables. The American Statistician, Vol. 38, No. 2. Pp 133-134.
- Swiscow, T. D. V. (1997). Statistics at Square One, p. 75. TheBMJ, 1997. See also TheBMJ site.
- Vogt, W. P. (2009). Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Sciences. p6. Sage. London.
Contact us for any suggestion or question regarding this tool.