Cosine Similarity Calculator
- Other data mining tools:
- Binary Distance
- Binary Similarity
- Data Set Editor
- Vector Space Explorer
- and many more »
- This tool computes uncentered and centered cosine similarities between paired data sets of same size n.
- The tool also computes the following statistics: correlation coefficient, determination coefficient, alienation coefficient, and a t-statistic.
- To use the tool, enter data set values, one value per line. End each line by pressing the
Enter
key. You may want to try the example provided, where the height (Y) and pulmonary anatomical dead space (X) of 15 children is analyzed (Swiscow, 1997).
- Cosine Similarity Defined
The cosine of the angle between two vectors, cos(θ), is a special measure of association known as cosine similarity. This is a measure of the resemblance between two data sets.(1)
If θ = 0°, cos(θ) = 1 so the vectors point to identical directions. By contrast if θ = 90°, cos(θ) = 0, and the vectors point to perpendicular directions. That is, the vectors are orthogonal.
Orthogonality is a special case of linear independence. By definition, linearly independent variables are those whose vectors do not fall along the same line.
Table 1 depicts possible types of angles that can be obtained from (1).
Table 1. Types of Angles. Type acute right obtuse straight reflex perigon Angle θ < 90° θ = 90° 180° > θ > 90° θ = 180° 360° > θ > 180° θ = 360° - Centered Cosine Similarity
Subtracting mean values in (1), x' = xi - x̄ and y' = yi - ȳ, leads to what is known as the centered cosine similarity.(2)
Mean-centering can change the angle, then the cosine similarity, between vectors, exposing the true nature of paired variables (Garcia, 2015a; Rodgers, Nicewander, & Toothaker, 1984).
As shown in Table 2, the following can happen to variable vectors after centering:
- If they are perpendicular, can become not perpendicular so the variables are orthogonal, but not uncorrelated.
- If they are not perpendicular, can become perpendicular so the variables are uncorrelated, but not orthogonal.
- If they are perpendicular and remain perpendicular, the variables are orthogonal and uncorrelated.
- If they are not perpendicular and remain not perpendicular, the variables are neither orthogonal nor uncorrelated--although their angle can change.
Table 2. Types of Variables Before and After Centering Uncentered Centered Variables Type perpendicular, θ = 90° not perpendicular, θ ≠ 90° orthogonal, not uncorrelated not perpendicular, θ ≠ 90° perpendicular, θ = 90° uncorrelated, not orthogonal perpendicular, θ = 90° perpendicular, θ = 90° orthogonal, uncorrelated not perpendicular, θ ≠ 90° not perpendicular, θ ≠ 90° not orthogonal, not uncorrelated Our cosine similarity tool detects these changes.
- Transforming Cosine Similarities to Distances
One of the most interesting problems in cluster analysis relates to the transformation of similarities to distances without breaking the triangular inequality condition for a distance metric.Some of the transformations found in the literature are based on heuristics and tricks of the trade, or based on assumptions applicable to a given knowledge domain. This topic is discussed in one of our tutorials (Garcia, 2015b).
We have incorporated to our tool a simple methodology that transforms cosine similarities to distances. It all boils down to mean-centering the variables.
It can be shown that Pearson's correlation coefficient, r, is a centered cosine similarity:
(3)
Pearson's r is a measure of the strength of the linear correlation between two variables. If the variables are presented as ranks, Pearson's r is the same as Spearman's correlation coefficient, ρ.
Due to the cosine nature of r values, we cannot add, subtract, or average them because the result won't be a cosine. The same goes for ρ values. The Self-Weighting Model provides a workaround for computing averages from nonadditive values (Garcia, 2012).
That said, uncentered and centered cosine similarities are measures of the strength of the linear association between two variables. However, centered cosine similarities also measure the correlation between them. You cannot do this with uncentered cosine similarities.
Since a centered cosine similarity is a correlation coefficient, its transformation to a distance metric is straightforward. Squaring r, we obtain a coefficient of determination, r2, which denotes the proportion of the variability of the dependent variable y that can be attributed to its linear relation with the independent variable x.
Subtracting r2 from 1 we obtain the alienation coefficient, also known as the undetermination coefficient, 1 - r2. This coefficient measures the proportion of variance not shared by x and y; i.e., an alienation coefficient measures the lack of relationship between two variables (Abdi, 2007; Vogt, 2009). Some authors define the alienation coefficient as the square root of 1 - r2. We prefer to call sqrt(1 - r2) just a correlation distance, dc
(4)
because squaring leads to a distance metric on the unit circle; i.e., dc2 + r2 = 1. This distance is a measure of dissimilarity. Again, we cannot do this transformation with uncentered cosine similarities.
Dongen & Enright (2012) call the right side of (4) the absolute correlation distance. Unfortunately, this causes some confusion since some authors also call 1- |r| in the same way, even when said expression is not a distance metric (Chen, Ng, Lin, Jiang, & Li, 2019).
Dividing (3) by (4) and multiplying the result by the square root of the number of degrees of freedom υ, where υ = n - 2, yields the t-statistic of significance.
(5)
With (5), we can perform the following test of significance. We first state the null hypothesis H0 that there is no linear relationship between the paired independent random variables (r = 0). That is, we make the hypothesis that r is not statistically different from zero.
Next, we calculate the t-statistic with (5) and compare it with the critical value listed in a statistical t-table at the appropriate degrees of freedom and desired level of confidence. If such a table is not readibly available, you may want to use our t-Values Calculator or Student-t Table Generator tools.
If tcalculated > tcritic, we reject H0 and conclude that there is a significant relationship between the paired variables.
As we can see, in addition to reveal the true nature of paired variables and correlation coefficients, centering provides the basis for transforming cosine similarities to distances using solid statistical concepts.
- Researchers, teachers, students, or anyone working with data sets.
- Users with a basic knowledge of statistics.
- What is the difference, if any, between orthogonal and uncorrelated sets?
- Are paired variables uncorrelated if their uncentered similarity is zero? Why?
- Are the following sets orthogonal, uncorrelated, or both? Explain.
Y = [1,-5,3,-1]
X = [5,1,1,3]
- Abdi, H. (2007). Coefficients of Correlation, Alienation, and Determination. The University of Texas at Dallas. In Encyclopedia of Measurement and Statistics. Neil Salkind (Ed.). Sage.
- Chen, J., Ng, Y. K., Lin, L., Jiang, Y., and Li, S. (2019). On Triangular Inequalities of Correlation-based Distances for Gene Expression Profiles. bioRxiv. DOI: 10.1101/582106.
- Dungen, S. V. and Enright, A. J. (2012). Metric Distances Derived from Cosine Similarity and Pearson and Spearman Correlations. Arxiv.org. See also ResearchGate copy.
- Garcia, E. (2015a). A Cosine Similarity Tutorial.
- Garcia, E. (2015b). A Tutorial on Distance and Similarity.
- Garcia, E. (2012). The Self-Weighting Model. Communications in statistics. Theory and methods. 2012, Vol 41, Num 7-9, pp 1421-1427. Taylor & Francis, London. See also ResearchGate.net version.
- Rodgers, J. L., Nicewander, W. A., Toothaker, L. (1984). Linearly Independent, Orthogonal, and Uncorrelated Variables. The American Statistician, Vol. 38, No. 2. Pp 133-134.
- Swiscow, T. D. V. (1997). Statistics at Square One, p. 75. TheBMJ, 1997. See also TheBMJ site.
- Vogt, W. P. (2009). Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Sciences. p6. Sage. London.
Feedback
Contact us for any suggestion or question regarding this tool.