Cosine Similarity Calculator

Compute cosine similarity and distance measures between data sets and analyze their variables.

Other data mining tools:: Binary Distance; Binary Similarity; Data Set Editor; Vector Space Explorer; and many more »

Cosine Similarity Defined
The cosine of the angle between two vectors, cos(θ), is a special measure of association known as cosine similarity. This is a measure of the resemblance between two data sets.
(1)

If θ = 0°, cos(θ) = 1 so the vectors point to identical directions. By contrast if θ = 90°, cos(θ) = 0, and the vectors point to perpendicular directions. That is, the vectors are orthogonal.

Orthogonality is a special case of linear independence. By definition, linearly independent variables are those whose vectors do not fall along the same line.

Table 1 depicts possible types of angles that can be obtained from (1).

Table 1. Types of Angles.
Type acute right obtuse straight reflex perigon

Angle θ < 90° θ = 90° 180° > θ > 90° θ = 180° 360° > θ > 180° θ = 360°

Table 1. Types of Angles.
Type	acute	right	obtuse	straight	reflex	perigon
Angle	θ < 90°	θ = 90°	180° > θ > 90°	θ = 180°	360° > θ > 180°	θ = 360°

Centered Cosine Similarity
Subtracting mean values in (1), x' = x_i - x̄ and y' = y_i - ȳ, leads to what is known as the centered cosine similarity.

centered cosine similarity (2)

Mean-centering can change the angle, then the cosine similarity, between vectors, exposing the true nature of paired variables (Garcia, 2015a; Rodgers, Nicewander, & Toothaker, 1984).

As shown in Table 2, the following can happen to variable vectors after centering:

If they are perpendicular, can become not perpendicular so the variables are orthogonal, but not uncorrelated.
If they are not perpendicular, can become perpendicular so the variables are uncorrelated, but not orthogonal.
If they are perpendicular and remain perpendicular, the variables are orthogonal and uncorrelated.
If they are not perpendicular and remain not perpendicular, the variables are neither orthogonal nor uncorrelated--although their angle can change.

Table 2. Types of Variables Before and After Centering
Uncentered	Centered	Variables Type
perpendicular, θ = 90°	not perpendicular, θ ≠ 90°	orthogonal, not uncorrelated
not perpendicular, θ ≠ 90°	perpendicular, θ = 90°	uncorrelated, not orthogonal
perpendicular, θ = 90°	perpendicular, θ = 90°	orthogonal, uncorrelated
not perpendicular, θ ≠ 90°	not perpendicular, θ ≠ 90°	not orthogonal, not uncorrelated

Our cosine similarity tool detects these changes.

Transforming Cosine Similarities to Distances
One of the most interesting problems in cluster analysis relates to the transformation of similarities to distances without breaking the triangular inequality condition for a distance metric.
Some of the transformations found in the literature are based on heuristics and tricks of the trade, or based on assumptions applicable to a given knowledge domain. This topic is discussed in one of our tutorials (Garcia, 2015b).

We have incorporated to our tool a simple methodology that transforms cosine similarities to distances. It all boils down to mean-centering the variables.

It can be shown that Pearson's correlation coefficient, r, is a centered cosine similarity:

(3)

Pearson's r is a measure of the strength of the linear correlation between two variables. If the variables are presented as ranks, Pearson's r is the same as Spearman's correlation coefficient, ρ.

Due to the cosine nature of r values, we cannot add, subtract, or average them because the result won't be a cosine. The same goes for ρ values. The Self-Weighting Model provides a workaround for computing averages from nonadditive values (Garcia, 2012).

That said, uncentered and centered cosine similarities are measures of the strength of the linear association between two variables. However, centered cosine similarities also measure the correlation between them. You cannot do this with uncentered cosine similarities.

Since a centered cosine similarity is a correlation coefficient, its transformation to a distance metric is straightforward. Squaring r, we obtain a coefficient of determination, r², which denotes the proportion of the variability of the dependent variable y that can be attributed to its linear relation with the independent variable x.

Subtracting r² from 1 we obtain the alienation coefficient, also known as the undetermination coefficient, 1 - r². This coefficient measures the proportion of variance not shared by x and y; i.e., an alienation coefficient measures the lack of relationship between two variables (Abdi, 2007; Vogt, 2009). Some authors define the alienation coefficient as the square root of 1 - r². We prefer to call sqrt(1 - r²) just a correlation distance, d_c

(4)

because squaring leads to a distance metric on the unit circle; i.e., d_c² + r² = 1. This distance is a measure of dissimilarity. Again, we cannot do this transformation with uncentered cosine similarities.

Dongen & Enright (2012) call the right side of (4) the absolute correlation distance. Unfortunately, this causes some confusion since some authors also call 1- |r| in the same way, even when said expression is not a distance metric (Chen, Ng, Lin, Jiang, & Li, 2019).

Dividing (3) by (4) and multiplying the result by the square root of the number of degrees of freedom υ, where υ = n - 2, yields the t-statistic of significance.

(5)

With (5), we can perform the following test of significance. We first state the null hypothesis H₀ that there is no linear relationship between the paired independent random variables (r = 0). That is, we make the hypothesis that r is not statistically different from zero.

Next, we calculate the t-statistic with (5) and compare it with the critical value listed in a statistical t-table at the appropriate degrees of freedom and desired level of confidence. If such a table is not readibly available, you may want to use our t-Values Calculator or Student-t Table Generator tools.

If t_calculated > t_critic, we reject H₀ and conclude that there is a significant relationship between the paired variables.

As we can see, in addition to reveal the true nature of paired variables and correlation coefficients, centering provides the basis for transforming cosine similarities to distances using solid statistical concepts.

Who can use it?

Suggested Exercises

References

Feedback

Contact us for any suggestion or question regarding this tool.