Binary Similarity Calculator
- Other data mining tools:
- Binary Distance
- Cosine Similarity
- Data Set Editor
- Vector Space Explorer
- and many more »
- This tool currently computes 72 similarity measures from two binary data sets of same size.
- A dynamic counter helps users check the size of the sets.
- Nonbinary sets must be first converted to binary (e.g., Yes=1 and No=0).
- Example: If A={Yes,No,Yes,Yes,No} and B={Yes,Yes,No,Yes,Yes}, then A=10110 and B=11011.
- This tool complements our Binary Distance Calculator.
- This tool was first released back in 2007. The tool was expanded and relaunched in 2013 to include more similarity measures as listed by Choi, et al. (2010), IBM Knowledge Center (2011), Stata (2007), Hayek (1994), and Tulloss (1997).
- In 2016, we added more similarity measures to the tool, including those proposed by Consonni & Todeschini (2012), and Todeschini, et al (2012).
- Some similarity measures are not within the [0, 1] range. To properly compare them these can be rescaled using the procedure described by Todeschini, et al. (2012).
- Before making similarity-distance transformations, you may want to read our companion tutorial which discusses this topic (Garcia, 2015).
- Notes
- Same measures with different names have been consolidated into a single record.
- Different measures with same name have been enumerated as necessary.
- For instance, Hayek (1994) reported that Rogot-Goldberg is the same as Sokal-Sneath version 4. Stata (2007) reported that Anderberg versions 1 and 2 were first proposed by Sokal and Sneath, while version 3 is given by Choi, et al.
- By contrast, Fager-McGowan versions 1, 2, and 3 are listed as in Choi et al. (2010), Hayek (1994), and Hayes (1978), respectively.
- On the other hand, Hayek (1994) and Choi, et al. (2010) listed five versions of Sokal-Sneath while Todeschini, et al. (2012) four. Versions 4 and 5 in the former references are versions 3 and 4 in the latter. This versioning is denoted in parentheses in the tool's output.
- Notice that Sokal-Sneath version 4 given in Choi, et al. (2010) has a typo.
- Please let us know how we can improve, enhance, or correct this tool.
- Data miners, teachers, or anyone that need to compare or grade data sets.
- Choi, S., Cha, S., and Tappert, C. C. (2010). A Survey of Binary Similarity and Distance Measures. Systemics, Cybernetics and Informatics, Vol. 8, 1, 43-48.
- Consonni, V. and Todeschini, R. (2012). New Similarity Coefficients for Binary Data. MATCH Commun. Math. Comput. Chem. 68, 581-592.
- Garcia, E (2015). A Tutorial on Distance and Similarity.
- Hayek, L. C. (1994). Analysis of Amphibian Biodiversity Data. Chapter 9. In: Measuring and monitoring miological diversity. Standard methods for amphibians. W. R. Heyer et al., eds. Smithsonian Institution, Washington, D. C.
- Hayes, W. B. (1978). Some Sampling Properties of the Fager Index for Recurrent Species Groups. Vol. 59, No. 1.
- IBM Knowledge Center (2011). Distances Similarity Measures for Binary Data. See also Sokal and Sneath Similarity Measure 3.
- Stata (2007). Stata Manuals.
- Todeschini, R., Consonni, V., Xiang, H., Holliday, J., Buscema, M., and Willet, P. (2012). Similarity Coefficients for Binary Chemoinformatics Data: Overview and Extended Comparison Using Simulated and Real Data Sets. J. Chem. Inf. Model. 52 (11).
- Tulloss, R. E. (1997). Assessment of Similarity Indices for Undesirable Properties and a new Tripartite Similarity Index Based on Cost Functions. Offprint from Palm, M. E. and I. H. Chapela, eds. 1997. Mycology in Sustainable Development: Expanding Concepts, Vanishing Borders. pp 122-143.
Feedback
Contact us for any suggestion or question regarding this tool.