SIMCALC - Binary Similarity Calculator and Vector Analyzer
A tool for measuring similarity between binary strings.
About SIMCALC
SIMCALC is a calculator designed to compute similarity measures between any two binary strings.
To use the calculator users must be familiar with Data Mining and similarity analysis. Since strings are treated as vectors, the tool also works as a vector analyzer. The figure at the right depicts sample results for the 1010101 and 1010101 vectors. (Try with dissimilar vectors.)
Who could benefit from this tool?
Scholars
IR/Statistic teachers, students, and researchers can use this calculator for classroom demonstrations or to compare results and exams of the Right (1), Wrong (0) type.
Investigators
Investigators and testers can use it to examine possible cases of duplicated content, fraud, or plagiarism.
Marketers
Marketing and sales executives can use the tool to score consumers' satisfaction questionnaires of the Yes (1), No (0) type.
Business Intelligence Analysts
Analysts can use it to extract patterns and correlations from polls, surveys, and similar intelligence instruments.
Instructions
To use SIMCALC, enter any two strings, one per textarea. These must be:
- binary; i.e., consisting of ones (1) and zeros (0).
- of identical length.
- real and non-negative.
Any non-binary character will be dynamically removed.
How SIMCALC Works
A dynamic programming routine monitors that the data entered is of the required length and format.
At the beginning, and as the user enters the data, SIMCALC generates a contingency table of positive/negative matches and mismatches. The calculator uses this table to create a results table, wherein different similarity measures are displayed.
When both strings are of identical length, the right column of the results table changes from pale yellow to pale blue, signifying the end of the analysis.
Measures Computed by SIMCALC
The following measures are computed: Sokal-Michener, Jaccard, Russell-Rao, Hamann, Sorensen, antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson's Phi, and Gower2.
Some of these measures are known in Data Mining by different names: e.g., Sokal-Michener by Simple Matching, Sorensen by Dice, and so forth. In addition, some of these adopt values between 0 and 1 while others, like the Hamman, Yule, and Pearson's Phi coefficients, adopt values between -1 and +1.
The tool does not list the classic Gower's Coefficient since for binary data this reduces to the Jaccard's Coefficient. The tool does support a vendor-defined coefficient termed Gower2, but this should not be mistaken for the classic Gower's Coefficient.
SIMCALC also computes other measures like the Dot Product, the Cosine Coefficient, and the Hamming Distance. The first two have widespread use in Data Mining. Although not a similarity measure, but a distance metric, we have included the Hamming Distance for historical reasons and because this metric is closely related with some of the aforementioned measures.
Although we have designed SIMCALC for classroom demonstration purposes, it can handle vectors consisting of thousand elements without crashing a browser. We have not tested how much data the tool can tolerate, though. Generally speaking, the calculator can be used as a handy tool for obtaining quick results or for double-checking results.
Important Notes
Transforming Similarities and Distances
According to Luke (1), a dissimilarity metric, D, can be formed by taking
(Eq 1) D = 1 - S
where S is a similarity measure and D is referred to as a distance. This implies
(Eq 2) S = 1 - D
For these transformations to be valid S must adopt values between 0 and 1. One could try to normalize similarity values prior to converting these, but not everyone agree with Eqs 1 and 2.For instance, Lin, in An Information-Theoretic Definition of Similarity (2) and Toit et al., in Graphical Exploratory Data Analysis (3), state that a distance can be converted into a similarity by a transformation of the form:
(Eq 3) S = 1/(1 + D)
Toit et al. also state that the reverse process, tranforming similarities into distances, is not so obvious because of the triangular inequality, which must be satisfied by a distance metric. Their arguments are based on the following reasoning.
Let dij be the element ij of a distance matrix D and let sij be the ij element of a similarity matrix S. Assuming that the similarity matrix is positive semi-definite,
(Eq 4) dij = (sii - 2*sij + sjj)1/2
This is the standard transformation from S to D, and results in an Euclidean distance matrix. In the special case where sii = sjj = 1, Eq 4 reduces to
(Eq 5) dij = (2*(1 - sij))1/2
Note that Eqs 1 and 2 are not based on the triangular inequality.
Considering that similarity can be defined according to specific models, arbitrarily transforming these into distances and vice versa is contraindicated. As noted by Lin (2), the problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain, or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Consequently, arbitrarily transforming both ways similarities and distances simply compounds many of the problems listed by Lin (2) and can induce to error.
References
- Luke, B. T., Clustering Binary Objects
- Lin, D., An Information-Theoretic Definition of Similarity.
- Toit, du S.H.C.; Steyn, A.G.W.; Stumpf, R.H.; Graphical Exploratory Data Analysis; Chapter 3, p. 77, 1986; Springer-Verlag.
Comments and Feedback
Have questions or suggestions relevant to this tool? We would like to hear about you. Drop us an email.

