Dr. Işık Barış Fidaner
Chemical compounds express great structural variety and diversity in and among themselves. To enable the application of principled computational methods to examine chemical compounds and molecules, certain properties of their structures are systematically encoded and published in standard bit strings called molecular fingerprints.
An example standard of molecular fingerprints is the CACTVS Substructure Key Fingerprint which is published on the PubChem website for millions of chemical compounds. It consists of 881 bits, each of which is a binary feature that encodes a particular chemical property. See the official documentation for the specification of these properties.
Thanks to the molecular fingerprints, it is possible to make structural comparisons among chemical compounds just by looking at these bit strings. They give a predetermined set of binary features that helps the deployment of combinatorial approaches for the computational detection of structural similarities. Given the massive number and great diversity of the molecules found in nature, molecular fingerprints provide an irreplaceable source of information.
A computational approach that aims to detect the similarity groups among a given set of entities is traditionally called Clustering.
An example Clustering method is Entropy Agglomeration [1, 2, 3, 4] that’s implemented as a Python software called REBUS 2.0. Just like the well-known Hierarchical Agglomerative Clustering, it produces a dendrogram that shows the similar groups of entities. It works by choosing the minimum entropy clusters.
The dendrogram below this text was produced by REBUS 2.0 based on CACTVS Substructure Key Fingerprints. It clusters 3000+ compounds in PubChem that include “glucose” in their names. For each compound, since the system provides several synonym names, the first synonym is taken to represent the compound. IUPAC Name is taken for those compounds without any synonym names. The computations to produce this dendrogram took several days.
The algorithm first takes the chemical compounds that have identical fingerprints and immediately clusters them together with zero entropy. As you can see on the dendrogram, several branches display zero entropy. As the molecular fingerprints differ from one another, the algorithm begins to record positive entropy values at the dendrogram’s bifurcation points. The clustering procedure proceeds, based on the minimum entropy principle, until all chemical compounds are joined together in one big cluster.
Note: PubChem website provides a clustering method based on Tanimoto (Jaccard) index, which computes the intersection of two sets divided by their union. It produces a different dendrogram based on a different principle.
 There is a short video tutorial on this page:
 Fidaner, I. B. & Cemgil, A. T. (2013) Summary Statistics for Partitionings and Feature Allocations. In Advances in Neural Information Processing Systems, 26.