Fingerprint Similarity and Entropy

In comparing chemical compounds by their fingerprints, the standard approach is to take a couple of compounds and compute a “similarity” for the pair by applying the Tanimoto index to their molecular fingerprints. Usually, the couple of compounds are said to be “similar” if the computed index is at least 0.85.

However, similarity is a very general concept that cannot be captured by the Tanimoto index. Here, I propose another approach to similarity:

Let’s articulate the self-similarity of a set of compounds in terms of entropy. If a set of compounds is perfectly self-similar, i.e. the set contains compounds with exactly identical fingerprints, its entropy should be zero. If the set of compounds contains a variety of compounds with diverse fingerprints, its entropy should be larger. These conditions can be met by computing projection entropy [1] for the set of compounds.

To demonstrate the use of projection entropy, we (1) take an empty set, (2) add three similar compounds to the set, (3) add a different compound to the set, and see how the projection entropy of the set changes through these steps.

I represented the compounds by their 881-bit “CACTVS Substructure Key Fingerprint” from the PubChem website. See its documentation.

In the first test, we add the compounds glucose, sucrose, fructose, and methanol. Here’s how the entropy changes: 0.000, 0.049, 0.062, 0.207.


In the second test, we add the compounds methanol, ethanol, propanol, and glucose. Here’s how the entropy changes: 0.000, 0.035, 0.050, 0.227.


In both tests, the entropy jumps upward when the different compound is added to the set. The entropy stays relatively low for the first three compounds that are similar, and then it jumps upward for the fourth compound, which is dissimilar to the first three. The fourth compound is an outlier in both cases.

The fingerprints of sugars and alcohols are quite asymmetrical. Compare glucose and methanol to see this asymmetry: There are 8 features in both, 1 feature in methanol only, and 34 features in glucose only. Sugars have much more features than alcohols, but projection entropy can distinguish these two classes properly.

I also applied Entropy Agglomeration to the six compounds glucose, sucrose, fructose, methanol, ethanol, propanol and it clustered the sugars and the alcohols correctly:


[1] Fidaner, I. B. & Cemgil, A. T. (2013) Summary Statistics for Partitionings and Feature Allocations. In Advances in Neural Information Processing Systems, 26.

See: REBUS 2.0

Download presentation


One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s