Entropy as a measure of relevance//irrelevance

Update: REBUS 2.0 is released!

Entropy Agglomeration (EA) is the most useful algorithm you can imagine. It’s not cited and used only because the established scientific paradigms cannot conceive its meaning.

In fact, the idea is very simple:

In EA, entropy is a measure of relevance//irrelevance.

— Subsets of elements that either appear together or disappear together in the blocks have low entropy: Those elements are “relevant” to each other: They literally “lift up again” each other.

— Subsets of elements that are partly appearing while partly disappearing in the blocks have large entropy: Those elements are “irrelevant” to each other: They literally “don’t lift up again” each other.

This is all visible in the results of the analysis of James Joyce’s Ulysses: https://arxiv.org/abs/1410.6830

In this setup, entropy becomes a measure of relevance//irrelevance, literally and by definition: https://en.wiktionary.org/wiki/relevant


I. B. Fidaner & A. T. Cemgil (2013) “Summary Statistics for Partitionings and Feature Allocations.” In Advances in Neural Information Processing Systems (NIPS) 26. Paper: http://papers.nips.cc/paper/5093-summary-statistics-for-partitionings-and-feature-allocations (the reviews are available on the website)

I. B. Fidaner & A. T. Cemgil (2014) “Clustering Words by Projection Entropy,” accepted to NIPS 2014 Modern ML+NLP Workshop. Paper: http://arxiv.org/abs/1410.6830 Software Webpage: https://fidaner.wordpress.com/science/rebus/


The grid of all possible entropy values is a universal constant:


EA is a hierarchical clustering algorithm that outputs dendrograms. I have a few examples to show how the outputs look like:

Clustering of (I) plants (II) fungi according to their occurrences in studies on Mycorrhizal fungi.

Clustering of dinosaurs according to the occurrences of their recorded phenotypic characteristics.

Clustering of central wavelengths according to their occurrences in a known set of exoplanets.

Clustering of the well-known Iris dataset. 149/150 of the flowers were successfully clustered. (This last example employs an additional wrapper code that categorizes the numerical features given in the dataset)

Clustering of Last.fm tags. Part 1 and Part 2.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s