Categorization of numerical features: A test

Part I

I did a test using a benchmark dataset: The well-known historic Iris dataset contains 4 numerical values (with precision: 0.1) from three species of flowers, Iris setosa, Iris virginica and Iris versicolor, 50 samples each:

1) I categorized all numerical values with an integral overlap parameter m: The numbers are allowed to overlap m times the precision (m*0.1) with their neighboring values.

2) I applied Entropy Agglomeration on the resulting categorical features and looked at the branches of the EA dendrograms produced.

I find the result satisfactory: Iris setosa is easily separated. Some groups of Iris virginica and Iris versicolor are separated, while other groups remain confused. The confusion is localized at certain branches of the dendrogram.

For example, the output for m=11:

The results can be improved. Making the categorized numbers overlap by a particular multiple of the precision is the simplest thing to do. Much more interesting categorizations can be constructed by e.g. setting several overlap parameters and assigning coefficients of importance to each of them.

Part II

I repeated the previous experiment with a modified algorithm and witnessed great improvement.

Entropy Agglomeration algorithm is able to cluster 149 of the 150 samples according to the 3 species that they belong to without any prior information about the species labels of the samples.

All species can be distinguished by cutting exactly 6 branches in the output dendrogram for m=34:

Modified algorithm:

1) All numerical values are categorized with (a) an integral overlap parameter m, and (b) the corresponding sequence of coefficients that range linearly in [1, 0] for the neighborhoods [0, m+1]. If a numerical category receives several coefficients, the maximum one (that comes from the nearest neighbor) is assigned to that category.

2) Entropy Agglomeration is applied on the resulting categorical features and the resulting dendrogram is seen to distinguish the three species.

Part III

I implemented a simpler and better algorithm with two parameters that can achieve the same good result for the Iris flower dataset.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s