REBUS 2.0 is released!

Entropy as a measure of relevance//irrelevance

____________________________________________

~~~ TECHNICAL PRESENTATION ~~~

**1)** Summary Statistics for Partitionings and Feature Allocations (NIPS 2013)

**2)** Clustering Words by Projection Entropy (NIPS 2014 Modern ML+NLP Workshop)

*Işık Barış Fidaner & Ali Taylan Cemgil*

____________________________________________

**Errata for “Summary Statistics for Partitionings and Feature Allocations”**

1) “Cumulative Occurence Distribution” should be spelt as Cumulative Occurrence Distribution.

2) In the definitions, “i∈{1,…,n}” should be i∈{1,…,|Z|} and i∈{1,…,|F|}.

3) On page 6, the second sentence should be “For large n, weighted per-element information reaches its maximum near |B| ≈ n/e ≈ 0.37 n (Figure 6).” instead of “n/2”.

____________________________________________

**Standart sapmada niye N-1?**

Niye bir örneklemin standart sapması s’yi hesaplarken kareler toplamını (aynı örneklemin ortalamasını hesaplarken yaptığınız gibi) N’e bölmek yerine N-1’e bölersiniz?

Niyesi şu.

____________________________________________

**Why N-1 in standard deviation?**

Why do you compute the standard deviation s of a sample set by dividing a summation by N-1, instead of dividing it by N, as you would do in computing the mean of this very same sample set?

Here is why.

____________________________________________

**Öbekleme Problemine Bayesci Bir Yaklaşım Ve Gen İfadesi Analizinde Uygulanması**

*Işık Barış Fidaner, PhD Tezi*

Bu tezde gen ifadesi zaman serisi verisinden bilgi çıkarılması için yöntemler araştırılmıştır. Bu zaman serileri altta yatan biyolojik mekanizmalara dair dolaylı ölçümler sağlar, bu yüzden analizlerde istatistiksel modelleme tekniklerine yoğunca başvurulur. Özellikle popüler bir analiz yaklaşımı, ifade profili benzerliklerine göre genleri öbeklemektir. Fakat bilimsel veri analizi açısından öbekleme güçlü bir metodoloji gerektirir ve Bayesci nonparametri bu konuda gelecek vaat eden bir çerçeve sağlar.

Bu bağlamda, iki yeni Bayesci nonparametrik model geliştirildi: Standart sonsuz karışım modelini genişleten Sonsuz Çokyönlü Karışım (IMM); ve karışım bileşenlerinde gen ifadesi zaman serilerine uyarlanmış özgül bir yapıyı varsayım alan Parçalı Doğrusal Dizilerin Sonsuz Karışımı (IMPLS). Bayesci paradigmada gen analizi için anahtar nesne, model ve gözlemler verildiğinde, bölüntüler üzerindeki sonsal dağılımdır. Fakat, bölüntüler üzerinde bir sonsal dağılım oldukça karmaşık bir nesnedir. Burada Markov zinciri Monte Carlo çıkarımı uygulayarak gen bölüntülerinin sonsal dağılımından bir örneklem elde ediyoruz, ve geliştirdiğimiz sezgisel iki-aşamalı öbekleme yaklaşımı ile sonsal örneklemi işliyoruz. Bölüntüler üzerindeki dağılımların analizi için entropi toplaşması (EA) adını verdiğimiz alternatif, yeni bir yaklaşım da geliştirildi. EA’nın bölüntülerden ve daha genelde özellik atamalarından oluşan örneklemlerin yorumlanmasında kullanışlı olduğu gösterildi.

Öbekleme metodolojisinin değerlendirilmesinde iki farklı sahada ayrı deneyler gerçekleştirilmiştir. Birincisinde, edebi bir metnin (James Joyce’un Ulysses’i) paragrafları EA ile analiz edilerek sözcükleri arasındaki bağlamsal ilişkiler ortaya çıkarılmıştır. İkinci olarak, biyoenformatik uygulamasında, sonuçta çıkan öbeklerin amaca uygunluğunu değerlendirmek için standart çoklu hipotez testi uygulanmış, bir gen ontolojisine ait terimlerle kodlanmış önceki biyolojik bilgilerle karşılaştırılmıştır. Geliştirilen metodolojinin entegre edildiği eksiksiz süreç akışı CLUSTERnGO (CnG) dört fazdan oluşur (Yapılandırma, Çıkarım, Öbekleme, Değerlendirme). CnG’nin işlem hattının tamamı bir yazılım paketi olarak geliştirilmiş ve GNU Genel Kamusal Lisansı altında yayınlanmıştır.

____________________________________________

**A Bayesian Approach To The Clustering Problem With Application To Gene Expression Analysis**

*Işık Barış Fidaner, PhD Thesis*

This thesis investigates methods for extraction of information from gene expression time series data. These time series provide indirect measurements about the underlying biological mechanisms, hence their analysis heavily depends on statistical modelling techniques. One particularly popular analysis approach is clustering genes by their similarity of expression profiles. However, for scientific data analysis, clustering requires a rigorous methodology and Bayesian nonparametrics provides a promising framework.

In this context, two novel Bayesian nonparametric models were developed: Infinite Multiway Mixture (IMM) that extends the standard infinite mixture model; and Infinite Mixture of Piecewise Linear Sequences (IMPLS) that assumes a specific structure for its mixture components, tailored towards gene expression time series. In the Bayesian paradigm, the key object for gene analysis is the posterior distribution over partitionings, given the model and observed data. However, a posterior distribution over partitionings is a highly complicated object. Here, we apply Markov Chain Monte Carlo (MCMC) inference to obtain a sample from the posterior distribution of gene partitionings, and we implement a heuristic, two-stage clustering approach to process the posterior samples. An alternative, novel approach for the analysis of distributions over partitions is also developed, that we named as entropy agglomeration (EA). EA is proved to be useful in interpreting sample sets of partitionings and more generally, feature allocations.

Separate experiments from two different domains were conducted to evaluate the clustering methodology. In the first one, the paragraphs of a literary text (Ulysses by James Joyce) were analysed by EA to reveal the contextual relations among its words. In the second, bioinformatics application, we evaluate the relevance of resulting clusters by applying standard multiple hypothesis testing to compare them against previous biological knowledge encoded in terms of a Gene Ontology. The developed methodology is integrated into a complete workflow, CLUSTERnGO (CnG), which consists of a four-phase pipeline (Configuration, Inference, Clustering, Evaluation). The entire processing pipeline of CnG was implemented as a software package and is released under GNU General Public License.

Defended for the Jury on March 25, delivered to the Institute on April 7.

DOI: 10.13140/RG.2.1.3545.6888

____________________________________________

**CLUSTERnGO: a user-defined modelling platform for two-stage clustering of time-series data**

*Işık Barış Fidaner, Ayça Cankorur-Çetinkaya, Duygu Dikicioğlu, Betül Kırdar, Ali Taylan Cemgil, Stephen G. Oliver*

**Motivation:** Simple bioinformatic tools are frequently used to analyse time-series datasets regardless of their ability to deal with transient phenomena, limiting the meaningful information that may be extracted from them. This situation requires the development and exploitation of tailor-made, easy-to-use and flexible tools designed specifically for the analysis of time-series datasets.

**Results:** We present a novel statistical application called CLUSTERnGO, which uses a model-based clustering algorithm that fulfils this need. This algorithm involves two components of operation. Component 1 constructs a Bayesian non-parametric model (Infinite Mixture of Piecewise Linear Sequences) and Component 2, which applies a novel clustering methodology (Two-Stage Clustering). The software can also assign biological meaning to the identified clusters using an appropriate ontology. It applies multiple hypothesis testing to report the significance of these enrichments. The algorithm has a four-phase pipeline. The application can be executed using either command-line tools or a user-friendly Graphical User Interface. The latter has been developed to address the needs of both specialist and non-specialist users. We use three diverse test cases to demonstrate the flexibility of the proposed strategy. In all cases, CLUSTERnGO not only outperformed existing algorithms in assigning unique GO term enrichments to the identified clusters, but also revealed novel insights regarding the biological systems examined, which were not uncovered in the original publications.

**Availability and implementation:** The C++ and QT source codes, the GUI applications for Windows, OS X and Linux operating systems and user manual are freely available for download under the GNU GPL v3 license at http://www.cmpe.boun.edu.tr/content/CnG.

(Bioinformatics @ Oxford Journals )

____________________________________________

**Dynamic Proteomic Profiling of Extra-Embryonic Endoderm Differentiation in Mouse Embryonic Stem Cells**

*Claire M. Mulvey, Christian Schröter, Laurent Gatto, Duygu Dikicioğlu, Işık Barış Fidaner, Andy Christoforou, Michael J. Deery, Lily T. Y. Cho, Kathy K. Niakan, Alfonso Martinez-Arias, Kathryn S. Lilley*

During mammalian preimplantation development, the cells of the blastocyst’s inner cell mass differentiate into the epiblast and primitive endoderm lineages, which give rise to the fetus and extra-embryonic tissues, respectively. Extra-embryonic endoderm (XEN) differentiation can be modeled in vitro by induced expression of GATA transcription factors in mouse embryonic stem cells. Here, we use this GATA-inducible system to quantitatively monitor the dynamics of global proteomic changes during the early stages of this differentiation event and also investigate the fully differentiated phenotype, as represented by embryo-derived XEN cells. Using mass spectrometry-based quantitative proteomic profiling with multivariate data analysis tools, we reproducibly quantified 2,336 proteins across three biological replicates and have identified clusters of proteins characterized by distinct, dynamic temporal abundance profiles. We first used this approach to highlight novel marker candidates of the pluripotent state and XEN differentiation. Through functional annotation enrichment analysis, we have shown that the downregulation of chromatin-modifying enzymes, the reorganization of membrane trafficking machinery, and the breakdown of cell–cell adhesion are successive steps of the extra-embryonic differentiation process. Thus, applying a range of sophisticated clustering approaches to a time-resolved proteomic dataset has allowed the elucidation of complex biological processes which characterize stem cell differentiation and could establish a general paradigm for the investigation of these processes.

(STEM CELLS @ Wiley Online Library)

____________________________________________

**İzdüşüm Entropisi ile Sözcüklerin Öbeklenmesi**

*Işık Barış Fidaner, Ali Taylan Cemgil*

Entropi toplaşması (ET) algoritmasını bir edebiyat metnindeki sözcüklerin öbeklenmesine uygulamaktayız. ET, öğe kümelerinde parçalılığı niceleyebilen izdüşüm entropisi (İE) fonksiyonunu en küçük yapan hasis bir toplaşma yordamıdır. Metin, uygulamada bir özellik atamasına, sözcüklerin metnin paragraflarında bulunmalarını temsil eden bir bileşimsel nesneye indirgenmektedir. Deney sonuçları indirgeme ve basitliğine rağmen ET’nin metindeki sözcükler arasında belirgin ilişkilerin yakalanmasında kullanışlı olduğunu göstermektedir. Python’da yazılan bu yordam, REBUS adıyla bir özgür yazılım olarak yayınlanmıştır.

____________________________________________

**Clustering Words by Projection Entropy**

*Işık Barış Fidaner, Ali Taylan Cemgil*

We apply entropy agglomeration (EA), a recently introduced algorithm, to cluster the words of a literary text. EA is a greedy agglomerative procedure that minimizes projection entropy (PE), a function that can quantify the segmentedness of an element set. To apply it, the text is reduced to a feature allocation, a combinatorial object to represent the word occurences in the text’s paragraphs. The experiment results demonstrate that EA, despite its reduction and simplicity, is useful in capturing signiﬁcant relationships among the words in the text. This procedure was implemented in Python and published as a free software: REBUS.

(accepted to NIPS 2014 Modern ML+NLP Workshop – paper – arxiv.org – python code)

____________________________________________

**Bölüntüler ve Özellik Atamaları için Özet İstatistikleri**

*Işık Barış Fidaner, Ali Taylan Cemgil*

Öbekleme için sonsuz karışım modelleri sıklıkla kullanılır. Bu modellerde karışım atamalarının sonsalından Monte Carlo yöntemiyle örnekleme yapmak veya eniyileme ile *maksimum a posteriori* çözümünü bulmak mümkündür. Ne var ki bazı problemlerde sonsal dağınıktır ve örneklenen bölüntüleri yorumlamak zordur. Bu makalede bölüntü ve özellik ataması örneklemlerinin temsili için blok büyüklüklerine dayalı yeni istatistikler tanıtmaktayız. Öğeler arası parçalılığı nicelemek için öğe-temelli bir entropi tanımı geliştirmekteyiz. Sonra bu bilgiyi özetleyip görselleştirecek *entropi toplaşması* adlı basit bir algoritma önermekteyiz. Önerilen istatistiklerin pratik kullanımı birkaç sonsuz karışım sonsalında ve bir özellik ataması veri kümesinde yapılan deneylerle gösterilmektedir.

____________________________________________

**Summary Statistics for Partitionings and Feature Allocations**

*Işık Barış Fidaner, Ali Taylan Cemgil*

Infinite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or find its *maximum a posteriori* solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. We develop an element-based definition of entropy to quantify segmentation among their elements. Then we propose a simple algorithm called *entropy agglomeration* (EA) to summarize and visualize this information. Experiments on various infinite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice.

(presented in NIPS 2013, Lake Tahoe – paper – poster – arxiv.org – papers.nips.cc – reviews and rebuttal – bibtex)

____________________________________________

**Parçalı doğrusal dizilerin sonsuz karışımı
[Infinite mixture of piecewise linear sequences]**

*Işık Barış Fidaner, Ali Taylan Cemgil*

Bu çalışmada, kısa zaman serilerini bölüntülemek için bir sonsuz karışım modeli öneriyoruz. Bileşenleri parçalı doğrusal diziler olan bu modeli Çin lokantası süreci ile inşa ediyoruz ve gözlem atamaları üzerindeki sonsal dağılımı daraltılmış Gibbs örneklemesi ile hesaplıyoruz. Parçalı bir doğrusal dizi, gözlemlerden daha az parametre ile ifade edilmektedir. Dolayısıyla, olabilirliğin ortalama parametresi, bileşen parametreleri üzerinde bir matris dönüşümü ile elde edilmektedir. Bu matris, parçalı doğrusal diziyi tanımlayan kurallara göre oluşturulmaktadır.

[In this paper, we present an infinite mixture model to partition short time series data. Components of this mixture model are piecewise linear sequences. The model is constructed using Chinese restaurant process and the posterior distribution over the sample assignments are calculated using collapsed Gibbs sampling. A piecewise linear sequence is represented by fewer parameters than its observations. Thus, the mean parameter of the likelihood is obtained by applying a matrix transformation on the component parameters. This matrix is constructed by a special method according to the rules that define our piecewise linear sequences.]

(presented in SİU 2012 (Sinyal İşlemeleri ve Uygulamaları) – ieeexplore)

____________________________________________

**Infinite Multiway Mixture Model with Factorized Latent Parameters**

*Işık Barış Fidaner, Ali Taylan Cemgil*

In this paper, we develop an infinite multiway mixture model, whose parameters are represented as a tensor factorization. We define a D-way Poisson mixture, where a large observed tensor X is generated by the mixture proportions pi_d and a smaller latent tensor Theta, which is represented as a factorization of M latent factors Theta_m of varying dimensionalities. We first derive an EM algorithm for the finite mixture. Then, we formulate an infinite multiway mixture, and propose an MCMC method to sample the assignments.

(presented in Bayesian Nonparametrics Workshop in NIPS 2011, Granada – paper)

____________________________________________

**Off-Axis Stereo Projection and Head Tracking for a Horizontal Display**

*Başar Uğur, Ali Vahit Şahiner, Işık Barış Fidaner*

In this work, a head-tracking based stereoscopic horizontal display system is presented. The hardware components of the system are, a helmet which is used for head tracking, 3D shutter glasses for stereoscopic viewing, a firewire stereo camera for locating the 3D coordinates of the LEDs on the helmet by stereo reconstruction, and a table in which a stereoscopic projector feeds 3D images in stereo onto a semi-transparent glass via a large mirror. The system runs under Ubuntu Linux. The main purpose of this project is to create a realistic visualization of 3D models that are often viewed horizontally, such as architectural models or terrain data. Head tracking is achieved by using the images captured in 50 FPS. Three-dimensional stereo pairs are rendered at 1024×768 resolution and 120 Hz.

(accepted to “Salon de ACE” in ACE 2009, Athens – paper )

____________________________________________

**Estimating review score from words**

*Işık Barış Fidaner*

In this study, we created a machine learning database that contains several thousands of samples in three categories, by collecting the review quotes and the corresponding scores from Metacritic. As a second step, we extracted the words used in these quotes from reviews of music albums, movies and TV shows. Then, according to a few statistics, we selected a small set of words that could carry the most amount of score information. This selection procedure is confirmed by the lists of positive and negative words we generated. These lists are very meaningful for a human examiner, and shows a clear relation between the words and the corresponding mean scores. In the last step, we applied linear regression and SVM regression for machine learning. The experiments show that we can achieve an estimation accuracy from +-11 to +-20 points, depending on the dataset and the algorithm used.

(project report and presentation prepared in 2009)

____________________________________________

**Perspectives on Walking in an Environment **

*Işık Barış Fidaner*

Walking is basically a sequence of body movements that allow a person to move forward. A gait cycle has a periodic structure that consists of a certain sequence of phases that are in turn composed of some sub-phases of movement. This structure puts a general constraint that limits the relative limb angles and positions in each phase of the movement. However, a walking person also continuously interacts with her environment and adjusts the her walking accordingly. The environmental interaction in gait can be considered as a communication that takes place through a number of input and output channels.

(project report and presentation prepared in 2009)

____________________________________________

**Tracking Human Motion by using Motion Capture Data**

*Işık Barış Fidaner*

Tracking an active human body and understanding the nature of his/her activity on a video is a very complex problem. Psychological experiments show that human subjects can easily track and extract several variables from a video, including what the person is doing, the gender of the person, the emotional status, the identity of the person, and even if the person is oneself; all from a little amount of dynamic information. This is by means of the biological vision and cognition system that include inborn mechanisms for emphatizing with other humans and also adaptively incorporate several years of experience. Therefore, to build intelligent systems ourselves, we must find a way to algorithmically incorporate the information about the physical nature of the human body and the dynamical structure of different human activities.

(project report written in 2009)

____________________________________________

**A Study on Particle Level Sets and Navier-Stokes Solver
For Water Simulations**

*Koray Balcı, Işık Barış Fidaner*

In this paper we present a survey on simulation of fluids in computer graphics domain, especially focusing on the ones with the level-set approach. We overview the mathematical foundations that governs the behaviour of fluids under the influence of physics in an environment, present state-of-the-art in surface tracking methods that drives the data for animation. We focus on a specific paper from Fedkiw et. al. using particle level sets for surface tracking as basis for our own studies. We also show our early results and present our progress in the domain.

(project report written in 2006)

____________________________________________

**Combined color and texture tracking for video post-editing **

*Işık Barış Fidaner, Lale Akarun*

Tracking a human head in a complicated scene with changing object pose, illumination conditions, and many occluding objects, is the subject of this paper. We present a general tracking algorithm, which uses a combination of object color statistics and object texture features with motion estimation. The object is defined by an ellipse window that is initially selected by the user. Color statistics are obtained by calculating object color histogram in the YCrCb space, with more resolution reserved for chroma components. In addition to the conventional discrete color histogram, a novel method, uniform fuzzy color histogram (UFCH) is proposed. The object texture is represented by lower frequency components of the objectpsilas discrete cosine transform (DCT), and local binary patterns (LBP). By using the tracker, performances of different features and their combinations are tested. The tracking procedure is based on constant velocity motion estimation by condensation particle filter, in which the sample set is obtained by the translation of the object window. Histogram comparison is based on Bhattacharyya coefficient, and DCT comparison is calculated by sum of squared differences (SSD). Similarity measures are joined by combining their independent likelihoods. As the combined tracker follows different features of the same object, it is an improvement over a tracker that makes use of only color statistics or texture information. The algorithm is tested and optimized on the specific application of embedding interactive object information to movies.

(presented in ISCIS 2008, İstanbul – paper)

____________________________________________

**A Survey on Variational Image Inpainting,
Texture Synthesis and Image Completion**

*Işık Barış Fidaner*

In this survey, techniques developed in three distinct but related fields of study, variational image inpainting, texture synthesis and image completion, are investigated. Variational image inpainting involves filling narrow gaps in images. Though there are challenging alternative methods, best results are obtained by PDE-based algorithms. Texture synthesis is reproduction of a texture from a sample. Firstly, statistical model based methods were proposed for texture synthesis. Then pixel and patch-based sampling techniques were developed, preserving texture structures better than statistical methods. Image completion algorithms deal with the problem of filling larger gaps that involve both texture and image structure. This is a more general field of study that emerged by the combination of variational image inpainting and texture synthesis. State-of-the-art image completion techniques are exemplar-based methods that are inspired by greedy image-based texture growing algorithms, and the global image completion approach that was recently proposed to solve quality problems in exemplar-based image completion.

(manuscript written in 2007)

____________________________________________

**Çok Boyutlu Cisimlerin İzdüşümlerinin ve Arakesitlerinin Alınması
[Projections and Cross-Sections of Multidimensional Objects]**

*Eser Aygün, Işık Barış Fidaner*

Bilimde önem kazanan çok boyut kavramı, üç boyutlu beyinlerimiz için anlaşılırlıktan uzaktır. Çok boyutlu cisimleri ve arakesitlerini görselleştirmek, bu kavramı anlaşılır kılmak için önemli bir adım olacaktır.

[The concept of multidimensionality has gained importance in science, but it is far from being conceivable for our three dimensional brains. Visualisation of multidimensional objects and their cross-sections will be an important step to make this concept conceivable.]

(2nd place in TÜBİTAK Project Competition 2001, Ankara – project report)