Explanatory Data Analysis group


Our research on Explanatory Data Analysis can be roughly clustered in the four (overlapping) themes described below. You might also be interested in our list of concrete research projects.

Pattern-based Modelling
Sushi patterns
Sushi patterns found with Semiring Rank Matrix

Patterns are ideally suited to describe and explain structure in data, but traditional pattern mining approaches usually result in too many patterns; the infamous pattern explosion. This issue can be addressed by constructing pattern-based models that accurately yet succinctly capture the relevant structure in the data. We often—but not always—use information theoretic principles to this end (see below). Pattern-based modelling can be applied to many data types and tasks.

Recent papers that fit this theme (and do not use information theory) include Explaining Deviating Subsets through Explanation Networks and Semiring Rank Matrix Factorisation.

Information Theoretic Data Mining
The Krimp algorithm
Krimp selects few patterns that describe the data.

Using information theory for data mining has proven very successful and is hence often used nowadays. Prime examples are pattern-based modelling approaches based on the Minimum Description Length (MDL) or Maximum Entropy principles. The MDL principle can be paraphrased as "Induction by Compression": compressing data equates learning from data. Using the MDL principle has various benefits, such as naturally avoiding overfitting by trading off data size with model complexity.

Recent papers include Subjective Interestingness of Subgraph Patterns, on community detection using the Maximum Entropy principle, and Association Discovery in Two-View Data, on two-view summarisation using the MDL principle.

Interactive Data Mining
Mine, Interact, Learn, Repeat
Mine, Interact, Learn, Repeat.

As it is often hard to define upfront which patterns are 'interesting', it may be very helpful to involve the human in the loop. That is, by visualising data and patterns, and by asking the domain expert for feedback, it is possible to learn and model the user's preferences. This can be paraphrased as "Mine, Interact, Learn, Repeat". Interactive data mining aims discover to more interesting patterns with little effort from the user.

Recent papers include Learning what matters – Sampling interesting patterns, on combining preference learning, pattern mining, and sampling; and Flexible constrained sampling with guarantees for pattern mining, on efficient and flexible sampling.

Applied Data Science
Manufacturing car body parts
Anomaly detection for manufacturing.

There is no better way to demonstrate the potential of explanatory data analysis than by applying our algorithms and tools to real-world applications. We are interested in both scientific and industrial applications. The scientific application domains we are most experienced with are bioinformatics and psychology, but we are expanding to the health domain and law. We have experience with industrial domains such as aviation, manufacturing, and design simulations.

Recent papers include Towards Data Driven Process Control in Manufacturing Car Body Parts and Simultaneous discovery of cancer subtypes and subtype features by molecular data integration.