Explanatory Data Analysis group
We are always looking for excellent and motivated students interested in doing research on explanatory data analysis and becoming member of the EDA group. For Bachelor students this usually means doing the BSc thesis, for Master students the MSc thesis or research project. Moreover, we sometimes have a vacancy for a fully-funded position as a PhD student or postdoc. Below you find examples of what others say about working with Matthijs, a list of available student projects, and a list of previous student projects.
Are you interested in doing a research project in the EDA group? Contact us.
What others say
There is no better way to find out how it is to work with someone than to hear it from others who have actually worked with him. This is what others say about working with Matthijs.
Matthijs was there from Day 0: he supervised my master thesis (which later turned into my first paper). I could always rely on Matthijs to provide an idea, a pointer to relevant literature, a lesson in writing or making slides, time for writing and reading our joint papers, general advice on how to survive a PhD, an enthusiasm boost, and many other things. These four and a half years would have been awfully harder without Matthijs!
It was a pleasure to work with Matthijs. Reflecting on the past few months, I can say that his patience and pro-active feedback dared me to ask more questions and to improve my work. Matthijs can be characterised as a reliable person with eye for detail. He was able to patiently explain situations if not immediately understood. He also helped me to analyse situations from different perspectives. With this, I am grateful that he has taught me how to practically apply the theories I learned during my bachelor program.
It was a great pleasure to share the office with Matthijs for a number of years, where we could have a lot of coffee breaks and useful discussions. [..] Matthijs also trained me how to give presentations, how to critically evaluate the performance of algorithms, how to write research papers and many other skills for data miners. Many thanks for the nights you stayed up late with me and Siegfried to finish papers on time!
Available student projects
Below is a list of suggested projects for BSc and MSc students; you are also free to propose your own project related to explanatory data analysis.
|Type||Project title||Together with|
|MSc||Information Theoretic Data Mining for Movement Data||Daniela Gawehns|
|MSc||Interesting scatterplot discovery||Lincen Yang|
|MSc||Conditional density estimation||Lincen Yang|
|MSc||Human-Guided Data Mining for Data Journalism||BBC Data Unit, Lincen Yang|
|MSc||Anonymizing medical data: developing a generalization algorithm that optimizes p-privacy||Shannon Kroes, Dr. Mart Janssen|
|MSc||Compression‐based Pattern Mining from Spatio‐Temporal Graphs||Dr. Mitra Baratchi|
Previous student projects
Below is a list of recent research projects that were supervised by Matthijs while he was affiliated with LIACS; before that he supervised six MSc thesis projects at Utrecht University and KU Leuven. (Types: MSc = Master thesis; Mp = Master project (other than thesis); BSc = Bachelor thesis.)
|MSc||2019||Evaluating Interpretability of Rule-based Classifiers
When building prediction models for fields involving high-stake decision-making, it is not sufficient for a model to just give accurate predictions. Decision makers require models to be interpretable, meaning that they can understand why certain predictions are made for different instances. There exist many interpretable models for which it is claimed that people can easily understand how they work, while interpretability of those models has not been adequately tested. In this thesis, we evaluate the interpretability of rule-based classifiers. By using eight measures of interpretability, we evaluate the interpretability of rule-based classifiers generated by four rule learning algorithms, both on benchmark and real-world data. We find that RIPPER seems to have poor interpretability. Results show that SBRL and FRL are more compact in structure, while CART is powerful when considering the dependency of rules. Experiments demonstrate that SBRL and CART have the possibility to conform well to domain knowledge. We demonstrate that SBRL, CART, and FRL are useful classifiers. It would be suitable to use SBRL if the user's main goal is to make predictions, and then knowing the explanation of a certain prediction. CART would be suitable when the user's main goal is to understand knowledge in data. FRL would be suitable if users would like to know what conditions could lead to a high probability of occurrence of an event. Additionally, results indicate that the number of rules and AIC are not good measures for the selection of rule-based classifiers.
|BSc||2019||Analysing time series data of children speaking in public to learn more about social anxiety
The aim of this thesis is to extract information from mixed time series and static data, in order to learn more about the development of stress responses during public speaking by adolescents. This is done by a descriptive data mining technique called subgroup discovery. This will result in subsets of the entire data set described by conditions, also called rules. The time series data consists of ECG data collected during a public speaking task. The static data consists of data collected by means of a questionnaire. In the first part, results of an experiment whether we can tune the quality measure such that we obtain a desirable trade-off between subgroup size and the deviation of the target are presented. These show that it is possible to tune the quality measure such that we can make a desirable trade-off between the subgroup size and the deviation of the target. The expert, with whom we are collaborating, indicated to tune the quality measure such that it find subsets that have at least 60 participants. The results of the second experiment which compares a tuned version of the quality measure to a default version of the quality measures, show that the tuned version is the only quality measure that returns subgroups with a subgroup size around 60. So it is indeed better, for the purpose of this thesis, to use a tuned version instead of a default version of the quality measures. In the second part, results of an experiment whether there is an observed effect or correlation that indicates that the obtained subgroups are unlikely to be based on luck are presented. These show that at least the top-100 subgroups with nominal targets were significant. The results of the second experiment, show that a higher heart rate, amplitude and age is associated with a higher social anxiety score. A lower heart rate variability and increasing BMI is also associated with a higher social anxiety score. This aim of this thesis is to offer useful insights so that psychologists can get a better under- standing of adolescents. Finally, I recommend further research in order to improve and obtain new insights.
|MSc||2019||Stimulating the Adoption of A/B Testing in a Large-Scale Agile Environment
In today's business world, software companies like banks focus on making informed decisions in order to satisfy customer needs whilst also succeeding in business. One way to make informed decisions is to adopt A/B testing across the organisation. However, the adoption rate of A/B testing at case company ING—a large international bank—is low on average. In this thesis, we use an exploratory mixed case study to find out how to stimulate the adoption of A/B testing throughout the company. We do this through a survey with 295 respondents, a focus group with 13 participants, and by building a model for investigating the effects of using pre-experiment data. Our findings show that advanced functionalities and high discoverability are key incentives for increasing the adoption rate of the experiment platform at our case company. Last, we provide practical guidelines for improving the rate of significant A/B tests. We show that by using pre-experiment data—user behaviour data prior to the start of an experiment—the power of A/B tests increases.
|MSc||2018||Pre-Schoolers during Recess: Dynamic Patterns in Face-to-Face Interactions
When you ask a pre-schooler what she learned at school, she might say something like: a new song or writing my name. Maybe they went outside and collected plants or practiced a new sport. While the games and basic maths, the reading and writing lessons are a necessary foundation for their later careers at school, children also develop their social competence by being around other children during recess. Developmental psychologists try to understand how children form groups, how they interact with each other, and how this behavior can be linked to a child's development. Using radio frequency emitting devices, researchers collected proximity data of 97 children during morning and afternoon recess. We were asked to explore the data to generate new hypotheses. This exploration should go beyond aggregated statistics to provide new insights. In this thesis, we show how finding recurrent subgroups in the data set can be formalized in terms of a pattern mining task in a dynamic network. Our main contribution is to highlight the need for a temporal perspective when analyzing dynamic networks of face-to-face interactions. We also show that information theoretic principles can be used to find recurring structures in the provided data and present a proof-of-concept.
|MSc||2018||MDL-based Map Segmentation for Trajectory Mining
Discretization is a key issue in urban trajectory pattern mining tasks. By assuming that regions with different functions will probably have different densities of visiting people, we propose to segment the city map-and hence discretize trajectory data-by Finding region boundaries based on strong density changes. We solve the map segmentation problem as a model selection problem, using the existing MDL-histogram approach. We also propose a heuristic algorithm so that we can apply MDL-histogram on 2-dimensional data (longitude and latitude). Finally, we validate our approach and algorithm by simulation studies and on taxi trajectories from New York City.
|BSc||2018||Topic modelling and clustering for error recognition in system logs
Our data is a collection of server logs. The servers logs have been aggregated but have no labels to indicate what type the servers logs are. This thesis focuses on the discovery and clustering of these server logs. The focus has been on logs with the term ’error’. The research makes use of the unsupervised machine learning technique Latent Dirichlet Allocation (LDA). We extract the server logs, transform them using the standard data preprocessing pipeline. With that we created a dataset which we can call the corpus and the logs are the documents. Using the best practices from Blei the founder of LDA, we create multiple models only varying in the topic count. The models are evaluated using multiple metrics. Topic modelling can distinguish itself by being one of the few machine learning techniques which depends on human readability of its models. We take a look at the topics generated by the models and conclude that a human has a hard time understanding the topics. Clustering the documents based on their highest probable topic, shows that models only have a few dominant topics where the bulk of the documents go. The clustering has a great performance based on silhouette coefficient on lower levels. At the end of the thesis we do not recommend topic modelling for latent topic discovery on server logs. Topic modelling is not human readable on server logs and applying semantic analysis metrics does not help a lot. The clustering however appears to create solid clusters when using low topic counts.
|Stephan van der Putten|
|BSc||2018||Predicting the Discharge Date of Patients using Classification Techniques
The expenses of hospitals are increasing due to medical and technological development. The hospital’s objective is to improve the quality of health care while reducing costs. The length of stay is one of the performance indicators that is used as a measure of efficiency of a hospital. It is essential to know the discharge date in advance to optimize the use of human resources and facilities. This thesis aims to research how the discharge date can be predicted more accurately using various classification machine learning techniques. Several classification models have been used to predict the discharge date. In particularly machine classification and regression tree, random forest, naive bayes, and support vector machine have been applied and compared using the F1-score. Based on the results, the model classification and regression tree appears to be the most insightful model to predict the discharge date. This model has been tuned and visualized. Moreover, this research shows how challenging it is to work with raw medical data of patients. It gives an insight into how the data is extracted and represented. It also shows how missing values are imputed using different methods.
|BSc||2018||Comparing ways of finding patterns in flight delays
The number of people traveling by airplane has been increasing over the last couple of years. Combine this with the effects of climate change, which increases unpredictable and more extreme weather, and it is easy to see that it is a difficult task for airlines to make sure their flights arrive according to schedule. Studies have been done on finding patterns in flight data and weather data, for example on finding causes of flight delays and comparing quality measures to use with subgroup discovery. This work focuses on finding the best way for finding patterns in flight data and weather data that help to tell something about flight delays. The dataset used in this research consists of flight data with corresponding weather data of domestic flights from the United States in the year 2016. Only data of United Airlines was used from the airports of Denver, Tampa and San Diego. The Diverse Subgroup Set Discovery (DSSD) algorithm was used for doing subgroup discovery. With this algorithm, experiments were done in which two quality measures, two equal frequency discretization techniques, three different bin sizes and two search strategies were compared, resulting in 72 conducted experiments. In conclusion, experiments showed that there are two ways particularly interesting for finding subgroups with high delays. The first way results in smaller subgroups with particularly high delays, which can, for example, be used for doing outlier detection. The second way results in bigger subgroups with lower (but still above average) delays. This way can, for example, be used for finding/capturing the effects of changes in strategy and policy made by the management of air carriers.
|Mp||2017||Understanding Flight Delays
Commercial aviation is likely the most complex transportation system. This is because the three main parts of the system, the passengers, the on-board crew and the aircraft itself must be coordinated through a tight schedule. A disturbance in one of the schedules can propagate through other schedules and can lead to big delays. The delays can lead to lost money due to increased expenses for the crew, extra fuel costs and loss of demand for passengers. In this research paper we will search for relations between arrival delays and the corresponding flight and weather data. Therefore a data set of all domestic 2016 American Airlines flights with corresponding weather data is assembled. On this data set, subgroup discovery algorithms will be applied to search for interesting attributes that correlate to the delay. In the experiment subgroup discovery was done on the data set using multiple different search strategies and quality measures. We show that as expected rare weather phenomena such as thunderstorms and heavy snowfall can lead to heavy delays. Other common attributes that are associated with delays are a high dew point and high wet bulb temperature.
|BSc||2017||Diverse subgroup discovery for big data
Nowadays, the amount of the data used in different fields is increasing. A consequence is that the task of data mining becomes more and more difficult. The highly redundant result sets and big search space are the biggest problems when it comes to handle big data. The task of Subgroup Discovery (SD) is to discover interesting subgroups that show a deviating behavior with respect to some target property of interest. The existing Diverse Subgroup Set Discovery (DSSD) algorithm aims to solve the two above problems by considering subgroup sets rather than individual subgroups. During the process of searching for the subgroups, their covers -representing the subset of the dataset- should be kept in memory. One of the approaches to deal with subgroups subsets is to cache all these covers, but this becomes a problem when we analyse big datasets because of large memory usage. In our study, we consider five subset implementation techniques and conduct experiments on 38 datasets. The results show that the empirical performance of the different techniques depends on the characteristics of the data. We investigate how nine such properties of the can be used to predict runtime and memory usage. To this end, we built linear regression models using WEKA. The results show that the predictions of the memory usage models are more accurate than the predictions of the runtime models.
|BSc||2017||Analysing data to predict processing times of vessels and barges in liquid bulk terminals
Effective berth scheduling allows terminals to maximise their utilisation of berth space and thus meet their terminal throughput targets. We propose a way to overcome the hard task of manually predicting processing times in liquid bulk terminals. We aim to reduce waiting times and thus optimise scheduling. Ultimately, the goal is to design a model that can accurately predict the processing times of vessels and barges in a liquid bulk terminal. Data from EuroTankTerminal, Rotterdam, is pre-processed and analysed to uncover problems and audit errors. Following this process, we compare different types of machine learning regression algorithms: linear regression, random forests, support vector machines, and XGBoost. Our results show that XGboost performs well at predicting berth occupation times from historic data. Further study should be directed towards the challenging infrastructure in liquid bulk terminals, as certain rules and conditions based on the individual terminals, such as pipeline availability, could add information to the model and further improve prediction results.
|BSc||2017||Data-Driven Estimation of Consultation Time using Regression
Doctors are able to give the right vaccines and inform customers about their travel destinations to ensure a prepared journey. To know how much time is exactly needed for these appointments, we can use historic data to perform analyses and make predictive models. Throughout the years data have been collected of appointments and GPS data. However, this data have not been used for estimating consultation time yet. The aim of this thesis is therefore to research how regression techniques can be used to accurately estimate appointment time based on historic data. We look at which data can be used for the estimation. Data can then be interpreted to see what effects the attributes have on time duration. Also, we look which regression models can be used to accurately model the consultation time. For this, we introduce the use of (ensemble) regression trees and model trees. Different experiments are conducted to compare the regression models to decide the most suitable model. Based on these results, the model tree comes out as the most insightful model to make time predictions. Moreover, this research also opens up the discussion in how much should be relied on models. It shows that human factors are important influencers in models and thus raises questions to what extent models can be used for human decision making. Data cannot explain every event and data-driven predictions may conflict with future policies.
|BSc||2016||On the Description-driven Community Mining algorithm
A more powerful classifier was used to replace the original one in the Description-driven Community Mining (DCM) algorithm, and a set of comparison experiments were carried out to test whether the MDCM (modified Description-driven Community Mining) algorithm is better than DCM. By analyzing the results, we obtained several conclusions: first, a powerful classifier -implying more homophily holds in the community- not necessarily means better structure in some cases. Second, the sacrifice of homophily in structure is not large, but cannot be ignored. Finally, future work to further improve the performance of MDCM was discussed, and guidance on when and where to use MDCM was also discussed.
|BSc||2016||An Attempt to Detecting Patterns Among Children on the Playground using Attributed Graph Mining
In this thesis we consider an attributed graph miner to find patterns among children who play on the playground. These patterns may provide us with a deeper understanding of the impact that the social-emotional skills of a child have on his social interactions. With a pattern mining approach we hope to find unexplored information that was not located by the previously used statistical approach. As our pattern mining method we chose CoPaM, an attributed graph miner that returns connected vertices with cohesive attributes. Firstly, we discuss the data pre-processing required to prepare the dataset as input for a dynamic social network whose vertices are associated with features. After that we examine the functionality and output of CoPaM. Next we visualize the output which gave an interesting insight into the interactions of children and provided a graphic overview of the data. Additionally, while analyzing the output of CoPaM we stumbled upon the fact that CoPaM was designed for a static instead of a dynamic attributed graph which caused a rise in the output of found patterns. To cope with this rise we focused on the frequency of each feature, the most prominent patterns and the pattern with the most vertices or features during post-processing of the output. In conclusion, a child’s capacity to calm down or to be calmed down seemed to be the most prominent feature that was present in groups. Nevertheless, after visualizing and analyzing the output of CoPaM there seem to be no strong patterns in the data that present a correlation between social-emotional skills and social development.
|BSc||2016||Application of Subgroup Discovery: Finding patterns using heart rate time series
Social anxiety is a disorder that could severely hamper the life of individuals. It is still unclear when social anxiety develops. At time of writing a study is being conducted with as goal to find out when social anxiety develops and how it can go astray. Understanding the relation between physiological data and the heart rate behaviour could lead to new insight. In this thesis we use subgroup discovery to find out what could cause an unusual behaviour of the heart rate time series. Due to the high dimensionality of time series, we use the Exceptional Model Mining Framework. To enable subgroup discovery we first preprocess the data and extract relevant features. Then we develop a quality function that we use to compare generated subgroups. The quality function uses the statistical z-score derived from the distance between subgroups using the Euclidean distance. The quality measure has been implemented using Cortana as it already contains an implementation of the Exceptional Model Mining Framework and as it is open source. To test our quality function we generated several synthetic datasets. Results show that the quality measure is able to find deviating heart rates and score them accordingly. Using the quality measure on the SAND data shows us that no patterns exist with a strongly deviating heart rate. The findings of the domain expert show us that the biggest differences occur in the first minute. This is interesting for future research.
|BSc||2016||Analysing flight recorder data: A data-driven safety analysis of mixed fleet flying
A major airline recently acquired a new airplane: the Boeing 787. To achieve more operational efficiency, this plane is own by pilots already flying the Boeing 777. However, the airline wants to make sure that this practice does not influence flight safety. This is done by analyzing the landings, which led to the following research question: does mixed fleet flying of Boeing 777 and Boeing 787 airplanes influence landing performance on Boeing 777 airplanes? Previous research on machine learning and flight recorder data focused almost exclusively on detecting anomalies. We use machine learning techniques on Boeing 777 ight recorder data to determine if there is a difference in performance between mixed fleet flying pilots and regular pilots, more specifically in the landing phase of the flight. We used both features proposed by experts and automatically constructed features. Although our techniques were able to distinguish the two subtypes of Boeing 777 airplanes as a proof of concept, a substantial difference in pilot performance was not found in this data set using the techniques presented in this research. These findings support the idea that mixed fleet flying of Boeing 787 and Boeing 777 airplanes does not impact pilot performance.