Explanatory Data Analysis group
We are always looking for excellent and motivated students interested in doing research on explanatory data analysis and becoming member of the EDA group. For Bachelor students this usually means doing the BSc thesis, for Master students the MSc thesis. Moreover, we sometimes have a vacancy for a fully-funded position as a PhD candidate or postdoctoral researcher. Below you find examples of what others say about working with Matthijs, a list of available student projects, and a list of previous student projects.
Are you interested in doing a research project in the EDA group? Contact us.
What others say
There is no better way to find out how it is to work with someone than to hear it from others who have actually worked with them. This is what others say about working with Matthijs.
Matthijs is a great supervisor! I really appreciated the combination of letting me managing my master's thesis independently, while providing me with sharp and constructive feedback and direction. He taught me to always keep the scientific goal of my research in mind, but also to place it in a broader perspective. His way of communicating is supportive, clear and friendly, and he always makes time to answer questions or to provide feedback. Therefore, I would like to thank Matthijs for his time and support during my thesis, it was a pleasure to work with you!
Prof. Van Leeuwen has always been there to comfort and motivate me in my arduous and challenging Ph.D. journey. Although I was miles away from him, in India, he always had time for me. Numerous discussions and weekly Skype meetings with him made me overcome every obstacle I faced during this course. He always went beyond expectations, even when his newborn baby needed all his attention. Whenever required, he even dedicated his personal time to advise me on my thesis. He always encouraged me to take new initiatives and make decisions in my research work. I am blessed to have such a devoted professor who guided me through one phase of my life, and his teachings would always stay with me in every phase.
The first praises go to my supervisor, Matthijs van Leeuwen, whose guidance helped me transverse the sinuous paths of research. You not only inspired me as a researcher, but as a person. I could always count on you, both professionally and personally. You have never pre-emptively judged my choices and I am deeply honoured to have been your student.
Available student projects
Below is a list of suggested thesis research projects for BSc and MSc students; you are also free to propose your own project related to explanatory data analysis.
|Type||Project title||Together with|
|BSc||Spatial Subgroup Discovery||Ioanna Papagianni|
|MSc||Information-theoretic trajectory mining||Lincen Yang|
|MSc||Information-theoretic rule learning for eXplainable AI (XAI)||Lincen Yang|
|MSc||Conditional density estimation||Lincen Yang|
Previous student projects
Below is a list of recent research projects that were supervised by Matthijs while he was affiliated with LIACS. (Types: MSc = Master thesis; Mp = Master project (other than thesis); BSc = Bachelor thesis.)
|BSc||2021||Weather as a Trigger for Migraines: Predictive Modelling on Migraine E-Diaries using Machine Learning
Background: Migraine is a prevalent, multifactorial brain disease, characterized by an intense form of headache, typically causing pain in one half of the head, nausea and visual changes. It affects approximately 1 out of 7 adults annually. There is a lot known about the pathology of migraine, however there still is a lot of speculation about the triggers. 53% of migraineurs indicate weather to be a trigger for attacks. Related work: Research into the correlation between weather and migraine is contradictory. This is explained by the existence of a subgroup of migraine patients with weather dependent migraine. However, there is little research done into identifying this subgroup. This research will expand the current research area of the relation between migraine and the weather by using machine learning to predict the start of migraine attacks and thereby identify migraineurs with weather dependent migraine. Data: For the migraine data, LUMC LUMINA e-diaries are used. From these diaries, the age of patients, the patients' indication on whether they have weather dependent migraine and the start dates of migraine attacks were used. The weather data was publicly available from the KNMI. The temperature, sunshine, precipitation, air pressure, cloudiness and relative humidity were taken into account. Methods: First, the data was preprocessed, selecting the usable patients. The 860 patients selected were linked with the weather data using zip codes. These patients were divided into 5 folds using stratified train test split. The stratification was based on the age, number of months in the diary and whether the patient indicated to have weather dependent migraine. Then the data sets were applied on random forest and support vector machine using 5-fold cross validation. Results: The prediction on the dataset using random forest and support vector machine have a mean accuracy over all the folds of 60.5% for random forest and 59.7% for support vector machine. Also, a subgroup of patients with a sensitivity > 0.5 and a specificity > 0.5 was analysed. This subgroup contains approximately 10% of the patients. Random forest and support vector machine show a linear correlation when plotting the accuracy per patient of both models. Discussion & Limitations: Random forest is better at finding a pattern in the subgroup leading to 71 more patients with a sensitivity > 0.5 and specificity > 0.5 than support vector machine. No distinctive pattern could be found in the baseline characteristics of the subgroup. Conclusion & further research Using classification models we found that for approximately 10% of the patients we were able to make predictions with a sensitivity > 0.5 and a specificity > 0.5 solely based on weather and weather changes. This shows that there are patterns in weather conditions which could be a trigger for patients with weather dependent migraine.
|MSc||2021||Graph pattern mining for blockchain networks
Addressing the growth of the field of Blockchain and cryptocurrencies, this study aims to analyse datasets of two cryptocurrencies, Bitcoin and Ethereum, for suspicious transaction activities. We represent the Blockchain data as a graph network and execute graph pattern mining methods to search for suspicious transaction patterns. We developed two graph pattern mining algorithms to discover suspicious transaction patterns following a circular or diamond-shaped structure. This research also conducts a quantitative evaluation of patterns observed and investigates their relation with Blockchain metrics. To assess the observed patterns, we propose a novel locality sensitive hashingbased method named relevance evaluation. The relevance evaluation is a statistical evaluation method for graph patterns. The relevance evaluation conducts various statistical tests and helps to classify graph patterns as usual or unusual observations. Results of this study unveil the relation between suspicious transaction activities such as money laundering and cryptocurrencies.
|BSc||2021||Can a human-computer hybrid outperform the LTM tiling algorithm?
Data mining is becoming increasingly important in our world. In a wide range of fields, understanding the constantly growing amount of gathered data is essential. For mining binary data sets specifically, the concept of tiles was introduced as a potentially interesting type of patterns. A tile represents some knowledge about the data set it is a part of. Finding the minimum tiling of a data set, i.e., a tiling that covers the entire data set with the smallest number of tiles used, is a problem that has been proved to be NP-hard. Nonetheless, efforts have been made to design algorithms that find good tilings. The Large Tile Mining (LTM) algorithm was introduced to mine large tiles in binary databases. LTM can be used to construct a tiling by iteratively mining the largest tile from a given database. The resulting tiling is often close to the minimum tiling, but improvements can still be made. Previous research has shown that human-computer interaction can improve data mining performance. We try to tackle the minimum tiling problem by letting humans cooperate with LTM to produce a better tiling than can be achieved by only using LTM. To this end, we have constructed a web application that allows users to construct a tiling, while getting suggestions from LTM. The experiment results show that a small percentage of people manage to produce a better tiling than the tiling produced by iteratively using LTM. We also discuss the drawbacks of this approach, mainly the problem of human attention span and the issue that this approach is unlikely to scale well.
|BSc||2021||Querying Frequent Itemsets in the Browser
The aim of this thesis is to investigate how non data experts can effectively query a set of frequent itemsets to find interesting patterns in their data set. Frequent itemset mining is a technique to find repeating patterns in data that consists of rows of transactions such as purchases in a supermarket. These patterns consist of items often occurring together in these transactions. This information can be used by domain experts to, for instance, optimize the positioning of items in a store to increase their sales. Currently it is difficult for domain experts to find these frequent itemsets and even if they have them, it is difficult to filter through them. It is difficult to find these itemsets because the tools that can do this are not very accessible for people without a background in data mining. Furthermore, if a domain expert finds these frequent itemsets, the list of them will usually be too large to look through without any help. This thesis tries to make it easier for domain experts to find the frequent itemsets and also to exploit their knowledge by allowing the user to filter through the frequent itemsets using an easy to use query language. To do this, we propose a browser-based tool to obtain the frequent itemsets out of a given data set and allow the user to filter through them with an easy to use query language. This program is run completely in the browser, without offloading anything to a server to ensure safety of their data. We also performed a preliminary user study to evaluate the usability and effectiveness of the query language. From these results we can say that the query language is effective for people with a background in computer science.
|MSc||2021||Explainable AI for Predicting ICU Readmission
Intensive Care Unit (ICU) readmission is a serious adverse event associated with high mortality rates and costs. Prediction of ICU readmission could support physicians in their decision to discharge patients from the ICU to lower care wards. Due to increasing ICU data availability, Artificial Intelligence (AI) models in the form of machine learning (ML) algorithms can be used to build high-performing decision support tools. To have impact on patient outcomes, these decision support tools should have high discriminative performance and should be explainable to the ICU physician. The goal of this thesis was to compare several types of ML models on predictive performance and explainability for the prediction of ICU readmission for discharge decision support. The scientific paper that aims to answer this question can be found in Part III of this thesis. In a broader perspective, we proposed a framework for the development and implementation of clinically valuable AI-based decision support. First, a systematic review was conducted to examine current literature on ML prediction models for ICU readmission (Part I). We concluded that previously developed models reported inappropriate performance metrics and were not implemented in clinical practice. Furthermore, previous work did not compare explainable outcomes in terms of patient factors contributing to the risk of readmission between models. Secondly, we conducted a questionnaire among ICU physicians to investigate current discharge practices and their attitude towards the use of AI tools in their work processes (Part II). Although not all physicians agreed that the decision to discharge ICU patients is complex, most of them do believe in the clinical value of an AI-based discharge decision support tool. Thirdly, we developed several prediction models for ICU readmission and compared them on discriminative performance, calibration properties, and explainability (Part III). We concluded that advanced ML models did not outperform logistic regression in terms of discriminative performance and calibration properties. However, the explanations of XGBoost, a state-of-the-art ML algorithm, were more in line with the ICU physician’s clinical reasoning compared to logistic regression and neural networks. Lastly, we designed a study protocol to prospectively evaluate the predictive performance of Pacmed Critical, a CE-certified AI-based discharge decision support tool, and that of the ICU physician (Part IV). This thesis contributed to making the step from developing high-performing prediction models to clinical adoption of an ICU discharge decision support system. Due to small differences in discriminative power and calibration properties between models, the model best explainable to the physician and most in line with clinical reasoning should be chosen for decision support. Before final implementation, impact on patient outcomes and costs will need to be studied in prospective trials.
|Siri van der Meijden|
|Mp||2021||Associations between migraine events and weather patterns
This introductory research project is an exploration of what is possible using data science applied to the field of migraine research. We adapted a statistical pattern mining method to investigate whether associations exist between migraine attacks and weather changes. We provide an adaptation to the DSPM-MTC framework and explain our approach to labeling the data. The result indicates that weather patterns from uni-variate weather variables are not associated with the onset of a migraine attack, however, we cannot deny that there may still be hidden associations due to the limitations of our approach. Therefore, we discuss each limitation and provide possible alternative methods or adaptations to alleviate this limitation.
|BSc||2020||Accessibility of subgroup discovery on housing data through bar visualization
In this thesis we investigate whether visualizing the results of subgroup discovery on a housing data set makes them more accessible, i.e. more easily understandable for laymen. Subgroup discovery is a data mining technique that can fin subsets in a data set with similar and interesting behavior towards a target variable. The results of such an experiment can give relevant insights into the data, but the results are often displayed in a text format, which might be hard to interpret for non-experts. Using the bar visualization technique on subgroup discovery results, results might become more accessible for people with no data mining affinity. Two surveys, one showing regular text results and the other showing the visualized results, ask respondents questions about the results. This yields an impression of their understanding of the results, and thus of the accessibility of the results. The surveys found that the bar visualization did not make the subgroup discovery results significantly more accessible for non-experts. In general, respondents with experience in computer science appeared to be more confident interpreting the results than those who do not have this experience. This was found to be true regardless of how the results were presented. For the age groups that were large enough to give an impression of differences, no substantial differences between the understanding of the results were found. A suggestion for further research is to use more data sets, in order to get a better understanding of differences in the accessibility of subgroup discovery results by using the bar visualization. Having more responses to the surveys could possibly find that age makes a difference. Using different subgroup discovery result sets and/or other visualization methods may also provide interesting insights in this area of research.
|MSc||2020||A Random Forest Approach for Dealing with Missingness: a Case Study in Primary Care Data
In this master’s thesis we propose a novel random forests method developed specifically for dealing with missing data, when classical methods of handling with missing data such as imputation are undesirable due to the introduction of bias. We compare this novel method, called Lost in the Forest (LITF), with classical random forests methods trained on data in which missingness has been handled in various ways: imputation via either the mean, k-nearest neighbors or Multivariate Imputation by Chained Equations, and using a dummy variable that indicates missingness of a value. We first perform this comparison on a large routine primary care data set, predicting five-year risk for cardiovascular events. Imputation of important risk predictors in these data yields biased results when compared with the de facto golden standard, indicating a possible missing not at random mechanism for missingness. Although the performance of LITF is comparable to that of other random forests methods, it provides additional insight in the contribution to the prediction of continuous variables without the need for imputation, and thus avoiding possible bias. Validation of our methodology on two clinical validation data sets with varying extent of missingness confirms these results.
|Teddy Etoeharnowo||MSc||2019||Evaluating Interpretability of Rule-based Classifiers
When building prediction models for fields involving high-stake decision-making, it is not sufficient for a model to just give accurate predictions. Decision makers require models to be interpretable, meaning that they can understand why certain predictions are made for different instances. There exist many interpretable models for which it is claimed that people can easily understand how they work, while interpretability of those models has not been adequately tested. In this thesis, we evaluate the interpretability of rule-based classifiers. By using eight measures of interpretability, we evaluate the interpretability of rule-based classifiers generated by four rule learning algorithms, both on benchmark and real-world data. We find that RIPPER seems to have poor interpretability. Results show that SBRL and FRL are more compact in structure, while CART is powerful when considering the dependency of rules. Experiments demonstrate that SBRL and CART have the possibility to conform well to domain knowledge. We demonstrate that SBRL, CART, and FRL are useful classifiers. It would be suitable to use SBRL if the user's main goal is to make predictions, and then knowing the explanation of a certain prediction. CART would be suitable when the user's main goal is to understand knowledge in data. FRL would be suitable if users would like to know what conditions could lead to a high probability of occurrence of an event. Additionally, results indicate that the number of rules and AIC are not good measures for the selection of rule-based classifiers.
|BSc||2019||Analysing time series data of children speaking in public to learn more about social anxiety
The aim of this thesis is to extract information from mixed time series and static data, in order to learn more about the development of stress responses during public speaking by adolescents. This is done by a descriptive data mining technique called subgroup discovery. This will result in subsets of the entire data set described by conditions, also called rules. The time series data consists of ECG data collected during a public speaking task. The static data consists of data collected by means of a questionnaire. In the first part, results of an experiment whether we can tune the quality measure such that we obtain a desirable trade-off between subgroup size and the deviation of the target are presented. These show that it is possible to tune the quality measure such that we can make a desirable trade-off between the subgroup size and the deviation of the target. The expert, with whom we are collaborating, indicated to tune the quality measure such that it find subsets that have at least 60 participants. The results of the second experiment which compares a tuned version of the quality measure to a default version of the quality measures, show that the tuned version is the only quality measure that returns subgroups with a subgroup size around 60. So it is indeed better, for the purpose of this thesis, to use a tuned version instead of a default version of the quality measures. In the second part, results of an experiment whether there is an observed effect or correlation that indicates that the obtained subgroups are unlikely to be based on luck are presented. These show that at least the top-100 subgroups with nominal targets were significant. The results of the second experiment, show that a higher heart rate, amplitude and age is associated with a higher social anxiety score. A lower heart rate variability and increasing BMI is also associated with a higher social anxiety score. This aim of this thesis is to offer useful insights so that psychologists can get a better under- standing of adolescents. Finally, I recommend further research in order to improve and obtain new insights.
|MSc||2019||Stimulating the Adoption of A/B Testing in a Large-Scale Agile Environment
In today's business world, software companies like banks focus on making informed decisions in order to satisfy customer needs whilst also succeeding in business. One way to make informed decisions is to adopt A/B testing across the organisation. However, the adoption rate of A/B testing at case company ING—a large international bank—is low on average. In this thesis, we use an exploratory mixed case study to find out how to stimulate the adoption of A/B testing throughout the company. We do this through a survey with 295 respondents, a focus group with 13 participants, and by building a model for investigating the effects of using pre-experiment data. Our findings show that advanced functionalities and high discoverability are key incentives for increasing the adoption rate of the experiment platform at our case company. Last, we provide practical guidelines for improving the rate of significant A/B tests. We show that by using pre-experiment data—user behaviour data prior to the start of an experiment—the power of A/B tests increases.
|MSc||2018||Pre-Schoolers during Recess: Dynamic Patterns in Face-to-Face Interactions
When you ask a pre-schooler what she learned at school, she might say something like: a new song or writing my name. Maybe they went outside and collected plants or practiced a new sport. While the games and basic maths, the reading and writing lessons are a necessary foundation for their later careers at school, children also develop their social competence by being around other children during recess. Developmental psychologists try to understand how children form groups, how they interact with each other, and how this behavior can be linked to a child's development. Using radio frequency emitting devices, researchers collected proximity data of 97 children during morning and afternoon recess. We were asked to explore the data to generate new hypotheses. This exploration should go beyond aggregated statistics to provide new insights. In this thesis, we show how finding recurrent subgroups in the data set can be formalized in terms of a pattern mining task in a dynamic network. Our main contribution is to highlight the need for a temporal perspective when analyzing dynamic networks of face-to-face interactions. We also show that information theoretic principles can be used to find recurring structures in the provided data and present a proof-of-concept.
|MSc||2018||MDL-based Map Segmentation for Trajectory Mining
Discretization is a key issue in urban trajectory pattern mining tasks. By assuming that regions with different functions will probably have different densities of visiting people, we propose to segment the city map-and hence discretize trajectory data-by Finding region boundaries based on strong density changes. We solve the map segmentation problem as a model selection problem, using the existing MDL-histogram approach. We also propose a heuristic algorithm so that we can apply MDL-histogram on 2-dimensional data (longitude and latitude). Finally, we validate our approach and algorithm by simulation studies and on taxi trajectories from New York City.
|BSc||2018||Topic modelling and clustering for error recognition in system logs
Our data is a collection of server logs. The servers logs have been aggregated but have no labels to indicate what type the servers logs are. This thesis focuses on the discovery and clustering of these server logs. The focus has been on logs with the term ’error’. The research makes use of the unsupervised machine learning technique Latent Dirichlet Allocation (LDA). We extract the server logs, transform them using the standard data preprocessing pipeline. With that we created a dataset which we can call the corpus and the logs are the documents. Using the best practices from Blei the founder of LDA, we create multiple models only varying in the topic count. The models are evaluated using multiple metrics. Topic modelling can distinguish itself by being one of the few machine learning techniques which depends on human readability of its models. We take a look at the topics generated by the models and conclude that a human has a hard time understanding the topics. Clustering the documents based on their highest probable topic, shows that models only have a few dominant topics where the bulk of the documents go. The clustering has a great performance based on silhouette coefficient on lower levels. At the end of the thesis we do not recommend topic modelling for latent topic discovery on server logs. Topic modelling is not human readable on server logs and applying semantic analysis metrics does not help a lot. The clustering however appears to create solid clusters when using low topic counts.
|Stephan van der Putten|
|BSc||2018||Predicting the Discharge Date of Patients using Classification Techniques
The expenses of hospitals are increasing due to medical and technological development. The hospital’s objective is to improve the quality of health care while reducing costs. The length of stay is one of the performance indicators that is used as a measure of efficiency of a hospital. It is essential to know the discharge date in advance to optimize the use of human resources and facilities. This thesis aims to research how the discharge date can be predicted more accurately using various classification machine learning techniques. Several classification models have been used to predict the discharge date. In particularly machine classification and regression tree, random forest, naive bayes, and support vector machine have been applied and compared using the F1-score. Based on the results, the model classification and regression tree appears to be the most insightful model to predict the discharge date. This model has been tuned and visualized. Moreover, this research shows how challenging it is to work with raw medical data of patients. It gives an insight into how the data is extracted and represented. It also shows how missing values are imputed using different methods.
|BSc||2018||Comparing ways of finding patterns in flight delays
The number of people traveling by airplane has been increasing over the last couple of years. Combine this with the effects of climate change, which increases unpredictable and more extreme weather, and it is easy to see that it is a difficult task for airlines to make sure their flights arrive according to schedule. Studies have been done on finding patterns in flight data and weather data, for example on finding causes of flight delays and comparing quality measures to use with subgroup discovery. This work focuses on finding the best way for finding patterns in flight data and weather data that help to tell something about flight delays. The dataset used in this research consists of flight data with corresponding weather data of domestic flights from the United States in the year 2016. Only data of United Airlines was used from the airports of Denver, Tampa and San Diego. The Diverse Subgroup Set Discovery (DSSD) algorithm was used for doing subgroup discovery. With this algorithm, experiments were done in which two quality measures, two equal frequency discretization techniques, three different bin sizes and two search strategies were compared, resulting in 72 conducted experiments. In conclusion, experiments showed that there are two ways particularly interesting for finding subgroups with high delays. The first way results in smaller subgroups with particularly high delays, which can, for example, be used for doing outlier detection. The second way results in bigger subgroups with lower (but still above average) delays. This way can, for example, be used for finding/capturing the effects of changes in strategy and policy made by the management of air carriers.
|Mp||2017||Understanding Flight Delays
Commercial aviation is likely the most complex transportation system. This is because the three main parts of the system, the passengers, the on-board crew and the aircraft itself must be coordinated through a tight schedule. A disturbance in one of the schedules can propagate through other schedules and can lead to big delays. The delays can lead to lost money due to increased expenses for the crew, extra fuel costs and loss of demand for passengers. In this research paper we will search for relations between arrival delays and the corresponding flight and weather data. Therefore a data set of all domestic 2016 American Airlines flights with corresponding weather data is assembled. On this data set, subgroup discovery algorithms will be applied to search for interesting attributes that correlate to the delay. In the experiment subgroup discovery was done on the data set using multiple different search strategies and quality measures. We show that as expected rare weather phenomena such as thunderstorms and heavy snowfall can lead to heavy delays. Other common attributes that are associated with delays are a high dew point and high wet bulb temperature.
|BSc||2017||Diverse subgroup discovery for big data
Nowadays, the amount of the data used in different fields is increasing. A consequence is that the task of data mining becomes more and more difficult. The highly redundant result sets and big search space are the biggest problems when it comes to handle big data. The task of Subgroup Discovery (SD) is to discover interesting subgroups that show a deviating behavior with respect to some target property of interest. The existing Diverse Subgroup Set Discovery (DSSD) algorithm aims to solve the two above problems by considering subgroup sets rather than individual subgroups. During the process of searching for the subgroups, their covers -representing the subset of the dataset- should be kept in memory. One of the approaches to deal with subgroups subsets is to cache all these covers, but this becomes a problem when we analyse big datasets because of large memory usage. In our study, we consider five subset implementation techniques and conduct experiments on 38 datasets. The results show that the empirical performance of the different techniques depends on the characteristics of the data. We investigate how nine such properties of the can be used to predict runtime and memory usage. To this end, we built linear regression models using WEKA. The results show that the predictions of the memory usage models are more accurate than the predictions of the runtime models.
|BSc||2017||Analysing data to predict processing times of vessels and barges in liquid bulk terminals
Effective berth scheduling allows terminals to maximise their utilisation of berth space and thus meet their terminal throughput targets. We propose a way to overcome the hard task of manually predicting processing times in liquid bulk terminals. We aim to reduce waiting times and thus optimise scheduling. Ultimately, the goal is to design a model that can accurately predict the processing times of vessels and barges in a liquid bulk terminal. Data from EuroTankTerminal, Rotterdam, is pre-processed and analysed to uncover problems and audit errors. Following this process, we compare different types of machine learning regression algorithms: linear regression, random forests, support vector machines, and XGBoost. Our results show that XGboost performs well at predicting berth occupation times from historic data. Further study should be directed towards the challenging infrastructure in liquid bulk terminals, as certain rules and conditions based on the individual terminals, such as pipeline availability, could add information to the model and further improve prediction results.
|BSc||2017||Data-Driven Estimation of Consultation Time using Regression
Doctors are able to give the right vaccines and inform customers about their travel destinations to ensure a prepared journey. To know how much time is exactly needed for these appointments, we can use historic data to perform analyses and make predictive models. Throughout the years data have been collected of appointments and GPS data. However, this data have not been used for estimating consultation time yet. The aim of this thesis is therefore to research how regression techniques can be used to accurately estimate appointment time based on historic data. We look at which data can be used for the estimation. Data can then be interpreted to see what effects the attributes have on time duration. Also, we look which regression models can be used to accurately model the consultation time. For this, we introduce the use of (ensemble) regression trees and model trees. Different experiments are conducted to compare the regression models to decide the most suitable model. Based on these results, the model tree comes out as the most insightful model to make time predictions. Moreover, this research also opens up the discussion in how much should be relied on models. It shows that human factors are important influencers in models and thus raises questions to what extent models can be used for human decision making. Data cannot explain every event and data-driven predictions may conflict with future policies.
|BSc||2016||On the Description-driven Community Mining algorithm
A more powerful classifier was used to replace the original one in the Description-driven Community Mining (DCM) algorithm, and a set of comparison experiments were carried out to test whether the MDCM (modified Description-driven Community Mining) algorithm is better than DCM. By analyzing the results, we obtained several conclusions: first, a powerful classifier -implying more homophily holds in the community- not necessarily means better structure in some cases. Second, the sacrifice of homophily in structure is not large, but cannot be ignored. Finally, future work to further improve the performance of MDCM was discussed, and guidance on when and where to use MDCM was also discussed.
|BSc||2016||An Attempt to Detecting Patterns Among Children on the Playground using Attributed Graph Mining
In this thesis we consider an attributed graph miner to find patterns among children who play on the playground. These patterns may provide us with a deeper understanding of the impact that the social-emotional skills of a child have on his social interactions. With a pattern mining approach we hope to find unexplored information that was not located by the previously used statistical approach. As our pattern mining method we chose CoPaM, an attributed graph miner that returns connected vertices with cohesive attributes. Firstly, we discuss the data pre-processing required to prepare the dataset as input for a dynamic social network whose vertices are associated with features. After that we examine the functionality and output of CoPaM. Next we visualize the output which gave an interesting insight into the interactions of children and provided a graphic overview of the data. Additionally, while analyzing the output of CoPaM we stumbled upon the fact that CoPaM was designed for a static instead of a dynamic attributed graph which caused a rise in the output of found patterns. To cope with this rise we focused on the frequency of each feature, the most prominent patterns and the pattern with the most vertices or features during post-processing of the output. In conclusion, a child’s capacity to calm down or to be calmed down seemed to be the most prominent feature that was present in groups. Nevertheless, after visualizing and analyzing the output of CoPaM there seem to be no strong patterns in the data that present a correlation between social-emotional skills and social development.
|BSc||2016||Application of Subgroup Discovery: Finding patterns using heart rate time series
Social anxiety is a disorder that could severely hamper the life of individuals. It is still unclear when social anxiety develops. At time of writing a study is being conducted with as goal to find out when social anxiety develops and how it can go astray. Understanding the relation between physiological data and the heart rate behaviour could lead to new insight. In this thesis we use subgroup discovery to find out what could cause an unusual behaviour of the heart rate time series. Due to the high dimensionality of time series, we use the Exceptional Model Mining Framework. To enable subgroup discovery we first preprocess the data and extract relevant features. Then we develop a quality function that we use to compare generated subgroups. The quality function uses the statistical z-score derived from the distance between subgroups using the Euclidean distance. The quality measure has been implemented using Cortana as it already contains an implementation of the Exceptional Model Mining Framework and as it is open source. To test our quality function we generated several synthetic datasets. Results show that the quality measure is able to find deviating heart rates and score them accordingly. Using the quality measure on the SAND data shows us that no patterns exist with a strongly deviating heart rate. The findings of the domain expert show us that the biggest differences occur in the first minute. This is interesting for future research.
|BSc||2016||Analysing flight recorder data: A data-driven safety analysis of mixed fleet flying
A major airline recently acquired a new airplane: the Boeing 787. To achieve more operational efficiency, this plane is own by pilots already flying the Boeing 777. However, the airline wants to make sure that this practice does not influence flight safety. This is done by analyzing the landings, which led to the following research question: does mixed fleet flying of Boeing 777 and Boeing 787 airplanes influence landing performance on Boeing 777 airplanes? Previous research on machine learning and flight recorder data focused almost exclusively on detecting anomalies. We use machine learning techniques on Boeing 777 ight recorder data to determine if there is a difference in performance between mixed fleet flying pilots and regular pilots, more specifically in the landing phase of the flight. We used both features proposed by experts and automatically constructed features. Although our techniques were able to distinguish the two subtypes of Boeing 777 airplanes as a proof of concept, a substantial difference in pilot performance was not found in this data set using the techniques presented in this research. These findings support the idea that mixed fleet flying of Boeing 787 and Boeing 777 airplanes does not impact pilot performance.