Explanatory Data Analysis group

Information Theoretic Data Mining '18-'19

Part of
MSc Computer Science, MSc Computer Science: Data Science
Matthijs van Leeuwen
Hugo Manuel Proença
  • 28.08.18 The Blackboard pages for this course are now available; please go there for further updates after enrolling.
  • 06.08.18 The schedule for the 2018-2019 edition of this course has been announced; attendance of all course meetings is mandatory. Note that this course has a strict limit on the number of students that can participate. Enroll by 1) registering for the course in uSis; and 2) e-mailing the lecturer to confirm your participation.

How can we gain insight from data? How can we discover and explain structure in data if we don't know what to expect? What is the optimal model for our data? How do we develop principled algorithms for exploratory data mining? To answer these questions, we study and discuss the state of the art in the relatively young research area of information theoretic data mining. We focus on theory, problems, and algorithms, not on implementation and experimentation.

Contents and schedule

The following provides an overview of the contents and schedule of the course; abbreviations for class types are explained below. Slides and literature will be made available on Blackboard.

# Date Type Topic Mandatory Optional
1 Thu 6 Sep L Introduction [2](1.2-1.6) [1](Ch1-6)
2 Thu 13 Sep L Kolmogorov complexity [2](1.7-1.8,2-2.1.1,8.3-8.3.3,8.4) [4] [5,6]
3 Thu 20 Sep L The Minimum Description Length principle [3](Ch1-3,5)
Thu 27 Sep No class (self-study)
Thu 4 Oct No class (self-study)
4 Thu 11 Oct L Pattern-based modelling [7] [8]
5 Thu 18 Oct S Coding for exploratory data analysis [7,9,10] [11-13]
Thu 25 Oct No class (self-study)
6 Thu 1 Nov S Finding good models [7,10,14] [11,15]
7 Thu 8 Nov L The Maximum Entropy principle [16] [17,18]
Thu 15 Nov No class (topic selection for presentations and essays, individual tutoring)
8 Thu 22 Nov P Presentations #1
9 Thu 29 Nov P Presentations #2
10 Thu 6 Dec P Presentations #3
11 Thu 13 Dec P Presentations #4
12 Thu 20 Dec P Presentations #5
Fri 4 Jan 2019 Essay submission deadline
Class types explained
Lecture; just sit back, pay attention, and ask questions when needed/useful; you can read the literature afterwards
Seminar; your active contribution is expected, prepare by reading the mandatory literature in advance
Student presentations; your active contribution is expected, but no need to prepare (unless you present, of course)

Attendance of all twelve course meetings is mandatory. The final mark is composed of participation in discussions (15%); presentation (including Q&A) (35%); and essay (50%).

  1. Wasserman, L. All of Statistics - A Concise Course in Statistics, Springer, 2004.
  2. Ming, L & Vitányi, PMB. An Introduction to Kolmogorov Complexity and Its Applications (3rd ed), Springer, 2008.
  3. Grünwald, P. The Minimum Description Length Principle, Springer, 2007.
  4. Campana, BJL & Keogh, EJ. A Compression-Based Distance Measure for Texture. In: Proceedings of SIAM Data Mining (SDM'10), 2010.
  5. Faloutsos, C & Megalooikonomou, V. On data mining, compression, and Kolmogorov complexity. In: Data Min. Knowl. Discov. 15(1):3-20, 2007.
  6. Cilibrasi, R & Vitányi, PMB. Clustering by Compression. In: IEEE Transactions on Information Theory 15(4):1523-1545, 2005.
  7. Vreeken, J, van Leeuwen, M & Siebes, A. Krimp: Mining Itemsets that Compress. In: Data Min. Knowl. Discov. 23(1):169-214, 2011.
  8. van Leeuwen, M & Vreeken, J. Mining and Using Sets of Patterns through Compression. In CC Aggarwal & J Han, eds, Frequent Pattern Mining, Springer, 2014.
  9. Budhathoki, K & Vreeken, J. The Difference and the Norm – Characterising Similarities and Differences between Databases. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 206-223, 2015.
  10. van Leeuwen, M & Galbrun, E. Association Discovery in Two-View Data. Transactions on Knowledge and Data Engineering 27(12):3190-3202, 2015.
  11. Tatti, N & Vreeken, J. The Long and the Short of It: Summarising Event Sequences with Serial Episodes. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 462-470, 2012.
  12. Koutra, D, Kang, U, Vreeken, J & Faloutsos, C. VoG: Summarizing and Understanding Large Graphs. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 91-99, 2014.
  13. Tatti, N & Vreeken, J. Finding Good Itemsets by Packing Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 588-597, 2008.
  14. Smet, K & Vreeken, J. SLIM: Directly Mining Descriptive Patterns. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 236-247, 2012.
  15. Siebes, A & Kersten, R. A Structure Function for Transaction Data. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 558-569, 2011.
  16. De Bie, T. An Information Theoretic Framework for Data Mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 564-572, 2011.
  17. De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. In: Data Min. Knowl. Discov. 23(3):407-446, 2011.
  18. van Leeuwen, M, De Bie, T, Spyropoulou, E & Mesnage, C. Subjective Interestingness of Subgraph Patterns. In: Machine Learning 105(1):41-75, 2016.