Indice
Data Mining A.A. 2018/19
DM 1: Foundations of Data Mining (6 CFU)
Instructors - Docenti:
- Dino Pedreschi
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
Teaching assistant - Assistente:
- Riccardo Guidotti
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
DM 2: Advanced topics on Data Mining and case studies (6 CFU)
Instructors:
- Mirco Nanni, Dino Pedreschi
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
DM: Data Mining (9 CFU)
Instructors:
- Dino Pedreschi, Anna Monreale
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
Teaching assistant - Assistente:
- Riccardo Guidotti
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
News
- Results of DM2 exam of 10.07.2019 are out: Results
- Results of DM2 exam of 19.06.2019 are out: Results
- Results of DM2 2nd mid-term exam are out: Results
- Due to a strike, the lesson of May 31th 2019 will not take place. In the course calendar you can find some additional material that might be useful to you, especially for the project.
There are no other classes in program, therefore the course is now officially closed. Thanks to all my students for attending, and good luck for the exam. - The project for DM2 is out!
- Results of DM2 mid-term exam are out: Results
- Last exam session on Feb, 14. Please register your name here: https://doodle.com/poll/6dgc5du4fgpnbyyx
- Results of the written exam of Feb dm_evaluation_1819_-_appello-feb.pdf
- Results of the written exam of January dm_evaluation_1819-jan-session.pdf
- Dates for exam registration: (a) Jan 21: slot 14 - 15, 16-17; (b) Jan 22: slot 10 - 11; ( c ) Jan 23: slot 09 - 10. Location: Monreale's office.
- I setup 3 days for the oral exam: 25, 28, 29 January. Other dates will we available after the written exam of Feb. For booking your oral exam please use the doodle indicating you Surname and Name: https://doodle.com/poll/3wunys9yd8s9q8ay
- Final results including project evaluation available here: dm_evaluation_1819.pdf. If you do not find your evaluation please write an email to Anna Monreale.
- New project is available!
- *Results of the Second mid-term test. After the evaluation of the project we will propose you the average grade considering: first and second midterm tests and project. *
- Get clusters from scipy dendogram: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster
- Help for installing Pyfim library https://anaconda.org/conda-forge/pyfim, https://pypi.org/project/fim/, http://www.borgelt.net/pyfim.html
- *Results of the first mid-term test.*
- Students need to decide the group composition for the project and fill this spreadsheet within October 1, 2018. It is strongly recommended an heterogenous composition with respect to the master degree. The number of members of each group can be 3 or 4.
Learning goals -- Obiettivi del corso
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:
- i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
- le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
- alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
- l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza
Reading about the "data scientist" job
- Data, data everywhere. The Economist, Feb. 2010 download
- Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
- Welcome to the yotta world. The Economist, Sept. 2011 download
- Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
- Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
- Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
- Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
- Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download
Hours - Orario e Aule
DM1 & DM
Classes - Lezioni
Day of Week | Hour | Room |
---|---|---|
Lunedì/Monday | 14:00 - 16:00 | Aula C1 |
Mercoledì/Wednesday | 14:00 - 16:00 | Aula C1 |
Venerdì/Friday | 11:00 - 13:00 | Aula C1 |
Office hours - Ricevimento:
- Prof. Pedreschi: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
- Prof. Monreale: by appointment, Room 374/DO, Dept. of Computer Science.
- Dr. Guidotti: class-appointment (see calendar)
DM 2
Classes - Lezioni
Day of week | Hour | Room |
---|---|---|
Thursday | 14 - 16 | A1 |
Friday | 16 - 18 | C1 |
Office hours - Ricevimento:
- Nanni : appointment by email, c/o ISTI-CNR
Learning Material -- Materiale didattico
Textbook -- Libro di Testo
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.
- Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
- Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.
- Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. 1st Edition.
Slides of the classes -- Slides del corso
- The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".
Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining"
Past Exams
* Some text of past exams on DM1 (6CFU):
* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):
* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:
- Some very old exercises (part of them with solutions) are available here, most of them in Italian, not all of them on topics covered in this year program:
Data mining software
- KNIME The Konstanz Information Miner. Download page
- Python - Anaconda (2.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
- Scikit-learn: python library with tools for data mining and data analysis Documentation page
- Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
- WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page
Class calendar - Calendario delle lezioni (2018/2019)
First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining)
Day | Aula | Topic | Learning material | Instructor | |
---|---|---|---|---|---|
1. | 19.09 14:00-16:00 | C1 | Overview. Introduction. | 1.2018-dm-overview.pdf | Pedreschi |
2. | 20.09 16:00-18:00 | C1 | Introduction | Pedreschi | |
21.09 11:00-13:00 | C1 | Lecture canceled | Pedreschi | ||
3. | 24.09 14:00-16:00 | C1 | KDD Process & Applications. Data Understanding. | DM + Applications DU | Monreale |
4. | 26.09 14:00-16:00 | C1 | Data Understanding. Data Preparation | Monreale | |
5. | 28.09 11:00-13:00 | C1 | Introduction to Python, Knime | intro_knime intro_python | Monreale/Guidotti |
6. | 01.10 14:00-16:00 | C1 | Data Preparation | Data Preparation | Monreale |
7. | 03.10 14:00-16:00 | C1 | Clustering Introduction e Centroid-based clustering | 4.basic_cluster_analysis-intro-kmeans.pdf | Monreale |
05.10 11:00-13:00 | C1 | Lecture canceled | |||
8. | 08.10 14:00-16:00 | C1 | Knime - Python: Data Understanding | du_knime du_python | Guidotti |
9. | 10.10 14:00-16:00 | C1 | Clustering: K-means & Hierarchical | 5.basic_cluster_analysis-hierarchical.pdf | Pedreschi |
12.10 11:00-13:00 | C1 | Lecture canceled for IF | |||
10. | 15.10 14:00-16:00 | C1 | Clustering: DBSCAN | 6.basic_cluster_analysis-dbscan-validity.pdf | Pedreschi |
11. | 17.10 14:00-16:00 | C1 | Clustering: Validity | Pedreschi | |
12. | 19.10 11:00-13:00 | C1 | Discussion on Projects - DU | Guidotti | |
13. | 22.10 14:00-16:00 | C1 | Exercises for mid-term test | Tool for Dm ex: Didactic Data Mining Ex. Clustering PDF Ex. Clustering PPTX | Monreale |
14. | 24.10 14:00-16:00 | C1 | Knime - Python: Clustering | clustering_knime clustering_python | Guidotti |
15. | 26.10 11:00-13:00 | C1 | Exercises for mid-term test | Ex. Clustering PPTX - complete Ex. Clustering PDF - complete Exercises DU ex-silhouette.pdf | Monreale |
16. | 05.11 14:00-16:00 | C1 | Classification/1 | 7.chap3_basic_classification.ppt | Monreale |
17. | 07.11 14:00-16:00 | C1 | Classification/2 | Monreale | |
09.11 11:00-13:00 | C1 | CANCELED | |||
18. | 12.11 14:00-16:00 | C1 | LAB: Classification | knime_classification python_classification | Guidotti |
19. | 14.11 14:00-16:00 | C1 | Pattern Mining | Explanation of classification/ML models Pattern mining Intro Apriori Algorithm for Pattern/AR Mining | Pedreschi |
20. | 16.11 11:00-13:00 | C1 | Pattern Mining | Pedreschi | |
21. | 19.11 14:00-16:00 | C1 | Exercises for the mid-term | ex-second-midterm.pdf | Monreale |
22. | 21.11 14:00-16:00 | C1 | Lab Pattern Mining+ Discussion Clustering | knime_pattern python_pattern https://anaconda.org/conda-forge/pyfim, https://pypi.org/project/fim/, http://www.borgelt.net/pyfim.html | Guidotti/Pedreschi |
The next lectures are dedicated to the DM of 9 credits | |||||
23. | 23.11 11:00-13:00 | C1 | Alternative methods for Pattern Mining. Privacy in DM | fp-growth.pdf | Monreale |
24. | 26.11 14:00-16:00 | C1 | Alternative methods for Clustering. Privacy in DM | 1-alternative-clustering.pdf | Monreale |
25. | 28.11 14:00-16:00 | C1 | Privacy in DM. Transactional Clustering | 2-transactionalclustering.pdf privacydt.pdf Papers on Clustering | Monreale |
26. | 30.11 11:00-13:00 | C1 | Alternative methods for classification/1 | K-Nearest Neighbors & Naive Bayes | Pedreschi |
27. | 03.12 14:00-16:00 | C1 | Alternative methods for classification/2 | Ensemble methods Wisdom of the crowd & Ensemble methods Galton's Vox Populi | Pedreschi |
28. | 05.12 14:00-16:00 | C1 | Alternative methods for classification/3 | Pedreschi | |
29. | 07.12 11:00-13:00 | C1 | Exercises on clustering and classification | CLOPE K-mode KNN & NB | Monreale |
30. | 10.12 14:00-16:00 | C1 | Exercises on Second part - all students | Monreale | |
31. | 12.12 14:00-16:00 | C1 | Final Discussion on Project - all students | Pedreschi/Guidotti | |
32. | 14.12 11:00-13:00 | C1 | Cancelled |
Second part of course, second semester (DMA - Data mining: advanced topics and case studies)
Day | Room (Aula) | Topic | Learning material | Instructor (default: Nanni) | |
---|---|---|---|---|---|
1. | 21.02.2019 14:00-16:00 | A1 | Introduction + Sequential patters/1 | Introduction, Sequential patterns | |
2. | 22.02.2019 16:00-18:00 | C1 | Sequential patterns/2 | ||
3. | 01.03.2019 16:00-18:00 | C1 | Sequential patterns/3 | Sample exercises (fixed) | |
4. | 07.03.2019 14:00-16:00 | A1 | Sequential patterns/4 | Sequential pattern tools: Link to SPMF + Sample datasets, Python2 GSP educational implementation(source), PrefixSpan-py (requires Python3) | |
5. | 08.03.2019 16:00-18:00 | C1 | Time series/1 | Time series | |
6. | 14.03.2019 14:00-16:00 | A1 | Time series/2 | Overview on DM for time series, DTW paper by Sakoe and Chiba, 1978 | |
7. | 15.03.2019 16:00-18:00 | C1 | Time series/3 | ||
8. | 21.03.2019 14:00-16:00 | A1 | Time series/4 | Preprocessing in Python DTW in Python | |
9. | 22.03.2019 16:00-18:00 | C1 | Time series/5 | ||
10. | 28.03.2019 14:00-16:00 | A1 | Exercises for mid-term exam | Exercises from past exams | |
11. | 29.03.2019 16:00-18:00 | C1 | Exercises for mid-term exam | Exercises from past exams (with some solutions) | |
04.04.2019 16:00-18:00 | A1 + E | mid-term exam | |||
11. | 11.04.2019 14:00-16:00 | A1 | Classification: alternative methods/1 | kNN and Bayes classifier | |
12. | 12.04.2019 16:00-18:00 | C1 | Classification: alternative methods/2 | NN and SVM, Exercises | |
| | Cancelled | |||
13. | 03.05.2019 16:00-18:00 | C1 | Classification: alternative methods/3 | ||
14. | 09.05.2019 14:00-16:00 | A1 | Classification: alternative methods/4 | Ex. on NNs and SVM, Ex. on KNN and Naive Bayes | |
15. | 10.05.2019 16:00-18:00 | C1 | Classification: Model Evaluation | Model performances | |
16. | 16.05.2019 14:00-16:00 | A1 | Classification: Model Evaluation | Unbalanced data, Classification weights | |
17. | 17.05.2019 16:00-18:00 | C1 | Classification: alternative methods/5 | Ensembles, Homeworks! | |
18. | 23.05.2019 14:00-16:00 | A1 | Exercises + Outlier detection/1 | Ex. on Lift chart, Ex. on Ensembles, Outlier detection | |
19. | 24.05.2019 16:00-18:00 | C1 | Outlier detection/2 | Ex. on outliers, Ex. from past exams | |
| | Due to a strike, the lesson will not take place. For you convenience, here is some material you can use: Examples of classification and validation in Python, Examples of outlier detection in Python, CRISP-DM guidelines. Feel free to contact me if you need clarifications. Remark: the CRISP-DM model will be not part of the exam program. | |||
06.06.2019 16:00-18:00 | E (+A1) | mid-term exam | 2nd mid-term of last year and its solutions (careful: they were not double-checked). |
Exams
Exam DM part I (DMF)
The exam is composed of three parts:
- A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of November and December.
- An oral exam (optional) , that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam. It is optional for students passing the written part by ONLY mid-term tests.
- A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2018-2019] Project 2” in the subject. Students who will decide to perform the project during the summer exam sessions will find the dataset of the project online after 31/05/2019. In this case the project must be delivered at least 2 days before the oral exam.
Tasks of the project:
- Data Understanding (Collective discussion on: 19/10/2018): Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
- Clustering analysis (Collective discussion on: 21/11/2018): Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
- Classification (Collective discussion on: 12/12/2018): Explore the dataset using classification trees and random forest. Use them to predict the target variable. (see Guidelines for details)
- Association Rules (Collective discussion on: 12/12/2018): Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)
- Project 1
- Dataset: Credit Card Default
- Assigned: 01/10/2018
- Deadline:
05/01/2019, 09/01/2019
- Project 2
- Dataset: Telco Customer Churn
- Assigned for the summer session.
- Deadline: 2 days before the oral exam.
Guidelines for the project are here.
Exam DM part II (DMA)
The exam is composed of three parts:
- A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of April and June.
A small online test for the data ethics part. The test can be taken at the following link: Link to "First Aid for Data Scientist" web site (pwd: datamining_2018). Register, and enroll to the “First Aid for Data Scientist” course. Take the quizzes of the 3 units. Then, download your certificate and send it to mirco [dot] nanni [at] isti [dot] cnr [dot] it before the oral exam.
- An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
- A project, that consists in exercises that require the use of data mining tools for analysis of data. Exercises include: sequential patterns, time series, classification (alternative methods and validation), outlier detection. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam.
- Dataset: the data is a time series dataset on air quality, which can be downloaded here: Dataset.
- Task 1: Time series: Consider only attribute “PT08.S1(CO)” and split the corresponding time series into daily series, deleting those with too many missing values (value = -200) and fixing the others in some way. Make also sure that all time series have 24 values. Compute clustering (with an algorithm of your choice) based on DTW and Euclidean distances and compare the results.
- Task 2: Sequential patterns: discover contiguous sequential patterns of at least length 4. Before that, time series should be discretized in some way.
- Task 3:Classification methods: define a target variable “WE” for the time series data set to “true” for weekend days, and “false” for the others. Test the K-NN classification method using DTW as distance measure, and at least another classification method using the 24 values as separate variables.
- Task 4: Outlier detection: from the original dataset (i.e. the raw records with all attributes, not the time series built only on the “PT08.S1(CO)” attribute), identify the top 1% outliers. Adopt at least two different methods belonging to different families (i.e. model-based, distance-based, density-based, angle-based, …) to identify the 1% of input records with the highest likelihood of being outliers, and compare the results. Before doing the analysis, the records containing missing values should be deleted to avoid trivial results.
Appelli di esame
Mid-term exams
Date | Hour | Place | Notes | Marks | |
---|---|---|---|---|---|
DM1: First Mid-term 2018 | 30.10.2018 | 11-13 | Room C1, L1, N1 | Please, use the system for registration: https://esami.unipi.it/ | results |
DM1: Second Mid-term 2018 | 18.12.2018 | 11-13 | Room C1, L1, N1 | Please, use the system for registration: https://esami.unipi.it/ | |
DM2: First Mid-term 2019 | 04.04.2019 | 16-18 | Room A1, E | Please, use the system for registration: https://esami.unipi.it/ Text + Solutions | Results |
DM2: Second Mid-term 2019 | 06.06.2019 | 16-18 | Room E (+ A1 if needed) | Please, use the system for registration: https://esami.unipi.it/ Text | Results |
Appelli regolari / Exam sessions
Session | Date | Time | Room | Notes | Marks |
---|---|---|---|---|---|
1. | 16.01.2019 | 14:00 - 18:00 | Room E | ||
2. | 06.02.2019 | 14:00 - 18:00 | Room E | ||
3. | 19.06.2019 | 09:00 - 13:00 | Room A1 | Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. | Results |
4. | 10.07.2019 | 09:00 - 13:00 | Room A1 | Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. | Results |
Appelli straordinari A.A. 2017/18 / Extra sessions A.A. 20167/18
Date | Time | Room | Notes | Results |
---|