Indice
Data Mining A.A. 2016/17
DM 1: Foundations of Data Mining
Instructors - Docenti:
- Dino Pedreschi, Anna Monreale
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
Teaching assistant - Assistente:
- Riccardo Guidotti
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
DM 2: Advanced topics on Data Mining and case studies
Instructors:
- Mirco Nanni, Dino Pedreschi
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
News
- The results of the written exam for DM2 held on September 6th, 2017 are out! Link: Results DM2 6.9.2017. The students interested in having the oral exam in this session are requested to contact the instructor at mirco [dot] nanni [at] isti [dot] cnr [dot] it. We remind you that the project report must be submitted at least 2 days before the oral exam.
- The results of the written exam for DM2 held on July 4th, 2017 are out! Link: Results DM2 4.7.2017. The students interested in having the oral exam in this session are requested to contact the instructor at mirco [dot] nanni [at] isti [dot] cnr [dot] it. We remind you that the project report must be submitted at least 2 days before the oral exam.
- The results of the written exam for DM2 held on June 13th, 2017 are out! Link: Results 13.6.2017. The students interested in having the oral exam in this session are requested to contact the instructor at mirco [dot] nanni [at] isti [dot] cnr [dot] it. We remind you that the project report must be submitted at least 2 days before the oral exam.
- The results of the 2nd mid-term held on June 6th, 2017 are out! Link: Results 6.6.2017.
- Reminder: 2nd mid-term exam is on 6.6.2017 at 11.00 a.m. in rooms A and B.
- All the projects for DM2 are now available! Check the exam section.
- The results of the mid-term held on April 7th, 2017 are out! Link: Results 7.4.2017.
- The lecture of May 5, 2017 is cancelled due to the review of a European project where the instructors are committed.
- Erratum: Date for mid-term exam is out: April 7th, 2017 at 11:00 (instead of 9:00) in Rooms A1 + C1.
- New dates for DM1 oral exams: 2017/02/21 15:00 (3 seats available); 2017/02/23 09:30 (6 seats available) office of Prof. Pedreschi. Please, send an email to BOTH me and Riccardo to book the oral exam.
- Results DM1 Written Exam 2017/02/08: Results DM1
- Oral Exam Calendar 2017/02/13 14.00 (seats available), 2017/02/15 14.00 (seats available) office of Prof. Pedreschi. If you need to do the oral exam before the 2017/02/17 but these dates do not fit your timetable please contact us as soon as possible. The next oral exam will be on June.
- Results DM1 Written Exam 2017/01/19: Results DM1: students who want to do the oral exam the first date available is 2017/01/30 10.00 office of Prof. Pedreschi. Please write an email to anna [dot] monreale [at] unipi [dot] it if you want to do the exam on Monday. Other dates will be scheduled and published on Monday
- Oral Exam Calendar 2017/01/23 14.00 (seats available), 2017/01/24 14.00 (completed), 2017/01/30 10.00 (seats available) office of Prof. Pedreschi.
- 2017/01/23: Shtjefni, Cei; 2017/01/24: Inversi, Savasta, Semeraro, Bonfanti, Tanga, Briganti, Di Sarli, Pioli
- A new project is now available! (see Exam section for details) We recommend to follow the guidelines.
- Results of the second mid-test of DM1: Results-21Dec2016. For opening the file you need a password that I will send you by email. If you did not receive the email you can require it by email to: anna [dot] monreale [at] unipi [dot] it. RULES: 1) Students having an AVG Mark in the file may do the oral exam. 2) Students having a vote >= 18 in only one of the tests can do the written exam for the part without a sufficient mark. 3) For the oral exam students must come to the written exam and decide together with the teachers the dates of the oral exam. It is possible to do the oral exam also during the written exam.
- Project deadline extension 23.59 of 10/01/2016
- Results of the first mid-test of DM1: Results-04Nov2016. For opening the file you need a password that you can require by email to: anna [dot] monreale [at] unipi [dot] it
- To be included in the course mailing list for urgent communications, please send as soon as possible a mail to anna [dot] monreale [at] unipi [dot] it with the following data: subject= “DM1” and text: name and surname
- Datasets for exercises - Iris: http://archive.ics.uci.edu/ml/datasets/Iris, Titanic: https://www.kaggle.com/c/titanic/data, Adult: https://archive.ics.uci.edu/ml/datasets/Adult
- The first project is now available! Details in the Exam Section.
Learning goals -- Obiettivi del corso
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:
- i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
- le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
- alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
- l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza
Reading about the "data scientist" job
- Data, data everywhere. The Economist, Feb. 2010 download
- Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
- Welcome to the yotta world. The Economist, Sept. 2011 download
- Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
- Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
- Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
- Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
- Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download
Hours - Orario e Aule
DM 1
Classes - Lezioni
Day of Week | Hour | Room |
---|---|---|
Lunedì/Monday | 11:00 - 13:00 | Aula C |
Venerdì/Friday | 14:00 - 16:00 | Aula A1 |
Office hours - Ricevimento:
- Prof. Pedreschi/Monreale: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
DM 2
Classes - Lezioni
Day of week | Hour | Room |
---|---|---|
Tuesday | 16:00 - 18:00 | B |
Friday | 16:00 - 18:00 | B |
Office hours - Ricevimento:
- Nanni : appointment by email, c/o ISTI-CNR
Learning Material -- Materiale didattico
Textbook -- Libro di Testo
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.
- Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
Slides of the classes -- Slides del corso
- The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".
Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining"
Past Exams
- Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:
- Some very old exercises (part of them with solutions) are available here, most of them in Italian, not all of them on topics covered in this year program:
Data mining software
- KNIME The Konstanz Information Miner. Download page
- Python - Anaconda (2.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
- Scikit-learn: python library with tools for data mining and data analysis Documentation page
- Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
- WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page
Class calendar - Calendario delle lezioni (2016-2017)
First part of course, first semester (DMF - Data mining: foundations)
Day | Aula | Topic | Learning material | Instructor | |
---|---|---|---|---|---|
1. | 19.09.2016 11:00-13:00 | C | Canceled | - | |
2. | 23.09.2016 14:00-16:00 | A1 | Introduction | Course OverviewDM Introduction | Monreale |
3. | 26.09.2016 11:00-13:00 | C | Data Understanding | 3.dataunderstanding.pdf 3.data-understanting-appendix.pdf | Monreale |
4. | 30.09.2016 14:00-16:00 | A1 | Data Preparation | 4.data_preparation.pdf | Monreale |
5. | 03.10.2016 11:00-13:00 | C | Introduction to Python, Knime | python_tutorial.zip | Monreale/Guidotti |
6. | 07.10.2016 14:00-16:00 | A1 | Exercises on Data Understanding. | exercises-dm1.pdf | Monreale/Guidotti |
7. | 10.10.2016 11:00-13:00 | C | Centroid-based methods. | dm2014_clustering_intro.pdf dm2014_clustering_kmeans.pdf | Monreale |
8. | 14.10.2016 14:00-16:00 | A1 | Hierarchical methods.Density Based Clustering | dm2014_clustering_hierarchical.pdf knime_slides_mains.pdf | Monreale |
9. | 17.10.2016 11:00-13:00 | C | Knime - Python: Data Understanding | python_data_understanding.zip knime_data_manipulation_iris.zip knime_data_manipulation_adult.zip | Monreale/Guidotti |
10. | 21.10.2016 14:00-16:00 | A1 | Clustering Validation | dm2014_clustering_validation.pdf | Monreale |
11. | 24.10.2016 11:00-13:00 | C | Knime - Python: Clustering | HC with Group Average exercises-clustering.pdf knime_clustering_iris.zip titanic_clustering.ipynb.zip | Monreale/Guidotti |
12. | 28.10.2016 14:00-16:00 | A1 | Exercises on Clustering | HC with Group Average exercises-clustering.pdf | Monreale/Guidotti |
04.11.2016 9:00-11:00 | A | First Mid-term test | Monreale/Guidotti | ||
13. | 07.11.2016 11:00-13:00 | C | Frequent Patterns & Association Rules | 4-5tdm-restructured_assoc.pdf | Monreale |
14. | 11.11.2016 14:00-16:00 | A1 | Event on Big Data: Aula Magna | ||
15. | 14.11.2016 11:00-13:00 | C | Frequent Patterns & Association Rules | ||
16. | 18.11.2016 14:00-16:00 | A1 | Knime - Python: Frequent Pattern & Association Rules | knime_pattern.zip knime_pattern_titanic2.zip titanic_frequent_patterns.ipynb.zip (http://www.borgelt.net/apriori.html) | |
17. | 21.11.2016 11:00-13:00 | C | Classification | chap4_basic_classification.pdf | |
18. | 25.11.2016 14:00-16:00 | A1 | Classification | ||
19. | 28.11.2016 11:00-13:00 | C | Classification | ||
20. | 02.12.2016 14:00-16:00 | A1 | Exercises on Patterns & Classification | ||
21. | 05.12.2016 11:00-13:00 | C | Canceled | ||
22. | 09.12.2016 14:00-16:00 | A1 | Canceled | ||
23. | 12.12.2016 11:00-13:00 | C | Exercises on Patterns & Classification | knime_classification_iris.zip titanic_classification.ipynb.zip | Guidotti / Pedreschi |
24. | 16.12.2016-18.12.2015 | A1 | Knime - Python: Classification | Guidotti / Pedreschi | |
21.12.2016 9:00-11:00 | A | Second Mid-term test | Monreale/Guidotti |
Second part of course, second semester (DMA - Data mining: advanced topics and case studies)
Day | Room (Aula) | Topic | Learning material | Instructor (default: Nanni) | |
---|---|---|---|---|---|
1. | 21.02.2017 16:00-18:00 | B | Introduction + Sequential patters/1 | Introduction Sequential patters | Nanni + Pedreschi |
2. | 24.02.2017 16:00-18:00 | B | Sequential patterns/2 | ||
3. | 28.02.2017 16:00-18:00 | B | Sequential patterns/3 | Link to SPMF, a tool for seq. patterns and sample dataset. Exercises: Text 1 and Text 2 | |
| | cancelled | |||
4. | 07.03.2017 16:00-18:00 | B | Time series/1 | Time series | |
5. | 10.03.2017 16:00-18:00 | B | Time series/2 | Python examples, Knime examples, link to sounds dataset (source: speech recognition example) | |
6. | 14.03.2017 16:00-18:00 | B | Time series/3 | Python examples/2 | |
7. | 17.03.2017 16:00-18:00 | B | Time series/4 | Python examples/3, Knime example | |
8. | 21.03.2017 16:00-18:00 | B | DM Process/1 | Example AMRP (also described in this report, in Italian), CRISP-DM, Link to the CRISP-DM 1.0 guide (by SPSS) | |
9. | 24.03.2017 16:00-18:00 | B | DM Process/2 | Intro_CRM Churn | |
10. | 28.03.2017 16:00-18:00 | B | DM Process/3 | Collective churn analysis, Promotions, Sophistication. Sample reports made by students and (loosely) following CRISP-DM: Report 1 (Italian), Report 2 (English), Report 3 (Italian). Exercise on CRISP-DM: understanding churn | |
| B | Cancelled | |||
11. | 04.04.2017 16:00-18:00 | B | Exercises | Exercise on Understanding churn (with a solution). See also exercises in section Past Exams | |
07.04.2017 11:00-13:00 | A1 + C1 | Mid-term exams | |||
12. | 21.04.2017 16:00-18:00 | B | Classification: alternative methods/1 | slides on K-nearest neighbours and Naive Bayes | Pedreschi |
13. | 28.04.2017 16:00-18:00 | B | Classification: alternative methods/2 | slides on Artificial Neural Networks and Support Vector Machines | Pedreschi |
14. | 02.05.2017 16:00-18:00 | B | Classification: alternative methods/3 | slides on ensemble methods and slides on the wisdom of the crowds original 1907 Nature paper by Francis Galton "Vox populi" | Pedreschi |
15. | | Lecture canceled | |||
16. | 09.05.2017 16:00-18:00 | B | Classification: validation methods/1 | Slides from P. Adamopoulos, Slides from J.F. Ehmke | |
17. | 12.05.2017 16:00-18:00 | B | Classification: validation methods/2 | Imbalanced data & evaluation, Knime sample classification & evaluation, Python sample classification & evaluation | |
18. | 16.05.2017 16:00-18:00 | B | Classification: validation methods/3 | ||
19. | 19.05.2017 16:00-18:00 | B | Exercises | Ex. from past exams 1, Ex. from past exams 2, Mixed Exercises, Lift chart | |
20. | 23.05.2017 16:00-18:00 | B | Outlier Detection/1 | Slides from SDM2010 tutorial | |
21. | 26.05.2017 16:00-18:00 | B | Outlier Detection/2 | Python examples, Knime examples, link to ELKI framework, test dataset for ELKI | |
22. | 30.05.2017 16:00-18:00 | B | Exercises | Exercises on outliers detection, Exercises on ensembles and ROC/Lift chart | |
06.06.2017 11:00-13:00 | A + B | Mid-term exams |
Exams
Exam DM part I (DMF)
The exam is composed of three parts:
- A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of November and December.
- An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
- A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by max 3 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2016-2017] Project” in the subject. Tasks of the project:
- Data Understanding (Assigned on: 17/10/2016): Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations.
- Clustering analysis (Assigned on: 14/11/2016): Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
- Association Rules (Assigned on: 21/11/2016): Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict the hotel type. (see Guidelines for details)
- Classification (Assigned on: 12/12/2016): Explore the dataset using classification trees and random forest. Use them to predict the hotel type. (see Guidelines for details)
- Project 1
- Dataset: Expedia (Hotel Recommendations)
- Assigned: 11/01/2017
Deadline: 11/02/2017Deadline extension 23.59 of 13/02/2017- Hint: if the dataset is too big to be analyzed by your computer you can use a representative sample of the entire dataset. You must specify in the project report how you selected this sample and justify your choices.
- Project 2
- Dataset: Pima (Diabets Detection)
- Assigned: 05/05/2017
- Deadline:
02/06/2017Deadline extension 23.59 of 05/06/2017
Guidelines for the project are here.
Exam DM part II (DMA)
The exam is composed of three parts:
- A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of April and June.
- An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
- A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: sequential patterns, time series, classification (alternative methods and validation), outlier detection. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam.
- Sequential patterns. Apply sequential pattern mining (with temporal constraints, if needed) to a dataset that encodes 100 Bach's chorales as sequences of numbers. Two files are provided: one encodes only notes (MIDI pitch integer numbers), the other encodes notes & durations as a single number (number = duration*100 + note). Objective: Find the top-5 most frequent sequences with at least 5 notes, and the top-5 contiguous sequences (i.e. contiguous strings of notes) with at least 4 notes. Repeat the experiments on both the datasets, using appropriate algorithms and adjusting parameters. Dataset: Preprocessed data, see also the original data and further details on the UCI page.
- Time series. It is given a dataset of the homicides recorded in the USA over 35 years, expressed as timeseries of yearly counts of homicides for each state. You are asked to look for similarities across the states. Objectives: check whether there is some periodicity in the timeseries; look for clusters over the time series using (i) DBSCAN with DTW, (ii) DBSCAN with Euclidean distance, (iii) K-means with Euclidean distance, each time searching the best parameters and commenting the results. In case of empty values (i.e. no records provided for some year/month in some state), fill them with a zero or another reasonable value. Dataset: download it from Kaggle (11 MB , zipped); you can also (optionally) use the following preprocessing python script to extract the relevant data from the dataset.
- Classification. Using the Titanic dataset with target variable “Survived”, extract one classification model for each of the following approaches: kNN, SVM, neural networks, naive Bayes. For at least one of them, perform a search to select the parameters that optimize accuracy. For all the others, simply choose reasonable parameters values. After dividing the dataset in training and test sets, provide for all the models the confusion matrix, accuracy, precision & recall for the positive class.
- Outlier detection. Given the 2-d dataset provided below, apply the distance-based outlier detection method (DB(epsilon,n)) fitting the parameters in order to have a 5% of outliers. On the same dataset apply the LOF method, and select the top 5% outliers according to it. Compare the outputs, showing the differences and trying to explain the results. Dataset: download here.
Appelli di esame
Mid-term exams
Date | Hour | Place | Notes | Marks | |
---|---|---|---|---|---|
First Mid-term 2016 | 4.11.2016 | 9:00 - 11:00 | Room A | ||
Second Mid-term 2016 | 21.12.2016 | 9:00 - 11:00 | Room A |
Date | Hour | Place | Notes | Marks | |
---|---|---|---|---|---|
1st Mid-term 2017 | 7.4.2017 | 11:00 - 13:00 | Rooms A1 + C1 | Solutions | Results 7.4.2017 |
2nd Mid-term 2017 | 6.6.2017 | 11:00 - 13:00 | Rooms A + B | Solutions | Results 6.6.2017 |
Appelli regolari / Exam sessions
Session | Date | Time | Room | Notes | Solutions | Marks |
---|---|---|---|---|---|---|
1. | 19 Jan 2017 | 09:00 | C | In the same date we will define the dates for the oral exam. | ||
2. | 08 Feb 2017 | 14:00 | C | In the same date we will define the dates for the oral exam. | ||
3. | 08 June 2017 | 14:00 | A1 | (1) Oral exam of DM1 for students having already the vote for the written exam of DM1. (2) Oral exam of DM2 for students having already the vote for the written exam of DM2. Please, use the system for registration: https://esami.unipi.it/ | ||
4. | 09 June 2017 | 10:00 | A1 | (1) Oral exam of DM1 for students having already the vote for the written exam of DM1. (2) Oral exam of DM2 for students having already the vote for the written exam of DM2. Please, use the system for registration: https://esami.unipi.it/ | ||
5. | 13 June 2017 | 11:00 | A1 | Written exam of DM1/DM2. In the same date we will do oral exam for students already having the written vote and we will define the dates for the oral exam. Please, use the system for registration: https://esami.unipi.it/ | Solutions | Results DM2 13.6.2017 |
6. | 04 July 2017 | 09:00 | A1 | Written exam of DM1/DM2. In the same date we will do oral exam for students already having the written vote and we will define the dates for the oral exam. Please, use the system for registration: https://esami.unipi.it/ | Solutions | Results DM2 4.7.2017 |
7. | 06 September 2017 | 09:00 | A1 | Written exam of DM1/DM2. In the same date we will do oral exam for students already having the written vote and we will define the dates for the oral exam. Please, use the system for registration: https://esami.unipi.it/ | Solutions | Results DM2 6.9.2017 |
Appelli straordinari A.A. 2015/16 / Extra sessions A.A. 2015/16
Date | Time | Room | Notes | Results |
---|---|---|---|---|