Indice

Data Mining A.A. 2018/19

DM 1: Foundations of Data Mining (6 CFU)

Instructors - Docenti:

Teaching assistant - Assistente:

DM 2: Advanced topics on Data Mining and case studies (6 CFU)

Instructors:

DM: Data Mining (9 CFU)

Instructors:

Teaching assistant - Assistente:

News

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

  1. i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
  2. le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
  3. alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
  4. l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

Hours - Orario e Aule

DM1 & DM

Classes - Lezioni

Day of Week Hour Room
Lunedì/Monday 14:00 - 16:00 Aula C1
Mercoledì/Wednesday 14:00 - 16:00 Aula C1
Venerdì/Friday 11:00 - 13:00 Aula C1

Office hours - Ricevimento:

DM 2

Classes - Lezioni

Day of week Hour Room
Thursday 14 - 16 A1
Friday 16 - 18 C1

Office hours - Ricevimento:

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Slides of the classes -- Slides del corso

Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining"

Past Exams

* Some text of past exams on DM1 (6CFU):

* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):

* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:

Data mining software

Class calendar - Calendario delle lezioni (2018/2019)

First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining)

Day Aula Topic Learning material Instructor
1. 19.09 14:00-16:00 C1 Overview. Introduction. 1.2018-dm-overview.pdf Pedreschi
2. 20.09 16:00-18:00 C1 Introduction Pedreschi
21.09 11:00-13:00 C1 Lecture canceled Pedreschi
3. 24.09 14:00-16:00 C1 KDD Process & Applications. Data Understanding. DM + Applications DU Monreale
4. 26.09 14:00-16:00 C1 Data Understanding. Data Preparation Monreale
5. 28.09 11:00-13:00 C1 Introduction to Python, Knime intro_knime intro_python Monreale/Guidotti
6. 01.10 14:00-16:00 C1 Data Preparation Data Preparation Monreale
7. 03.10 14:00-16:00 C1 Clustering Introduction e Centroid-based clustering 4.basic_cluster_analysis-intro-kmeans.pdf Monreale
05.10 11:00-13:00 C1 Lecture canceled
8. 08.10 14:00-16:00 C1 Knime - Python: Data Understanding du_knime du_python Guidotti
9. 10.10 14:00-16:00 C1 Clustering: K-means & Hierarchical 5.basic_cluster_analysis-hierarchical.pdf Pedreschi
12.10 11:00-13:00 C1 Lecture canceled for IF
10. 15.10 14:00-16:00 C1 Clustering: DBSCAN 6.basic_cluster_analysis-dbscan-validity.pdf Pedreschi
11. 17.10 14:00-16:00 C1 Clustering: Validity Pedreschi
12. 19.10 11:00-13:00 C1 Discussion on Projects - DU Guidotti
13. 22.10 14:00-16:00 C1 Exercises for mid-term test Tool for Dm ex: Didactic Data Mining Ex. Clustering PDF Ex. Clustering PPTX Monreale
14. 24.10 14:00-16:00 C1 Knime - Python: Clustering clustering_knime clustering_python Guidotti
15. 26.10 11:00-13:00 C1 Exercises for mid-term test Ex. Clustering PPTX - complete Ex. Clustering PDF - complete Exercises DU ex-silhouette.pdf Monreale
16. 05.11 14:00-16:00 C1 Classification/1 7.chap3_basic_classification.ppt Monreale
17. 07.11 14:00-16:00 C1 Classification/2 Monreale
09.11 11:00-13:00 C1 CANCELED
18. 12.11 14:00-16:00 C1 LAB: Classification knime_classification python_classification Guidotti
19. 14.11 14:00-16:00 C1 Pattern Mining Explanation of classification/ML models Pattern mining Intro Apriori Algorithm for Pattern/AR Mining Pedreschi
20. 16.11 11:00-13:00 C1 Pattern Mining Pedreschi
21. 19.11 14:00-16:00 C1 Exercises for the mid-term ex-second-midterm.pdf Monreale
22. 21.11 14:00-16:00 C1 Lab Pattern Mining+ Discussion Clustering knime_pattern python_pattern https://anaconda.org/conda-forge/pyfim, https://pypi.org/project/fim/, http://www.borgelt.net/pyfim.htmlGuidotti/Pedreschi
The next lectures are dedicated to the DM of 9 credits
23. 23.11 11:00-13:00 C1 Alternative methods for Pattern Mining. Privacy in DM fp-growth.pdfMonreale
24. 26.11 14:00-16:00 C1 Alternative methods for Clustering. Privacy in DM 1-alternative-clustering.pdfMonreale
25. 28.11 14:00-16:00 C1 Privacy in DM. Transactional Clustering 2-transactionalclustering.pdf privacydt.pdf Papers on Clustering Monreale
26. 30.11 11:00-13:00 C1 Alternative methods for classification/1 K-Nearest Neighbors & Naive Bayes Pedreschi
27. 03.12 14:00-16:00 C1 Alternative methods for classification/2 Ensemble methods Wisdom of the crowd & Ensemble methods Galton's Vox Populi Pedreschi
28. 05.12 14:00-16:00 C1 Alternative methods for classification/3 Pedreschi
29. 07.12 11:00-13:00 C1 Exercises on clustering and classification CLOPE K-mode KNN & NBMonreale
30. 10.12 14:00-16:00 C1 Exercises on Second part - all students Monreale
31. 12.12 14:00-16:00 C1Final Discussion on Project - all students Pedreschi/Guidotti
32. 14.12 11:00-13:00 C1 Cancelled

Second part of course, second semester (DMA - Data mining: advanced topics and case studies)

Day Room (Aula) Topic Learning material Instructor (default: Nanni)
1. 21.02.2019 14:00-16:00 A1 Introduction + Sequential patters/1 Introduction, Sequential patterns
2. 22.02.2019 16:00-18:00 C1 Sequential patterns/2
3. 01.03.2019 16:00-18:00 C1 Sequential patterns/3 Sample exercises (fixed)
4. 07.03.2019 14:00-16:00 A1 Sequential patterns/4 Sequential pattern tools: Link to SPMF + Sample datasets, Python2 GSP educational implementation(source), PrefixSpan-py (requires Python3)
5. 08.03.2019 16:00-18:00 C1 Time series/1 Time series
6. 14.03.2019 14:00-16:00 A1 Time series/2 Overview on DM for time series, DTW paper by Sakoe and Chiba, 1978
7. 15.03.2019 16:00-18:00 C1 Time series/3
8. 21.03.2019 14:00-16:00 A1 Time series/4 Preprocessing in Python DTW in Python
9. 22.03.2019 16:00-18:00 C1 Time series/5
10. 28.03.2019 14:00-16:00 A1 Exercises for mid-term exam Exercises from past exams
11. 29.03.2019 16:00-18:00 C1 Exercises for mid-term exam Exercises from past exams (with some solutions)
04.04.2019 16:00-18:00 A1 + E mid-term exam
11. 11.04.2019 14:00-16:00 A1 Classification: alternative methods/1 kNN and Bayes classifier
12. 12.04.2019 16:00-18:00 C1 Classification: alternative methods/2 NN and SVM, Exercises
02.05.2019 14:00-16:00 A1 Cancelled
13. 03.05.2019 16:00-18:00 C1 Classification: alternative methods/3
14. 09.05.2019 14:00-16:00 A1 Classification: alternative methods/4 Ex. on NNs and SVM, Ex. on KNN and Naive Bayes
15. 10.05.2019 16:00-18:00 C1 Classification: Model Evaluation Model performances
16. 16.05.2019 14:00-16:00 A1 Classification: Model Evaluation Unbalanced data, Classification weights
17. 17.05.2019 16:00-18:00 C1 Classification: alternative methods/5 Ensembles, Homeworks!
18. 23.05.2019 14:00-16:00 A1 Exercises + Outlier detection/1 Ex. on Lift chart, Ex. on Ensembles, Outlier detection
19. 24.05.2019 16:00-18:00 C1 Outlier detection/2 Ex. on outliers, Ex. from past exams
20. 31.05.2019 16:00-18:00 C1 Due to a strike, the lesson will not take place. For you convenience, here is some material you can use: Examples of classification and validation in Python, Examples of outlier detection in Python, CRISP-DM guidelines. Feel free to contact me if you need clarifications. Remark: the CRISP-DM model will be not part of the exam program.
06.06.2019 16:00-18:00 E (+A1) mid-term exam 2nd mid-term of last year and its solutions (careful: they were not double-checked).

Exams

Exam DM part I (DMF)

The exam is composed of three parts:

Tasks of the project:

  1. Data Understanding (Collective discussion on: 19/10/2018): Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
  2. Clustering analysis (Collective discussion on: 21/11/2018): Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
  3. Classification (Collective discussion on: 12/12/2018): Explore the dataset using classification trees and random forest. Use them to predict the target variable. (see Guidelines for details)
  4. Association Rules (Collective discussion on: 12/12/2018): Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)

Guidelines for the project are here.

Exam DM part II (DMA)

The exam is composed of three parts:

Appelli di esame

Mid-term exams

Date Hour Place Notes Marks
DM1: First Mid-term 2018 30.10.2018 11-13 Room C1, L1, N1 Please, use the system for registration: https://esami.unipi.it/ results
DM1: Second Mid-term 2018 18.12.2018 11-13 Room C1, L1, N1 Please, use the system for registration: https://esami.unipi.it/
DM2: First Mid-term 2019 04.04.2019 16-18 Room A1, E Please, use the system for registration: https://esami.unipi.it/
Text + Solutions
Results
DM2: Second Mid-term 2019 06.06.2019 16-18 Room E
(+ A1 if needed)
Please, use the system for registration: https://esami.unipi.it/
Text
Results

Appelli regolari / Exam sessions

Session Date Time Room Notes Marks
1.16.01.2019 14:00 - 18:00 Room E
2.06.02.2019 14:00 - 18:00 Room E
3.19.06.2019 09:00 - 13:00 Room A1 Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. Results
4.10.07.2019 09:00 - 13:00 Room A1 Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. Results

Appelli straordinari A.A. 2017/18 / Extra sessions A.A. 20167/18

Date Time Room Notes Results

Previous years