Instructors - Docenti:
Teaching assistant - Assistente:
Instructors:
Instructors:
Teaching assistant - Assistente:
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:
Classes - Lezioni
Day of Week | Hour | Room |
---|---|---|
Lunedì/Monday | 14:00 - 16:00 | Aula C1 |
Mercoledì/Wednesday | 14:00 - 16:00 | Aula C1 |
Venerdì/Friday | 11:00 - 13:00 | Aula C1 |
Office hours - Ricevimento:
Classes - Lezioni
Day of week | Hour | Room |
---|---|---|
Thursday | 14 - 16 | A1 |
Friday | 16 - 18 | C1 |
Office hours - Ricevimento:
Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining"
* Some text of past exams on DM1 (6CFU):
* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):
* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:
Day | Aula | Topic | Learning material | Instructor | |
---|---|---|---|---|---|
1. | 19.09 14:00-16:00 | C1 | Overview. Introduction. | 1.2018-dm-overview.pdf | Pedreschi |
2. | 20.09 16:00-18:00 | C1 | Introduction | Pedreschi | |
21.09 11:00-13:00 | C1 | Lecture canceled | Pedreschi | ||
3. | 24.09 14:00-16:00 | C1 | KDD Process & Applications. Data Understanding. | DM + Applications DU | Monreale |
4. | 26.09 14:00-16:00 | C1 | Data Understanding. Data Preparation | Monreale | |
5. | 28.09 11:00-13:00 | C1 | Introduction to Python, Knime | intro_knime intro_python | Monreale/Guidotti |
6. | 01.10 14:00-16:00 | C1 | Data Preparation | Data Preparation | Monreale |
7. | 03.10 14:00-16:00 | C1 | Clustering Introduction e Centroid-based clustering | 4.basic_cluster_analysis-intro-kmeans.pdf | Monreale |
05.10 11:00-13:00 | C1 | Lecture canceled | |||
8. | 08.10 14:00-16:00 | C1 | Knime - Python: Data Understanding | du_knime du_python | Guidotti |
9. | 10.10 14:00-16:00 | C1 | Clustering: K-means & Hierarchical | 5.basic_cluster_analysis-hierarchical.pdf | Pedreschi |
12.10 11:00-13:00 | C1 | Lecture canceled for IF | |||
10. | 15.10 14:00-16:00 | C1 | Clustering: DBSCAN | 6.basic_cluster_analysis-dbscan-validity.pdf | Pedreschi |
11. | 17.10 14:00-16:00 | C1 | Clustering: Validity | Pedreschi | |
12. | 19.10 11:00-13:00 | C1 | Discussion on Projects - DU | Guidotti | |
13. | 22.10 14:00-16:00 | C1 | Exercises for mid-term test | Tool for Dm ex: Didactic Data Mining Ex. Clustering PDF Ex. Clustering PPTX | Monreale |
14. | 24.10 14:00-16:00 | C1 | Knime - Python: Clustering | clustering_knime clustering_python | Guidotti |
15. | 26.10 11:00-13:00 | C1 | Exercises for mid-term test | Ex. Clustering PPTX - complete Ex. Clustering PDF - complete Exercises DU ex-silhouette.pdf | Monreale |
16. | 05.11 14:00-16:00 | C1 | Classification/1 | 7.chap3_basic_classification.ppt | Monreale |
17. | 07.11 14:00-16:00 | C1 | Classification/2 | Monreale | |
09.11 11:00-13:00 | C1 | CANCELED | |||
18. | 12.11 14:00-16:00 | C1 | LAB: Classification | knime_classification python_classification | Guidotti |
19. | 14.11 14:00-16:00 | C1 | Pattern Mining | Explanation of classification/ML models Pattern mining Intro Apriori Algorithm for Pattern/AR Mining | Pedreschi |
20. | 16.11 11:00-13:00 | C1 | Pattern Mining | Pedreschi | |
21. | 19.11 14:00-16:00 | C1 | Exercises for the mid-term | ex-second-midterm.pdf | Monreale |
22. | 21.11 14:00-16:00 | C1 | Lab Pattern Mining+ Discussion Clustering | knime_pattern python_pattern https://anaconda.org/conda-forge/pyfim, https://pypi.org/project/fim/, http://www.borgelt.net/pyfim.html | Guidotti/Pedreschi |
The next lectures are dedicated to the DM of 9 credits | |||||
23. | 23.11 11:00-13:00 | C1 | Alternative methods for Pattern Mining. Privacy in DM | fp-growth.pdf | Monreale |
24. | 26.11 14:00-16:00 | C1 | Alternative methods for Clustering. Privacy in DM | 1-alternative-clustering.pdf | Monreale |
25. | 28.11 14:00-16:00 | C1 | Privacy in DM. Transactional Clustering | 2-transactionalclustering.pdf privacydt.pdf Papers on Clustering | Monreale |
26. | 30.11 11:00-13:00 | C1 | Alternative methods for classification/1 | K-Nearest Neighbors & Naive Bayes | Pedreschi |
27. | 03.12 14:00-16:00 | C1 | Alternative methods for classification/2 | Ensemble methods Wisdom of the crowd & Ensemble methods Galton's Vox Populi | Pedreschi |
28. | 05.12 14:00-16:00 | C1 | Alternative methods for classification/3 | Pedreschi | |
29. | 07.12 11:00-13:00 | C1 | Exercises on clustering and classification | CLOPE K-mode KNN & NB | Monreale |
30. | 10.12 14:00-16:00 | C1 | Exercises on Second part - all students | Monreale | |
31. | 12.12 14:00-16:00 | C1 | Final Discussion on Project - all students | Pedreschi/Guidotti | |
32. | 14.12 11:00-13:00 | C1 | Cancelled |
Day | Room (Aula) | Topic | Learning material | Instructor (default: Nanni) | |
---|---|---|---|---|---|
1. | 21.02.2019 14:00-16:00 | A1 | Introduction + Sequential patters/1 | Introduction, Sequential patterns | |
2. | 22.02.2019 16:00-18:00 | C1 | Sequential patterns/2 | ||
3. | 01.03.2019 16:00-18:00 | C1 | Sequential patterns/3 | Sample exercises (fixed) | |
4. | 07.03.2019 14:00-16:00 | A1 | Sequential patterns/4 | Sequential pattern tools: Link to SPMF + Sample datasets, Python2 GSP educational implementation(source), PrefixSpan-py (requires Python3) | |
5. | 08.03.2019 16:00-18:00 | C1 | Time series/1 | Time series | |
6. | 14.03.2019 14:00-16:00 | A1 | Time series/2 | Overview on DM for time series, DTW paper by Sakoe and Chiba, 1978 | |
7. | 15.03.2019 16:00-18:00 | C1 | Time series/3 | ||
8. | 21.03.2019 14:00-16:00 | A1 | Time series/4 | Preprocessing in Python DTW in Python | |
9. | 22.03.2019 16:00-18:00 | C1 | Time series/5 | ||
10. | 28.03.2019 14:00-16:00 | A1 | Exercises for mid-term exam | Exercises from past exams | |
11. | 29.03.2019 16:00-18:00 | C1 | Exercises for mid-term exam | Exercises from past exams (with some solutions) | |
04.04.2019 16:00-18:00 | A1 + E | mid-term exam | |||
11. | 11.04.2019 14:00-16:00 | A1 | Classification: alternative methods/1 | kNN and Bayes classifier | |
12. | 12.04.2019 16:00-18:00 | C1 | Classification: alternative methods/2 | NN and SVM, Exercises | |
| | Cancelled | |||
13. | 03.05.2019 16:00-18:00 | C1 | Classification: alternative methods/3 | ||
14. | 09.05.2019 14:00-16:00 | A1 | Classification: alternative methods/4 | Ex. on NNs and SVM, Ex. on KNN and Naive Bayes | |
15. | 10.05.2019 16:00-18:00 | C1 | Classification: Model Evaluation | Model performances | |
16. | 16.05.2019 14:00-16:00 | A1 | Classification: Model Evaluation | Unbalanced data, Classification weights | |
17. | 17.05.2019 16:00-18:00 | C1 | Classification: alternative methods/5 | Ensembles, Homeworks! | |
18. | 23.05.2019 14:00-16:00 | A1 | Exercises + Outlier detection/1 | Ex. on Lift chart, Ex. on Ensembles, Outlier detection | |
19. | 24.05.2019 16:00-18:00 | C1 | Outlier detection/2 | Ex. on outliers, Ex. from past exams | |
| | Due to a strike, the lesson will not take place. For you convenience, here is some material you can use: Examples of classification and validation in Python, Examples of outlier detection in Python, CRISP-DM guidelines. Feel free to contact me if you need clarifications. Remark: the CRISP-DM model will be not part of the exam program. | |||
06.06.2019 16:00-18:00 | E (+A1) | mid-term exam | 2nd mid-term of last year and its solutions (careful: they were not double-checked). |
The exam is composed of three parts:
Tasks of the project:
Guidelines for the project are here.
The exam is composed of three parts:
Date | Hour | Place | Notes | Marks | |
---|---|---|---|---|---|
DM1: First Mid-term 2018 | 30.10.2018 | 11-13 | Room C1, L1, N1 | Please, use the system for registration: https://esami.unipi.it/ | results |
DM1: Second Mid-term 2018 | 18.12.2018 | 11-13 | Room C1, L1, N1 | Please, use the system for registration: https://esami.unipi.it/ | |
DM2: First Mid-term 2019 | 04.04.2019 | 16-18 | Room A1, E | Please, use the system for registration: https://esami.unipi.it/ Text + Solutions | Results |
DM2: Second Mid-term 2019 | 06.06.2019 | 16-18 | Room E (+ A1 if needed) | Please, use the system for registration: https://esami.unipi.it/ Text | Results |
Session | Date | Time | Room | Notes | Marks |
---|---|---|---|---|---|
1. | 16.01.2019 | 14:00 - 18:00 | Room E | ||
2. | 06.02.2019 | 14:00 - 18:00 | Room E | ||
3. | 19.06.2019 | 09:00 - 13:00 | Room A1 | Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. | Results |
4. | 10.07.2019 | 09:00 - 13:00 | Room A1 | Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. | Results |
Date | Time | Room | Notes | Results |
---|