Indice
Data Mining A.A. 2017/18
DM 1: Foundations of Data Mining (6 CFU)
Instructors - Docenti:
- Dino Pedreschi
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
Teaching assistant - Assistente:
- Riccardo Guidotti
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
DM 2: Advanced topics on Data Mining and case studies (6 CFU)
Instructors:
- Mirco Nanni, Dino Pedreschi
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
DM: Data Mining (9 CFU)
Instructors:
- Dino Pedreschi, Anna Monreale
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
Teaching assistant - Assistente:
- Riccardo Guidotti
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
News
- Next Exam Session: September 13, 9:00, room A1.
- Oral Exam DM1: In Sept a possible date for oral exam of DM1 will be on Sept 5 if you have already made the written part. Please, send me an email at anna [dot] monreale [at] unipi [dot] it in case you want to do the oral exam.
- ERRATA CORRIGE: The results of the written exam for DM2 of June 12 have been updated (a few marks slightly increased): DM2 Results.
- The results of the written exam for DM2 of June 12 are available: DM2 Results.
- The results of the (definitely too easy) second midterm exam for DM2 (June 1) are available: Results.
- DM2 projects: the exercises on classification and outlier detection are out.
- The results of the first midterm exam for DM2 (April 10) are available: Results.
- DM2 projects: the exercises on time series and sequential patterns are out.
- DM2: first mid-term exam fixed to April 10 from 16:00 to 18:00, room E.
- Today (March 1st, 2018)'s class is cancelled due to meteo conditions.
- Last date for oral exam in winter session: 23 Feb. 2018 please book here: https://doodle.com/poll/a9y85z3ztmvz2grr (max 6 people)
- Next dates for oral exam: a) 29 Jan 9:00-13:00 Pedreschi's office; b) 30 Jan 9:00-13:00 Monreale's office; c) 06 Feb 9:00-13:00 Room of written exam; d) 09 Feb 9:00-13:00 Monreale's office. e) 14 Feb 9:00-13:00 Monreale's office.Please fill the doodle: https://doodle.com/poll/qp42cqcbi4cy95f9
- The Results of the written exam on 17.01.2018 are online results-2018-1-17.pdf
- Mid-term Exam question time: Wed 15th November 16.00-17.00 in Anna Monreale's office.
- The Results of the first mid-term are online 2017-10-30-first-midterm.pdf. For any problem and question about the written exam please contact Riccardo Guidotti and Anna Monreale.
- Please, fill the doodle about the project group. In the field of Participant you should insert the list of surnames of the group components. I sent the doodle link by email. If you do not have access to the DM group please send me an email following the below instructions.
- Please, send to anna [dot] monreale [at] unipi [dot] it an email with:
- subject: DATA MINING
- content: your name, your surname, your studentID, the credits of your exam (12CFU, 6CFU, 9CFU)
Learning goals -- Obiettivi del corso
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:
- i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
- le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
- alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
- l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza
Reading about the "data scientist" job
- Data, data everywhere. The Economist, Feb. 2010 download
- Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
- Welcome to the yotta world. The Economist, Sept. 2011 download
- Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
- Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
- Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
- Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
- Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download
Hours - Orario e Aule
DM1 & DM
Classes - Lezioni
Day of Week | Hour | Room |
---|---|---|
Mercoledì/Wednesday | 14:00 - 16:00 | Aula C1 |
Giovedì/Thursday | 16:00 - 18:00 | Aula C1 |
Venerdì/Friday | 11:00 - 13:00 | Aula A1 |
Office hours - Ricevimento:
- Prof. Pedreschi: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
- Prof. Monreale: Giovedì/Thursday h 14:00 - 16:00, Dipartimento di Informatica
- Dr. Guidotti: Mercoledì/Wednesday h 16:00 - 18:00, Dipartimento di Informatica
DM 2
Classes - Lezioni
Day of week | Hour | Room |
---|---|---|
Thursday | 14 - 16 | A1 |
Friday | 16 - 18 | C1 |
Office hours - Ricevimento:
- Nanni : appointment by email, c/o ISTI-CNR
Learning Material -- Materiale didattico
Textbook -- Libro di Testo
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN ISBN-13: 978-0-13-312890-1, 2018 (Second Edition)
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.
- Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
- Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.
- Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. 1st Edition.
Slides of the classes -- Slides del corso
- The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".
Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining"
Past Exams
* Some text of past exams on DM1 (6CFU):
* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):
* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:
- Some very old exercises (part of them with solutions) are available here, most of them in Italian, not all of them on topics covered in this year program:
Data mining software
- KNIME The Konstanz Information Miner. Download page
- Python - Anaconda (2.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
- Scikit-learn: python library with tools for data mining and data analysis Documentation page
- Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
- WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page
Class calendar - Calendario delle lezioni (2017-2018)
First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining)
Day | Aula | Topic | Learning material | Instructor | |
---|---|---|---|---|---|
1. | 20.09.2017 14:00-16:00 | C1 | Introduction | Pedreschi | |
2. | 21.09.2017 16:00-18:00 | C1 | Introduction | Introduction | Pedreschi |
3. | 22.09.2017 11:00-13:00 | A1 | Lecture canceled | Pedreschi | |
4. | 27.09.2017 14:00-16:00 | C1 | Data Understanding | Data Understanding For this topic we suggest: “Guide to Intelligent Data Analysis” | Monreale |
5. | 28.09.2017 16:00-18:00 | C1 | Introduction to Python, Knime | python_tutorial knime_tutorial | Monreale/Guidotti |
6. | 29.09.2017 11:00-13:00 | A1 | Data Understanding | Pedreschi | |
7. | 04.10.2017 14:00-16:00 | C1 | Data Preparation | 4.data_preparation.pdf | Pedreschi |
8. | 05.10.2017 16:00-18:00 | C1 | Data Preparation | Pedreschi | |
9. | 06.10.2017 11:00-13:00 | A1 | Canceled | ||
10. | 11.10.2017 14:00-16:00 | C1 | Knime - Python: Data Understanding | Pandas knime_data_understanding python_data_understanding | Pedreschi/Guidotti |
11. | 12.10.2017 16:00-18:00 | C1 | Clustering analysis: Centroid-based methods. | dm2014_clustering_intro.pdf dm2014_clustering_kmeans.pdf | Pedreschi |
12. | 13.10.2017 11:00-13:00 | A1 | Hierarchical methods. | dm2014_clustering_hierarchical.pdf | Pedreschi |
13. | 18.10.2017 14:00-16:00 | C1 | Clustering analysis: Density-based methods. Exercises on Data Understanding | dm2014_clustering_dbscan.pdf exercises-dm1.pdf | Monreale/Guidotti |
14. | 19.10.2017 16:00-18:00 | C1 | Exercises on Clustering | Online Didactic Data Mining | Monreale/Guidotti |
15. | 20.10.2017 11:00-13:00 | A1 | Knime - Python: Clustering | knime_clustering python_clustering | Monreale/Guidotti |
16. | 25.10.2017 14:00-16:00 | C1 | Clustering Validation | dm2014_clustering_validation.pdf | Monreale |
17. | 26.10.2017 16:00-18:00 | C1 | Exercises on Clustering | 2016-01-18-dm1-prima.pdf dm-clustering.pdf | Monreale |
18. | 27.10.2017 11:00-13:00 | A1 | Canceled | ||
30.10.2017 14:00-18:00 | A1,C1 | First Mid-term test | |||
19. | 08.11.2017 14:00-16:00 | C1 | Frequent Pattern & Association Rules | restructured_assoc.pdf Chapter 6 of textbook (avoid sections 6.4.2, 6.5, 6.6, 6.7.2, 6.7.2, 6.8) | Pedreschi |
20. | 09.11.2017 16:00-18:00 | C1 | Frequent Pattern & Association Rules | Pedreschi | |
21. | 10.11.2017 11:00-13:00 | A1 | Knime - Frequent Patterns & Association Rules | knime_pattern_mining python_pattern_mining Borgelt Web Page | Guidotti / Pedreschi |
22. | 15.11.2017 14:00-16:00 | C1 | Classification/1 | 11.chap4_basic_classification.pdf | Pedreschi |
23. | 16.11.2017 16:00-18:00 | C1 | Classification/2 | Monreale | |
24. | 17.11.2017 11:00-13:00 | A1 | Knime - Python: Classification | knime_classification python_classification | Guidotti/Pedreschi |
25. | 22.11.2017 14:00-16:00 | C1 | Classification/3 | Pedreschi | |
26. | 23.11.2017 16:00-18:00 | C1 | Exercises on Classification & Frequent Patterns | exercises-c-ar.pdf | Guidotti/Pedreschi |
24.11.2017 11:00-13:00 | A1 | Canceled – The next lectures are dedicated to the DM of 9 credits | |||
27. | 29.11.2017 14:00-16:00 | C1 | Alternative methods for clustering | 1-alternative-clustering.pdf | Monreale |
28. | 30.11.2017 16:00-18:00 | C1 | Transactional Clustering | 2-transactionalclustering.pdf exercises-clustering-rock.pdf Papers on Clustering | Monreale |
29. | 01.12.2017 11:00-13:00 | A1 | Alternative methods for classification/1 | K-NN & Naive Bayes | Pedreschi |
30. | 06.12.2017 14:00-16:00 | C1 | Alternative methods for classification/2 | Ensemble methods Wisdom of the crowd & Ensemble methods Galton's Vox Populi | Pedreschi |
31. | 07.12.2017 16:00-18:00 | C1 | Exercises on clustering and classification | exercises-clope.pptx exercises_classification_3cfu.pdf | Monreale |
32. | 13.12.2017 14:00-16:00 | C1 | Alternative method for frequent patterns and AR | fp-growth.pdf | Monreale |
33. | 14.12.2017 16:00-18:00 | C1 | Alternative methods for classification/2 | Rule-based classification | Pedreschi |
34. | 15.12.2017 11:00-13:00 | A1 | Exercises on the second part of the course | esercitazione20171215 | Guidotti/Pedreschi |
35. | 20.12.2017 14:00-17:00 | A1,C1 | Second Mid-term test: See Mid-term section for details |
Second part of course, second semester (DMA - Data mining: advanced topics and case studies)
Day | Room (Aula) | Topic | Learning material | Instructor (default: Nanni) | |
---|---|---|---|---|---|
1. | 22.02.2018 14:00-16:00 | A1 | Introduction + Sequential patters/1 | Introduction Sequential patterns | |
2. | 23.02.2018 16:00-18:00 | C1 | Sequential patterns/2 | ||
| A1 | Cancelled | |||
3. | 02.03.2018 16:00-18:00 | C1 | Sequential patterns/3 | Exercises from past exams: dm2_exam.2017.10.30.pdf dm2_mid-term_exam.2017.04.07.pdf | |
4. | 08.03.2018 14:00-16:00 | A1 | Sequential patterns/4 + Time series/1 | Sequential pattern tools: Link to SPMF + sample dataset, Python educational implementation (source), Knime example. Slides: Time Series (updated) | |
| C1 | Cancelled | |||
5. | 15.03.2018 14:00-16:00 | A1 | Time series/2 | Python preprocessing, Python DTW | |
6. | 16.03.2018 16:00-18:00 | C1 | Time series/3 | Book chapter about DTW (from Meinard Müller's book) | |
7. | 22.03.2018 14:00-16:00 | A1 | Time series/4 | Python structural distances. Exercises from past exams: dm2_exam.2017.10.30.pdf dm2_mid-term_exam.2017.04.07.pdf | |
8. | 23.03.2018 16:00-18:00 | C1 | Exercises | Exercises from past exams | |
10.04.2018 16:00-18:00 | E | Mid-term exam | |||
9. | 12.04.2018 14:00-16:00 | A1 | Classification: alternative methods/1 | Slides on the wisdom of the crowds, Original 1907 Nature paper by Francis Galton "Vox populi" | Pedreschi |
10. | 13.04.2018 16:00-18:00 | C1 | Classification: alternative methods/2 | Slides on K-nearest neighbours and Naive Bayes | |
11. | 19.04.2018 14:00-16:00 | A1 | Classification: alternative methods/3 | Slides on ANNs and Support Vector Machines | |
12. | 20.04.2018 16:00-18:00 | C1 | Classification: exercises | Exercises from past exams | |
13. | 26.04.2018 14:00-16:00 | A1 | Classification: evaluation/1 | Model performances, Unbalanced classes and Scoring models | |
14. | 27.04.2018 16:00-18:00 | C1 | Classification: evaluation/2 | Classification weights, Lift chart examples. Homeworks! | |
15. | 03.05.2018 14:00-16:00 | A1 | DM process/1 | Python sample classification & evaluation, Example AMRP (also described in this report, in Italian) | |
16. | 04.05.2018 16:00-18:00 | C1 | DM process/2 | CRISP-DM, Sample project with CRISP-DM, Link to "First Aid for Data Scientist" web site (pwd: datamining_2018 – Contains slides and Quiz to pass, see exam instructions). | |
17. | 10.05.2018 14:00-16:00 | A1 | Outlier detection/1 | Slides | |
18. | 11.05.2018 16:00-18:00 | C1 | Outlier detection/2 | ||
19. | 17.05.2018 14:00-16:00 | A1 | Outlier detection/3 | Python examples, Knime examples, link to ELKI framework, test dataset for ELKI | |
20. | 18.05.2018 16:00-18:00 | C1 | Exercises | Ex on outlier detection, Ex on classification | |
21. | 25.05.2018 16:00-18:00 | C1 | Exercises | exercises_25.05.2018.zip | |
01.06.2018 16:00-18:00 | E | 2nd Mid-term exam | |||
08.06.2018 14:00-17:00 | C | Oral exams | Reserved to who passed the mid-term written exams |
Exams
Exam DM part I (DMF)
The exam is composed of three parts:
- A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of November and December.
- An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
- A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2017-2018] Project” in the subject. Tasks of the project:
- Data Understanding (Assigned on: 03/10/2017): Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations.
- Clustering analysis (Assigned on: 14/11/2017): Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
- Association Rules (Assigned on: 21/11/2017): Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict if an employee will leave prematurely or not. (see Guidelines for details)
- Classification (Assigned on: 12/12/2017): Explore the dataset using classification trees and random forest. Use them to predict if an employee will leave prematurely or not. (see Guidelines for details)
- Project 1
- Dataset: HRA (Human Resources Analytics)
- Assigned: 29/09/2017
- Deadline: 06/01/2018
Guidelines for the project are here.
Exam DM part II (DMA)
The exam is composed of three parts:
- A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of April and June.
- A small online test for the data ethics part. The test can be taken at the following link: Link to "First Aid for Data Scientist" web site (pwd: datamining_2018). Register, and enroll to the “First Aid for Data Scientist” course. Take the quizzes of the 3 units. Then, download your certificate and send it to mirco [dot] nanni [at] isti [dot] cnr [dot] it before the oral exam.
- An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
- A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: sequential patterns, time series, classification (alternative methods and validation), outlier detection. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam:
- Time series: given the 50+ years long history of stock values of a company, split it into years, and study their similarities, also using clustering. Objectives: compare similarities, compute clustering. Dataset: IBM stocks (source: Yahoo Finance), includes a Python snippet to read and split the data. Dataset obtained from Yahoo!Finance service.
- Sequential patterns: discover patterns over the stock value time series above. Before that, preprocess the data by splitting it into monthly time series and discretizing them in some way. Objective: find Motifs-like patterns (i.e. frequent contiguous subsequences) of length at least 4 days. Dataset: same as the point before.
- (Alternative) Classification methods: test different classification methods over a simple classification problem. Dataset: the UCI Abalone dataset, containing various features of abalones, including the age – to be inferred by the number or rings. Objective: (i) discard the “Infant” abalones; (ii) discretize the attribute “Number of rings” into 2 classes; (iii) try at least 3 different classification methods (among those discussed in DM2, including ensemble methods) on the resulting dataset, using the discretized n. of rings as class, and evaluating them with cross-validation.
- Outlier detection: from the Abalone dataset used above, identify the top 1% outliers. Objective: adopt at least two different methods belonging to different families (i.e. model-based, distance-based, density-based, angle-based, …) to identify the 1% of input records with the highest likelihood of being outliers, and compare the results. Dataset: same as the point before.
Appelli di esame
Mid-term exams
Date | Hour | Place | Notes | Marks | |
---|---|---|---|---|---|
First Mid-term 2017 | 30.10.2017 | 14:00 - 17:00 | Room A1, C1 | Please, use the system for registration: https://esami.unipi.it/ | |
Second Mid-term 2017 | 20.12.2017 | 14:00 - 17:00 | Room A1, C1 | Please, use the system for registration: https://esami.unipi.it/ | |
DM2: first mid-term 2018 | 10.04.2018 | 16:00 - 18:00 | Room E | Please, use the system for registration: https://esami.unipi.it/ | Results |
DM2: second mid-term 2018 | 01.06.2018 | 16:00 - 18:00 | Room E | Please, use the system for registration: https://esami.unipi.it/ | Results |
DM2: oral exam 2018 | 08.06.2018 | 14:00 - 17:00 | Room C | For students that passed the mid-term written exams. Please, use the system for registration: https://esami.unipi.it/ |
Appelli regolari / Exam sessions
Session | Date | Time | Room | Notes | Marks |
---|---|---|---|---|---|
1. | 10 Jan 2018 | 09:00 | C1 | Oral exam for students who passed the mid-term exam and delivered the project work. https://esami.unipi.it/ | |
2. | 17 Jan 2018 | 09:00 | A1 | Witten Exam. In the same date we will define the dates for the next oral exams. https://esami.unipi.it/ | |
3. | 06 Feb 2018 | 09:00 | C | Witten Exam. In the same date we will define the dates for the next oral exams. https://esami.unipi.it/ | |
4. | 12 June 2018 | 09:00 | A1 | Witten Exam. In the same date we will define the dates for the next oral exams. https://esami.unipi.it/ | DM2 Results |
5. | 3 July 2018 | 09:00 | A1 | Witten Exam. In the same date we will define the dates for the next oral exams. https://esami.unipi.it/ | |
6. | 13 September 2018 | 09:00 | A1 | Witten Exam. In the same date we will define the dates for the next oral exams. https://esami.unipi.it/ |
Appelli straordinari A.A. 2016/17 / Extra sessions A.A. 2016/17
Date | Time | Room | Notes | Results |
---|---|---|---|---|
30.10.2017 | 14:00 - 18:00 | Room A1, C1 | ||
20.12.2017 | Room A1, C1 |