Indice
Data Mining A.A. 2014/15
DM 1: Foundations of Data Mining
Instructors - Docenti:
- Dino Pedreschi, Fosca Giannotti
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
Teaching assistant - Assistente:
- Letizia Milli
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
DM 2: Advanced topics on Data Mining and case studies
Instructors:
- Mirco Nanni, Anna Monreale
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
News
- [21/09/2015] The results of the written exam for DM II of 09.09.2015 are available here: DMII-Results-Sept
- [22/07/2015] The results of the written exam for DM II of 17.07.2015 are available here: Results
- [06/07/2015] The results of the written exam for DM II of 26.06.2015 are available here: Results
- [30/06/2015] The results of the written exam for DM I of 26.06.2015 are available here: result
- [14/06/2015] The results of the written exam for DM II of 05.06.2015 are available here: Results 05.06.2015
- [18/05/2015] The results of the midterm exam for DM II is available here: Results 13.04.2015
- [03/03/2015] The midterm test for DM I and the special session for exams for DM II will take place on April 13th, 2015 in room C1, at 9 a.m.
- [21/02/2015] Results of DM I (written exam) available Data Mining I:result
- [19/02/2014] The first lesson of Data Mining 2 will take place on Monday, Feb. 23-rd, in room N1.
- [16/02/2015] The next oral exam will be on Monday 23 February 2015 at 11.00 and Monday 2 March 2015 at 11:00 at Predreschi's office. Note that you have to send an email (milli [at] di [dot] unipi [dot] it or dino [dot] pedreschi [at] di [dot] unipi [dot] it) to register for the oral exam.
- [21/01/2015] Results of DM I (written exam) available Data Mining I: Results of written exam, January 19, 2015
- [19/01/2015] The next oral exam will be on Monday 19 January 2015 at 9.00 and Thursday 29 January 2015 at 14:00 at Predreschi's office. Note that you have to send an email (milli [at] di [dot] unipi [dot] it or dino [dot] pedreschi [at] di [dot] unipi [dot] it) to register for the oral exam.
- [15/12/2014] The text for the third and fourth exercises, has been released. Deadline: three days before the oral exam.
- [12/12/2014] Today lesson is cancelled for strike. The lesson is moved to Monday 15/12/2014 16:00
- [12/12/2014] Le valutazioni del secondo esercizio sono / Evaluation of the second homework is online
- [12/12/2014] Le valutazioni del primo esercizio sono / Evaluation of the first homework is online
- [24/11/2014]Il 27 & 28 Novembre il KDD Lab tiene il suo workshop annuale aperto a tutti gli interessati KddLab Workshop
- [07/11/2014]The text for the first and second exercises, has been released. Deadline: 28/11/2014.
- [17/10/2014] Appello straordinario Anno Accademico 2013/2014: venerdì 7 novembre 2014 ore 9:00-11:00 aula C1
- Richiesta di collaborazione al progetto di ricerca scientifica MOTUS - Mobility and Tourism in Urban Scenarios. Dedicate 2 ore di tempo il prossimo Giovedì 16 ottobre al Dipartimento di Informatica a testare e valutare nuove app per la mobilità. Dettagli e iscrizione: focus group MOTUS
- Nuovo orario/new hours: Lunedì/Monday 16:00-18:00 Aula C; Venerdì/Friday 14:00-16:00 Aula A1
- Per impegni del docente precedenti allo spostamento dell'orario, la lezione di Lunedì 13 ottobre inizierà alle 16:30. The class of Monday 13 October will begin at 16:30
- [07/10/2014] Il doodle per decidere se spostare o meno la lezione del Giovedì è in linea qui. Esprimere le vostre disponibilità entro la lezione di Giovedì 9 ottobre.
- [07/10/2014] Siete tutti invitati all'evento dell'Internet Festival “Big Data e la mobilità del futuro” sabato 11 ottobre dalle 10 alle 18:30 nell'aula magna del Polo Fibonacci (edificio E). Big Data e la Mobilità del Futuro
- [25/09/2014] La lezione di oggi è sostituita dall'evento BRIGHT presso il CNR di Pisa - Big Data Tales Notte dei Ricercatori al CNR di Pisa
Learning goals -- Obiettivi del corso
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:
- i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
- le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
- alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
- l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza
Reading about the "data scientist" job
- Data, data everywhere. The Economist, Feb. 2010 download
- Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
- Welcome to the yotta world. The Economist, Sept. 2011 download
- Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
- Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
- Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
- Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
- Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download
Hours - Orario e Aule
DM 1
Classes - Lezioni
Giorno | Orario | Aula |
---|---|---|
Lunedì/Monday | 16:00 - 18:00 | Aula C |
Venerdì/Friday | 14:00 - 16:00 | Aula A1 |
Office hours - Ricevimento:
- Prof. Pedreschi: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
- Giannotti/Milli: appointment by email, c/o ISTI-CNR
DM 2
Classes - Lezioni
Day of week | Hour | Room |
---|---|---|
Monday | 9:00 - 11:00 | Room N1 |
Thursday | 9:00 - 11:00 | Room A1 |
Office hours - Ricevimento:
- Nanni / Monreale: appointment by email, c/o ISTI-CNR
Learning Material -- Materiale didattico
Textbook -- Libro di Testo
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.
Slides of the classes -- Slides del corso
- Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining".
- The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".
Testi di esame
- Oltre ai testi e (dove disponibili) soluzioni degli appelli d'esame degli anni recenti, sono consultabili i seguenti esercizi proposti in anni precedenti.
Data mining software
- KNIME The Konstanz Information Miner. Download page
- WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page
Class calendar - Calendario delle lezioni (2014-2015)
First part of course, first semester (DMF - Data mining: foundations)
Day | Aula | Topic | Learning material | Instructor | |
---|---|---|---|---|---|
1. | 25.09.2014 14:00-16:00 | B | Intro: data mining & knowledge discovery process | Textbook, Chapt. 1 dm_intro-2011.pdf | Pedreschi |
2. | 26.09.2014 16:00 | CNR | Evento BRIGHT presso il CNR di Pisa - Big Data Tales | Pedreschi | |
3. | 02.10.2014 14:00-16:00 | B | Intro: data mining & knowledge discovery process | Textbook, Chapt. 1 dm_intro-2011.pdf | Pedreschi |
4. | 03.10.2014 14:00-16:00 | A1 | Intro: data mining & knowledge discovery process | Textbook, Chapt. 1 dm_intro-2011.pdf | Pedreschi |
5. | 09.10.2014 14:00-16:00 | B | Data: types and basic measures | Textbook, Chapt. 2 chap2_data_new.pdf | Pedreschi |
6. | 10.10.2014 14:00-16:00 | A1 | Data: types and basic measures | Textbook, Chapt. 2 chap2_data_new.pdf | Pedreschi |
7. | 13.10.2014 14:00-16:00 | B | Data: types and basic measures | Textbook, Chapt. 2 chap2_data_new.pdf | Pedreschi |
8. | 17.10.2014 14:00-16:00 | A1 | Canceled | Pedreschi | |
9. | 20.10.2014 14:00-16:00 | B | Exploratory data analysis and data understanding. | Textbook, Chapt. 3 chap3_data_exploration.pdf | Pedreschi |
10. | 24.10.2014 14:00-16:00 | A1 | Clustering analysis. Centroid-based methods | Textbook, Chapt. 8 dm2014_clustering_intro.pdf dm2014_clustering_kmeans.pdf | Pedreschi |
11. | 27.10.2014 14:00-16:00 | B | Clustering analysis. Hierarchical methods | Textbook, Chapt. 8 dm2014_clustering_hierarchical.pdf | Pedreschi |
12. | 31.10.2014 14:00-16:00 | A1 | Tutorial on Knime | Slide: knime_slides_dm.pdf Workflows: data-manipulation_iris.zip data-manipulation_adult.zip clustering_iris.zip | Pedreschi |
13. | 10.11.2014 14:00-16:00 | B | Clustering analysis. Density-based methods | Textbook, Chapt. 8 dm2014_clustering_dbscan.pdf | Pedreschi |
14. | 14.11.2014 14:00-16:00 | A1 | Classification and predictive methods | Textbook, Chapt. 4 chap4_basic_classification.pdf | Pedreschi |
15. | 17.11.2014 14:00-16:00 | B | Classification. Decision trees | Textbook, Chapt. 4 chap4_basic_classification.pdf | Pedreschi |
16. | 21.11.2014 14:00-16:00 | A1 | Classification. Decision trees | Textbook, Chapt. 4 chap4_basic_classification.pdf | Pedreschi |
17. | 24.11.2014 14:00-16:00 | B | Classification. Validation and Weka & KNIME Lab | Workflows:decisiontreeiris.zip decisiontreeadult.zip decisiontreeadultoverfitting.zip | Milli |
18. | 28.11.2014 14:00-16:00 | A1 | Classification. Rule-based and bayesian methods | Textbook, Chapt. 4 chap4_basic_classification.pdf | Pedreschi |
19. | 01.12.2014 14:00-16:00 | B | Frequent Pattern Mining. | Textbook, Chapt. 6 2-3tdm-restructured_assoc_2013.pdf | Pedreschi |
20. | 05.12.2014 14:00-16:00 | A1 | Association Rule Mining | Textbook, Chapt. 6 2-3tdm-restructured_assoc_2013.pdf | Pedreschi |
21. | 12.12.2014 14:00-16:00 | A1 | Cancelled for strike | Pedreschi | |
22. | 15.12.2014 14:00-16:00 | B | Association Rule Mining and Knime | Workflow: FP and AR | Monreale |
Second part of course, second semester (DMA - Data mining: advanced topics and case studies)
Day | Aula | Topic | Learning material | Instructor | |
---|---|---|---|---|---|
1. | 23.02.2014 09:00-11:00 | N1 | Introduction + Sequential patterns / 1 | Sequential Patterns - Slides | Nanni |
2. | 26.02.2015 09:00-11:00 | A1 | Sequential patterns / 2 | Link to Tool for seq. patterns | Nanni |
3. | 02.03.2015 09:00-11:00 | N1 | Graph mining | Slides | Nanni |
05.03.2015 09:00-11:00 | A1 | ———– | |||
4. | 09.03.2015 09:00-11:00 | N1 | Advanced Classification Methods / 1 | Slides | Monreale |
5. | 12.03.2015 09:00-11:00 | A1 | Advanced Classification Methods / 2 | Monreale | |
6. | 16.03.2015 09:00-11:00 | N1 | Advanced Classification Methods / 3 | Exercises on Classidication | Monreale |
7. | 19.03.2015 09:00-11:00 | A1 | Time series / 1 | Slides | Nanni |
8. | 23.03.2015 09:00-11:00 | N1 | Time series / 2 | Example of DTW in R | Nanni |
9. | 26.03.2015 09:00-11:00 | A1 | Exercises | Exercises from past exams | Nanni |
10. | 30.03.2015 09:00-11:00 | N1 | Exercises | Monreale | |
11. | 02.04.2015 09:00-11:00 | A1 | Exercises | Monreale | |
03-07.04.2015 | EASTER HOLIDAYS | ||||
13.04.2015 09:00-11:00 | C1 | Midterm test | |||
12. | 16.04.2015 09:00-11:00 | A1 | Case study: CRM - Customer Segmentation + CRISP-DM | AMRP & Stulong CRISP-DM | Nanni |
13. | 23.04.2015 09:00-11:00 | A1 | Case study: CRM - Churn Analysis | Intro CRM Churn ST-Churn | Nanni |
14. | 27.04.2015 09:00-11:00 | N1 | Case study: CRM - Promotions and Sophistication | Promotions Sophistication | Nanni |
15. | 30.04.2015 09:00-11:00 | A1 | Spatiotemporal analysis / 1 | ST Analysis REF: Survey paper | Nanni |
16. | 04.05.2015 09:00-11:00 | N1 | Spatiotemporal analysis / 2 | Nanni | |
17. | 07.05.2015 09:00-11:00 | A1 | Case study: Spatiotemporal analysys / 1 + Projects presentation | Case study 1 Projects | Nanni |
18. | 11.05.2015 09:00-11:00 | N1 | Case study: Spatiotemporal analysys / 2 | Case study 2 | Nanni |
19. | 14.05.2015 09:00-11:00 | A1 | Spatiotemporal analysis / 3 | ST Classification | Nanni |
20. | 18.05.2015 09:00-11:00 | N1 | Outlier detection | Slides from SDM2010 tutorial | Nanni |
21. | 21.05.2015 09:00-11:00 | A1 | Ethical Issues in Data Analytics | Slides | Monreale |
22. | 25.05.2015 09:00-11:00 | N1 | Ethical Issues in Data Analytics / Fraude Detection Case Study | Monreale |
Exams
Exam DM part I (DMF)
L'esame consiste in una prova scritta ed in una prova orale:
- La prova scritta è composta essenzialmente di esercizi sui metodi e algoritmi visti a lezione. I testi degli appelli d'esame passati vengono regolarmente messi online e possono essere presi come riferimento generale. La prova scritta può essere sostituita dalle due verifiche intermedie: nel caso vengano entrambe superate con successo la media dei loro voti costituirà il voto con cui presentarsi all'orale – a meno che non si sostenga nuovamente l'esame scritto, nel qual caso il voto più recente cancella quelli precedenti (in meglio o in peggio). Non è possibile recuperare una sola verifica intermedia durante gli appelli d'esame regolari. Per l'a.a. 2013-2014, le verifiche intermedie sono sostituite da una serie di esercizi che verranno proposti durante il corso.
- La prova orale verte sugli aspetti più teorici del corso (definizioni, metodi, algoritmi, ecc.) trattati a lezione, oppure dalla discussione di bibliografia concordata con i docenti.
Exam DM part II (DMA)
The exam is composed of three parts:
- A written exam, with exercises and questions about classification (advanced topics), sequential patterns, graph mining and times series.
- A project, assigned among those proposed during the classes, or proposed by the students themselves. In the latter case, they are invited to submit a short project proposal (max. 1 page) describing the data to use and the analysis objectives. The work done should be summarized in a report, to be sent to the teachers at least 2 days before the oral exam. The proposed projects are the following:
- An oral exam, that includes: (1) discussing the project report with a group presentation (15 minutes for all the group); (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
Esercizi 2014-2015
Esercizi DM parte I -- Exercises DM First Part
Guidelines for the homework are here.
- Data Understanding: Thyroid Disease Data Set. Assigned on: 07.11.2014. To be completed within: 28.11.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 1” in the subject. Download the Thyroid Disease Data Set Thyroid Disease Data Set (in CSV format, zipped). This data set is one of the several databases about Thyroid avalaible at the UCI repository,http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease, where you can also find the data description. Explore the dataset with the analytical tools of KNIME or Weka (or whatever you like) and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations.
- Clustering analysis:Thyroid Disease Data Set. Assigned on: 07.11.2014. To be completed within: 28.11.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 2” in the subject. Download the Thyroid Disease Data Set Thyroid Disease Data Set (in CSV format, zipped). Perform an adequate data understanding phase, and then clustering analysis, with any of the studied methods, using an appropriate subset of variables. Determine an adequate number of clusters, if any, and try to explain the properties of the discovered clusters (or else, argue why this dataset does not exhibit a clustering structure).
- Market Basket Analysis: SuperMarket dataset. Assigned on: 15.12.2014. To be completed within: three days before the oral exam. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 3” in the subject. Download the SuperMarket dataset (in CSV format, zipped). Given a database of customer transactions of a supermarket, find the set of frequent items co-purchased and analyse the most interesting association rules that is possible to derive from the frequent patterns. Provide a short document which illustrates the input dataset, the adopted frequent pattern algorithm and the association rule analysis discussing your findings related to the most interesting rules. The database is composed of two files:(1) transactions.csv containing the customer transactions where each row contains a SCONTRINO_ID (transaction code) and COD_MKT_ID (the code of the item purchased); (2) segments-description.csv containing the full description of each item. For each COD_MKT_ID you can find information about the CATEGORY, SECTOR, AREA, SEGMENT and so on. Perform the analysis considering the segment level.
- Classification. Serie A statistic player dataset. Assigned on: 15.12.2014. To be completed within: three days before the oral exam. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 4” in the subject. Download the Dataset here: serieA_dataset (in CSV format, zipped). The dataset contain two files:serieA_aggregated_player.csv that contains the statistics aggregated for each player during the last soccer championship and serieA_events.csv that contains for each match all the statistics for each player. You can choose the file that you prefer. Objective: finding decision trees to predict the position of a player (Defender, Goalkeeper, Forward, Midfielder). The paper has to illustrate the input dataset, some analyses for the data understanding, the adopted classification methodology and the decision tree validation and interpretation.
Appelli di esame
Mid-term exams
Date | Hour | Place | Notes | Marks | |
---|---|---|---|---|---|
Mid-term 2015 | Monday 13.04.2015 | 9.00 | Room C1 |
Appelli regolari / Exam sessions
Session | Date | Time | Room | Notes | Results |
---|---|---|---|---|---|
1. | Monday 19 January 2015 | 9.00 | C | Results of written exam | |
1. | Wednesday 21 January 2015 | 9.00 | Predreschi's office | oral exam. Send an email to register for the oral exam | |
1. | Thursday 29 January 2015 | 14.00 | Predreschi's office | oral exam. Send an email to register for the oral exam | |
2. | Monday 16 February 2015 | 9.00 | C | Results of written exam | |
2. | Monday 23 February 2015 | 11.00 | Predreschi's office | oral exam. Send an email to register for the oral exam | |
2. | Monday 2 March 2015 | 11.00 | Predreschi's office | oral exam. Send an email to register for the oral exam | |
3. | Friday 05 June 2015 | 14.00 | C | Results of written exam |
Session | Date | Time | Room | Notes | Results |
---|---|---|---|---|---|
1. | Monday 19 January 2015 | 9.00 | C | ||
2. | Monday 16 February 2015 | 9.00 | C | ||
3. | Friday 05.06.2015 | 14.00 | C | ||
4. | Friday 26.06.2015 | 14.00 | C | ||
5. | Friday 17.07.2015 | 9.00 | C | ||
6. | Wednesday 09.09.2015 | 9.00 | C |
Appelli straordinari A.A. 2013/14 / Extra sessions A.A. 2013/14
Date | Time | Room | Notes | Results |
---|---|---|---|---|
7 November 2014 | 9:00-11:00 | C1 |