Data Mining (309AA) - 9 CFU A.Y. 2020/2021

Instructor:

Anna Monreale
- KDDLab, Università di Pisa
- anna [dot] monreale [at] unipi [dot] it

Teaching Assistant:

Francesca Naretto
- KDDLab, SNS, Pisa
- francesca [dot] naretto [at] sns [dot] it

News

[01.10.2020] The lecture on 9.10.2020 will be suppressed.
[09.09.2020] The course will be held online, please use this link to join the class: https://teams.microsoft.com/l/team/19%3a8f6779bab74f4368ba7ce1c2b092346d%40thread.tacv2/conversations?groupId=8da15095-b6e5-41c1-a894-d418aed3983e&tenantId=c7456b31-a220-47f5-be52-473828670aa1 *

Learning Goals

Fundamental concepts of data knowledge and discovery.
Data understanding
Data preparation
Clustering
Classification & Regression
Pattern Mining and Association Rules
Outlier Detection
Time Series Analysis
Sequential Pattern Mining
Ethical Issues

Hours and Rooms

Classes

Day of Week	Hour	Room
Wednesday	09:00 - 10:45	Online
Thursday	09:00 - 10:45	Online
Friday	11:00 - 12:45	Online

Office hours - Ricevimento: Anna Monreale: Wednesday: 11:00-13:00 online using Teams (Appointment by email) Francesca Naretto: Monday: 15:00-18:00 online using Teams (Appointment by email)

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- http://www-users.cs.umn.edu/~kumar/dmbook/index.php
- Chapters 4,6 and 8 are also available at the publisher's Web site.
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.
Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. 1st Edition.

Slides

The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors Slides per "Introduction to Data Mining".

Software

Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
Scikit-learn: python library with tools for data mining and data analysis Documentation page
Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page

Class Calendar (2020/2021)

First Semester

	Day	Topic	Learning material	References
1.	16.09 09:00-10:45	Overview. Introduction to KDD	1-overview.pdf 1-intro-dm.pdf	Chap. 1 Kumar Book
2.	17.09 09:00-10:45	Data Understanding	Slides DU	Chap.2 Kumar Book and additioanl resource of Kumar Book:Exploring Data If you have the first ed. of KUMAR this is the Chap 3
3.	18.09 09:00-10:45	Data Preparation	3-data_preparation.pdf	Chap. 2 Kumar Book
4.	23.09 09:00-10:45	Data Preparation: Transformations & PCA	3-data_preparation.pdf	Chap. 2 Kumar Book, Appendix B Dimensionality Reduction (only PCA)
5.	24.09 09:00-10:45	Data Similarities. Introduction to Clustering.	4-data_similarity.pdf 5-basic_cluster_analysis-intro.pdf	Data Similarity is in Chap. 2 while Clustering is in Chap. 7
6.	25.09 11:00-12:45	LAB: Data Understanding in Python	Very basic notions on Python Notebook on Data Understanding tipsdata.zip
7.	30.09 09:00-10:45	Center-based clustering: kmeans	6-basic_cluster_analysis-kmeans-variants.pdf	Chap. 7 Kumar Book
8.	01.10 09:00-10:45	Center-based clustering: Bisecting K-means, Xmeans, EM	Same Slides of the previous lectures	Chap. 7 Kumar Book, Clustering & Mixture Models xmeans.pdf
9.	02.10 11:00-12:45	Hierarchical clustering	7.basic_cluster_analysis-hierarchical.pdf ex._hierarchical-clustering.pdf	Chap. 7 Kumar Book
10.	07.10 09:00-10:45	Density based clustering	8.basic_cluster_analysis-dbscan-validity.pdf	Chap. 7 Kumar Book
11.	08.10 09:00-10:45	Lab: clustering + Project Assignment	py-clustering.zip
	09.10 11:00-12:45	Lecture canceled
12.	14.10 09:00-10:45	Classification Problem + Decision trees	9.chap3_basic_classification-2020.pdf	Chap. 3 Kumar Book
13.	15.10 09:00-10:45	Only 30 minutes of Discussion on the project due to connection problems		Chap. 3 Kumar Book
14.	16.10 11:00-12:45	Decision Tree + Classifier Evaluation		Chap. 3 Kumar Book
15.	21.10 09:00-10:45	Evaluation Methods for Classification Models	9.chap3_basic_classification-2020.pdf	Chap. 3 Kumar Book + Chap. 4 Kumar Book
16.	22.10 09:00-10:45	Statistical tool for model evaluation + Rule based classification	10-rule-based-clussifiers.pdf	Chap. 3 Kumar Book + Chap. 4 Kumar Book
17.	23.10 11:00-12:45	Rule based classification + Instance-based Classification	11-knn.pptx	Chap. 4 Kumar Book
18.	28.10 09:00-10:45	Naive Bayesian Classifier + Ensemble Classifieres	12-naive_bayes.pdf 13_ensemble_2020.pdf	Chap. 4 Kumar Book
19.	29.10 09:00-10:45	SVM & NN	14_svm_2020.pdf 15_neural_networks_2020.pdf	Chap. 4 Kumar Book
20.	30.10 11:00-12:45	MLNN & Lab on Classification	Nootebook Python for classification	Chap. 4 Kumar Book
21.	04.11 09:00-10:45	Regression & Association Rule Mining	16_linear_regression.pdf 17_association_analysis.pdf	Regression: Appendix D in Kumar BOOK Chap.5 Association Rules: Kumar Book
22.	05.11 09:00-10:45	Association Rule Mining		Chap.5 Association Rules: Kumar Book
23.	06.11 11:00-12:45	Sequential Pattern Mining	18_sequential_patterns_2020.pdf	Chap.6 Kumar Book
24.	11.11 09:00-10:45	Ethics in AI & Privacy	19_ethics_privacy.pdf	Report in Trustworthy AI
25.	12.11 09:00-10:45	Ethics in AI & Privacy		Overview on Privacy allegato11-cpdp13.pdf Privacy by design
26.	13.11 11:00-12:45	Ethics in AI & Privacy, Explainability	20_explainability_2020.pdf
27.	18.11 09:00-10:45	Explainability	20_explainability_2020.pdf	Material: LORE LIME Survey ABELE
28.	19.11 09:00-10:45	Anomaly Detection	21_anomaly_detection_2020.pdf	Chap. 9 of Kumar Book
29.	20.11 11:00-12:45	Anomaly Detection	anomalydetection.ipynb.zip	Chap. 9 of Kumar Book
30.	25.11 09:00-10:45	Time series Siminarity	22_time_series_similarity.pdf	Overview on DM for time series, DTW paper by Sakoe and Chiba, 1978
31.	26.11 09:00-10:45	Time series Clustering	22_time_series_similarity.pdf
32.	27.11 11:00-12:45	Lab on Association Rules and Sequential Pattern Mining	patterns.zip
33.	02.12 09:00-10:45	Time Series: Motif Discovery	23_time_series_motif_shapelets.pdf	randomproj.pdf matrixprofile.pdf
34.	03.12 09:00-10:45	Time Series: Shapelets Discovery + Ex. DTW + Subsequences + Thesis available	23_time_series_motif_shapelets.pdf ex-dtw-sequences.pdf Thesis Proposals	shaplet.pdf
	04.12 11:00-12:45	Lecture Canceled
35.	09.12 09:00-10:45	Paper Presentation
36.	10.12 09:00-10:45	Paper Presentation
37.	11.12 11:00-12:45	Paper Presentation

Exams

Mid-term Project

A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 2/3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks.

First part of the project consists in the assignments described here: Project Description
- Dataset: customer_supermarket.csv.zip
- Deadline: the fist part has to be delivered within ~~November, 5th 2020.~~ November, 12 2020.
Second part of the project consists in the assignment Task 3 described here: Updated Project Description
- Deadline: the second part has to be delivered within January, 4th 2021
Third part of the project consists in the assignment Task 4 described here: Final Project Description
- Deadline: January, 4th 2021 (strict) Prepare a single zip folder containing also the material of the previous submitted task (even if they are already submitted). Note that, in the file of the project description I reported all the detailed instructions for the delivery of all the tasks for the final submission.

Project to be delivered during the exam sessions

Students who did not deliver the above project within 4 Jan 2021 need to ask by email a new project to the teacher.

Paper Presentation (OPTIONAL)

Students need to present a research paper (made available by the teacher) during the last week of the course. This presentation is OPTIONAL: Students that decide to do the paper presentation can avoid the oral exam with open questions. They only need to present the project (see next point).

Oral Exam

Project presentation (with slides) – 10 minutes: mandatory for all the students
Open questions on the entire program: optional only for students opting for paper presentation.

Exam Dates

TBD

Exam Sessions

TBD

Reading About the "Data Scientist" Job

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

Data, data everywhere. The Economist, Feb. 2010 download
Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
Welcome to the yotta world. The Economist, Sept. 2011 download
Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video

Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Previous years

Data Mining (309AA) - 9 CFU A.Y. 2020/2021

DM-2019/20