Data Mining A.A. 2024/25
DM1 - Data Mining: Foundations (6 CFU)
Instructors:
Teaching Assistant
DM2 - Data Mining: Advanced Topics and Applications (6 CFU)
Instructors:
Teaching Assistant
News
[07.09.2024] Past years' lectures available at
link
[02.09.2024] Lectures will start on Monday 30 September 2024 at 11.00 room C1.
[02.09.2024] Lectures will be in presence only. Registrations of the lectures of past years can be found at the bottom of this web page.
[02.09.2024] Project Groups
link
[11.09.2023] MS Teams
link
Learning Goals
DM1
Fundamental concepts of data knowledge and discovery.
Data understanding
Data preparation
Clustering
Classification
Pattern Mining and Association Rules
Sequential Pattern Mining
Hours and Rooms
DM1
Classes
Day of Week | Hour | Room |
Monday | 11:00 - 13:00 | C1 |
Tuesday | 14:00 - 16:00 | C1 |
Office hours - Ricevimento:
Prof. Pedreschi
Prof. Guidotti
DM 2
Classes
Day of Week | Hour | Room |
Monday | 09:00 - 11:00 | C |
Wednesday | 11:00 - 13:00 | C |
Office Hours - Ricevimento:
Learning Material -- Materiale didattico
Textbook -- Libro di Testo
Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
-
I capitoli 3, 5, 7 sono disponibili sul sito del publisher. – Chapters 3,5 and 7 are also available at the publisher's Web site.
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.
-
Slides
Software
Python - Anaconda (>3.7): Anaconda is the leading open data science platform powered by Python.
Download page (the following libraries are already included)
Scikit-learn: python library with tools for data mining and data analysis
Documentation page
Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Documentation page
Other softwares for Data Mining
Class Calendar (2024/2025)
First Semester (DM1 - Data Mining: Foundations)
| Day | Time | Room | Topic | Material | Lecturer |
| 16.09.2024 | | | No Lecture | | |
| 17.09.2024 | | | No Lecture | | |
| 23.09.2024 | | | No Lecture | | |
| 24.09.2024 | | | No Lecture | | |
01. | 30.09.2024 | 11-13 | C1 | Overview, Introduction | Intro | Pedreschi |
02. | 01.10.2024 | 14-16 | C1 | Lab. Introduction to Python | Python Basics | Pedreschi |
03. | 07.10.2024 | 11-13 | C1 | Data Understanding | Data Understanding | Pedreschi |
04. | 08.10.2023 | 14-16 | C1 | Data Understanding & Preparation | Data Understanding, Data Preparation | Pedreschi |
05. | 14.10.2023 | 11-13 | C1 | Data Preparation & Similarity | Data Preparation, Data Similarity | Pedreschi |
06. | 15.10.2024 | 14-16 | C1 | Lab. Data Understanding | Data Understanding | Pedreschi |
07. | 21.10.2024 | 11-13 | C1 | Introduction to Clustering, K-Means | Intro Clustering, K-Means | Pedreschi |
08. | 22.10.2024 | 14-16 | C1 | Centroid-based Clustering | K-Means | Pedreschi |
09. | 28.10.2023 | 11-13 | C1 | Hierarchical Clustering & Density-based Clustering | Hierarchical Clustering, Density-based Clustering | Pedreschi |
10. | 29.10.2024 | 14-16 | C1 | Lab. Clustering | Clustering | Pedreschi |
11. | 04.11.2024 | 11-13 | C1 | Ex. Clustering | ExClustering | Guidotti |
12. | 05.11.2024 | 14-16 | C1 | Intro Classification & kNN | Intro Classification, kNN | Guidotti |
13. | 11.11.2024 | 11-13 | C1 | Naive Bayes, Exercises | Naive Bayes | Guidotti |
14. | 12.11.2024 | 14-16 | C1 | Model Evaluation, Lab. Classification (kNN,NB) | Model Evaluation, Classification | Guidotti |
15. | 14.11.2024 | 9-11 | C1 | Decision Tree Classifier | Decision Tree | Guidotti |
16. | 18.11.2024 | 11-13 | C1 | Decision Tree Classifier | Decision Tree | Guidotti |
17. | 19.11.2024 | 14-16 | C1 | Decision Tree Classifier | Decision Tree | Guidotti |
18. | 21.11.2024 | 9-11 | C1 | Decision Tree Classifier Exercises and Lab | Decision Tree, Classification | Guidotti |
19. | 25.11.2024 | 11-13 | C1 | Regression & Lab. Regression | Regression, Regression, IMDb Rating | Guidotti |
20. | 26.11.2024 | 14-16 | C1 | Into Pattern Mining and Apriori | Pattern Mining | Pedreschi |
Second Semester (DM2 - Data Mining: Advanced Topics and Applications)
Exams
How and Where:
The exam will take place in oral mode only at the teacher's office or classroom previously designated.
The exam will be held online on the 420AA Data Mining course channel only at the request of the
student in accordance with current legislation.
When:
The dates relating to the start of the three exams are/will be published on the online platform
https://esami.unipi.it/. Within each session, we will identify dates and slots in order to distribute the
various orals. The dates and slots to take the exam will be published on the course page by the end of
May. Each student must also register on https://esami.unipi.it/. The examination can only be carried out after the delivery of the project. The project must be delivered one week before when you want to take the exam. Group oral discussions will be preferred in respect of the project groups in order to parallelize any discussion on the project. It is not mandatory to take the oral exam together with the other members of the group.
In the event that the oral exam is not passed, it will not be possible to take it for 20 days. If the project is not considered sufficient, it must be carried out again on a new dataset or a very updated version of the current one.
What:
The oral test will evaluate the practical understanding of the algorithms. The exam will evaluate three aspects.
Understanding of the theoretical aspects of the topics addressed during the course. The student may be required to write on formulas or pseudocode. During the explanations, the student can use pen and paper.
Understanding of the algorithms illustrated during the course and their practical implementation. You will be asked to perform one or more simple exercises. The text will be shown on the teacher's screen and / or copied to Miro. The student will have to use pen and paper (if online by Miro
https://miro.com/ to show how the exercise is solved.
Discussion of the project with questions from the teacher regarding unclear aspects,
questionable steps or choices.
Final Mark: for 12-credit exam, the final mark will be obtained as the
average mark of DM1 and DM2.
Exam Booking Periods
-
1st Appello: from TBD to TBD
2nd Appello: from TBD to TBD
3rd Appello: from TBD to TBD
4th Appello: from TBD to TBD
5th Appello: from TBD to TBD
6th Appello: from TBD to TBD
Exam Booking Agenda
When registering for the oral exam please specify in the notes DM1 if you do not want to do DM2 (that is assumed by default). After having booked for DM1 please contact Prof. Pedreschi to agree on the exam date (put Prof. Guidotti and Andrea Fedele in cc). There will be no agenda for DM1.
1st Appello - DM1 & DM2: from TBD to TBD (deliver project by TBD)
2nd Appello - DM1 & DM2: from TBD to TBD (deliver project by TBD)
3rd Appello: - DM1 & DM2: from TBD to TBD (deliver project by TBD)
4th Appello: - DM1 & DM2: from TBD to TBD (deliver project by TBD)
5th Appello: - DM1 & DM2: from TBD to TBD (deliver project by TBD)
6th Appello: - DM1 & DM2: from TBD to TBD (deliver project by TBD)
Do not forget to make the evaluation of the course!!!
Exam DM1
The exam is composed of two parts:
A
project, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, pattern mining, and classification (guidelines will be provided for more details). The project has to be performed by min 2, max 3 people. It has to be performed by using Python or any other data mining software. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must be emailed to
andrea [dot] fedele [at] phd [dot] unipi [dot] it and
riccardo [dot] guidotti [at] unipi [dot] it. Please, use “[DM1 2024-2025] Project” in the subject.
Dataset
Assigned: 15/10/2024
MidTerm Submission: 15/11/2024 22/11/2024 (+0.5) (half project required, i.e., Data Understanding & Preparation and Clustering)
Final Submission: 31/12/2024 (+0.5) one week before the oral exam (complete project required).
-
DM1 Project Guidelines
See Project Guidelines.
Exam DM2
The exam is composed of two parts:
A
project, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: imbalanced learning, dimensionality reduction, outlier detection, advanced classification/regression methods, time series analysis/clustering/classification (guidelines will be provided for more details). The project has to be performed by min 1, max 3 people. It has to be performed by using Python or any other data mining software. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 30 pages of text including figures. The paper must be emailed to
andrea [dot] fedele [at] phd [dot] unipi [dot] it and
riccardo [dot] guidotti [at] unipi [dot] it. Please, use “[DM2 2023-2024] Project” in the subject.
Dataset
Assigned: TBD
MidTerm Submission: 07/05/2024 (Modules 1 and 2 (for TS classification non DL-based models))
Final Submission: one week before the oral exam (complete project required, also with DL-based models for TS classification).
Dataset: TBD
DM2 Project Guidelines
See Project Guidelines.
Past Exams
Reading About the "Data Scientist" Job
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
Data, data everywhere. The Economist, Feb. 2010
download
Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011
link
Welcome to the yotta world. The Economist, Sept. 2011
download
Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012
link
Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012
link
Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics
download
Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012:
YouTube video
Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com.
download
Previous years