====== Data Analytics for Digital Health (DAD) ====== **Instructors:** * **Anna Monreale** * KDDLab, Università di Pisa * [[anna.monreale@unipi.it]] * **Francesca Naretto** * KDDLab, Università di Pisa * [[francesca.naretto@unipi.it]] ====== News ====== ====== Learning Goals ====== * Fundamental concepts of data knowledge and discovery. * Data Types in Healthcare Data and Public Databases * Data understanding * Data preparation * Clustering * Classification * Rule-based methods * Outlier Detection * Time Series Analysis * Sequential Pattern Mining ====== Hours and Rooms ====== **Classes** ^ Day of Week ^ Hour ^ Room ^ | Monday | 09:00 - 11:00 | Room FIB PS4 | | Wednesday| 14:00 - 16:00 | Room C | | Friday | 11:00 - 13:00 | Room FIB PS4 | **Office hours - Ricevimento:** Anna Monreale: Thu 09:00-11:00 - Online using Teams or in my Office (Appointment by email). Francesca Naretto: Mon 11:00-13:00 - Online using Teams or in my Office (Appointment by email). A [[https://teams.microsoft.com/l/team/19%3AYicRl7qo_TVGdu-QzXkPsV78YMyBSz-DUvdz3AJMoUI1%40thread.tacv2/conversations?groupId=d4217229-2988-44de-bbd8-6f4be6224ffa&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams Channel]] will be used ONLY to post news, Q&A, and other stuff related to the course. The lectures will be only in presence and will **NOT** be live-streamed. ====== Learning Material -- Materiale didattico ====== ===== Textbook -- Libro di Testo ===== * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, 2006 * [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php]] * Chapters 4,6 and 8 are also available at the publisher's Web site. * Jake VanderPlas. **[[http://shop.oreilly.com/product/0636920034919.do| Python Data Science Handbook: Essential Tools for Working with Data.]]** 1st Edition. * For Python Notions: {{ :magistraleinformatica:dmi:python_basics.ipynb.zip | Very basic notions on Python}} ===== Slides ===== * The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. ===== Software===== * Python - Anaconda (at least 3.7 version!!!): Anaconda is the leading open data science platform powered by Python. [[https://www.anaconda.com/distribution/| Download page]] (the following libraries are already included) * Scikit-learn: python library with tools for data mining and data analysis [[http://scikit-learn.org/stable/ | Documentation page]] * Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. [[http://pandas.pydata.org/ | Documentation page]] ====== Class Calendar (2024/2025) ====== ===== First Semester ===== ^ ^ Day ^ Topic ^ Learning material ^ References ^ Video Lectures ^ Teacher ^ |1. | 16.09 | Overview. Introduction to KDD + Data Types | {{ :digitalhealth:0-overview.pdf | Overview}} {{ :digitalhealth:1-intro-da-dm-tecs.pdf |Introduction to DADH}} {{ :digitalhealth:2-data_understanding.pdf |Data Understanding}}| Chap. 1 Kumar Book | |Monreale | |2. | 18.09 | Data Understanding for tabular data | Slides of DU of the previous lecture | Chap.2 Kumar Book and additioanl resource of Kumar Book: [[https://www-users.cs.umn.edu/~kumar001/dmbook/data_exploration_1st_edition.pdf|Data Exploration Chap.]] If you have the first ed. of KUMAR this is the Chap 3 | |Monreale | |3. | 20.09 | Data Preparation for tabula Data | {{ :digitalhealth:3-data_preparation_dad.pdf |}} | Chap.2 Kumar Book and additioanl resource of Kumar Book: [[https://www-users.cs.umn.edu/~kumar001/dmbook/data_exploration_1st_edition.pdf|Data Exploration Chap.]] If you have the first ed. of KUMAR this is the Chap 3 | |Monreale | |4. | 23.09 | Data Understanding and Preparation for images | {{ :digitalhealth:4-data-understanding_images.pdf |}}|Digital Image processing, 3 edition, Rafael Gonzalez, Richard Woods | | Naretto | |5. | 25.09 | Data Understanding and Preparation for images and Time Series | {{ :digitalhealth:5-data-understanding_ts.pdf |}}| | | Naretto | |6. | 27.09 | Data Understanding and Preparation for Time Series + Python Lab.| {{ :digitalhealth:integrazione.zip | Intro to Python}}| | | Naretto | |7. | 30.09 | Data Understanding and Preparation for Tabular Python Lab. | {{:digitalhealth:Data_Und.zip}} | | | Naretto | |8. | 02.10 | Data Understanding and Preparation for Images and Time Series Python Lab. | {{:digitalhealth:Data_Und.zip}} | | | Naretto | |9. | 04.10 | Data Management and Data Warehousing | {{ :digitalhealth:6-dw.pdf |}}| | | Monreale | |9. | 07.10 | Data Management and Data Warehousing | {{ :digitalhealth:6-dw.pdf |}}| | | Monreale | |9. | 09.10 | Data Reporting - Project presentation | {{ :digitalhealth:6-dw.pdf |}}| | | Monreale | |10. | 14.10 | Clustering: intro and k-means | {{ :digitalhealth:6-basic_cluster_analysis-intro.pdf |}} {{ :digitalhealth:6-basic_cluster_analysis-kmeans.pdf |}}| Chapter 7, Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar| | Naretto | |11. | 16.10 | Clustering: k-means and hierarchical| {{ :digitalhealth:7.basic_cluster_analysis-Hierarchical.pdf |}}|Chapter 7, Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar | | Naretto | |12. | 21.10 | Clustering: k-means variants and density Based approaches|{{ :digitalhealth:11-basic_cluster_analysis-kmeans-variants.pdf |}} {{ :digitalhealth:10-basic_cluster_analysis-dbscan.pdf |}} |Chapter 7, Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar | | Monreale | |13. | 23.10 | Clustering: Validity|{{ :digitalhealth:12-basic_cluster_analysis-validity.pdf |}} |Chapter 7, Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar | | Monreale | |14. | 25.10 | Clustering and similarity for Images| {{ :digitalhealth:3.2-Clustering_images.pdf |}}| | | Naretto | |15. | 28.10 | Clustering and similarity for Time Series| {{ :digitalhealth:8_time_series_similarity_2024.pdf |}}| Time Series Analysis and Its Applications. Robert H. Shumway and David S. Stoffer. 4th edition| | Naretto | |16. | 30.10 | Python Lab: Clustering| {{ :digitalhealth:clustering_diabetes.zip|}} {{ :digitalhealth:images_similarity.zip|}} {{ :digitalhealth:timeseries_similarity_clustering.zip|}} {{ :digitalhealth:clustering_tabular_tips.zip|}}| | | Naretto | |17. | 04.11 | Python Lab: Clustering + Frequent Pattern Mining| {{ :digitalhealth:17_association_analysis.pdf |}}| | | Naretto, Monreale | |18. | 06.11 | Frequent Pattern Mining| same slides as previous lecture| | |Monreale | |19. | 08.11 | Sequential Pattern Mining| {{ :digitalhealth:18_sequential_patterns_2024.pdf |}}| | |Monreale | |20. | 11.11 | Python lab: FPM + SPM|{{ :digitalhealth:AR_SPM.zip |}} | | |Naretto | |21. | 13.11 | Classification for tabular| | | |Naretto | |22. | 15.11 | Classification for tabular|{{ :digitalhealth:10-KNN.pdf |}} | | |Naretto | |23. | 18.11 | Classification for tabular|{{ :digitalhealth:10-lg.pdf |}} | | |Naretto | |24. | 20.11 | Project| | | |Monreale, Naretto | |25. | 22.11 | Classification for tabular|{{ :digitalhealth:10-Rule-Based-Classifiers.pdf |}} {{ :digitalhealth:11_2021-naive_bayes.pdf|}}| | |Naretto | |26. | 25.11 | Classification for tabular|{{ :digitalhealth:13_ensemble_2023.pdf |}}| | |Naretto | ====== Exams ====== **Project ** A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 2 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 25 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks. * First part of the project consists in the **assignments** described here: {{ :digitalhealth:data_analytics_for_digital_health_project_du_cl.pdf |Project Description: DU and Clustering}} - **Dataset**:[[https://unipiit.sharepoint.com/:f:/s/a__td_65366/EmBuMOhFMZVMiWh9cZT_VrMB5DJ8Xf6s5w4m0bPmwx_5jA?e=PvGLMs|Dataset Material]] - **Deadline**: the fist part has to be delivered by ** December 2th, 2024 **. The delivery will be through Teams' assignement **Students who did not deliver the above project within **Dec 31, 2024** need to ask by email a new project to the teachers. The project that will be assigned will require about 20 days of work and after the delivery it will be discussed during the oral exam. ** **Oral Exam** * **Project presentation** (with slides) – 15 minutes: mandatory for all the students with question fo understanding the details of any part of the project. * ** Open questions on the entire program ** **How to book for the exam colloquium? ** In https://esami.unipi.it/ you can find the dates for the exam: one for January and one for February. Each student must do the registration on one of the 2 dates. These are not the dates of the colloquium or project delivery but we will use the list of registered students for organizing the exam dates. After that deadline we will share with you a calendar for the oral exam.