Indice
Big Data Analytics A.A. 2019/20
Instructors - Docenti:
- Fosca Giannotti, Luca Pappalardo
- KDD Laboratory, Università di Pisa and ISTI-CNR, Pisa
- Monday 16:00 - 18:00 Aula Fib L1
- Thursday 14:00 - 16:00 Aula Fib L1
NOTICE: ON NOVEMBER 18TH UNIVERSITY IS CLOSED AND ALL LESSONS ARE SUSPENDED DUE TO WEATHER ALERT
Registration to the course: build up teams of 3 or 4 students and register to the course here, by September 29th: http://bit.ly/bda_19_20_registration.
Preferences for datasets: fill this form: http://bit.ly/bda_19_20_bidding_datasets. Use the same email address used during the registration. In the form you find 5 choices, each choice is associated with five datasets. Your preferred dataset is the one chosen in “Prima scelta”, your second preferred dataset is the one chosen in “Seconda scelta”, and so on.
Assignment of datasets:
- Car Crash: KSGK
- Credit Risk: A2P
- Ted Talks: Data Worms
- Soccer Match Events: The big data theory
- Injury Forecasting: FortheGretagood
- LastGroup: Reddit
First mid term presentation: The first mid term presentation (data understanding and project proposal) will be on October 24th (Soccer Match Events, Credit Risk, Ted Talks) and October 31st (Injury Forecasting, Car Crash, Reddit).
- presentation: prepare a presentation describing the data understanding and a proposal of the problem you want to solve. The presentation should last 20 minutes (+ 10 minutes questions). The presentation must be sent through the google form (see below) in pdf format;
- report: the report must be done in latex, using this template: latex_template_bda.zip. It must be a maximum of 5 pages long. Summarize the data understanding and describe and motivate your project proposal. A zipped folder (.zip file) containing the .tex file, the .cls file and the files of all figures/plots must be sent through the google form. In the report, put the name of the title of your project and the names of the members of your team.
- python code: the python code in .ipynb format (Jupyter Notebook) used to generate the computations and the plots must be sent through the google form. Please document adequately your notebooks using the markdown language.
- Google Form: upload the material by October 23th (Soccer Match Events, Credit Risk, Ted Talks) and October 30th (Injury Forecasting, Car Crash, Reddit) using this form: https://forms.gle/qhMz8K6HFDFi7uDu9
Second mid term presentation: 18 November (Soccer Match Events, Credit Risk, Ted Talks) and 25 November (Injury Forecasting, Car Crash)
- presentation: prepare a presentation describing the model(s) you created to solve the predictive problem and to explain how to evaluate them. The presentation should last 20 minutes (+ 10 minutes questions). The presentation must be sent through the google form (see below) in pdf format;
- report: the report must be done in latex, using this template: latex_template_bda.zip. It must be a maximum of 10 pages long. Extend/modify the previous report and describe the model(s) you created and how you evaluate them. A zipped folder (.zip file) containing the .tex file, the .cls file and the files of all figures/plots must be sent through the google form. In the report, put the name of the title of your project and the names of the members of your team.
- python code: the python code in .ipynb format (Jupyter Notebook) used to generate the computations and the plots must be sent through the google form. Please document adequately your notebooks using the markdown language.
- Google Form: upload the material by November 17th (Soccer Match Events, Credit Risk, Ted Talks) and November 24th (Injury Forecasting, Car Crash) using this form: https://forms.gle/cF7ps7ycde8UMLm68
Paper presentation:
- each student will present, during a talk of 7 minutes, a paper on Big Data Analytics. The presentations of the papers are scheduled on December 2nd and 5th. The presentation should last 7 minutes (+ 3 minutes questions). The list of papers to choose from is here: http://bit.ly/BDA_paper_bidding.
- The paper assigned to each student, and the date of presentation, are here: http://bit.ly/32YftoN
- During the presentation (with slides) you should highlight the following aspects: the data set used, the feature engineering and/or selection (if any), the problem addressed, the models/algorithms used to solve the problem, and finally how the authors explain the model constructed (if any).
Third midterm presentation: 9 December (Soccer Match Events, Credit Risk, Ted Talks) and 12 December (Injury Forecasting, Car Crash)
- presentation: prepare a presentation in which you show how to interpret the model(s) you created and how to explain the reasoning the model. If your best model is not easily interpretable (e.g., you use a neural network), use a less accurate but more interpretable model for the interpretation. Example: if you use a decision tree (or similar) you can show the feature importance and show its structure to describe the rules it is composed of; if you use a (logistic or linear) regressor you can show the value of the computed weights; if you use a geometric model you can perform a dimensionality reduction and try to interpret it in two or three dimensions. Provide also examples of how to interpret the model on specific records that are correctly classified and records that are incorrectly classified by your model. Example: if you use a decision tree, show the rule that is used for classifying that record.
- The presentation should last 20 minutes (+ 10 minutes questions). The presentation must be sent through the google form (see below) in pdf format;
- report: the report must be done in latex, using this template: latex_template_bda.zip. It must be a maximum of 15 pages long. Extend/modify the previous report. A zipped folder (.zip file) containing the .tex file, the .cls file and the files of all figures/plots must be sent through the google form. In the report, put the name of the title of your project and the names of the members of your team.
- python code: the python code in .ipynb format (Jupyter Notebook) used to generate the computations and the plots must be sent through the google form. Please document adequately your notebooks using the markdown language.
- Google Form: upload the material by December 9th, 13:00 (Soccer Match Events, Credit Risk, Ted Talks) and December 12th, 13:00 (Injury Forecasting, Car Crash) using this form: https://forms.gle/DAFN9d2cVZHEcSCM9
Learning goals
In our digital society, every human activity is mediated by information technologies, hence leaving digital traces behind. These massive traces are stored in some, public or private, repository: phone call records, movement trajectories, soccer-logs and social media records are all examples of “Big Data”, a novel and powerful “social microscope” to understand the complexity of our societies. The analysis of big data sources is a complex task, involving the knowledge of several technological and methodological tools. This course has three objectives:
- introducing to the emergent field of big data analytics and social mining;
- introducing to the technological scenario of big data, like programming tools to analyze big data, query NoSQL databases, and perform predictive modeling;
- guide students to the development of a open-source and reproducible big data analytics project, based on the analyis of real-world datasets.
Module 1: Big Data Analytics and Social Mining
In this module, analytical methods and processes are presented thought exemplary cases studies in challenging domains, organized according to the following topics:
- The Big Data Scenario and the new questions to be answered
- Sport Analytics:
- Soccer data landscape and injury prediction
- Analysis and evolution of sports performance
- Mobility Analytics
- Mobility data landscape and mobility data mining methods
- Understanding Human Mobility with vehicular sensors (GPS)
- Mobility Analytics: Novel Demography with mobile-phone data
- Social Media Mining
- The social media data landscape: Facebook, Linked-in, Twitter, Last_FM
- Sentiment analysis. example from human migration studies
- Discussion on ethical issues of Big Data Analytics
- Well-being&Now-casting
- Nowcasting influenza with retail market data
- Predicting well-being from human mobility patterns
- Paper presentations by students
Module 2: Big Data Analytics Technologies
This module will provide to the students the technologies to collect, manipulate and process big data. In particular the following tools will be presented:
- Python for Data Science
- The Jupyter Notebook: developing open-source and reproducible data science
- MongoDB: fast querying and aggregation in NoSQL databases
- GeoPandas: analyze geo-spatial data with Python
- Scikit-learn: machine learning in Python
- Keras: deep learning in Python
Module 3: Laboratory for Interactive Project Development
During the course, teams of students will be guided in the development of a big data analytics project. The projects will be based on real-world datasets covering several thematic areas. Discussions and presentation in class, at different stages of the project execution, will be performed.
- Data Understanding and Project Formulation
- Mid Term Project Results
- Final Project results
Calendar
16/09 (Mod. 1) Introduction to the course, The Big Data scenario lesson1_introduction_to_the_course.pdf
20/09 NO LESSON
23/09 (Mod. 2) Python for Data Science and the Jupyter Notebook: developing open-source and reproducible data science
- How to install Jupyter notebook: https://jupyter.readthedocs.io/en/latest/install.html
- Python notebooks: http://bit.ly/bda_19_20_notebooks_1
27/09 (Mod. 3) Presentation of datasets for projects: http://bit.ly/bda_19_20_datasets
30/09 (Mod. 2) Scikit-learn: programming tools for data mining: http://bit.ly/bda_notebooks_2
03/10 (Mod. 2) Geopandas and scikit-mobility: analyze trajectory data in Python: geopandas.zip
07/10 (Mod. 2) PyMongo and MongoDB: fast querying and aggregation in NoSQL databases: mongodb.zip
10/10 NO LESSON
14/10 (Mod. 1) Soccer data landscape and injury prediction: bda_1920_sports_analytics.pdf
17/10 (Mod. 1) Performance evaluation: from human evaluations to data-driven algorithms: bda_performance_evaluation.pdf
21/10 (Mod. 1) Nowcasting well-being with Big Data: bda_wellbeing.pdf
24/10 (Mod. 3) Project presentation - first group of teams
28/10 NO LESSON
31/10 (Mod. 3) Project presentation - second group of teams
7/11 (Mod. 3) Discussion and group working on projects
11/11 (Mod. 3) Discussion and group working on projects
18/11 (Mod. 3) All lessons suspended for weather alert
21/11 (Mod. 1) Forecasting influenza with retail market data
25/11 (Mod. 3) Project advancements - first and second group of teams
28/11 (Mod. 1) Explainability and interpretation of machine learning models
02/12 (Mod. 3) Paper presentations
05/12 (Mod. 3) Paper presentations
09/12 (Mod. 3) Project advancements - first group of teams
12/12 (Mod. 3) Project advancements - second group of teams
Final exams (appelli) are scheduled on:
- January 14th, 2020. Room N1, 09:00 - 14:00
- February 6th, 2020. Room L1, 09:00 - 14:00