Indice
Big Data Analytics A.A. 2017/18
Instructors - Docenti:
- Fosca Giannotti, Roberto Trasarti
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
Learning goals -- Obiettivi del corso
Objective In our digital society, every human activity is mediated by information technologies. Therefore, every activity leaves digital traces behind, that can be stored in some repository. Phone call records, transaction records, web search logs, movement trajectories, social media texts and tweets, Every minute, an avalanche of “big data” is produced by humans, consciously or not, that represents a novel, accurate digital proxy of social activities at global scale. Big data provide an unprecedented “social microscope”, a novel opportunity to understand the complexity of our societies, and a paradigm shift for the social sciences. Objective of the course is twofold: an introduction to the emergent field of big data analytics and social mining, aimed at acquiring and analyzing big data from multiple sources to the purpose of discovering the patterns and models of human behavior that explain social phenomena and an introduction to the technological scenario of scalable analytics.
Intro lectures
Lecture 1: Course Presentation, Course organization, Big Data Landscape: Opportunities, risks, big data sources, challenges.
Slides:https://goo.gl/WztPDg
Technologies lectures:
Lecture 1: Overview/Recall parallel computing. Slides: https://goo.gl/eCwz7G
Lecture 2: Introduction to Hadoop and Map-Reduce Patterns. Slides: https://goo.gl/kukSQx https://goo.gl/efVLKD
Lecture 3: HDFS and Spark (LAB). Slides https://goo.gl/eD5p6c
Lecture 4-5-6: Data Analytics with Spark (LAB) (Last slides of Lecture 3 with exercises) https://goo.gl/AQJXhD
Lecture 7-8-9: Data Mining with Spark and Mllib (LAB) Slides: https://goo.gl/HJEQwT, Materials: https://goo.gl/VxAEhi
Methodological scenarios lectures:
Lecture 1-2: What is possible to observe with Mobile Phone Data? Formulation of novel questions to be answered: estimating population, understanding city dynamics, estimating unemployment or gender Distribution, Wellbeing; The complexity of feature construction; Model Construction; new mining algorithms; validation strategies.
Slides: https://goo.gl/fULiAu, https://goo.gl/UZEPdu
Lecture 3-4: What is possible to observe with GPS data? Formulation of novel questions to be answered: Understanding Human Mobility; the complexity of feature construction, new Model Construction, ew mining algorithms; validation strategies.
Slides: https://goo.gl/ztUvLd
Lecture 5-6: What is possible to observe with Social Media Data? Formulation of novel questions to be answered: Understanding Sentiment, Wellbeing, Happyness; the complexity of feature construction, new Model Construction, ew mining algorithms; validation strategies.
Lecture 7: What is possible to observe with IoT Data? Formulation of novel questions to be answered: Understanding performance in Sport; the complexity of feature construction, new Model Construction, ew mining algorithms; validation strategies.
Datasets
The datasets overview: https://goo.gl/fyAjth The datasets folder: https://goo.gl/nPd6HT
Solutions for the tech midterms are in the exercises folder of the datasets.
Calendar
18/09 - (Intro) Course Presentation, Big Data Landscape
22/09 - (Tech) Overview/Recall parallel computing
25/09 - (Method) What is possible to do observe with Mobile Phone Data? (i)
29/09 - (Method) What is possible to do observe with Mobile Phone Data? (ii)
02/10 - (Tech) Introduction to Hadoop e Design Pattern (Lab)
06/10 - Cancelled!
09/10 - (Tech) Managing HDFS and Introduction to Spark (Lab) and Datasets Presentation
13/10 - (Tech) Data Analytic with Spark (Lab)
16/10 - (Tech) Data Analytic with Spark (Lab)
20-23/10 - No Class (Time to practice!)
27/10 - (Tech) Data Analytic with Spark (Lab)
30/10 Mid-term Tech I - 16,30 starts, you will have 1 hour and 30 minutes
6/11 - (Tech) Data Mining with Spark and Mllib (Lab) (i)
10/11 - (Method) What is possible to do observe with GPS data? (i)
13/11 - (Tech) Data Mining with Spark and Mllib (Lab) (ii)
17/11 - (Method) What is possible to do observe with GPS data? (ii)
20/11 - Discussing the final project proposal - Collective discussion (not evaluated)
24/11 - (Tech) Data Mining with Spark and Mllib (Lab) (iii)
27/11 - (Method) What is possible to do observe with Social Media Data? (i)
01/12 - (Method) What is possible to do observe with Social Media Data? (ii)
4/12 - (Method) What is possible to do observe with GPS data? (iii)
11/12 - Cancelled due weather
15/12 - Discussing the final project proposal - Collective discussion (not evaluated) and (Method) What is possible to do observe with IoT data: examples from sport ?
18/12 Mid-term Tech II
12/01 - 14,00 @ CNR (Entrance 20 - Room C36b) - Mid-term Tech part I and/or II (2° chance, send an e-mail before 07/01 if you want do it)
22/01 - 16/02 Final Project and Discussion: 14,00 @ CNR (Entrance 20 - Room C40)
Exam
The two mid-terms will be 40% of the final grade, the remaining 60% is the evaluation of the Project and the Discussion (prepare some Slides to present your project). There is the possibility to do the a final test about technologies if the Mid-Terms are not sufficient.
The following table describe the expected content of a project:
Laboratories
Student should bring their own laptop (especially for technology lectures).
Software & Links
- Python website: http://www.python.it/download/ (Install the 2.x. Do not install 3.x). Instructions https://goo.gl/yBRjkG
- Installing Hadoop single node on your machine (without VM): https://goo.gl/KGME9t (Linux/OS) https://goo.gl/7Bkcnr (Win)
- Spark http://spark.apache.org/downloads.html (Can be installed without hadoop)
Virtual Machines:
- http://hortonworks.com/products/hortonworks-sandbox/#install (hortonworks VM root/hadoop http://127.0.0.1:8888 or ssh [email protected] -p 2222)
- http://www.cloudera.com/downloads/quickstart_vms/5-8.html (Cloudera VM cloudera/cloudera)
- https://www.virtualbox.org/ (Virtual Box - Virtual Machine manager)