Indice
Big Data Analytics A.A. 2015/16
Instructors - Docenti:
- Fosca Giannotti, Roberto Trasarti
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
Learning goals -- Obiettivi del corso
In our digital society, every human activity is mediated by information technologies. Therefore, every activity leaves digital traces behind, that can be stored in some repository. Phone call records, transaction records, web search logs, movement trajectories, social media texts and tweets, Every minute, an avalanche of “big data” is produced by humans, consciously or not, that represents a novel, accurate digital proxy of social activities at global scale. Big data provide an unprecedented “social microscope”, a novel opportunity to understand the complexity of our societies, and a paradigm shift for the social sciences. This course is an introduction to the emergent field of big data analytics and social mining, aimed at acquiring and analyzing big data from multiple sources to the purpose of discovering the patterns and models of human behavior that explain social phenomena. The focus is on what can be learnt from big data in different domains: mobility and transportation, urban planning, demographics, economics, social relationships, opinion and sentiment, etc.; and on the analytical and mining methods that can be used. An introduction to scalable analytics is also given, using the “map-reduce” paradigm.
Module1: Big Data Analytics and Social Mining
In this module complex analytical methods and processes are presented thought exemplar cases studies in challenging domains organized according the following topics:
- Big Data scenario
- Opportunities, risks, Big data sources, social sensing challenges
- Big data analytics and social mining methods
- Mobility data analytics: understanding human mobility with GPS traces
- Mobility data analytics: understanding City dynamics with GSM traces
- Novel Demographic Indicators with GSM traces
- Social Media Mining
- Sport Analytics - Elenco numerato
- Ethics issues in Big Data Analytics
Module2: Scalable Data Analytics Technologies
This module will provide to the students the technologies to collect, manipulate and process big data. In particular the following tools will be presented:
- R
- Web Scraping
- Hadoop
- Spark and MLlib
- Hive: schema and data storage
Module3: Students projects
The students will realize analytical projects in team work. Discussions and presentation in class at different stages of the project execution are forecasted.
Hours - Orario e Aule
Classes - Lezioni
Giorno | Orario | Aula |
---|---|---|
Lunedì/Monday | 11:00 - 13:00 | Aula Fib N1 |
Venerdì/Friday | 09:00 - 11:00 | Aula Fib L1 |
Office hours - Ricevimento:
- Roberto Trasarti: Venerdì/Friday h 11:30 - 12:30, CNR Room: C36b (Entrance n.20)
Learning Material -- Materiale didattico
- F Giannotti, et. al. A planetary nervous system for social mining and collective awareness. The European Physical Journal Special Topics 214 (1), 49-75, 2012
- F. Giannotti, et al. Big Data Analytics: towards a European research agenda. http://www.ercim.eu/news/387-ercim-white-paper-on-big-data-analytics:
- Agrawal et al. Challenges and Opportunities with Big Data 2011-1 (2011). Cyber Center Technical Reports. Paper 1. http://docs.lib.purdue.edu/cctech/1
- Data, data everywhere. The Economist, Special Report on Big Data, February 2010.
- Data Science for Business – Foster Provost, Tom Fawcett
- SOCIAL MEDIA E SENTIMENT ANALYSIS L'EVOLUZIONE DEI FENOMENI SOCIALI ATTRAVERSO LA RETE Ceron Andrea; Curini Luigi; Iacus Stefano
- Introducing Map-Reduce in java- http://www.rohitmenon.com/index.php/introducing-mapreduce-part-i/
Software & Links
- Data challanges website: https://www.kaggle.com/
- Python website: http://www.python.it/download/ (Install the 2.x. Do not install 3.x). Instructions https://goo.gl/yBRjkG
- Scrapy webpage: http://scrapy.org/
- Installing Hadoop single node on your machine (without VM): https://goo.gl/KGME9t (Linux/OS) https://goo.gl/7Bkcnr (Win)
- Spark http://spark.apache.org/downloads.html (Can be installed without hadoop)
Two alternatives VM:
- http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.html (cloudera VM - Cloudera/Cloudera)
- http://hortonworks.com/products/hortonworks-sandbox/#install (hortonworks VM root/hadoop - http://127.0.0.1:8888)
Class calendar - Calendario delle lezioni
Day | Aula | Topic | Materials/Notes | Student Presentation | Instructor | |
---|---|---|---|---|---|---|
1. | 21.09, 11-13 | N1 | MOD1 Course Presentation | https://goo.gl/4HKzkf | Giannotti/Trasarti | |
2. | 25.09, 09-11 | L1 | MOD1 Big data scenario | https://goo.gl/0xY0a7 | Giannotti | |
3. | 02.10, 09-11 | L1 | MOD2 Introduction to Hadoop | https://goo.gl/0UiFg8 | Trasarti | |
4. | 05.10, 11-13 | N1 | MOD2 Hadoop Ecosystem & Design Patterns | https://goo.gl/rmRZ5C https://goo.gl/9GSlgZ | Trasarti | |
5. | 09.10, 09-11 | L1 | MOD2 Hadoop ground level [LAB] | https://goo.gl/aOn0rm https://goo.gl/luhYzB | Trasarti | |
6. | 12.10, 11-13 | N1 | MOD2 Analyzing data with Python [LAB] | https://goo.gl/7NgdVE | Trasarti* | |
7. | 19.10, 11-13 | N1 | MOD2 Web Scraping [LAB] | https://goo.gl/8SojKJ | (P1) | Trasarti* |
8. | 23.10, 09-11 | L1 | MOD1 Understanding Human Mobility with GPS traces | https://goo.gl/km9199 https://goo.gl/6tjhyJ https://goo.gl/k5XRLj https://goo.gl/u6Q04b | (P2) | Giannotti |
9. | 26.10, 11-13 | N1 | MOD1 City dynamics with Mobile Phone Traces | Giannotti | ||
10. | 30.10, 09-11 | L1 | MOD1 Novel Demography with Phone Traces | Project Assignment https://goo.gl/11ZfYm | Giannotti | |
11. | 09.11, 11-13 | N1 | MOD2 Hive [LAB] | Student Groups definition https://goo.gl/CLFgJV | (P3) | Trasarti |
12. | 13.11, 09-11 | L1 | MOD2 Spark [LAB] | https://goo.gl/niOL5z | (P4) | Trasarti |
13. | 16.11, 11-13 | N1 | MOD1 Sport analytics | https://goo.gl/ntt1S8 | (P5) | Giannotti |
14. | 20.11, 09-11 | L1 | Project proposal presentations | Giannotti/Trasarti | ||
15. | 23.11, 11-13 | N1 | MOD2 Data Mining with Spark [LAB] | https://goo.gl/Xoz6Hl | (P6) | Trasarti |
16. | 27.11, 09-11 | L1 | MOD2 Introducing R [LAB] | https://goo.gl/98dF4x | Trasarti | |
17. | 30.11, 11-13 | N1 | Project alignment | (P7,8) | Trasarti | |
18. | 04.12, 09-11 | L1 | MOD1 Sentiment analysis | https://goo.gl/Sf8KDL | Giannotti* | |
19. | 11.12, 09-11 | L1 | MOD3 Open Lab/Discussion [LAB] | Project preliminary results (Taxi Group) | (P9,10,11) | Giannotti/Trasarti |
20. | 14.12, 11-13 | N1 | MOD3 Open Lab/Discussion [LAB] | Project preliminary results (Reddit and Crime Groups) | Giannotti/Trasarti |
- * - Some guest will be at the lesson to provide his/her expertise on the topic.
- [LAB] - Bring your laptop in class, some practical example will be shown.
Exam
The exam is composed by two parts:
- A project, assigned among those proposed during the classes, or proposed by the students themselves. In the latter case, they are invited to submit a short project proposal (max. 2 pages) describing the data to use and the analysis objectives and to prepare a presentation of 15 minutes. The work done should be summarized in a report (max. 10 pages), to be sent to the teachers at least a week before the oral exam (project discussion) with a presentation of 30 minutes. https://goo.gl/ike9vi
- An oral exam, that includes: (1) discussing the project report with a group presentation (15 minutes for all the group); (2) A small presentation describing the analytical process from a research paper (10 minutes for each student).
Papers assignment
Paper | Link | Student | Discussion day | |
---|---|---|---|---|
P1. | Twitter as an indicator for whareabouts of peole? | https://goo.gl/Vk7Sox | Florencio Paucar Sedano | 19/10 |
P2. | Explaining International Migration with skype data | https://goo.gl/IlJSmm | Pierluca Serra | 23/10 |
P3. | Big Data System for Analyzing Risky Procurement Entities | https://goo.gl/N2u3Lx | Marco Vicariucci | 9/11 |
P4. | Detecting and understanding big events in big cities (GSM data) | https://goo.gl/sGDeZ3 | Tommaso Inghirami | 13/11 |
P5. | CoMobile – Human Mobility with Mobile Phone | https://goo.gl/ZqVKB8 | William Tisdall | 16/11 |
P6. | Estimating Potential Customers Anywhere and Anytime Based on Location-Based Social Networks | https://goo.gl/rJKVqQ | Martina Vasapollo | 23/11 |
P7. | Product Assortment and Customer mobility | https://goo.gl/gwUbwy | Victoria Kotova | 30/11 |
P8. | Analyzing traffic with GPS using R | https://goo.gl/RHVOZR | Andrea Buccarella | 30/11 |
P9. | Tweet Sentiment: From Classification to Quantification | http://goo.gl/tWe3xm | Alessandro Marrella | 11/12 |
P10. | Small Area Model-Based Estimators Using Big Data | https://goo.gl/ZIRzLU | Raffaele Grezzi | 11/12 |
P11. | The purpouse of motion (GPS data) | https://goo.gl/clw8vd | Filippo Todeschini | 11/12 |
The presentation must be 5 slides:
- Data description
- Problem statement
- Data manipulation
- The analytical process and the results
- Validation
Please send the slides to both of us by e-mail (or a link is the size is over 5 MB)