====== Big Data Analytics A.A. 2015/16 ====== Instructors - Docenti: * **Fosca Giannotti, Roberto Trasarti** * KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa * [[http://www-kdd.isti.cnr.it]] * [[fosca.giannotti@isti.cnr.it]] * [[roberto.trasarti@isti.cnr.it]] ====== Learning goals -- Obiettivi del corso ====== In our digital society, every human activity is mediated by information technologies. Therefore, every activity leaves digital traces behind, that can be stored in some repository. Phone call records, transaction  records, web search logs, movement trajectories, social media texts and tweets, Every minute, an  avalanche of “big data” is produced by humans, consciously or not, that represents a novel, accurate digital  proxy of social activities at global scale. Big data provide an unprecedented “social microscope”, a novel  opportunity to understand the complexity of our societies, and a paradigm shift for the social sciences. This course is an introduction to the emergent field of big data analytics and social mining, aimed at  acquiring and analyzing big data from multiple sources to the purpose of discovering the patterns and  models of human behavior that explain social phenomena. The focus is on what can be learnt from big  data in different domains: mobility and transportation, urban planning, demographics, economics, social  relationships, opinion and sentiment, etc.; and on the analytical and mining methods that can be used. An  introduction to scalable analytics is also given, using the “map-reduce” paradigm. === Module1: Big Data Analytics and Social Mining === In this module complex analytical methods and processes are presented thought exemplar cases studies in challenging domains organized according the following topics: * Big Data scenario * Opportunities, risks, Big data sources, social sensing challenges * Big data analytics and social mining methods * Mobility data analytics: understanding human mobility with GPS traces * Mobility data analytics: understanding City dynamics with GSM traces * Novel Demographic Indicators with GSM traces * Social Media Mining * Sport Analytics - Elenco numerato * Ethics issues in Big Data Analytics === Module2: Scalable Data Analytics Technologies === This module will provide to the students the technologies to collect, manipulate and process big data. In particular the following tools will be presented: * R * Web Scraping * Hadoop * Spark and MLlib * Hive: schema and data storage === Module3: Students projects === The students will realize analytical projects in team work. Discussions and presentation in class at different stages of the project execution are forecasted. ====== Hours - Orario e Aule ====== **Classes - Lezioni** ^ Giorno ^ Orario ^ Aula ^ | Lunedì/Monday | 11:00 - 13:00 | Aula Fib N1 | | Venerdì/Friday | 09:00 - 11:00 | Aula Fib L1 | **Office hours - Ricevimento:** * Roberto Trasarti: Venerdì/Friday h 11:30 - 12:30, CNR Room: C36b (Entrance n.20) ====== Learning Material -- Materiale didattico ====== ========== * F Giannotti, et. al. A planetary nervous system for social mining and collective awareness. The European Physical Journal Special Topics 214 (1), 49-75, 2012 * F. Giannotti, et al. Big Data Analytics: towards a European research agenda. http://www.ercim.eu/news/387-ercim-white-paper-on-big-data-analytics: * Agrawal et al. Challenges and Opportunities with Big Data 2011-1 (2011). Cyber Center Technical Reports. Paper 1. http://docs.lib.purdue.edu/cctech/1 * Data, data everywhere. The Economist, Special Report on Big Data, February 2010. * Data Science for Business -- Foster Provost, Tom Fawcett * SOCIAL MEDIA E SENTIMENT ANALYSIS L'EVOLUZIONE DEI FENOMENI SOCIALI ATTRAVERSO LA RETE Ceron Andrea; Curini Luigi; Iacus Stefano * Introducing Map-Reduce in java- http://www.rohitmenon.com/index.php/introducing-mapreduce-part-i/ ===== Software & Links ===== * Data challanges website: https://www.kaggle.com/ * Python website: http://www.python.it/download/ (Install the 2.x. Do not install 3.x). Instructions [[https://goo.gl/yBRjkG]] * Scrapy webpage: http://scrapy.org/ * Installing Hadoop single node on your machine (without VM): https://goo.gl/KGME9t (Linux/OS) https://goo.gl/7Bkcnr (Win) * Spark http://spark.apache.org/downloads.html (Can be installed without hadoop) * R https://www.r-project.org/ Two alternatives VM: * http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.html (cloudera VM - Cloudera/Cloudera) * http://hortonworks.com/products/hortonworks-sandbox/#install (hortonworks VM root/hadoop - http://127.0.0.1:8888) ====== Class calendar - Calendario delle lezioni ====== ^ ^ Day ^ Aula ^ Topic ^ Materials/Notes ^ Student Presentation ^ Instructor ^ |1.| 21.09, 11-13 | N1 | MOD1 Course Presentation | [[https://goo.gl/4HKzkf]] | | Giannotti/Trasarti | |2.| 25.09, 09-11 | L1 | MOD1 Big data scenario | [[https://goo.gl/0xY0a7]] | | Giannotti | |3.| 02.10, 09-11 | L1 | MOD2 Introduction to Hadoop | [[https://goo.gl/0UiFg8]] | | Trasarti | |4.| 05.10, 11-13 | N1 | MOD2 Hadoop Ecosystem & Design Patterns | [[https://goo.gl/rmRZ5C]] [[https://goo.gl/9GSlgZ]] | | Trasarti | |5.| 09.10, 09-11 | L1 | MOD2 Hadoop ground level [LAB]| [[https://goo.gl/aOn0rm]] [[https://goo.gl/luhYzB]] | | Trasarti | |6.| 12.10, 11-13 | N1 | MOD2 Analyzing data with Python [LAB]| https://goo.gl/7NgdVE | | Trasarti* | |7.| 19.10, 11-13 | N1 | MOD2 Web Scraping [LAB] | [[https://goo.gl/8SojKJ]] |(P1)| Trasarti* | |8.| 23.10, 09-11 | L1 | MOD1 Understanding Human Mobility with GPS traces |[[https://goo.gl/km9199]] [[https://goo.gl/6tjhyJ]] [[https://goo.gl/k5XRLj]] [[https://goo.gl/u6Q04b]] |(P2)| Giannotti | |9.| 26.10, 11-13 | N1 | MOD1 City dynamics with Mobile Phone Traces | | | Giannotti | |10.| 30.10, 09-11 | L1 | MOD1 Novel Demography with Phone Traces | Project Assignment [[https://goo.gl/11ZfYm]] | | Giannotti | |11.| 09.11, 11-13 | N1 | MOD2 Hive [LAB] | Student Groups definition [[https://goo.gl/CLFgJV]] |(P3)| Trasarti | |12.| 13.11, 09-11 | L1 | MOD2 Spark [LAB] | [[https://goo.gl/niOL5z]] |(P4) | Trasarti | |13.| 16.11, 11-13 | N1 | MOD1 Sport analytics | [[https://goo.gl/ntt1S8]] |(P5)| Giannotti | |14.| 20.11, 09-11 | L1 | Project proposal presentations | | | Giannotti/Trasarti | |15.| 23.11, 11-13 | N1 | MOD2 Data Mining with Spark [LAB] | [[https://goo.gl/Xoz6Hl]] |(P6)| Trasarti | |16.| 27.11, 09-11 | L1 | MOD2 Introducing R [LAB] | [[https://goo.gl/98dF4x]] | | Trasarti | |17.| 30.11, 11-13 | N1 | Project alignment | | (P7,8)| Trasarti | |18.| 04.12, 09-11 | L1 | MOD1 Sentiment analysis |[[https://goo.gl/Sf8KDL]] | | Giannotti* | |19.| 11.12, 09-11 | L1 | MOD3 Open Lab/Discussion [LAB] | Project preliminary results (Taxi Group) | (P9,10,11) | Giannotti/Trasarti | |20.| 14.12, 11-13 | N1 | MOD3 Open Lab/Discussion [LAB] | Project preliminary results (Reddit and Crime Groups)| | Giannotti/Trasarti | * * - Some guest will be at the lesson to provide his/her expertise on the topic. * [LAB] - Bring your laptop in class, some practical example will be shown. ====== Exam ====== The exam is composed by two parts: * A **project**, assigned among those proposed during the classes, or proposed by the students themselves. In the latter case, they are invited to submit a short project proposal (max. 2 pages) describing the data to use and the analysis objectives and to prepare a presentation of 15 minutes. The work done should be summarized in a report (max. 10 pages), to be sent to the teachers at least a week before the oral exam (project discussion) with a presentation of 30 minutes. [[https://goo.gl/ike9vi]] * An **oral exam**, that includes: (1) discussing the project report with a group presentation (15 minutes for all the group); (2) A small presentation describing the analytical process from a research paper (10 minutes for each student). === Papers assignment === ^ ^ Paper ^ Link ^ Student ^ Discussion day ^ |P1. | Twitter as an indicator for whareabouts of peole? | [[https://goo.gl/Vk7Sox]] | Florencio Paucar Sedano | 19/10 | |P2. | Explaining International Migration with skype data | https://goo.gl/IlJSmm | Pierluca Serra | 23/10 | |P3. | Big Data System for Analyzing Risky Procurement Entities | [[https://goo.gl/N2u3Lx]] | Marco Vicariucci | 9/11 | |P4. | Detecting and understanding big events in big cities (GSM data) | [[https://goo.gl/sGDeZ3]] | Tommaso Inghirami | 13/11 | |P5. | CoMobile – Human Mobility with Mobile Phone | [[https://goo.gl/ZqVKB8]] | William Tisdall | 16/11 | |P6. | Estimating Potential Customers Anywhere and Anytime Based on Location-Based Social Networks | [[https://goo.gl/rJKVqQ]] | Martina Vasapollo | 23/11 | |P7. | Product Assortment and Customer mobility | [[https://goo.gl/gwUbwy]] | Victoria Kotova | 30/11 | |P8. | Analyzing traffic with GPS using R | [[https://goo.gl/RHVOZR]] | Andrea Buccarella | 30/11 | |P9. | Tweet Sentiment: From Classification to Quantification | [[http://goo.gl/tWe3xm]] | Alessandro Marrella | 11/12 | |P10. | Small Area Model-Based Estimators Using Big Data | [[https://goo.gl/ZIRzLU]] | Raffaele Grezzi | 11/12 | |P11. | The purpouse of motion (GPS data) | [[https://goo.gl/clw8vd]] | Filippo Todeschini | 11/12 | The presentation **must** be 5 slides: - Data description - Problem statement - Data manipulation - The analytical process and the results - Validation Please send the slides to both of us by e-mail (or a link is the size is over 5 MB)