====== Laboratory on Algorithms for Big Data 2014/2015 ====== Some useful infos on the course: * **Teacher**: [[http://www.di.unipi.it/~ferragin/|Paolo Ferragina]] and [[http://zola.di.unipi.it/rossano/|Rossano Venturini]] * **CFU**: 6 * **Period**: First semester * **Language**: English * **Lectures schedule:** Tuesday 9-11 (N1) and Thursday 16-18 (C). * **Question time:** After lectures or by appointment. * **Allowed Degree**: Master degree in Computer Science, or Computer Science and Networking, or Business Informatics * **News:** about this course will be distributed via a [[http://twitter.com/FerraginaTeach | Twitter-channel]]. * **Official lectures schedule:** The schedule and content of the lectures is available below and with the [[http://unimap.unipi.it/registri/dettregistriNEW.php?re=154453::::&ri=9142|official registro]]. \\ ====== Goals and opportunities for students ====== The course consists of a first part of lectures describing advanced algorithms and data structures (3 CFU), and a laboratory in the second part (3 CFU) in which the students will deploy these techniques to develop a software project. The students will select their projects among a set of proposals by some **IT companies** which are challenging from an algorithmic perspective. These companies will also contribute to identify/construct significant **datasets** that will help in testing the proposed algorithmic solutions. The course will provide the opportunity of * facing with difficult algorithmic problems of practical interest involving big data; * evaluating the impact of efficient algorithmic solutions in the design of software managing big data; * implementing advanced software by using powerful and sophisticated libraries; * getting in touch with some companies for internships, scholarships, or thesis proposals. ====== Exam ====== ^ Date ^ Room ^ Text ^ | 22/01/2015, 09:00 | L1 | {{:magistraleinformatica:lad:lad14:lab150122.doc|text}} | | 13/02/2015, 09:00 | L1 | {{:magistraleinformatica:lad:lad14:lab150213.doc|text}} | | 05/06/2015, 09:00 | L1 | {{:magistraleinformatica:lad:lad14:lab150605.doc|text}} | | 29-06-2015, 09:00 | L1 | text | | 20-07-2015, 09:00 | L1 | text | | 10-09-2015, 09:00 | L1 | text | The exam consists of three parts: Project 70%, Written/oral test 20%, Project presentation 10%. Students can attend the written/oral test before the presentation/development of the project. As far as the two projects are concerned, we list below the datasets for each of them. These are //warm-up datasets// of moderate size, yet sufficient to start designing your solutions and be concerned with time/space efficiency issues. We will provide you larger datasets to test the final codes you'll produce. **Project 1.** The [[https://www.dropbox.com/s/ba28q2k3vcif498/DatasetsProject1.zip?dl=0|dataset]] consists of the following parts: * Dictionary of English terms (3.5Mb, 354.984 terms); * Dictionary of titles of Wikipedia articles (116 Mb, 6.971.668 titles). This dataset has been modified by introducing errors; * Queries obtained by introducing errors in the dictionary of terms above (2.126.355 queries); * Queries obtained by taking the Wikipedia titles (376.304 queries). Few related [[https://www.dropbox.com/s/ci49poek5armjf3/papersP1.zip?dl=0|papers]]. **Project 2.** The [[https://mega.co.nz/#!XFslwRjA!G_JptVb3EeFVAKvPUTxjZ9ISMKoP7IuFhr6hdt5iyYg|dataset]] consists of 2.8Gb of 7zipped web pages drawn from the UK-GOV2 collection, and in [[http://commoncrawl.org/navigating-the-warc-file-format/|WARC format]]. Some techniques that you can use for your project are described in these [[https://dl.dropboxusercontent.com/u/7999075/Progetto%202%20-%20papers.zip|papers]]. Software libraries are indicated [[http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/lad/lad14/algorithmengineeringlab_-_projects.docx|here]], plus few additional softwares from students listed below: * {{:magistraleinformatica:lad:lad14:libwarc-75524e4.tgz|WARC parser}} (by Andrea Cardaci). ====== Background====== If you wish to refresh your mind on basic Algorithms and Data Structures, we suggest you to look at the well-known book [[http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11866|Introduction to Algorithms]] by Cormen-Leiserson-Rivest-Stein, third edition. Moreover, you can look at the content of the [[http://didawiki.cli.di.unipi.it/doku.php/magistraleinformaticanetworking/ae/start|Algorithm Engineering course]] by Paolo Ferragina. Anyway, depending on the topics addressed in the lectures you'll be pointed out to the suitable chapters of Ferragina's notes. ====== Lectures ====== We'll often do not project slides for teaching, but use //mainly// the //old-fashioned// blackboard. Most of the content of the course will be covered by notes, sometime we'll use parts of papers/books. ^ Date ^ Lecture ^ Biblio ^ | 24/09/2014 | Introduzione al corso: temi, progetti e svolgimento dell'esame. | {{:magistraleinformatica:lad:lad14:algorithmengineeringlab_-_projects.docx|progetti assegnati}}, {{:magistraleinformatica:lad:lad14:il_corso.pptx|slides}} | | 25/09/2014 | (Compact) Trie, Ternary Search Trie, e Patricia Trie. | Sezioni 7.1, 7.4 e 7.5 di queste [[http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2013/chap07.pdf|note]]. [[http://www.drdobbs.com/database/ternary-search-trees/184410528|Questo]] articolo su Dr.Dobb's.| | 30/09/2014 | Strutture dati succinte: rank/select, Elias-Fano, alberi (LOUDS e BP). | [[https://www.dropbox.com/s/oejl7rvvsw7qu5y/SuccinctDS.pdf?dl=0|Lucidi]] sufficienti per lo studio. | | 02/10/2014 | Strutture dati per l'indicizzazione compressa di testi: Suffix Array, Trasformata di Burrows-Wheeler, FM-index (count, locate, extract). | {{:magistraleinformatica:lad:lad14:03._bwt_e_fmi.pptx|Lucidi}} sufficienti per lo studio. | | 07/10/2014 | Compressed Permuterm Index e Dictionary search with prefix, suffix, substring, prefixSuffix match. Hashing: chaining (with time/space complexity), Universal hashing, Minimal ordered perfect hash. | {{:magistraleinformatica:lad:lad14:04_b._hashing_min_perfect_hash_short_.pptx|Lucidi1}} e {{:magistraleinformatica:lad:lad14:04_a._permuterm.ppt|Lucidi2}} sufficienti per lo studio. | | 09/10/2014 | Alberi succinti DFUDS. Descrizione Progetto 1 e possibili soluzioni. | [[https://www.dropbox.com/s/5tai57kqcvoh1ro/Project1.pdf?dl=0|Lucidi]] | | 14/10/2014 | Permuting e Sorting: I/O-bottleneck, binary mergesort, multi-way mergesort. | {{:magistraleinformatica:lad:lad14:6._sorting.ppt|Lucidi}} e [[http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2013/chap_05.pdf|capitolo 5]] note di P. Ferragina (no 5.1.2, e no 5.2 in poi). | | 16/10/2014 | Near-duplicate document detection: problem definition and comments, Karp-Rabin-fingerprint, Shingling, Jaccard similarity of sets, document sketches, locality sensitive hashing, the detection process. | {{:magistraleinformatica:lad:lad14:07_a._minhash-shingle.ppt|Slides}} e {{:magistraleinformatica:lad:lad14:lsh.pdf|parte di capitolo}}. | | 21/10/2014 | An introduction to data compression: Gamma/Delta codes, Huffman code, Lempel-Ziv 1977, and Burrows-Wheeler transform. | {{https://www.dropbox.com/s/q5i70eybz0217ii/datacompression.pptx?dl=0|Slides}}. Sect. 9.1 of these {{http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2013/chap_09.pdf|notes}}, Sect. 10.1 of these {{http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2013/chap_10.pdf|notes}}, and Sect. 11.1 of these {{http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2013/chap_11.pdf|notes}}. | | 23/10/2014 | Discussion on the projects. | | | 28/10/2014 | Clustering: soft/hard clustering, bottom-up and top-down approaches, various metrics, K-means. Discussion on Project 2| Slides on {{:magistraleinformatica:lad:lad14:07_b.clustering.ppt|clustering}}. Slides on the {{:magistraleinformatica:lad:lad14:il_progetto_n._2.pptx|second project}}, and {{:magistraleinformatica:lad:lad14:progetto_2_-_papers.zip|papers to read}}.| | 30/10/2014 | Discussion on the projects. | | | 11/11/2014 | Discussion on the projects: To generate k random positions, p(i), one could use the computation p(i) = h1(x) + i h2(x), where x is a random number and h1/h2 are [[https://sites.google.com/site/murmurhash/|MurmurHash]] functions. | | | 13/11/2014 | Discussion on the projects. | | | 18/11/2014| Discussion on the projects: Venturini's office. | | | 20/11/2014| Discussion on the projects: Venturini's office. | | | 25/11/2014| Discussion on the projects: Venturini's office. | | | 27/11/2014| Discussion on the projects: Ferragina's office. | | | 02/12/2014| Discussion on the projects: Ferragina's office. | | | 04/12/2014| Discussion on the projects: Ferragina's office. | | | 09/12/2014| Discussion on the projects: Ferragina's office. | | | 11/12/2014| Discussion on the projects: Venturini's office. | | | 16/12/2014| Discussion on the projects: Ferragina's office. | |