Indice
Laboratory on Algorithms for Big Data 2014/2015
Some useful infos on the course:
- Teacher: Paolo Ferragina and Rossano Venturini
- CFU: 6
- Period: First semester
- Language: English
- Lectures schedule: Tuesday 9-11 (N1) and Thursday 16-18 (C).
- Question time: After lectures or by appointment.
- Allowed Degree: Master degree in Computer Science, or Computer Science and Networking, or Business Informatics
- News: about this course will be distributed via a Twitter-channel.
- Official lectures schedule: The schedule and content of the lectures is available below and with the official registro.
Goals and opportunities for students
The course consists of a first part of lectures describing advanced algorithms and data structures (3 CFU), and a laboratory in the second part (3 CFU) in which the students will deploy these techniques to develop a software project. The students will select their projects among a set of proposals by some IT companies which are challenging from an algorithmic perspective. These companies will also contribute to identify/construct significant datasets that will help in testing the proposed algorithmic solutions.
The course will provide the opportunity of
- facing with difficult algorithmic problems of practical interest involving big data;
- evaluating the impact of efficient algorithmic solutions in the design of software managing big data;
- implementing advanced software by using powerful and sophisticated libraries;
- getting in touch with some companies for internships, scholarships, or thesis proposals.
Exam
Date | Room | Text |
---|---|---|
22/01/2015, 09:00 | L1 | text |
13/02/2015, 09:00 | L1 | text |
05/06/2015, 09:00 | L1 | text |
29-06-2015, 09:00 | L1 | text |
20-07-2015, 09:00 | L1 | text |
10-09-2015, 09:00 | L1 | text |
The exam consists of three parts: Project 70%, Written/oral test 20%, Project presentation 10%. Students can attend the written/oral test before the presentation/development of the project.
As far as the two projects are concerned, we list below the datasets for each of them. These are warm-up datasets of moderate size, yet sufficient to start designing your solutions and be concerned with time/space efficiency issues. We will provide you larger datasets to test the final codes you'll produce.
Project 1. The dataset consists of the following parts:
- Dictionary of English terms (3.5Mb, 354.984 terms);
- Dictionary of titles of Wikipedia articles (116 Mb, 6.971.668 titles). This dataset has been modified by introducing errors;
- Queries obtained by introducing errors in the dictionary of terms above (2.126.355 queries);
- Queries obtained by taking the Wikipedia titles (376.304 queries).
Few related papers.
Project 2. The dataset consists of 2.8Gb of 7zipped web pages drawn from the UK-GOV2 collection, and in WARC format. Some techniques that you can use for your project are described in these papers. Software libraries are indicated here, plus few additional softwares from students listed below:
- WARC parser (by Andrea Cardaci).
Background
If you wish to refresh your mind on basic Algorithms and Data Structures, we suggest you to look at the well-known book Introduction to Algorithms by Cormen-Leiserson-Rivest-Stein, third edition.
Moreover, you can look at the content of the Algorithm Engineering course by Paolo Ferragina. Anyway, depending on the topics addressed in the lectures you'll be pointed out to the suitable chapters of Ferragina's notes.
Lectures
We'll often do not project slides for teaching, but use mainly the old-fashioned blackboard. Most of the content of the course will be covered by notes, sometime we'll use parts of papers/books.
Date | Lecture | Biblio |
---|---|---|
24/09/2014 | Introduzione al corso: temi, progetti e svolgimento dell'esame. | progetti assegnati, slides |
25/09/2014 | (Compact) Trie, Ternary Search Trie, e Patricia Trie. | Sezioni 7.1, 7.4 e 7.5 di queste note. Questo articolo su Dr.Dobb's. |
30/09/2014 | Strutture dati succinte: rank/select, Elias-Fano, alberi (LOUDS e BP). | Lucidi sufficienti per lo studio. |
02/10/2014 | Strutture dati per l'indicizzazione compressa di testi: Suffix Array, Trasformata di Burrows-Wheeler, FM-index (count, locate, extract). | Lucidi sufficienti per lo studio. |
07/10/2014 | Compressed Permuterm Index e Dictionary search with prefix, suffix, substring, prefixSuffix match. Hashing: chaining (with time/space complexity), Universal hashing, Minimal ordered perfect hash. | Lucidi1 e Lucidi2 sufficienti per lo studio. |
09/10/2014 | Alberi succinti DFUDS. Descrizione Progetto 1 e possibili soluzioni. | Lucidi |
14/10/2014 | Permuting e Sorting: I/O-bottleneck, binary mergesort, multi-way mergesort. | Lucidi e capitolo 5 note di P. Ferragina (no 5.1.2, e no 5.2 in poi). |
16/10/2014 | Near-duplicate document detection: problem definition and comments, Karp-Rabin-fingerprint, Shingling, Jaccard similarity of sets, document sketches, locality sensitive hashing, the detection process. | Slides e parte di capitolo. |
21/10/2014 | An introduction to data compression: Gamma/Delta codes, Huffman code, Lempel-Ziv 1977, and Burrows-Wheeler transform. | Slides. Sect. 9.1 of these notes, Sect. 10.1 of these notes, and Sect. 11.1 of these notes. |
23/10/2014 | Discussion on the projects. | |
28/10/2014 | Clustering: soft/hard clustering, bottom-up and top-down approaches, various metrics, K-means. Discussion on Project 2 | Slides on clustering. Slides on the second project, and papers to read. |
30/10/2014 | Discussion on the projects. | |
11/11/2014 | Discussion on the projects: To generate k random positions, p(i), one could use the computation p(i) = h1(x) + i h2(x), where x is a random number and h1/h2 are MurmurHash functions. | |
13/11/2014 | Discussion on the projects. | |
18/11/2014 | Discussion on the projects: Venturini's office. | |
20/11/2014 | Discussion on the projects: Venturini's office. | |
25/11/2014 | Discussion on the projects: Venturini's office. | |
27/11/2014 | Discussion on the projects: Ferragina's office. | |
02/12/2014 | Discussion on the projects: Ferragina's office. | |
04/12/2014 | Discussion on the projects: Ferragina's office. | |
09/12/2014 | Discussion on the projects: Ferragina's office. | |
11/12/2014 | Discussion on the projects: Venturini's office. | |
16/12/2014 | Discussion on the projects: Ferragina's office. |