Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:
The exam will consist of a written test, plus an oral discussion on the exercises.
Date | Argument | Refs | |
---|---|---|---|
22/09/2015 | Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The structure of a search engine. | Slides Chapt 1 of [MRS] | |
24/09/2015 | Web search engine: difficulties in their design and their ephocs. The Web graph: some useful structural properties (such as Bow Tie). Crawling: problems and algorithmic structure. An example: Mercator. | Slides, Sections 19.1, 19.2, 19.4, 20.1, 20.2 of [MRS]. | |
29/09/2015 | Few useful algorithmic techniques for crawling the Web (and not only that!): Bloom Filter and Consistent Hashing. | Slides. Sect 20.3 and 20.4 of [MRS]. For doubts on Bloom Filter see paper. | |
01/10/2015 | Compressed storage of the Web graph. Compressed storage of documents: LZ-based compression. | Slides, Sect 19.1 and 19.2 of [MRS], and Sect 1.1 and 2.2 of Ferragina's notes. | |
06/10/2015 | MTF (with proof of its 2-optimality), RLE, and the Burrows-Wheeler Transform with bzip. | Slides of the previous lecture, Sect 2.1 and 2.3 of Ferragina's notes above. | |
08/10/2015 | Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync), and Set Reconciliation. | Slides. | |
13/10/2015 | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Slides. Sect. 2.1, 2.2 and 5.1 of [MRS]. | |
15/10/2015 | The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. | Slides. Chapter 4 of [MRS]. | |
20/10/2015 | Dynamic indexing. Posting list compression, codes: gamma, delta, variable bytes. PForDelta. | Slides. Sez 5.3 of [MRS] and Ferragina's notes (only the coders presented in class). | |
22/10/2015 | Exact search: hashing with chaining, univeral hashing, cuckoo hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted) | Slides. | |
27/10/2015 | Overlap measure with k-gram index. Edit distance with k-gram index. One-error match. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. | Slides. Chap 3 of [MRS]. | |
29/10/2015 | Query processing: skip pointers (with solution based on dynamic programming), caching, phrase queries. Zone index and tiered index. | Slide. Sect. 2.3 and 2.4 of [MRS] | |
10/11/2015 | The auto-complete problem and its solutions for the top-1, top-2, .., top-k strings. Rank/Select data structures: B untouched and Elias-Fano's approach with B compressed. | Slide | |
12/11/2015 | Exercises. | ||
13/11/2015 | Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Other exercises. | Sect 6.2 and 6.3 from [MRS]. | |
17/11/2015 | Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. | Slides. Chap 7 and 9 of [MRS] | |
19 and 20 11/2015 | Lab on Lucene. You need to configure your laptop as follows: Linux system (may be a virtual machine) with debian-like OS (e.g. Ubuntu 15.10 ), working Internet connection from the Polo's room, at least 5GB of free disk and 2GB RAM, httrack and pylucene installed (that can be done with sudo apt-get update and sudo apt-get install python-lucene httrack ).In collaboration with Marco Cornolti ([email protected]). | Slides (crawling) Slides (Lucene) | |
24/11/2015 | Performance measures: precision, recall, F1 and user happiness. Random Walks. Link-based ranking: pagerank and personalized pagerank. | Slides. Chap 8 and 21 from [MRS]. | |
26/11/2015 | CoSim Rank and HITS. Recommendation systems and Web advertising. | Slides only. | |
27/11/2015 | Projections to smaller spaces: Latent Semantic Indexing (LSI). Random Projections: Johnson-Linderstauss Lemma and its applications. | Slides. Chap 18 from [MRS]. | |
01/12/2015 | Semantic-annotation tools: basics, Wikipedia structure, TAGME and other annotators. How to evaluate those systems. Various approaches to text representation. | Slides. | |
03/12/2015 | More on topics annotators and their applications. Clustering: flat, hierarchical, soft, hard. K-means, optimal bisect, hierarchical - max, min, avg, centroid. | Slides. Chap 16 and 17 of [MRS]. | |
10/12/2015 | Locality-sensitive hashing: basics, hamming distance, Jaccard similarity, sketch of the main theorem. | Slides. Sect 19.6 of [MRS] | |
11/12/2015 | Exercise | ||
15/12/2015 | Exercise |