magistraleinformatica:ir:ir12:start
Indice
Information Retrieval - Academic Year 2012/2013
General Information
- Teacher : Paolo Ferragina
- Course ID: 289AA
- CFU: 6 (first semester)
- Language: English
- Lectures Schedule: Monday and Tuesday 14-16, both in room L1
- Question time: by appointment.
- Official Lecture's Log: The schedule and content of the lectures is available with the official registro.
- News about this course will be distributed via a Tweeter-channel
Goals
Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will:
- describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer;
- present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph;
- introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching.
Exam
Project assigned during the course, plus an oral discussion concerning with the project and the course.
Books, notes, ...
Content of the Lectures
Date | Argument | Refs |
---|---|---|
24/09/2012 | Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model. | Chapter 1 of [MRS] |
25/09/2012 | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Sect. 2.1, 2.2 and 5.1 of [MRS]. Slides of this day |
01/10/2012 | NO Lecture | |
02/10/2012 | Index construction: multi-way mergesort, BSBI and SPIMI. | Chapter 4 of [MRS]. Slides. |
08/10/2012 | Distributed and dynamic indexing. Query processing: skip pointers, caching, phrase queries | Chapter 4, Sect. 2.3, 2.4 of [MRS]. Slides. |
09/10/2012 | wild-cards (permuterm, k-gram, dynamic programming). Zone indexes. | Sect. 3.2, 3.3, 6.1 of [MRS]. |
15/10/2012 | Posting list compression, codes: gamma, delta, variable bytes, PForDelta. | Sect. 5.3 of [MRS]. Read also this paper. |
16/10/2012 | Notion of Entropy and prefix-free codes. Huffman and Arithmetic coding. LZ77 and gzip. Hashing with chaining, cuckoo hashing. | pag. 21-36, 52-56, 74-79 of [MG]. Slides. Paper on cuckoo, just description (no proof). |
18/10/2012 | Bloom filter and its applications. Prefix-search: trie, front-coding and two-level indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. | Sect 6.2, 6.3 from [MRS]. Paper on bloom filter. Slides. |
22/10/2012 | Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. | Chap 7 and 9 from [MRS]. Slides. |
23/10/2012 | Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE, the storage of the web-graph. | Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. Slides. |
25/10/2012 | Crawling and consistent hashing. Link-based ranking: pagerank and HITS and weighted variants. Evaluation measures: precision, recall, F1. Recommendation Systems. | Chap 20.1, 20.2, 21 and 8 from [MRS]. Slides. |
29/10/2012 | Some notes on XML search engines and Web advertising. | chap 10 from [MRS]. Slides. |
30/10/2012 | Latent Semantic Indexing and Random projections. Exact and near-document duplication: shingling, fingerprinting and min-wise permutations. Semantic-annotation tools. | chap 18 and 19.6 from [MRS]. Slides. |
05/11/2012 | Software project | description |
08/11/2012 | Clustering: flat, hierarchical, soft, hard. K-means, optimal bisect, hierarchical - max, min, avg, centroid. Co-clustering. | Chap 16 and 17 of [MRS], slides. |
12/11/2012 | Framework map-reduce, hadoop, with examples. | By Claudio Lucchese, slides and some code. |
13/11/2012 | Beyond hadoop: pig and giraffe. | By Fabrizio Silvestri, slides. |
19/11/2012 | Project discussion | |
20/11/2012 | Lab at home | |
26/11/2012 | Lab at home | |
27/11/2012 | Lab at home | |
03/12/2012 | Lab at home | |
04/12/2012 | Lab at home | |
10/11/2012 | Lab at home | |
11/11/2012 | Lab at home | |
17/11/2012 | Lab at home | |
18/11/2012 | Lab at home |
magistraleinformatica/ir/ir12/start.txt · Ultima modifica: 16/07/2013 alle 08:29 (12 anni fa) da Paolo Ferragina