magistraleinformatica:ir:ir11:start
Indice
Information Retrieval - Academic Year 2011/2012
General Information
- Teacher : Paolo Ferragina
- Course ID: 289AA
- CFU: 6 (first semester)
- Language: English
- Lectures Schedule: Wednesday 9-11 and 16-18, both in room N1
- Question time: by appointment.
- Official Lecture's Log: The schedule and content of the lectures is available with the official registro.
- News about this course will be distributed via a Tweeter-channel
Goals
Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will:
- describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer;
- present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph;
- introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching.
Exam
Project assigned during the course, plus an oral discussion concerning with the project and the course.
Books, notes, ...
- C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
- Chapter 2 on “Text compression” of Managing Gigabytes, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999.
Content of the Lectures
Date | Argument | Refs |
---|---|---|
21/09/2011, morning | Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model. | Chapter 1 of [MRS] |
21/09/2011, afternoon | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Sect. 2.1, 2.2 and 5.1 of [MRS]. Slides of this day |
28/09/2011, morning | Index construction: multi-way mergesort, BSBI and SPIMI, distributed indexing, dynamic indexing. | Chapter 4 of [MRS] |
28/09/2011, afternoon | Query processing: skip pointers, caching, phrase queries, wild-cards (permuterm, k-gram, dynamic programming) | Sect. 2.3, 2.4, 3.2, 3.3 of [MRS]. Slides of this day |
05/10/2011, morning | Zone indexes. Posting list compression, codes: gamma, delta, variable bytes, PForDelta. Document compression: Entropy, Prefix-free codes, Huffman. | Chapter 6.1, 5.3 of [MRS], and this paper. |
05/10/2011, afternoon | Document compression: Arithmetic coding, Lempel-Ziv77, gzip. Exact string search: hashing with chaining, cuckoo hashing. Prefix string search: trie, front-compression, 2-level indexing. | pag. 21-36, 52-56, 74-79 of [MG]. Sect 3.1 and 5.2 from [MRS]. Paper on cuckoo, just description (no proof). Slides of this day |
12/10/2011, morning | Bloom filter. Text-based ranking: dice/jaccard measures, tf-idf, vector space model, cosine similarity. | Sect 6.2, 6.3 from [MRS]. Paper on bloom filter. |
12/10/2011, afternoon | Top-k retrieval: high idf, champion lists, many-terms, fancy hits, clustering. Relevance feedback (Rocchio), pseudo-relevance feedback, query expansion. Quality of a search engine: precision, recall, F1. A sketch of recommendation systems and XML search engines. | Chap 7, 8, 9, 10 from [MRS]. Slides of this day |
19/10/2011, afternoon | Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE, the storage of the web-graph. | Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. Slides of this day |
26/10/2011, morning | Crawling and consistent hashing. Link-based ranking: PageRank, HITS, and weighted variants. | Chap 20.1, 20.2 and 21 from [MRS]. |
26/10/2011, afternoon | Latent Semantic Indexing and Random projections. Exact and near-document duplication: shingling, fingerprinting and min-wise permutations. | Chap 18 from [MRS]. Slides of this day |
27/10/2011, afternoon | Lucene in action, and the project on AIRPIM | Slides on lucene and Slides on the Airpim's project |
09/11/2011, morning | The other project on Twitter. | Slides on the Twitter's project, also in pdf. |
09/11/2011, afternoon | Lab at home: to collaborate in the project development with your colleagues please use the following facebook-group | |
16/11/2011, morning | Discussion on the Twitter's project and the assignment of papers to InfoUma students. | |
16/11/2011, afternoon | Lab at home | |
23/11/2011, morning | Lab at home | |
23/11/2011, afternoon | Lab at home | |
30/11/2011, morning | Lab at home | |
30/11/2011, afternoon | Discussion on the projects, current state | |
07/12/2011, morning | Exercises on the theory part | |
07/12/2011, afternoon | Lab at home |
magistraleinformatica/ir/ir11/start.txt · Ultima modifica: 06/05/2015 alle 13:25 (10 anni fa) da Paolo Ferragina