Information Retrieval - Academic Year 2012/2013

General Information

Teacher : Paolo Ferragina
Course ID: 289AA
CFU: 6 (first semester)
Language: English
Lectures Schedule: Monday and Tuesday 14-16, both in room L1
Question time: by appointment.
Official Lecture's Log: The schedule and content of the lectures is available with the official registro.
News about this course will be distributed via a Tweeter-channel

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will:

describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer;

present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph;

introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching.

Exam

Project assigned during the course, plus an oral discussion concerning with the project and the course.

Date	Text of the exam	Results
10 January 2013	text	.
12 February 2013	text	.
25 June 2013	text	.
16 July 2013	text	.

Books, notes, ...

[MRS] C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
[MG] Chapter 2 “Text compression” of Managing Gigabytes, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999.

Content of the Lectures

Date	Argument	Refs

24/09/2012	Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model.	Chapter 1 of [MRS]
25/09/2012	Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration.	Sect. 2.1, 2.2 and 5.1 of [MRS]. Slides of this day
01/10/2012	NO Lecture
02/10/2012	Index construction: multi-way mergesort, BSBI and SPIMI.	Chapter 4 of [MRS]. Slides.
08/10/2012	Distributed and dynamic indexing. Query processing: skip pointers, caching, phrase queries	Chapter 4, Sect. 2.3, 2.4 of [MRS]. Slides.
09/10/2012	wild-cards (permuterm, k-gram, dynamic programming). Zone indexes.	Sect. 3.2, 3.3, 6.1 of [MRS].
15/10/2012	Posting list compression, codes: gamma, delta, variable bytes, PForDelta.	Sect. 5.3 of [MRS]. Read also this paper.
16/10/2012	Notion of Entropy and prefix-free codes. Huffman and Arithmetic coding. LZ77 and gzip. Hashing with chaining, cuckoo hashing.	pag. 21-36, 52-56, 74-79 of [MG]. Slides. Paper on cuckoo, just description (no proof).
18/10/2012	Bloom filter and its applications. Prefix-search: trie, front-coding and two-level indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity.	Sect 6.2, 6.3 from [MRS]. Paper on bloom filter. Slides.
22/10/2012	Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion.	Chap 7 and 9 from [MRS]. Slides.
23/10/2012	Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE, the storage of the web-graph.	Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. Slides.
25/10/2012	Crawling and consistent hashing. Link-based ranking: pagerank and HITS and weighted variants. Evaluation measures: precision, recall, F1. Recommendation Systems.	Chap 20.1, 20.2, 21 and 8 from [MRS]. Slides.
29/10/2012	Some notes on XML search engines and Web advertising.	chap 10 from [MRS]. Slides.
30/10/2012	Latent Semantic Indexing and Random projections. Exact and near-document duplication: shingling, fingerprinting and min-wise permutations. Semantic-annotation tools.	chap 18 and 19.6 from [MRS]. Slides.
05/11/2012	Software project	description
08/11/2012	Clustering: flat, hierarchical, soft, hard. K-means, optimal bisect, hierarchical - max, min, avg, centroid. Co-clustering.	Chap 16 and 17 of [MRS], slides.
12/11/2012	Framework map-reduce, hadoop, with examples.	By Claudio Lucchese, slides and some code.
13/11/2012	Beyond hadoop: pig and giraffe.	By Fabrizio Silvestri, slides.
19/11/2012	Project discussion
20/11/2012	Lab at home
26/11/2012	Lab at home
27/11/2012	Lab at home
03/12/2012	Lab at home
04/12/2012	Lab at home
10/11/2012	Lab at home
11/11/2012	Lab at home
17/11/2012	Lab at home
18/11/2012	Lab at home