Information Retrieval - Academic Year 2011/2012

General Information

Teacher : Paolo Ferragina
Course ID: 289AA
CFU: 6 (first semester)
Language: English
Lectures Schedule: Wednesday 9-11 and 16-18, both in room N1
Question time: by appointment.
Official Lecture's Log: The schedule and content of the lectures is available with the official registro.
News about this course will be distributed via a Tweeter-channel

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will:

describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer;

present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph;

introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching.

Exam

Project assigned during the course, plus an oral discussion concerning with the project and the course.

Date	Text of the exam	Results
07/12/2011	pre-exam
11/01/2012	text
1/02/2012	text
8/06/2012	text
23/07/2012	text
03/09/2012	text

Books, notes, ...

C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
Chapter 2 on “Text compression” of Managing Gigabytes, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999.

Content of the Lectures

Date	Argument	Refs

21/09/2011, morning	Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model.	Chapter 1 of [MRS]
21/09/2011, afternoon	Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration.	Sect. 2.1, 2.2 and 5.1 of [MRS]. Slides of this day
28/09/2011, morning	Index construction: multi-way mergesort, BSBI and SPIMI, distributed indexing, dynamic indexing.	Chapter 4 of [MRS]
28/09/2011, afternoon	Query processing: skip pointers, caching, phrase queries, wild-cards (permuterm, k-gram, dynamic programming)	Sect. 2.3, 2.4, 3.2, 3.3 of [MRS]. Slides of this day
05/10/2011, morning	Zone indexes. Posting list compression, codes: gamma, delta, variable bytes, PForDelta. Document compression: Entropy, Prefix-free codes, Huffman.	Chapter 6.1, 5.3 of [MRS], and this paper.
05/10/2011, afternoon	Document compression: Arithmetic coding, Lempel-Ziv77, gzip. Exact string search: hashing with chaining, cuckoo hashing. Prefix string search: trie, front-compression, 2-level indexing.	pag. 21-36, 52-56, 74-79 of [MG]. Sect 3.1 and 5.2 from [MRS]. Paper on cuckoo, just description (no proof). Slides of this day
12/10/2011, morning	Bloom filter. Text-based ranking: dice/jaccard measures, tf-idf, vector space model, cosine similarity.	Sect 6.2, 6.3 from [MRS]. Paper on bloom filter.
12/10/2011, afternoon	Top-k retrieval: high idf, champion lists, many-terms, fancy hits, clustering. Relevance feedback (Rocchio), pseudo-relevance feedback, query expansion. Quality of a search engine: precision, recall, F1. A sketch of recommendation systems and XML search engines.	Chap 7, 8, 9, 10 from [MRS]. Slides of this day
19/10/2011, afternoon	Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE, the storage of the web-graph.	Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. Slides of this day
26/10/2011, morning	Crawling and consistent hashing. Link-based ranking: PageRank, HITS, and weighted variants.	Chap 20.1, 20.2 and 21 from [MRS].
26/10/2011, afternoon	Latent Semantic Indexing and Random projections. Exact and near-document duplication: shingling, fingerprinting and min-wise permutations.	Chap 18 from [MRS]. Slides of this day
27/10/2011, afternoon	Lucene in action, and the project on AIRPIM	Slides on lucene and Slides on the Airpim's project
09/11/2011, morning	The other project on Twitter.	Slides on the Twitter's project, also in pdf.
09/11/2011, afternoon	Lab at home: to collaborate in the project development with your colleagues please use the following facebook-group
16/11/2011, morning	Discussion on the Twitter's project and the assignment of papers to InfoUma students.
16/11/2011, afternoon	Lab at home
23/11/2011, morning	Lab at home
23/11/2011, afternoon	Lab at home
30/11/2011, morning	Lab at home
30/11/2011, afternoon	Discussion on the projects, current state
07/12/2011, morning	Exercises on the theory part
07/12/2011, afternoon	Lab at home