Indice

Information Retrieval - Academic Year 2023/2024

General Information


Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:


Schedule of the Lectures

Week Schedule
Day Time Slot Room
Monday 11:00 - 13:00 Room C (Polo Fibonacci)
Tuesday 11:00 - 13:00 Room C (Polo Fibonacci)

Exams

The exam will consist of a written test including two parts: exercises and “oral” questions. The exam is passed if in both parts the student gets a sufficient score: expressed as the sum of two reported grades.

The first (exercises) and the second (theory questions) parts of the exam can be split into different exam dates, even of different exam sessions. The exam dates are the ones indicated in the calendar on ESAMI. In the case that the second part is not passed or the student abandons the exam, (s)he can keep the rank of the first exam, but this may occur just once. The second time this happens, the rank of the first part is dropped, and the student has to do both parts again.

Date Room Text Notes
13/11/2023, start at 14:00 room E Mid-term exam. text, solution, results
15/12/2023, start at 9:00 room E Final-term exam. text, solution, results Students must register at esami.unipi.it on the exam date of January and write “final-term” in the notes.
16/01/2024, start at 9:00 room C text, solution
07/02/2024, start at 11:00 room C text, solution Results. Students accepting their vote have to send an email to Ferragina, with that statement. Correction occurs on Wednesday, 14th February, at 17:30 in the Teams' room of the course. When you are there, please write on the chat that you are “waiting”.
27/05/2024, start at 11:00 room C text, solution
21/06/2024, start at 11:00 room C text, solution
11/07/2024, start at 11:00 room C text, solution
04/09/2024, start at 09:00 room C text, solution

Materials for study


Lectures

Video-lectures of last year are available at the link and they are linked just for reference if you wish to re-check something you listened in class. This year, lectures are in presence and the program of the course could be different.

Date Topics Refs
18.09.2023 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement AND, OR, and NOT queries, and their time complexities. Slides and sketch of the lab activities.
Chapter 1 of [MRS]
19.09.2023 Skip pointers, Phrase queries, biword index, and positional index. Sect 2.3 and 2.4 of [MRS]
25.09.2023 Zone indexes. Web search engine: its structure, difficulties in their design, and their epochs. The Web graph: some useful structural properties (such as Bow Tie). Crawling: problems and algorithmic structure. An example: Mercator. The Bloom filter: definition, time/space complexity, and error bound. Slide1 and Slide2.
Sections Sections 19.1, 19.2, 19.4, 20.1, 20.2 of [MRS].
For doubts on Bloom Filters see paper.
26.09.2023 Spectral Bloom Filter: single minimum, recurring minimum, two-level SBF. Consistent Hashing. Web graph compression: properties of the web, power laws, and compressing the adjacency lists. Sect 20.3 and 20.4 of [MRS], and this page and note for consistent hashing.
03.10.2023 The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes (logarithmic method). Slides.
Chapter 4 of [MRS].
09.10.2023 Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Use in an off-line and in an on-line setting. Comparison between LSH and K-means for the clustering problem. Slides.
10.10.2023 Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing. Min-hashing (with prob property). LSH on integer vectors. Cosine similarity among vectors of real features. Sect 19.6 of [MRS]
16.10.2023 Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta). Slide.
17.10.2023 File Synchronization (rsync, zsync). Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Sect. 2.1, 2.2 and 5.1.
Suggest reading this paper on rsynch and zsync.
Slides on text parsing.
18.10.2023 Extra lecture (room C, 14 - 16): Exercises on the topics of the previous lectures Video.
23.10.2023 Exercises. LSH for cosine similarity estimation.
24.10.2023 Soft AND, Caching, and Tiered index. Keyword extraction: statistical, chi-square test. Rake tool. Query processing: soft-AND. Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Sect. 5.2 of [MRS].
Slides.
30.10.2023 Edit distance with e-errors via brute-force approach, or Dynamic Programming (possibly weighted). Overlap measure with k-gram index. An index for e-error matches based on k-gram index (with false positives, no false negatives). Phonetic match.
31.10.2023 Wild-card queries: Permuterm. Scoring of the candidates.
06.11.2023 Posting list compression, codes: gamma, variable bytes (t-nibble), Simple9, Group varint. Sect. 5.3 of [MRS].
Slides.
Video
07.11.2023 Exercises
13.11.2023 MIDTERM exam: 14-16 in room E You have to register here by 8 november 2023.
14.11.2023 Posting list compression: PForDelta, Elias-Fano.
Text-based ranking: dice, jaccard, tf-idf.
Sect 6.2 of [MRS].
Slides
20.11.2023 Vector space model and cosine similarity doc-doc and query-doc. Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: many query terms, high idf, champion lists, fancy hits, clustering. Sect 6.3 and 7 of [MRS]
21.11.2023 Exact Top-K: WAND and blocked-WAND.
27.11.2023 Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1, DCG and NDCG. Exercise on (Blocked) WAND. Sect 8.1-8.3 and 9 [MRS].
28.11.2023 Random Walks. Link-based ranking: PageRank. Sect 21.1 and 21.2 of [MRS].
Slides.
04.12.2023 Topic-specific PageRank. Personalized PageRank. HITS. Application to Text Summarization. Exercise on PageRank. Sect 21.2.3 and 21.3 of [MRS].
05.12.2023 Projections to smaller spaces: Latent Semantic Indexing (LSI). Exercise on TF-IDF. Chapter 18 of [MRS].
Slides.
11.12.2023 Extra lecture (11-13, room C): Exercises on posting list compression, Personalized PageRank, WAND and Blocked WAND. Lab on Entity linkers: TagMe. Slides.
12.12.2023 Extra lecture (11-13, room C): Lab on ElasticSearch, please bring your own laptop and make sure you have a working installation of a recent version of Python and Anaconda. Use them to create a clean environment specific for this course. The only required package is ElasticSearch: install it via pip install elasticsearch==7.14.0. All material of the lecture is in this repo Slides.
15.12.2023 Final-term exam (9-11, room E)