#ir2010
Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. We will describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer; present and use in the Lab some effective Open-Source Tools (e.g. Lucene and Octave), and introduce some basic algorithmic techniques which are now ubiquitous in any IR application: data streaming, data sketching, data projection, etc..
Project assigned during the course, plus an oral discussion concerning with the project and the course material.
# Hours | Argument | Refs | Speaker |
---|---|---|---|
2 | Introduction: large data collections and new algorithmic challanges? I/O-efficient sorting as a universal tool to manage large datasets. | notes | |
2 | The Web Graph: Properties, storage (compression). | BV's paper, 19.1-19.2 in [MRS] | |
2 | The Web Graph: visits via (semi-)external algorithms. | ||
2 (+4 home) | Lab: Web Graph Library | web site | Ugo Scaiella |
2 | Search Engines: structure, crawling. | chap 20 in [MRS] | |
2 | The Bloom Filter and its Spectral variant. | paper | |
2 | Search Engines: Inverted Lists = Dictionary + Postings (statistical models: Zipf, Luhn). Parsing. Construction. | chaps 2.1-2.2, 4, 5.1 in [MRS] | |
2 | Search Engines: Dictionary indexing (Hashing vs trie). Front-coding and LPFC | chaps 5.1-5.2 in [MRS] | |
3 | Search Engines: IL compressed storage, LZ-compression, ZSync et al | chaps 5.3 in [MRS] and notes | |
3 | Search Engines: Document storage via Huffman, its canonical variant and Huffword. A simple word-encoder | pp. 21-41 in [WMB] | |
2 | Search Engines: Text-based ranking. Recommendation Systems (sketch). Quality of the results: Precision, Recall, F-measure. | chap 6 and 8 in [MRS] | |
2 (+4 home) | Lab: Lucene in action | web site | Ugo Scaiella |
4 | Search Engines: Link-based ranking; PageRank, HITS, Salsa. | chap 21 in [MRS] | Gianna del Corso |
2 | Reputation Systems | Gianna del Corso | |
2 | Latent Semantic Indexing and Random Projections. | chap 18 in [MRS] | |
4 | Clustering: sketch of k-means, MST-based, max-cut, spectral. | chap 16 and 17 in [MRS] | |
2 | Suffix Tree and Array for Text Mining | slides | |
1 | Web SPAM and Web Advertising | slides | 19.3 in [MRS] |
2 | Doc duplication and set similarity: shingles, sketches, min-wise permutations | chap 19.6 in [MRS] | |
8 | Automated text categorization | paper | Fabrizio Sebastiani |
00 | Query-Log mining | ||
00 | Advanced data structures: Locality Preserving Hashing |