====== Statistics for Data Science (628PP) A.Y. 2023/24 ====== =====Instructor===== * **Salvatore Ruggieri** * Università di Pisa * [[http://pages.di.unipi.it/ruggieri/]] * [[salvatore.ruggieri@unipi.it]] * **Office hours:** Tuesdays h 14:00 - 16:00 or by appointment, at the Department of Computer Science, room 321/DO, or via Teams. =====Hours and rooms===== ^ Day of Week ^ Hour ^ Room ^ | Tuesday | 16:00 - 18:00 | Fib-C | | Thursday | 14:00 - 16:00 | Fib-C | | Friday | 11:00 - 13:00 | Fib-C | A [[https://teams.microsoft.com/l/team/19%3ASjRtZgcEvEp6qBlbjmDPwXmns29MUOiYFtYTgIh2t-g1%40thread.tacv2/conversations?groupId=180796ed-55c5-443c-8f6f-cabd89c6db0d&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams channel]] is used to post news, notes, Q&A, and other stuff related to the course. The lectures will be only in presence and will **NOT** be live-streamed, but recordings of the lecture or of the previous years will be made available here for non-attending students. =====Pre-requisites===== Students should be comfortable with most of the topics on mathematical calculus covered in: * **[P]** J. Ward, J. Abdey. **Mathematics and Statistics**. University of London, 2013. __Chapters 1-8 of Part 1__. Extra-lessons refreshing such notions may be planned in the first part of the course. =====Mandatory Teaching Material===== The following are //mandatory text books//: * **[T]** F.M. Dekking C. Kraaikamp, H.P. Lopuha, L.E. Meester. **A Modern Introduction to Probability and Statistics**. Springer, 2005. * **[R]** P. Dalgaard. **Introductory Statistics with R**. 2nd edition, Springer, 2008. * selected chapters of other books for advanced topics =====Software===== * [[https://cran.r-project.org/|R]] * [[https://www.rstudio.com/|R Studio]] =====Preliminary program and calendar===== * [[https://esami.unipi.it/programma.php?c=61293&aa=2023|Preliminary program]]. * [[https://didattica.di.unipi.it/laurea-magistrale-in-data-science-and-business-informatics/orario-magistrale-data-science-business-informatics/|Calendar of lessons]]. =====Exams===== __//There are no mid-terms//.__ The exam consists of a written part and an oral part. The written part consists of exercises and questions on the topics of the course. Each question is assigned a grade, summing up to 30 points. Example written texts: **{{ :mds:sds:sds_sample1.pdf | sample1}}**, **{{ :mds:sds:sds_sample2.pdf | sample2}}**. Students are admitted to the oral part if they receive a grade of at least 18 points. The oral part consists of critical discussion of the written part and of open questions and problem solving on the topics (both theory and R programming) of the course. In particular, students must demonstrate to be able to summarize both the theory and the software related to any of the lessons using the slides and R scripts of the lessons. Registration to exams is mandatory (**beware of the registration deadline!**): [[https://esami.unipi.it/esami2/|register here]]. The dates below are only for the written test (normal exam). Dates for project discussion are included in the project description.\\ ^ Date ^ Hour ^ Room ^ Notes ^ | 6/11/2024 | 11:00 - 13:00 | Riunioni Ovest, Dept. Computer Science | [[https://didattica.di.unipi.it/en/appelli-straordinari/|Extra-ordinary exam]] | =====Student project===== * The project replaces the written part of the examination * {{:mds:sds:s4ds.project.2024.pdf |Project description and rules and Q&A}}. * [[http://131.114.72.230/sds/video/s4ds.project.2024.mp4|Recording of project description (.mp4)]] =====Class calendar===== Lessons will be **NOT** be live-streamed, but recordings of past years are available here for non-attending students.\\ To watch the recordings online, you must be connected to the [[https://start.unipi.it/en/help-ict/vpn/|unipi.it VPN]]. Alternatively, right click on the link and download the whole file, then watch it locally on your device using e.g. [[http://www.videolan.org/vlc/|VLC media player]]. Slides and R scripts might be updated **after the classes** to align with actual content of lessons and to correct typos. Be sure to download the updated versions. ^ # ^ Date ^ Room ^ Topic ^ Mandatory teaching material ^ |01| 20/02 16-18| Fib-C | Introduction. Probability and independence. [[http://131.114.72.230/sds/video/sds01_20220215.mp4|rec01 (.mp4)]] | **[T]** Chpts. 1-3 {{:mds:sds:s4ds01.pdf|slides01 (.pdf)}}| |02| 22/02 14-16| Fib-C | R basics. [[http://131.114.72.230/sds/video/sds02_20220217.mp4|rec02 (.mp4)]] | **[R]** Chpts. 1,2.1-2.3 {{:mds:sds:s4ds02.pdf|slides02 (.pdf)}}, {{:mds:sds:s4ds02.r|script02 (.R)}}| |03| 23/02 11-13| Fib-C | Bayes' rule and applications. [[http://131.114.72.230/sds/video/sds03_20220218.mp4|rec03 (.mp4)]] | **[T]** Chpt. 3 {{:mds:sds:s4ds03.pdf|slides03 (.pdf)}}, {{:mds:sds:s4ds03.r|script03 (.R)}}| |04| 27/02 16-18 | Fib-C | Discrete random variables. [[http://131.114.72.230/sds/video/sds04_20220222.mp4|rec04 (.mp4)]] | **[T]** Chpts. 4, 9.1, 9.2, 9.4 **[R]** Chpt. 3 {{:mds:sds:s4ds04.pdf|slides04 (.pdf)}}, {{:mds:sds:s4ds04.r|script04 (.R)}}| |05| 29/02 14-16 | Fib-C | Discrete random variables (continued). [[http://131.114.72.230/sds/video/sds05_20220224.mp4|rec05 (.mp4)]] | | |06| 01/03 11-13 | Fib-C | Recalls: derivatives and integrals. [[http://131.114.72.230/sds/video/sds06_20220225.mp4|rec06 (.mp4)]] | **[P]** Chpt. 1-8 {{:mds:sds:s4ds06.pdf|slides06 (.pdf)}}, {{:mds:sds:s4ds06.r|script06 (.R)}}| |07| 05/03 16-18 | Fib-C | R data access and programming. [[http://131.114.72.230/sds/video/sds07_20220301.mp4|rec07 (.mp4)]] | **[R]** Chpt. 2.3,2.4 {{:mds:sds:s4ds07.zip|script07 (.zip)}} | |08| 07/03 14-16 | Fib-C | Continuous random variables.[[http://131.114.72.230/sds/video/sds08_20220303.mp4|rec08 (.mp4)]] | **[T]** Chpts. 5, 9.2-9.4 **[R]** Chpt. 3 {{:mds:sds:s4ds08.pdf|slides08 (.pdf)}}, {{:mds:sds:s4ds08.r|script08 (.R)}}| |09| 08/03 11-13 | Fib-C | Expectation and variance. Computations with random variables.[[http://131.114.72.230/sds/video/sds09_20220304v2.mp4|rec09 (.mp4)]] | **[T]** Chpts. 7,8 {{:mds:sds:s4ds09.pdf|slides09 (.pdf)}}, {{:mds:sds:s4ds09.r|script09 (.R)}}| |10| 12/03 16-18| Fib-C | Expectation and variance. Computations with random variables (continued). Moments. Functions of random variables. [[http://131.114.72.230/sds/video/sds10_20220308v3.mp4|rec10 (.mp4)]] | **[T]** Chpts. 9-11 {{:mds:sds:s4ds10.pdf|slides10 (.pdf)}}, {{:mds:sds:s4ds10.zip|script10 (.zip)}} | |11| 14/03 14-16| Fib-C | Functions of random variables (continued). Distances between distributions. [[http://131.114.72.230/sds/video/sds11_20240314.mp4|rec11 (.mp4)]] | {{:mds:sds:murphychpt6.pdf|Murphy's book}} Chpt. 6 {{:mds:sds:s4ds11.pdf|slides11 (.pdf)}}, {{:mds:sds:s4ds11.R|script11 (.R)}} | |12| 15/03 11-13 | Fib-C | Simulation. [[http://131.114.72.230/sds/video/sds12_20220311v2.mp4|rec12 (.mp4)]] | **[T]** Chpts. 6.1-6.2 {{:mds:sds:s4ds12.pdf|slides12 (.pdf)}}, {{:mds:sds:s4ds12.r|script12 (.R)}} {{:mds:sds:s4ds12_sol07.r|script12_sol07 (.R)}}| |13| 19/03 16-18 | Fib-C | Power laws and Zipf's law. [[http://131.114.72.230/sds/video/sds13_20220315.mp4|rec13 (.mp4)]] | [[https://arxiv.org/pdf/cond-mat/0412004.pdf | Newman's paper]] Sect I, II, III(A,B,E,F) {{:mds:sds:s4ds13.pdf|slides13 (.pdf)}}, {{:mds:sds:s4ds13.r|script13 (.R)}}| |14| 21/03 14-16| Fib-C | Law of large numbers. The central limit theorem. [[http://131.114.72.230/sds/video/sds14_20220317.mp4|rec14 (.mp4)]] | **[T]** Chpts. 13-14 {{:mds:sds:s4ds14.pdf|slides14 (.pdf)}}, {{:mds:sds:s4ds14.R|script14 (.R)}} | |15| 22/03 11-13 | Fib-C | Graphical summaries. Kernel Density Estimation. [[http://131.114.72.230/sds/video/sds15_20220322.mp4|rec15 (.mp4)]] | **[T]** Chpt. 15, **[R]** Chpt. 4 {{:mds:sds:s4ds15.pdf|slides15 (.pdf)}}, {{:mds:sds:s4ds15.r|script15 (.R)}}| |16| 26/03 16-18| Fib-C | Numerical summaries.[[http://131.114.72.230/sds/video/sds16_20220324.mp4|rec16 (.mp4)]] | **[T]** Chpt. 16, **[R]** Chpt. 4 {{:mds:sds:s4ds16.pdf|slides16 (.pdf)}}, {{:mds:sds:s4ds16.r|script16 (.R)}} | |17| 28/03 14-16 | Fib-C |Data preprocessing in R. Estimators.[[http://131.114.72.230/sds/video/sds17_20220325.mp4|rec17 (.mp4)]] | **[R]** Chpt. 10, **[T]** Chpts. 17.1-17.3{{:mds:sds:s4ds17.r|script17 (.R)}}, {{ :mds:sds:dataprep.r | dataprep.R}} | |18| 04/04 14-16 | Fib-C | Unbiased estimators. Efficiency and MSE.[[http://131.114.72.230/sds/video/sds18_20220329.mp4|rec18 (.mp4)]] | **[T]** Chpts. 19, 20 {{:mds:sds:s4ds18.pdf|slides18 (.pdf)}}, {{:mds:sds:s4ds18.r|script18 (.R)}} | |19| 05/04 11-13 | Fib-C | Maximum likelihood estimation.[[http://131.114.72.230/sds/video/sds19_20220331.mp4|rec19 (.mp4)]] | **[T]** Chpt. 21 {{ :mds:sds:s4dsln.pdf |}} Chpt. 1 {{:mds:sds:s4ds19.pdf|slides19 (.pdf)}}, {{:mds:sds:s4ds19.r|script19 (.R)}} | |20| 09/04 16-18 | Fib-C | Linear regression. Least squares estimation.[[http://131.114.72.230/sds/video/sds20_20220405.mp4|rec20 (.mp4)]] | **[T]** Chpts. 17.4,22 **[R]** Chpt. 6 {{ :mds:sds:s4dsln.pdf |}} Chpt. 2 {{:mds:sds:s4ds20.pdf|slides20 (.pdf)}}, {{:mds:sds:s4ds20.r|script20 (.R)}} | |21| 11/04 14-16 | Fib-C | Non-linear, and multiple linear regression.[[http://131.114.72.230/sds/video/sds21_20220407.mp4|rec21 (.mp4)]] | **[R]** Chpt. 12.1,13,16.1-16.2 {{ :mds:sds:s4dsln.pdf |}} Chpt. 2 {{:mds:sds:s4ds21.pdf|slides21 (.pdf)}}, {{:mds:sds:s4ds21.R|script21 (.R)}} | |22| 12/04 11-13 | Fib-C | Issues with linear regression. Logistic regression.[[http://131.114.72.230/sds/video/sds22_20220408.mp4|rec22 (.mp4)]] | **[R]** Chpt. 12.1,13,16.1-16.2 {{:mds:sds:s4ds22.pdf|slides22 (.pdf)}}, {{:mds:sds:s4ds22.zip|script22 (.zip)}} | |23| 16/04 16-18 | Fib-C | Statistical decision theory.[[http://131.114.72.230/sds/video/sds23_20220412.mp4|rec23 (.mp4)]] | {{ :mds:sds:s4dsln.pdf |}} Chpt. 4 {{:mds:sds:s4ds23.pdf|slides23 (.pdf)}}, {{:mds:sds:s4ds23.r|script23 (.R)}} | |24| 18/04 14-16 | Fib-C | Statistical decision theory (continued).[[http://131.114.72.230/sds/video/sds24_20220421.mp4|rec24 (.mp4)]] | | |25| 19/04 11-13 | Fib-C | Statistical decision theory (continued). Project presentation. | | |26| 23/04 16-18| Fib-C | Confidence intervals: mean, proportion, linear regression.[[http://131.114.72.230/sds/video/sds26_20220422.mp4|rec26 (.mp4)]] | **[T]** Chpts. 23.1,23.2,23.4,24.3,24.4 {{ :mds:sds:s4dsln.pdf |}} Chpt. 3 {{:mds:sds:s4ds26.pdf|slides26 (.pdf)}}, {{:mds:sds:s4ds26.r|script26 (.R)}} | |27| 30/04 16-18| Fib-C | Confidence intervals (continued). Bootstrap and resampling methods.[[http://131.114.72.230/sds/video/sds27_20220426.mp4|rec27 (.mp4)]] | **[T]** Chpts. 18.1-18.3,23.3 {{:mds:sds:s4ds27.pdf|slides27 (.pdf)}}, {{:mds:sds:s4ds27.r|script27 (.R)}} | |28| 02/05 14-16| Fib-C | Bootstrap and resampling methods (continued).[[http://131.114.72.230/sds/video/sds28_20220428.mp4|rec28 (.mp4)]] | | |29| 03/05 11-13| Fib-C | Hypotheses testing. One-sample tests of the mean and application to linear regression.[[http://131.114.72.230/sds/video/sds29_20220429.mp4|rec29 (.mp4)]] | **[T]** Chpts. 25,26,27, **[R]** Chpts. 5.1,5.2 {{ :mds:sds:s4dsln.pdf |}} Chpt.3.3 {{:mds:sds:s4ds29.pdf|slides29 (.pdf)}}, {{:mds:sds:s4ds29.r|script29 (.R)}} | |s03| 07/05 16-18| Fib-C | //Mandatory seminar:// Introduction to causal modeling and reasoning. Speakers: I. Beretta and M. Cinquini. [[http://131.114.72.230/sds/video/sds_s03_20240507.mp4|rec_s03 (.mp4)]] | {{:mds:sds:s4ds_s03.pdf|slides_s03 (.pdf)}}| |30| 09/05 14-16| Fib-C | One-sample tests of the mean and application to linear regression (continued). Classifier performance metrics in R. [[http://131.114.72.230/sds/video/sds30_2022mix.mp4|rec30 (.mp4)]] | {{:mds:sds:s4ds30.pdf|slides30 (.pdf)}}, {{:mds:sds:s4ds30.r|script30 (.R)}} | |31| 10/05 11-13| Fib-C | Two-sample tests of the mean and applications to classifier comparison. [[http://131.114.72.230/sds/video/sds31_2022mix.mp4|rec31 (.mp4)]] | **[T]** Chpt. 28, **[R]** Chpts. 5.3-5.7 {{:mds:sds:s4ds31.pdf|slides31 (.pdf)}}, {{:mds:sds:s4ds31.r|script31 (.R)}} | |32| 14/05 16-18| Fib-C | Multiple-sample tests of the mean and applications to classifier comparison.[[http://131.114.72.230/sds/video/sds32_2022mix.mp4|rec32 (.mp4)]] | **[R]** Chpt. 7 {{:mds:sds:s4ds32.pdf|slides32 (.pdf)}}, {{:mds:sds:s4ds32.r|script32 (.R)}} | |33| 16/05 14-16| Fib-C | Fitting distributions. Testing independence/association.[[http://131.114.72.230/sds/video/sds33_2022mix.mp4|rec33 (.mp4)]] | **[R]** Chpt. 8 {{ :mds:smd:ks.pdf | K-S}}, {{:mds:sds:s4ds33.pdf|slides33 (.pdf)}}, {{:mds:sds:s4ds33.r|script33 (.R)}} | |34| 17/05 11-13| Fib-C | Fitting distributions. Testing independence/association (continued). Project Q&A. | | |35| 21/05 16-18| Fib-C | Project Q&A. | | =====Seminars of past years===== In some years, speakers were invited to give a seminar on advanced topics. Here it is a list of seminars held in past years. ^ # ^ Date ^ Room ^ Topic ^ Teaching material ^ |s01| 04/05/2022 9-11| Gerace+Teams | Bias in statistics and causal reasoning. Speaker: prof. Fabrizia Mealli [[http://131.114.72.230/sds/video/sds_s01_20220504.mp4|rec_s01 (.mp4)]] | {{:mds:sds:s4ds_s01.pdf|slides_s01 (.pdf)}} [[https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf|Optional reading]] | |s02| 04/05/2022 11-13| Gerace+Teams | Bias in statistics and causal reasoning (continued). Speaker: prof. Fabrizia Mealli [[http://131.114.72.230/sds/video/sds_s02_20220504.mp4|rec_s02 (.mp4)]] | | =====Past years===== * [[mds:sds:2022|Statistics for Data Science A.Y. 2022/23]] Moreover, this course of 9 ECTS replaces an older 6 ECTS version: [[mds:smd: |Statistical Methods for Data Science A.Y. 2020/21 (500PP)]]. The 6 ECTS version is discontinued. Students having the 6 ECTS version in their study plan can still take the 6 ECTS version exam for the A.Y. 2021/22, 2022/23 and 2023/24. However, there will no specific project for the 6 ECTS version.