Statistics

Statistical analysis aims to understand relevant characteristics of (sub)populations or phenomena that are described by the collected data. This field of data science and artificial intelligence contains a wealth of statistical models and techniques that can summarize data into understandable statistical quantities and features that have direct impact and interpretation to real-world situations. It contains methods that may help in collecting data (e.g. survey sampling, experimental design), modeling data (e.g. time-series, linear, non-linear, and generalized mixed models, survival and reliability), and exploring or learning from data (e.g. clustering analysis, discriminant analysis, resampling). The field characterizes itself by its capability of analyzing complex data sets and properly describing or addressing the inherent uncertainty (data collection variabilities, measurement errors, incompleteness, outliers, heterogeneities, systematic biases) in the observed data.

Choosing the appropriate statistical method for the analysis of a specific data set requires knowledge, experience, skill, and practice. For large and complex data sets there are often many different ways that a data set can be approached, and it is imperative that the strengths and weaknesses of the statistical analyses are known to the user for the tasks at hand. In large and complex data sets it is impossible to address in a statistical analysis all the different data aspects present in a data set, since statistical methods may not always be rich enough to capture all these aspects. Knowing what aspects of the data in combination with the statistical models and methods can be ignored or treated as nuisance is an important part of statistical analysis.

The trajectory contains the following courses:

  • 2AMS11 - Survival analysis for Data Scientists,
  • 2DD23 - Time series analysis and forecasting (TSF), 
  • 2DI70 - Statistical learning theory (SLT),
  • 2AMS30 - Network Statistics for Data Science

The courses in this trajectory teach you how statistical methods can be used to analyze and model temporal and big data sets and how statistical methods can be used to learn from this type of data. In all topics several different approaches are being offered and compared on what information these methods and techniques can and cannot get out of the data. Additionally, the courses discuss how the methods and techniques can be evaluated on their fit to the data (e.g. goodness-of-fit; outlier detection; likelihood information criteria), learning students how to evaluate the appropriateness of statistical models and techniques. All courses will make use of real and complex data sets without having it cleaned to be able to mimic realistic situations. For the analysis of temporal data, we offer two courses (LDAand TSF). LDA provide several modeling, estimation, and hypothesis testing techniques that are perfectly suitable for data sets with relatively few repeated measurements and variables and relatively large number of units (e.g. mixed effects models). TSF focuses on models that can handle temporal data over longer time periods for limited number of units and variables.  SLT presents algorithms for data analysis and discusses how to choose the complexity of the statistical models.