Learning Goals
Learning goals for our course topics are listed below. Use these to guide your synthesis of video and reading material.
Introduction to Statistical Machine Learning
- Formulate research questions that align with regression, classification, or unsupervised learning tasks
Evaluating Regression Models
- Create and interpret residuals vs. fitted, residuals vs. predictor plots to identify improvements in modeling and address ethical concerns
- Interpret MSE, RMSE, MAE, and R-squared in a contextually meaningful way
Overfitting and cross-validation
- Explain why training/in-sample model evaluation metrics can provide a misleading view of true test/out-of-sample performance
- Accurately describe all steps of cross-validation to estimate the test/out-of-sample version of a model evaluation metric
- Explain what role CV has in a predictive modeling analysis and its connection to overfitting
- Explain the pros/cons of higher vs. lower k in k-fold CV in terms of sample size and computing time
Subset selection
- Clearly describe the forward and backward stepwise selection algorithm and why they are examples of greedy algorithms
- Compare best subset and stepwise algorithms in terms of optimality of output and computational time
LASSO (shrinkage/regularization)
- Explain how ordinary and penalized least squares are similar and different with regard to (1) the form of the objective function and (2) the goal of variable selection
- Explain how the lambda tuning parameter affects model performance and how this is related to overfitting
KNN Regression and the Bias-Variance Tradeoff
- Clearly describe / implement by hand the KNN algorithm for making a regression prediction
- Explain how the number of neighbors relates to the bias-variance tradeoff
- Explain the difference between parametric and nonparametric methods
- Explain how the curse of dimensionality relates to the performance of KNN (not in the video–will be discussed in class)
Modeling Nonlinearity: Polynomial Regression and Splines
- Explain the advantages of splines over global transformations and other types of piecewise polynomials
- Explain how splines are constructed by drawing connections to variable transformations and least squares
- Explain how the number of knots relates to the bias-variance tradeoff
Local Regression and Generalized Additive Models
- Clearly describe the local regression algorithm for making a prediction
- Explain how bandwidth (span) relate to the bias-variance tradeoff
- Describe some different formulations for a GAM (how the arbitrary functions are represented)
- Explain how to make a prediction from a GAM
- Interpret the output from a GAM
Logistic regression
- Use a logistic regression model to make hard (class) and soft (probability) predictions
- Interpret non-intercept coefficients from logistic regression models in the data context
Evaluating classification models
- Calculate (by hand from confusion matrices) and contextually interpret overall accuracy, sensitivity, and specificity
- Construct and interpret plots of predicted probabilities across classes
- Explain how a ROC curve is constructed and the rationale behind AUC as an evaluation metric
- Appropriately use and interpret the no-information rate to evaluate accuracy metrics
Decision trees
- Clearly describe the recursive binary splitting algorithm for tree building for both regression and classification
- Compute the weighted average Gini index to measure the quality of a classification tree split
- Compute the sum of squared residuals to measure the quality of a regression tree split
- Explain how recursive binary splitting is a greedy algorithm
- Explain how different tree parameters relate to the bias-variance tradeoff
Bagging and random forests
- Explain the rationale for bagging
- Explain the rationale for selecting a random subset of predictors at each split (random forests)
- Explain how the size of the random subset of predictors at each split relates to the bias-variance tradeoff
- Explain the rationale for and implement out-of-bag error estimation for both regression and classification
- Explain the rationale behind the random forest variable importance measure and why it is biased towards quantitative predictors (in class)
K-means clustering
- Clearly describe / implement by hand the k-means algorithm
- Describe the rationale for how clustering algorithms work in terms of within-cluster variation
- Describe the tradeoff of more vs. less clusters in terms of interpretability
- Implement strategies for interpreting / contextualizing the clusters
Hierarchical clustering
- Clearly describe / implement by hand the hierarchical clustering algorithm
- Compare and contrast k-means and hierarchical clustering in their outputs and algorithms
- Interpret cuts of the dendrogram for single and complete linkage
- Describe the rationale for how clustering algorithms work in terms of within-cluster variation
- Describe the tradeoff of more vs. less clusters in terms of interpretability
- Implement strategies for interpreting / contextualizing the clusters