Final Project
Requirements
Scope:
You will be analyzing a dataset using both supervised and unsupervised learning tools.
Finding data:
Your dataset should allow you to perform a supervised learning analysis (regression or classification) and an unsupervised learning analysis (clustering or dimension reduction). The following resources are good places to start looking for data:
- Kaggle
- Tidy Tuesday GitHub repository
- UCI Machine Learning Repository
- Harvard Dataverse
- Google Dataset Search
- Macalester’s librarians are a great resource too! They can help you find data aligning with your interests. You can make an appointment using the Ask Us page on the library website.
What qualities should you look for in a dataset?
- Should contain an outcome variable that you would be interested in modeling
- Should contain roughly 10 sensible predictors or more (ID variables aren’t useful; latitude and longitude are generally not useful as predictors)
- NOT data where the main predictor is a time variable (e.g., year, day). This usually is a situation where the goal is to forecast a trend into the future. (This type of research question can’t be handled well using methods that we’ll cover–appropriate tools are covered in STAT 452: Correlated Data.)
Collaboration:
- You may work in teams of up to 3 members. Individual work is fine.
- The homework assignments will indicate whether work for that assignment should be submitted individually or if just one team member should submit work.
- There will be a required synthesis of the weekly homework investigations at the end of the course. If working on a team, this should be done in groups, rather than individually.
Final deliverables:
Each group will be required to give a practice presentation and a final presentation. The instructor will provide feedback on the practice presentation to incorporate into the final presentation. (See our syllabus on Moodle for details.)
Presentation requirements:
- Length: Aim for a 10-12 minute presentation.
- Clarify your research questions.
- Describe your data.
- Explain how your supervised and unsupervised learning investigations help to answer your research questions.
- Describe your methods: What models were fit? How were models evaluated?
- Describe your results in the context of your research questions.
- Report and interpret evaluation metrics and plots.
- Report and interpret variable importance measures and any other modeling output that helps you answer your research question. (e.g., Explorations of predicted values from the model, coefficient estimates)
- Comment on any cautions that have to be kept in mind when analyzing the data or interpreting the results. Draw from insights you have taken from our Data Ethics readings over the course of the semester.
Class time during the week of 4/17 - 4/21 will be devoted to working on these presentations and getting peer review.
During the week of 4/24 - 4/28, we will not have class, but your group will sign up for a time to meet with me during class time to present your draft presentation. I will give you feedback to incorporate into a final, recorded version of your presentation which will be due on Moodle by Friday 5/5 at midnight.