Final Project Instructions
Overview
For your final project, you will apply the statistical machine learning methods that we’ve covered in class to answer scientific questions pertaining to datasets(s) that are interesting to you.
You can work in groups of up to 3 students, and you can choose your own groups.
Some sources for datasets are listed below. If you are interested in a particular topic and would like help finding data, let me know!
- Kaggle
- Google Dataset Search
- UCI Machine Learning Repository
- fivethirtyeight
- The Pudding
- Awesome Public Datasets
Deliverables
The final project is worth 120 points total with the breakdown below. See sections below for more details on each part.
- Initial check-in (20 points)
- Final report (60 points)
- Final presentation (40 points)
Initial check-in
DUE: This must be completed the week of 4/8-4/12.
Each group will schedule a time to meet with me to present (informally) the following items involved in starting up the project.
- Provide an overview of the information in the dataset and the main scientific question(s) being investigated.
- Provide justification that the data is cleaned and prepared in such a way that your machine learning explorations can proceed. This could include (but is not restricted to) missing value imputation, creation of relevant variables, and generally ensuring correct coding of variables.
Go to this Google Doc to sign up for a time to do this check-in with me in my office.
Final report
DUE: Monday, May 13 at midnight on Moodle.
The final report is meant to communicate the insights gained from your machine learning analysis to a statistically-savvy audience. The final report will be a knitted HTML file.
You should make sure to include the following:
Introduce your data and research question(s). Describe your research plan so that readers can easily follow your thought process and the flow of the report. Please also include key results at the beginning so that readers know what to watch for. Also mention any important data cleaning or preparation. NOTE: it is not necessary (or advisable) to report absolutely every small step that you take.
Exploratory work and basic analyses. Use graphs to illustrate interesting relationships that are important to your final analyses. DO NOT just show a bunch of graphs because you can. You should label and discuss every graph you include. There is no required number to include. The graphs should help us understand your analysis process and illuminate key features of the data.
Describe the analysis process. Method comparison and sensitivity analyses are absolutely CRUCIAL to good scientific work. To that end, you MUST compare at least 2 different methods from class in answering your scientific questions. It is important to report what you tried, but do so SUCCINCTLY.
Summarize the results succinctly. Reiterate why we should be interested in this analysis. Depending on the project, you might do this in different ways. Maybe there were important relationships you learned about. Or maybe you have a nice way to predict something useful. Let us know what we’ve learned and why it’s important/neat/interesting.
Cite all sources. This includes your data source (including URL if applicable), any articles behind the data source (e.g. Pudding and FiveThirtyEight datasets), and any sources that you drew inspiration from (for context, code, etc.) There is no specific formatting requirement for citations.
Specific length and formatting requirements:
- Hand in (1) ONE knitted HTML or PDF file and (2) the underlying Rmd file(s).
- Maximum of 1000 words of text (not code). Do a word count of the knitted file. If your report seems long, I will check the word count and deduct points.
- Make sure that the knitted file looks neat (e.g. text following a figure starts on a new line) and is organized logically with headers (e.g.
# Introduction
). Do not show code in the knitted file. I can read the code if I want to by looking at your Rmd file(s). Hide the code by adding
echo=FALSE
to your code chunks like this:{r echo=FALSE}
Make sure that any figures you include are big enough to see clearly. You can adjust plot height and width by adding
fig.width=12
andfig.height=8
(fiddle with your own numbers) to your code chunks like this:{r echo=FALSE, fig.width=12, fig.height=8}
Final presentation
WHEN: These will occur during your final exam slot:
Section 01: Monday 5/13 8:00am-10:00am
Section 02: Thursday 5/9 10:30am-12:30pm
WHERE: Library 250 (classroom right next to the Idea Lab)
- You will put together slides to present your findings to your classmates. The aspects that you should emphasize for the presentation are the same 4 aspects that you should emphasize for the final report (detailed above).
- Aim to have a 10-12 minute presentation with time for questions at the end. In total each presentation should take about 15-20 minutes. In Library 250, there is space for 4 groups to present at a time.
- 10 points of your presentation grade comes from you giving good qualitative feedback to your classmates. You will be assigned to watch other groups’ presentations and give feedback via a Google form. I will synthesize the comments AND look at your slides to provide a grade for each groups’ presentation (30 points).
- View the schedule indicating when each group is presenting and what other groups you have been assigned to evaluate.
- If you would prefer to not evaluate or be evaluated by certain people, let me know, and I’ll make the changes to the schedule. No questions asked.
- Google form for providing feedback