CST383 - Week 4
- YZ

- Jun 1, 2021
- 2 min read

This week we learned all about looking at data and cleaning it up, getting ready to use it to visualize the data and form predictions. First, we learned about data acquisition, including where to find data and the different formats datasets come in. Next, we learned how to identify and fix missing data. Searching for values that are NaN can give a good idea of where data is missing. Rows can either be removed, or values can be imputed with reasonable numbers such as the median or mode. Once all of the data is filled in, the next step is to look for bad data. This includes data that doesn't fit like extreme outliers and how to convert categorical data to numeric values. The last step is preprocessing tasks like normalization and understanding the correlation of different columns. This module gave a great overview of the first things to look out for in a dataset and the initial tasks of setting the data up to be able to be used later.
We also had homework on the topic of the previous week, probability. Using data about false and negative measles tests, we calculated marginal and conditional probabilities. Additionally, there was a second homework with 8 probability questions.
Lastly, we formed a group and submitted a proposal for the final project. I am working with Jose Herrera Gallegos, David Szabo, and Jomar Baluyot. We chose a dataset from Kaggle with data about student alcohol consumption. We plan to use the dataset to predict academic success in students based on alcohol use. I am glad we were able to form a great group and chose our topic so we can get started.


Comments