Finding data

All of the data for CSE 150 can be found within the data4 packages in our GitHub group byuidatascience and on our Google Drive.

We have built multiple data sets that will help you answer the questions in the case studies. You should read the data description and look at the available files to find the files you need before you do too much data wrangling.

Repository details

These repositories are built using the DataPushR R package, and they have the following structure.

  • There is a data-raw folder that has varied file formats of the curated data. All data is the same, but they are stored in varied formats for use with different software and programming languages. this folder can have more data than is available in the R package.
  • Each of these repositories is stored as an R package and can be installed in R using devtools::install_github("byuidatascience/data4[NAME]"). If done, the data objects in the data folder will be available to the R user. We recommend using data4[NAME]:: structure to access the available data.
  • Each data set is documented with the data.md file in the repository and as a help file in R. You can find the data sources here.
  • There is an R script in the data-raw folder that documents what was done with the source data before it was created in the repository.

Case Study 2: Lego my data

You will be making your own data set from bags of LEGO in class or by using the virtual LEGO bags.

Case Study 3: What is normal about marathons?

This repository has marathon times for the majority of marathons in the US. It is largely various subsets of data from Wu’s data. One addition is the inclusion of spatial coordinates for the locations of some of the races.

Case Study 4: What is a healthy child?

This repository has varied height and weight measurements of children from various countries under the age of 2. Also, we have the WHO table of mean and standard deviation values for each day of age. This data can be used to teach about z-scores and the normal distribution.

Case Study 5: Does your birthday make you better at sports?

This repository has the birthdates of many professional athletes. It also has the number of births by month and day of the month for the US and the varied athlete groups. This data can be used to teach about the Matthew Effect.

Case Study 6: Catch me, if you can?

The repository has varied data sets that can be used to teach about Benford’s law. They can be used to explain the Chi-Square Goodness of Fit process.

Case Study 7: Can you help me with my data problem?

This repository has tuberculosis data from the World Health Organization (WHO). It is messy by intention. This data can be used to teach tidy principles.