All of the data for CSE 150 can be found within the data4
packages in our GitHub group byuidatascience and on our Google Drive.
We have built multiple data sets that will help you answer the questions in the case studies. You should read the data description and look at the available files to find the files you need before you do too much data wrangling.
These repositories are built using the DataPushR R package, and they have the following structure.
data-raw
folder that has varied file formats of the curated data. All data is the same, but they are stored in varied formats for use with different software and programming languages. this folder can have more data than is available in the R package.devtools::install_github("byuidatascience/data4[NAME]")
. If done, the data objects in the data
folder will be available to the R user. We recommend using data4[NAME]::
structure to access the available data.data.md
file in the repository and as a help file in R. You can find the data sources here.data-raw
folder that documents what was done with the source data before it was created in the repository.You will be making your own data set from bags of LEGO in class or by using the virtual LEGO bags.
This repository has marathon times for the majority of marathons in the US. It is largely various subsets of data from Wu’s data. One addition is the inclusion of spatial coordinates for the locations of some of the races.
This repository has varied height and weight measurements of children from various countries under the age of 2. Also, we have the WHO table of mean and standard deviation values for each day of age. This data can be used to teach about z-scores and the normal distribution.
This repository has the birthdates of many professional athletes. It also has the number of births by month and day of the month for the US and the varied athlete groups. This data can be used to teach about the Matthew Effect.
The repository has varied data sets that can be used to teach about Benford’s law. They can be used to explain the Chi-Square Goodness of Fit process.
This repository has tuberculosis data from the World Health Organization (WHO). It is messy by intention. This data can be used to teach tidy principles.