T-tests
- One Sample t.test(NameOfYourData$Y, mu = YourNull, alternative = YourAlternative, conf.level = 0.95)
- Paired Sample t.test(NameOfYourData$Y1, NameOfYourData$Y2, paired = TRUE, mu = YourNull, alternative = YourAlternative, conf.level = 0.95)
Wilcoxon Tests
- Signed-Rank wilcox.test(object1, object2, mu = YourNull, alternative = YourAlternative, paired = TRUE, conf.level = 0.95)

Y is a single quantitative variable of interest. This would be like “heights” of BYU-Idaho students. X is a qualitative (categorical) variable of interest like “gender” that has just two groups “A” and “B”. So this logo represents situtations where we would want to compare heights of male (group A) and female (group B) students.

Quantitative Y | Categorical X (2 Groups)

Data Summaries

Numerical

Mean (average) - Center of Mass - mean(x)
Median - Middle Data Point - median(x)
Quartiles (five-number summary) - Spread of Data - quantile(x, percentile) or summary(x)
Standard Deviation - Measures how spread out the data are from the mean - sd(x)
n - Sample size - n()

— Use group_by(y)

Function from tidyverse package
Normally used with the summarise or mutate
- Qualitative in Group By
- Quantitative in Summarize

Graphics

Tests

T-tests
- Independent t.test(Y ~ X, data = YourData, mu = YourNull, alternative = YourAlternative, conf.level = 0.95)
Wilcoxon
- Rank Sum (Mann-Whitney) wilcox.test(Y ~ X, data = YourData, mu = YourNull, alternative = YourAlternative, conf.level = 0.95)

Y is a single quantitative variable of interest, like “heights” of BYU-Idaho students. X is a categorical (qualitative) variable like which Math 221 you took, 221A, 221B, or 221C. In other words, X has three or more groups. So “Classrank” could be X, with groups “Freshman”, “Sophomore”, “Junior”, and “Senior”.

Quantitative Y | Categorical X (3+ Groups)

Data Summaries

Numerical

Minimum - Smallest occurring data value - min(x)
Maximum - Largest occurring data value - max(x)
Mean (average) - Center of Mass - mean(x)
Median - Middle Data Point - median(x)
Quartiles (five-number summary) - Spread of Data - quantile(x, percentile) or summary(x)
Standard Deviation - Measures how spread out the data are from the mean - sd(x)
n - Sample size - n()

— Use favstats(x ~ y, data=YourDataSet)

x is a numeric vector of data values that represents the quantatitive response variable.
y is a qualitative grouping variable defining which groups each value in x belongs to.
YourDataSet is the name of your data set.

Graphics

Tests

ANOVA
- One-way aov(y ~ A, data=YourDataSet)
- Two-way aov(y ~ A+B+A:B, data=YourDataSet)
- Block Design aov(y ~ Block+A+B+A:B, data=YourDataSet)
Kruskal-Wallis
- Rank Sum kruskal.test(x ~ g, data=YourDataSet)

Y is a single quantitative variable of interest, like “height”. X is another single quantitative variable of interest, like “shoe-size”. This would imply we are using “shoe-size” (X) to explain “height” (Y).

Quantitative Y | Quantitative X

Data Summaries

Numerical

Correlation - the strength and direction of the association - cor(x,y)
Summary - Ten Number Summary - favstats(y ~ x, data=YourDataSet)

Graphics

Tests

Simple Linear Regression lm(Y ~ X, data = YourDataSet)

Y is a single quantitative variable of interest, like height. While we could use an X1 of “shoe-size” to explain height, we might also want to use a second x-variable, X2, like “gender” to help explain height. Further x-variables could also be used.

Quantitative Y | Multiple X

Data Summaries

Numerical

Summary
- Five Number Summary - summary(y)
- Ten Number Summary - favstats(y ~ x, data=YourDataSet)

Graphics

Tests

Multiple Linear Regression lm(Y ~ X1 + X2 + X1:X2 + ..., data = YourDataSet)

Y is a single categorical (qualitative) variable of interest where 1 (success) or 0 (failure) are the only possible values for Y. This would be like “getting an A in Math 325” where 1 means you got an A and 0 means you didn’t. We might use an explanatory variable X of “height” to see if taller students are more likely to get an A in Math 325 than shorter students. (They aren’t, if you were wondering.)

Binomial Y | Quantitative X

Data Summaries

Numerical

Summary - Ten Number Summary - favstats(y ~ x, data=YourDataSet)

Graphics

Tests

Simple Logistic Regression Model glm(Y ~ X, data = YourDataSet, family = binomial)

Y is a single categorical (qualitative) variable of interest where 1 (success) or 0 (failure) are the only possible values for Y. This would be like “getting an A in Math 325” where 1 means you got an A and 0 means you didn’t. We might use an explanatory variable X1 of “height” and a second explanatory variable X2 of “gender” to try to predict whether or not a student will get an A in Math 325.

Binomial Y | Multiple X

Data Summaries

Numerical

Summary
- Five Number Summary - summary(y)
- Ten Number Summary - favstats(y ~ x, data=YourDataSet)

Graphics

Tests

Mutiple Logistic Regression Model glm(Y ~ X1 * X2 * ... , data = YourDataSet, family = binomial)

Y is a single categorical variable of interest, like gender. X is another categorical variable of interest, like “hair color”. This type of data would help us understand if men or women are more likely to have certain hair colors than the other gender.

Categorical Y | Categorical X

Data Summaries

Numerical

Table - table(x)

Graphics

Tests

Chi-squared Test of Independence
- chisq.test(x)
- chisq.test(table(Dataset$variable, Dataset$variable))
Nonparametric Chi-squared Test chisq.test(x, simulate.p.value=TRUE)

Making Inferences

This comes from the Making Inference page of this book

It is common to only have a sample of data from some population of interest. Using the information from the sample to reach conclusions about the population is called making inference. When statistical inference is performed properly, the conclusions about the population are almost always correct.

Hypothesis Testing

One of the great focal points of statistics concerns hypothesis testing. Science generally agrees upon the principle that truth must be uncovered by the process of elimination. The process begins by establishing a starting assumption, or null hypothesis ($H_0$). Data is then collected and the evidence against the null hypothesis is measured, typically with the $p$-value. The $p$-value becomes small (gets close to zero) when the evidence is extremely different from what would be expected if the null hypothesis were true. When the $p$-value is below the significance level $\alpha$ (typically $\alpha=0.05$) the null hypothesis is abandoned (rejected) in favor of a competing alternative hypothesis ($H_a$).

Managing Decision Errors

When the p-value approaches zero, one of two things must be occurring. Either an extremely rare event has happened or the null hypothesis is incorrect. Since the second option, that the null hypothesis is incorrect, is the more plausible option, we reject the null hypothesis in favor of the alternative whenever the p-value is close to zero. It is important to remember that rejecting the null hypothesis could however be a mistake.

	$H_0$ True	$H_0$ False
Reject $H_0$	Type I Error	Correct Decision
Accept $H_0$	Correct Decision	Type II Error

Type I Error (Significance Level, Confidence and α)

Defined as rejecting the null hypothesis when it is actually true
A hypothesis test controls the probability of a Type I Error
The typical value of α is 0.05
Defined as 1−α or the opposite of a Type I error

Type II Errors (β, and Power)

Defined as failing to reject the null hypothesis when it is actually false
The typical value of β often unknown

Sufficient Evidence

Statistics comes in to play with hypothesis testing by defining the phrase “sufficient evidence.” When there is “sufficient evidence” in the data, the null hypothesis is rejected and the alternative hypothesis becomes the working hypothesis.

There are many statistical approaches to this problem of measuring the significance of evidence, but in almost all cases, the final measurement of evidence is given by the p-value of the hypothesis test. The p-value of a test is defined as the probability of the evidence being as extreme or more extreme than what was observed assuming the null hypothesis is true. This is an interesting phrase that is at first difficult to understand.

The “as extreme or more extreme” part of the definition of the p-value comes from the idea that the null hypothesis will be rejected when the evidence in the data is extremely inconsistent with the null hypothesis. If the data is not extremely different from what we would expect under the null hypothesis, then we will continue to believe the null hypothesis. Although, it is worth emphasizing that this does not prove the null hypothesis to be true.

Evidence not Proof

Hypothesis testing allows us a formal way to decide if we should “conclude the alternative” or “continue to accept the null.” It is important to remember that statistics (and science) cannot prove anything, just show evidence towards. Thus we never really prove a hypothesis is true, we simply show evidence towards or against a hypothesis.

Probability and Odds

Probability

The probability that an event will occur is the fraction of times you expect to see that event in many trials.

\[ \frac{\overbrace{x}^\text{successes}}{\underbrace{n}_\text{total_attempts}} \]

Odds

The odds are defined as the probability that the event will occur divided by the probability that the event will not occur.

\[ Odds for = \frac{\overbrace{x}^\text{successes}}{\underbrace{n-x}_\text{fails}}\]

\[ Odds Against = \frac{\overbrace{n-x}^\text{fails}}{\underbrace{x}_\text{successes}}\]

Calculating the p-Value

Recall that the p-value measures how extremely the data (the evidence) differs from what is expected under the null hypothesis. Small p-values lead us to discard (reject) the null hypothesis.The P-value is the probability of obtaining a result (called a test statistic) at least as extreme as the one you calculated, assuming the null hypothesis is true.

A p-value can be calculated whenever we have two things.

A test statistic, which is a way of measuring how “far” the observed data is from what is expected under the null hypothesis.
The sampling distribution of the test statistic, which is the theoretical distribution of the test statistic over all possible samples, assuming the null hypothesis was true.

A distribution describes how data is spread out. When we know the shape of a distribution, we know which values are possible, but more importantly which values are most plausible (likely) and which are the least plausible (unlikely). The p-value uses the sampling distribution of the test statistic to measure the probability of the observed test statistic being as extreme or more extreme than the one observed.

All p-value computation methods can be classified into two broad categories, parametric methods and nonparametric methods.

Methods

Parametric Methods

Parametric methods assume that, under the null hypothesis, the test statistic follows a specific theoretical parametric distribution. Parametric methods are typically more statistically powerful than nonparametric methods, but necessarily force more assumptions on the data.

Parametric distributions are theoretical distributions that can be described by a mathematical function. There are many theoretical distributions.

Four of the most widely used parametric distributions are:

Normal Distribution (Z-Distribution)
- It is a theoretical distribution that approximates the distributions of many real life data distributions
Chi Squared Distribution
- It only allows for values that are greater than or equal to zero
- It has a few real life applications, by far its greatest use is theoretical
t Distribution
- A close friend of the normal distribution
- It is used extensively in hypothesis testing
F Distribution
- It is the ratio of two chi squared random variables that are each divided by their respective degrees of freedom

Parametric Tests

Regression Models

Nonparametric Methods

Nonparametric methods place minimal assumptions on the distribution of data. They allow the data to “speak for itself.” They are typically less powerful than the parametric alternatives, but are more broadly applicable because fewer assumptions need to be satisfied.

Nonparametric Tests

Permutation Tests

This comes from the Permutation Tests page of this book
Random Testing - A nonparametric approach to computing the p-value for any test statistic in just about any scenario.

R Tools

This comes from the R Commands page and R Markdown Hints page of this book

A Quote to Remember

“Knowing a language doesn’t make you a data scientist, just like knowing English doesn’t make you a poet. You will also need to have analytics and visualization capabilities.”

What is R?

A successor to the S language with it’s first beta release in 2000. Heavily used by trained statisticians and researchers. Thanks to RStudio (established in 2010), data scientists also use this software for their work.

Get Started with R

Hover your mouse here to begin. Good work!
This book requires that you interact with it to learn. Hovering is the first step.
Now “click” to get started.

R Commands

`?` The Help Command

Getting help in R is easy.

Usage

?something

This command pulls up the help file for whatever you write in the place of something.

Examples

Click to view. Hover to learn.

? The quick way to access the help function in R. cars The name of a dataset can be typed to open the help file for that dataset.
Press Enter to run the code. Click to Show Output Click to View Output.

? The quick way to access the help function in R. data The name of an R function, like data can also be used to open the help file for that function.
Press Enter to run the code. Click to Show Output Click to View Output.

? The quick way to access the help function in R. mean The mean function computes the mean of a column of quantitative data. Typing the name of an R function, like mean can also be used to open the help file for that function.
Press Enter to run the code. Click to Show Output Click to View Output.

`$` The Selection Operator

Once you have a dataset, you need to be able to access columns from it.

Usage

DataSetName$ColumnName

The $ operator allows you to access the individual columns of a dataset.

Tip: think of the data set as a “store” from which you “purchase” a column using “money”: $.

Example Code

airquality The airqaulity dataset. This could be the name of any dataset instead of airquality. $ Grabs the column, or variable, from the dataset to be used. This is typically used when computing say the mean (or other statistic) of a single column of the data. Wind The name of any column of the dataset can be entered after the dollar sign. In the airquality dataset, this includes: Ozone, Solar.R, Wind, Temp, Month, or Day as shown by View(airquality).
Press Enter to run the code. Click to Show Output Click to View Output.

This allows you to compute things about that column, like the mean or standard deviation.

mean( The mean function computes the mean of a column of quantitative data. airquality The airqaulity dataset. This could be the name of any dataset instead of airquality. $ Grabs the column, or variable, from the dataset to be used. This is typically used when computing say the mean (or other statistic) of a single column of the data. Wind The name of any column of the dataset can be entered after the dollar sign. In the airquality dataset, this includes: Ozone, Solar.R, Wind, Temp, Month, or Day as shown by View(airquality). ) Closing parenthesis to the mean() function.
Press Enter to run the code. Click to Show Output Click to View Output.

sd( The sd function computes the standard deviation of a column of quantitative data. airquality The airqaulity dataset. This could be the name of any dataset instead of airquality. $ Grabs the column, or variable, from the dataset to be used. This is typically used when computing say the mean (or other statistic) of a single column of the data. Wind The name of any column of the dataset can be entered after the dollar sign. In the airquality dataset, this includes: Ozone, Solar.R, Wind, Temp, Month, or Day as shown by View(airquality). ) Closing parenthesis to the sd() function.
Press Enter to run the code. Click to Show Output Click to View Output.

See Numerical Summaries for more stats functions like mean() and sd().

`<-` The Assignment Operator

Being able to save your work is important!

Usage Keyboard Shortcut: Alt -

NameYouCreate <- some R commands

<- (Less than symbol < with a hyphen -) is called the assignment operator and lets you store the results of the some R commands into an object called NameYouCreate.
NameYouCreate is any name that begins with a letter, but can use numbers, periods, and underscores thereafter. To use spaces in the name, you must use `your Name` encased in back-ticks, but this is not recommended.

Example Code

cars2 First we name the object we are creating. In this case, we are making a copy of the cars dataset, so it is logical to call it cars2, but it could be bob, c2 or any name you wanted to use. Just be careful to not use names that are already in use! <- The <- assignment operator will take whatever is on the right hand side and save it into the name written on the left hand side. cars In this case the cars dataset is being copied to cars2 so that we can change cars2 without changing the original cars dataset.
Press Enter to run the code.

cars2 The new copy of the cars dataset that we just created $ftpersec The $ selection operator can be used to create a new column in a dataset when used with the <- assignment operator. <- The <- assignment operator will take the results of the right-hand-side and save them into the name on the left-hand-side. cars2$speed * 5280 / 3600 This calculation converts the miles per hour of the cars2 speed column into feet per seconds because there are 5280 feet in a mile and 60 minutes in an hour and 60 seconds in a minute.

View(cars2) The cars2 dataset now contains a 3rd column called feetpersec. Compare this to the original cars dataset to see how it changed. Click to Show Output Click to View Output.

c( ) The Combine Function

Think of this function as the “back-pack” function, just like putting different books into one back-pack.

Usage

c(value 1, value 2, value 3, ... )

The c( ) function combines values into a single object called a “vector”.
values 1, 2, 3, ... can be numbers or characters, i.e., words, but must be all of one type or the other.

Example Code

Classlist <- Classlist is a new object being created using the assignment operator <- that will contain the four names listed above. c( The combine function c( ) is being used in this case to group character values representing names of students into a single object named “Classlist”. “Jackson”, “Jared”, “Jill”, “Jane”) These are the values we are grouping into the object named Classlist.
Press Enter to run the code.

Ages <- The assignment operator <- is being used to create the object called Ages that will contain the ages of each student on the Classlist. c( The R function “c()” allows us to group together values in order to save them into an object. 8, 9, 7, 8 The values, separated by comma’s, that are being grouped together. In this case, numbers are being grouped together. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code.

Colors <- The assignment operator <- is being used to create the object called Colors that will have one color for each student on the Classlist. c( The R function “c()” allows us to group together values in order to save them into an object. “red”, “blue”, “green”, “yellow” The values, separated by comma’s, that are being grouped together. In this case, characters are being grouped together. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

table( )

This is a way to quickly count how many times each value occurs in a column or columns.

Usage

table(NameOfDataset$columnName)

table(NameOfDataset$columnName1, NameOfDataset$columnName2)

The table( ) function counts how many times each value in a column of data occurs.
NameOfDataset is the ane of a data set, like cars or airquality or KidsFeet.
columnName is the name of a column from the data set.
columnName1 and columnName2 are two different names of columns from the data set.

Example Code

speedCounts <-
speedCounts is a new object being created using the assignment operator <- that will contain the counts of how many times each “speed” occurs in the cars data set speed column. table( The table function table( ) is being used in this case to count how many times each speed occurs in the cars data set speed column. cars This is the name of the data set. $ The $ is used to access a given column from the data set. speed This is the name of the column we are interested in from the cars data set. ) Always close off your functions in R with a closing parathesis.
speedCounts Typing the name of an object will print the results to the screen.
Press Enter to run the code. Click to Show Output Click to View Output.

## 
##  4  7  8  9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 
##  2  2  1  1  3  2  4  4  4  3  2  3  4  3  5  1  1  4  1

Notice how the speed of “4” occurs 2 times, same for the speed of 7, but the speed of 8 only occurs 1 time and so on with the other speeds. The first row of the output is the value from the speed column. The number on the second line shows how many times that value occurred in the speed column.

library(mosaic) library(mosaic) is needed to access the KidsFeet data set that is used in this example. If you don’t have the mosaic library, you will need to run install.packages("mosaic") to install it first. From then on, you can open mosaic to use it with the command library(mosaic). You need only install packages once. You must library them each time you wish to use them.
birthdays <-
birthdays is a new object being created using the assignment operator <- that will contain the counts of how many birthdays occur in each month for each gender in the KidsFeet dataset. table( The table function table( ) is being used in this case to count how many birthdays occur in each month for children of each gender. KidsFeet This is the name of the data set. $ The $ is used to access a given column from the data set. sex This is the name of the column we are interested in becoming the rows of our final table. , Comma separating the two columns of the data set you want to table. KidsFeet This is the name of the data set. $ The $ is used to access a given column from the data set. birthmonth This is the name of the column we are interested in becoming the columns of our final table. ) Always close off your functions in R with a closing parathesis.
birthdays Typing the name of an object will print the results to the screen.
Press Enter to run the code. Click to Show Output Click to View Output.

##    
##     1 2 3 4 5 6 7 8 9 10 11 12
##   B 1 2 3 2 1 1 2 2 2  1  1  2
##   G 1 1 5 1 1 3 1 0 3  1  1  1

The left column contains the “sex” values of “B” and “G” (Boy and Girl).

The top row contains the birthmonths (1 through 12).

The numbers within the row of the table next to the “B” show how many Boys had birthdays in each month of the year.

The numbers within the row of the table next to the “G” show how many Girls had birthdays in each month of the year.

filter( )

Used to reduce a dataset to a smaller set of rows than the original dataset contained.

Usage

filter(NameOfDataset, columnName filteringRules)

filter() is the function that filters out certain rows of the dataset.
NameOfDataset is the name of a dataset, like cars or airquality or KidsFeet.
columnName is the name of one of the columns from the dataset. You can use colnames(NameOfDataset) or View(NameOfDataset) to see the names.
filteringRules consists of some Logical Expression (see table below) that selects only the rows from the original dataset that meet the criterion.
Controls what rows that are used – By Filtering Rows

Filtering Rule	Logical Expression
Equals one “thing”	`columnName` `==` `something`
Equals Any Of Several “things”	`columnName` `%in%` `c(something1,something2,...)`
Not Equal (one thing)	`columnName` `!=` `something`
Not Equals Any of (several things)	`!columnName` `%in%` `c(something1,something2,...)`
Less Than	`columnName` `<` `value`
Less Then or Equal to	`columnName` `<=` `value`
Greater Than	`columnName` `>` `value`
Greater Than or Equal to	`columnName` `>=` `value`
AND	`expression1` `&` `expression2`
OR	`expression1` `\|` `expression2`
Equals `NA`	`is.na(columnName)`
Not `NA`	`!is.na(columnName)`

Example Code

library(tidyverse) The tidyverse library is needed to access the filter function used in the following example codes.
library(mosaic) The mosaic library is needed to access the KidsFeet data set used in the following example codes.

Equals one “thing”…

Kids87 <- Kids87 is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. birthyear A quantitative column of the KidsFeet dataset that we want to use to reduce the dataset. == 87 This “filtering rule” filters the data down to just those children who had a birthyear equal to 87. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

KidsBoys <- KidsBoys is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. sex A categorical column of the KidsFeet dataset that we want to use to reduce the dataset. == “B” This “filtering rule” filters the data down to just those children who are boys. Words must be quoted “B” but values are just typed directly. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

Equals Any of Several “things”…

KidsSummer <- KidsSummer is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. birthmonth The column of the KidsFeet dataset that we want to use to reduce the dataset. %in% c(6,7,8) This is the “filtering rule”. It will filter the data down to just those children who were born during the summer, i.e., birthmonth equal to either 6, 7, or 8. Notice how the c( ) function is being used to combine the values of 6, 7, and 8 together into a single list of numbers. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

Does not equal one thing…

KidsNotJosh <- KidsNotJosh is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. name The column of the KidsFeet dataset that we want to use to reduce the dataset. != “Josh” This is the “filtering rule”. It will filter the data down to just those children who are NOT named “Josh”. In this case, it removed just two students who were named “Josh”. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

Less than…

KidsLength24 <- KidsLength24 is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. length The column of the KidsFeet dataset that we want to use to reduce the dataset. < 24 This is the “filtering rule”. It will filter the data down to just those children who have a foot length less than 24 cm. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

Less than or equal to…

KidsLessEq24 <- KidsLessEq24 is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. length The column of the KidsFeet dataset that we want to use to reduce the dataset. <= 24 This is the “filtering rule”. It will filter the data down to just those children who have a foot length less than or equal to 24 cm. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

Greater than…

KidsWider9 <- KidsNotJosh is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. width The column of the KidsFeet dataset that we want to use to reduce the dataset. > 9 This is the “filtering rule”. It will filter the data down to just those children who have a foot width greater than 9 cm. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

Greater than or equal to…

KidsWiderEq9 <- KidsWiderEq9 is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. width The column of the KidsFeet dataset that we want to use to reduce the dataset. >= 9 This is the “filtering rule”. It will filter the data down to just those children who have a foot width greater than or equal to 9 cm. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

The “and” statement…

GirlsWide9 <- GirlsWide9 is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. sex The first column of the KidsFeet dataset that we want to use to reduce the dataset. == “G” This is the first “filtering rule”. It will filter the data down to just those children who are girls. & The & is the AND statement. It joins to filtering criteria together into a single criteria where both conditions must be met. In this case, it ensures we get only girls with foot widths greater than 9 cm. width The second column of the KidsFeet dataset that we want to use to reduce the dataset. > 9 This is the second “filtering rule”. It will filter the data down to just those children who have a foot width greater than 9 cm. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

The “or” statement…

KidsWinter <- KidsWinter is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the filter(...) function into this name. filter(KidsFeet, “filter” is a function from library(tidyverse) that reduces the number of rows in the KidsFeet dataset by filtering according to certain criteria. Click on this code to see the original and filtered datasets. birthmonth The first column of the KidsFeet dataset that we want to use to reduce the dataset. <= 2 This is the first “filtering rule”. It will filter the data down to just those children who are born in January or February. | The | is the OR statement. It joins to filtering criteria together into a single criteria where either condition gives us what we want. In this case, it keeps any child born in January, February, November, or December. birthmonth The second column of the KidsFeet dataset that we want to use to reduce the dataset. In this case, it is the same as the first column. >= 11 This is the second “filtering rule”. It will filter the data down to just those children who are born in November or December. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

select( )

Used to select out certain columns from a dataset.

Usage

select(NameOfDataset, listOfColumnNames)

select( ) is the function that selects out certain columns of the dataset.
NameOfDataset is the name of a dataset, like cars or airquality or KidsFeet.
listOfColumnNames is a vector of names of columns from the dataset, usually supplied inside a combine c(...) statement.
Controls the columns being used – By Selecting Columns

Example Code

KidsNameBirth <- KidsNameBirth is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the select(...) function into this name. select(KidsFeet, “select” is a function from library(tidyverse) that selects out specified columns from the original dataset in the order specified. c(name, birthyear, birthmonth) The columns of the KidsFeet dataset that we want to select out of the original dataset. Notice how the concatenation function c(...) is used to list out the columns we want. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

KidsBigLength <- KidsBigLength is a name we made up. The assignment operator <- will save the reduced version of the KidsFeet dataset created by the select(...) function into this name. select(KidsFeet, “select” is a function from library(tidyverse) that selects out specified columns from the original dataset in the order specified. c(name, length) The columns of the KidsFeet dataset that we want to select out of the original dataset. Notice how the concatenation function c(...) is used to list out the columns we want. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

%>% The Pipe Operator

Just like the pipes in your kitchen sink, the pipe operator takes “water from the sink” and “sends it down to somewhere else.”

Usage Keyboard Shortcut: Ctrl Shift M

NameOfDataset %>%

some R commands that follow on the next line

%>%, the pipe operator, is created by typing percent symbols % on both sides of a greater than symbol >. It lets you take whatever is on the left of the symbol and “pipe it down into” some R commands that follow on the next line.
NameOfDataset is the name of a dataset, like cars or airquality or KidsFeet.

Note: you should load library(tidyverse) before using the %>% operator.

Example Code

Kids2 <- This provides a name for the new reduced version of the KidsFeet dataset that is going to be created by the combined use of filter(...) and select(...). KidsFeet KidsFeet is a dataset found in library(mosaic). Click on this code to View the dataset and the resulting Kids2 dataset. %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
filter( “filter” is a function from library(tidyverse) that allows us to reduce the number of rows in the KidsFeet dataset by filtering according to certain criteria. birthyear Represents the column of data that we want to use to reduce the rows of the dataset. == 87 This is the “filtering rule”. It will filter the data down to just those children who had a birthyear equal to 87. ) Always close off your functions in R with a closing parathesis. %>% The pipe operator that will send the filtered version of the KidsFeet dataset down inside of the code on the following line.
select( “select” is a function from library(tidyverse) that selects out specified columns from the current dataset in the order specified. c(name, birthyear, length) The columns of the filtered KidsFeet dataset that we want to select. Notice how the concatenation function c(...) is used to list out the columns we want. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

summarise( ) and group_by( )

Compute numerical summaries on data or on groupings within the data.

Usage

NameofDataset %>%

summarise(nameYouLike = some_stats_function(columnName))

NameofDataset %>%

group_by(columnGroupsName) %>%

summarise(nameYouLike = some_stats_function(columnName))

NameOfDataset is the name of a dataset, like cars or airquality or KidsFeet.
%>% is the pipe operator that “pipes data” down into R commands on the next line.
group_by(...) is an R function from library(tidyverse) that groups data according to a specified column (or columns).
summarise(...) is an R function from library(tidyverse) that computes numerical summaries on data or groups of data.
columnGroupsName is the name of a column that represents qualitative (categorical) data. This column is used to separate the dataset into little datasets, one “little dataset” for each group or category in the columnGroupsName column.
nameYouLike is just that. Some name you come up with.
some_stats_function(...) is a stats function like mean(...), sd(...), n(...) or so on.
columnName is the name of a column from the dataset that you want to compute numerical summaries on.
Quantitative in Summarize
Qualitative in Group By

Example Code

aveLength
24.72

KidsFeet KidsFeet is a dataset found in library(mosaic). %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
   summarise( “summarise” is a function from library(tidyverse) that allows us to compute numerical summaries on data. aveLength A name we came up with that will store the results of the numerical summary. = mean(length), This computes the mean(...) of the length column from the KidsFeet dataset.
             sdLength A name we came up with that will store the results of the numerical summary. = sd(length), This computes the sd(...) of the length column from the KidsFeet dataset.
             sampleSize A name we came up with that will store the results of the numerical summary. = n( ) This computes the n(...), or sample size, of the length column from the KidsFeet dataset. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

aveLength	sdLength	sampleSize
24.72	1.318	39

KidsFeet KidsFeet is a dataset found in library(mosaic). %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
   group_by( “group_by” is a function from library(tidyverse) that allows us to split the dataset up into “little groups” according to the column specified. sex “sex” is a column from the KidsFeet dataset that records the gender of each child. ) Always close off your functions in R with a closing parathesis. %>% The pipe operator that will send the grouped according to gender version of the KidsFeet dataset down inside of the code on the following line.
   summarise( “summarise” is a function from library(tidyverse) that allows us to compute numerical summaries on data. aveLength A name we came up with that will store the results of the numerical summary. = mean(length), This computes the mean(...) of the length column from the KidsFeet dataset.
             sdLength A name we came up with that will store the results of the numerical summary. = sd(length), This computes the sd(...) of the length column from the KidsFeet dataset.
             sampleSize A name we came up with that will store the results of the numerical summary. = n( ) This computes the n(...), or sample size, of the length column from the KidsFeet dataset. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

For more uses of summarise(...) and group_by(...) see the Example codes on the various “R Instructions” of the Numerical Summaries page.

mutate( )

Transform a column of data.

Usage

NameofDataset %>%

mutate(nameYouLike = some_transformation)

NameOfDataset is the name of a dataset, like cars or airquality or KidsFeet.
%>% is the pipe operator that “pipes data” down into R commands on the next line.
nameYouLike is just that. Some name you come up with that will be the name of a new column in the dataset.
some_transformation is just that. See the example codes for ideas.
Create or Transform columns in data

Example Code

mtcars2 <- mtcars2 is a new dataset we are creating that will contain all of mtcars data set along with a couple new columns we are creating. mtcars mtcars is a dataset found in base R. Typing View(mtcars) and ?mtcars in the console will help you learn more about the dataset. %>% The pipe operator that will send the mtcars dataset down inside of the code on the following line.
   mutate( “mutate” is a function from library(tidyverse) that allows us to transform columns of data.
         cyl_factor = as.factor(cyl), “cyl_factor” is a name we came up with that will store the results of the transformation of the “cyl” column. Here we are simply converting the “cyl” column from type numeric to a factor. Treating the “cyl” column as a factor could be useful in certain situations.
         weight = wt * 1000 “weight” is a name we came up with that will store the results of the transformation of the “wt” column. Taking a closer look with ?mtcars shows us that wt is in 1000 lbs. Here we are just multiplying each row in the column by 1000.
   ) Closing parenthesis for the mutate(…) function.
Press Enter to run the code. Click to Show Output Click to View Output.

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	cyl_factor	weight
21	6	160	110	3.9	2.62	16.46	0	1	4	4	6	2620
21	6	160	110	3.9	2.875	17.02	0	1	4	4	6	2875
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	4	2320
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	6	3215
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	8	3440
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6	3460
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	8	3570
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4	3190
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	4	3150
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6	3440
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	6	3440
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	8	4070
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	8	3730
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	8	3780
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	8	5250
10.4	8	460	215	3	5.424	17.82	0	0	3	4	8	5424
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4	8	5345
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1	4	2200
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2	4	1615
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1	4	1835
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1	4	2465
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2	8	3520
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2	8	3435
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4	8	3840
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2	8	3845
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1	4	1935
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2	4	2140
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2	4	1513
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4	8	3170
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6	6	2770
15	8	301	335	3.54	3.57	14.6	0	1	5	8	8	3570
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2	4	2780

Kids3 <- Kids3 is a new dataset we are creating that will contain all of KidsFeet data set along with a couple new columns we are creating. KidsFeet KidsFeet is a dataset found in library(mosaic). %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
   mutate( “mutate” is a function from library(tidyverse) that allows us to transform columns of data.
         season = case_when( “season” is a name we came up with that will store the results of the transformation of the “birthmonth” column. The case_when(…) function from library(tidyverse) allows us to perform more complicated transformations with columns.
           birthmonth %in% c(12,1,2) ~ “Winter”, The body of case_when(…) is of the form logical expression ~ "newValueName". This statement says that we want the values in the column “birthmonth” that are equal to 12, 1, and 2 to be assigned to the value “Winter” in the new “season” column.
           birthmonth %in% c(3,4,5) ~ “Spring”, The body of case_when(…) is of the form logical expression ~ "newValueName". This statement says that we want the values in the column “birthmonth” that are equal to 3, 4, and 5 to be assigned to the value “Spring” in the new “season” column.
           birthmonth %in% c(6,7,8) ~ “Summer”, The body of case_when(…) is of the form logical expression ~ "newValueName". This statement says that we want the values in the column “birthmonth” that are equal to 6, 7, and 8 to be assigned to the value “Summer” in the new “season” column.
           birthmonth %in% c(9,10,11) ~ “Fall” The body of case_when(…) is of the form logical expression ~ "newValueName". This statement says that we want the values in the column “birthmonth” that are equal to 9, 10, and 11 to be assigned to the value “Fall” in the new “season” column.
          ) Closing parenthesis of the case_when(…) function.
        ) Closing parenthesis for the mutate(…) function.
Press Enter to run the code. Click to Show Output Click to View Output.

name	birthmonth	birthyear	length	width	sex	biggerfoot	domhand	season
David	5	88	24.4	8.4	B	L	R	Spring
Lars	10	87	25.4	8.8	B	L	L	Fall
Zach	12	87	24.5	9.7	B	R	R	Winter
Josh	1	88	25.2	9.8	B	L	R	Winter
Lang	2	88	25.1	8.9	B	L	R	Winter
Scotty	3	88	25.7	9.7	B	R	R	Spring
Edward	2	88	26.1	9.6	B	L	R	Winter
Caitlin	6	88	23	8.8	G	L	R	Summer
Eleanor	5	88	23.6	9.3	G	R	R	Spring
Damon	9	88	22.9	8.8	B	R	L	Fall
Mark	9	87	27.5	9.8	B	R	R	Fall
Ray	3	88	24.8	8.9	B	L	R	Spring
Cal	8	87	26.1	9.1	B	L	R	Summer
Cam	3	88	27	9.8	B	L	R	Spring
Julie	11	87	26	9.3	G	L	R	Fall
Kate	4	88	23.7	7.9	G	R	R	Spring
Caroline	12	87	24	8.7	G	R	L	Winter
Maggie	3	88	24.7	8.8	G	R	R	Spring
Lee	6	88	26.7	9	G	L	L	Summer
Heather	3	88	25.5	9.5	G	R	R	Spring
Andy	6	88	24	9.2	B	R	R	Summer
Josh	7	88	24.4	8.6	B	L	R	Summer
Laura	9	88	24	8.3	G	R	L	Fall
Erica	9	88	24.5	9	G	L	R	Fall
Peggy	10	88	24.2	8.1	G	L	R	Fall
Glen	7	88	27.1	9.4	B	L	R	Summer
Abby	2	88	26.1	9.5	G	L	R	Winter
David	12	87	25.5	9.5	B	R	R	Winter
Mike	11	88	24.2	8.9	B	L	R	Fall
Dwayne	8	88	23.9	9.3	B	R	L	Summer
Danielle	6	88	24	9.3	G	L	R	Summer
Caitlin	7	88	22.5	8.6	G	R	R	Summer
Leigh	3	88	24.5	8.6	G	L	R	Spring
Dylan	4	88	23.6	9	B	R	L	Spring
Peter	4	88	24.7	8.6	B	R	L	Spring
Hannah	3	88	22.9	8.5	G	L	R	Spring
Teshanna	3	88	26	9	G	L	R	Spring
Hayley	1	88	21.6	7.9	G	R	R	Winter
Alisha	9	88	24.6	8.8	G	L	R	Fall

Kids4 <- Kids4 is a new dataset we are creating that will contain all of KidsFeet data set along with a couple new columns we are creating. KidsFeet KidsFeet is a dataset found in library(mosaic). %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
   mutate( “mutate” is a function from library(tidyverse) that allows us to transform columns of data.
         lengthIN = length / 2.54, “lengthIN” is a name we came up with that will store the results of the transformation of the “length” column. This is just converting the length data from cm to inches.
         widthIN = width / 2.54, “widthIN” is a name we came up with that will store the results of the transformation of the “width” column. This is just converting the width data from cm to inches.
         lengthSplit = ifelse(length < median(length),
         “Under 50th Percentile”,
         “50th Percentile or Greater”), “lengthSplit” is a name we came up with that will store the results of the ifelse(…) function. The ifelse(…) function in this case is being used to split the length column by the median of that column. The ifelse(…) function is of the form ifelse( Logical Condition , valueIfConditionTrue, valueIfConditionFalse).
         gender = case_when( “gender” is a name we came up with that will store the results of the transformation of the “sex” column. The case_when(…) function from library(tidyverse) allows us to perform more complicated transformations with columns.
           sex == “B” ~ “Boy”, The body of case_when(…) is of the form logical expression ~ "newValueName". This part of the case_when(…) function is being used to change the value of “B” to “Boy”.
           sex == “G” ~ “Girl” The body of case_when(…) is of the form logical expression ~ "newValueName". This part of the case_when(…) function is being used to change the value of “G” to “Girl”.
          ) Closing parenthesis for the case_when(…) function.
        ) Closing parenthesis for the mutate(…) function.
Press Enter to run the code. Click to Show Output Click to View Output.

Notice the addition (on the right) of three new columns: lengthIN, widthIN, and gender.

name	birthmonth	birthyear	length	width	sex	biggerfoot	domhand	lengthIN	widthIN	lengthSplit	gender
David	5	88	24.4	8.4	B	L	R	9.606	3.307	Under 50th Percentile	Boy
Lars	10	87	25.4	8.8	B	L	L	10	3.465	50th Percentile or Greater	Boy
Zach	12	87	24.5	9.7	B	R	R	9.646	3.819	50th Percentile or Greater	Boy
Josh	1	88	25.2	9.8	B	L	R	9.921	3.858	50th Percentile or Greater	Boy
Lang	2	88	25.1	8.9	B	L	R	9.882	3.504	50th Percentile or Greater	Boy
Scotty	3	88	25.7	9.7	B	R	R	10.12	3.819	50th Percentile or Greater	Boy
Edward	2	88	26.1	9.6	B	L	R	10.28	3.78	50th Percentile or Greater	Boy
Caitlin	6	88	23	8.8	G	L	R	9.055	3.465	Under 50th Percentile	Girl
Eleanor	5	88	23.6	9.3	G	R	R	9.291	3.661	Under 50th Percentile	Girl
Damon	9	88	22.9	8.8	B	R	L	9.016	3.465	Under 50th Percentile	Boy
Mark	9	87	27.5	9.8	B	R	R	10.83	3.858	50th Percentile or Greater	Boy
Ray	3	88	24.8	8.9	B	L	R	9.764	3.504	50th Percentile or Greater	Boy
Cal	8	87	26.1	9.1	B	L	R	10.28	3.583	50th Percentile or Greater	Boy
Cam	3	88	27	9.8	B	L	R	10.63	3.858	50th Percentile or Greater	Boy
Julie	11	87	26	9.3	G	L	R	10.24	3.661	50th Percentile or Greater	Girl
Kate	4	88	23.7	7.9	G	R	R	9.331	3.11	Under 50th Percentile	Girl
Caroline	12	87	24	8.7	G	R	L	9.449	3.425	Under 50th Percentile	Girl
Maggie	3	88	24.7	8.8	G	R	R	9.724	3.465	50th Percentile or Greater	Girl
Lee	6	88	26.7	9	G	L	L	10.51	3.543	50th Percentile or Greater	Girl
Heather	3	88	25.5	9.5	G	R	R	10.04	3.74	50th Percentile or Greater	Girl
Andy	6	88	24	9.2	B	R	R	9.449	3.622	Under 50th Percentile	Boy
Josh	7	88	24.4	8.6	B	L	R	9.606	3.386	Under 50th Percentile	Boy
Laura	9	88	24	8.3	G	R	L	9.449	3.268	Under 50th Percentile	Girl
Erica	9	88	24.5	9	G	L	R	9.646	3.543	50th Percentile or Greater	Girl
Peggy	10	88	24.2	8.1	G	L	R	9.528	3.189	Under 50th Percentile	Girl
Glen	7	88	27.1	9.4	B	L	R	10.67	3.701	50th Percentile or Greater	Boy
Abby	2	88	26.1	9.5	G	L	R	10.28	3.74	50th Percentile or Greater	Girl
David	12	87	25.5	9.5	B	R	R	10.04	3.74	50th Percentile or Greater	Boy
Mike	11	88	24.2	8.9	B	L	R	9.528	3.504	Under 50th Percentile	Boy
Dwayne	8	88	23.9	9.3	B	R	L	9.409	3.661	Under 50th Percentile	Boy
Danielle	6	88	24	9.3	G	L	R	9.449	3.661	Under 50th Percentile	Girl
Caitlin	7	88	22.5	8.6	G	R	R	8.858	3.386	Under 50th Percentile	Girl
Leigh	3	88	24.5	8.6	G	L	R	9.646	3.386	50th Percentile or Greater	Girl
Dylan	4	88	23.6	9	B	R	L	9.291	3.543	Under 50th Percentile	Boy
Peter	4	88	24.7	8.6	B	R	L	9.724	3.386	50th Percentile or Greater	Boy
Hannah	3	88	22.9	8.5	G	L	R	9.016	3.346	Under 50th Percentile	Girl
Teshanna	3	88	26	9	G	L	R	10.24	3.543	50th Percentile or Greater	Girl
Hayley	1	88	21.6	7.9	G	R	R	8.504	3.11	Under 50th Percentile	Girl
Alisha	9	88	24.6	8.8	G	L	R	9.685	3.465	50th Percentile or Greater	Girl

airquality2 <- airquality is a new dataset we are creating that will contain all of the airquality data set along with a new column we are creating. airquality airquality is a dataset found in base R. Typing View(airquality) and ?airquality in the console will help you learn more about the dataset. %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
mutate( “mutate” is a function from library(tidyverse) that allows us to transform columns of data. Month_Full = “Month_Full” is a name we came up with that will store the results of the transformation of the “Month” column. month( month(…) is from library(lubridate) and changes the “Month” column from type integer to type datetime. Month, “Month” is the “Month” column from airquality. label = TRUE, “label = TRUE” tells month(…) to change the month numbers to abbreviated month names. abbr = FALSE “abbr = FALSE” changes the abbreviated month names to the full month names. ) Closing parenthesis for the month(…) function. ) Closing parenthesis for the mutate(…) function.
Press Enter to run the code. Click to Show Output Click to View Output.

Other case_when( ) Uses

case_when(length > 25 & width > 9 ~ "Long and Wide",
          length < 25 & width > 9 ~ "Short and Wide",
          length > 25 & width < 9 ~ "Long and Thin",
          length < 25 & width < 9 ~ "Short and Thin")

replace_na( ) Function

replace_na(columnName, value)

as.numeric( ) Function

as.numeric(columnName)

as.character( ) Function

as.character(columnName)

as.factor( ) Function

as.factor(columnName)

arrange( )

Arrange data by a certain column, or columns, i.e. “sort” the data.

Usage

NameofDataset %>%

arrange(columnName1)

Note: arrange(columnName1, columnName2, ...) is also possible.

NameOfDataset is the name of a dataset, like cars or airquality or KidsFeet.
%>% is the pipe operator that “pipes data” down into R commands on the next line.
arrange(...) is an R function from library(tidyverse) that arranges a data set by order for the column given.
columnName1 is the name of a column from the dataset that you want to compute numerical summaries on.
columnName2 is the name of a column from the dataset that you want to compute numerical summaries on.
... implies that you can arrange by as many columns as you want.

Example Code

KidsFeet KidsFeet is a dataset found in library(mosaic). %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
arrange( “arrange” is an R function from library(tidyverse) that arranges a data set by order for the column given. birthmonth birthmonth is the name of one of the columns of the KidsFeet data set. Specifying this name will cause the data to be sorted by birthmonth from 1 to 12. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

name	birthmonth	birthyear	length	width	sex	biggerfoot	domhand
Josh	1	88	25.2	9.8	B	L	R
Hayley	1	88	21.6	7.9	G	R	R
Lang	2	88	25.1	8.9	B	L	R
Edward	2	88	26.1	9.6	B	L	R
Abby	2	88	26.1	9.5	G	L	R
Scotty	3	88	25.7	9.7	B	R	R
Ray	3	88	24.8	8.9	B	L	R
Cam	3	88	27	9.8	B	L	R
Maggie	3	88	24.7	8.8	G	R	R
Heather	3	88	25.5	9.5	G	R	R
Leigh	3	88	24.5	8.6	G	L	R
Hannah	3	88	22.9	8.5	G	L	R
Teshanna	3	88	26	9	G	L	R
Kate	4	88	23.7	7.9	G	R	R
Dylan	4	88	23.6	9	B	R	L
Peter	4	88	24.7	8.6	B	R	L
David	5	88	24.4	8.4	B	L	R
Eleanor	5	88	23.6	9.3	G	R	R
Caitlin	6	88	23	8.8	G	L	R
Lee	6	88	26.7	9	G	L	L
Andy	6	88	24	9.2	B	R	R
Danielle	6	88	24	9.3	G	L	R
Josh	7	88	24.4	8.6	B	L	R
Glen	7	88	27.1	9.4	B	L	R
Caitlin	7	88	22.5	8.6	G	R	R
Cal	8	87	26.1	9.1	B	L	R
Dwayne	8	88	23.9	9.3	B	R	L
Damon	9	88	22.9	8.8	B	R	L
Mark	9	87	27.5	9.8	B	R	R
Laura	9	88	24	8.3	G	R	L
Erica	9	88	24.5	9	G	L	R
Alisha	9	88	24.6	8.8	G	L	R
Lars	10	87	25.4	8.8	B	L	L
Peggy	10	88	24.2	8.1	G	L	R
Julie	11	87	26	9.3	G	L	R
Mike	11	88	24.2	8.9	B	L	R
Zach	12	87	24.5	9.7	B	R	R
Caroline	12	87	24	8.7	G	R	L
David	12	87	25.5	9.5	B	R	R

KidsFeet KidsFeet is a dataset found in library(mosaic). %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
arrange( “arrange” is an R function from library(tidyverse) that arranges a data set by order for the column given. desc( This causes the arranging to be done in descending order (highest to lowest). birthmonth birthmonth is the name of one of the columns of the KidsFeet data set. Specifying this name will cause the data to be sorted by birthmonth from 1 to 12. ) Always close off your functions in R with a closing parathesis. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

name	birthmonth	birthyear	length	width	sex	biggerfoot	domhand
Zach	12	87	24.5	9.7	B	R	R
Caroline	12	87	24	8.7	G	R	L
David	12	87	25.5	9.5	B	R	R
Julie	11	87	26	9.3	G	L	R
Mike	11	88	24.2	8.9	B	L	R
Lars	10	87	25.4	8.8	B	L	L
Peggy	10	88	24.2	8.1	G	L	R
Damon	9	88	22.9	8.8	B	R	L
Mark	9	87	27.5	9.8	B	R	R
Laura	9	88	24	8.3	G	R	L
Erica	9	88	24.5	9	G	L	R
Alisha	9	88	24.6	8.8	G	L	R
Cal	8	87	26.1	9.1	B	L	R
Dwayne	8	88	23.9	9.3	B	R	L
Josh	7	88	24.4	8.6	B	L	R
Glen	7	88	27.1	9.4	B	L	R
Caitlin	7	88	22.5	8.6	G	R	R
Caitlin	6	88	23	8.8	G	L	R
Lee	6	88	26.7	9	G	L	L
Andy	6	88	24	9.2	B	R	R
Danielle	6	88	24	9.3	G	L	R
David	5	88	24.4	8.4	B	L	R
Eleanor	5	88	23.6	9.3	G	R	R
Kate	4	88	23.7	7.9	G	R	R
Dylan	4	88	23.6	9	B	R	L
Peter	4	88	24.7	8.6	B	R	L
Scotty	3	88	25.7	9.7	B	R	R
Ray	3	88	24.8	8.9	B	L	R
Cam	3	88	27	9.8	B	L	R
Maggie	3	88	24.7	8.8	G	R	R
Heather	3	88	25.5	9.5	G	R	R
Leigh	3	88	24.5	8.6	G	L	R
Hannah	3	88	22.9	8.5	G	L	R
Teshanna	3	88	26	9	G	L	R
Lang	2	88	25.1	8.9	B	L	R
Edward	2	88	26.1	9.6	B	L	R
Abby	2	88	26.1	9.5	G	L	R
Josh	1	88	25.2	9.8	B	L	R
Hayley	1	88	21.6	7.9	G	R	R

pander( )

Makes output of most commands “beautiful”.

Usage

library(pander) then…

pander(someCode)

someCode %>%

pander( )

Note: pander(stuff, caption="Some useful caption", ...) is also possible.

someCode is exactly that, some coding you have done that creates output that you want displayed nicely.
%>% is the pipe operator that “pipes data” down into R commands on the next line.
pander(...) is an R function from library(pander) that makes most R output look nice.
... other useful commands like split.table=Inf.

Example Code

pander( pander is an R function that makes output look nice. table(KidsFeet$sex, KidsFeet$birthmonth), Code that makes a table of how many boys and girls were born in each month of the year. caption=“Counts of Birthdays by Month” The caption=" " command is very useful for giving your output a small title. ) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

Counts of Birthdays by Month
	1	2	3	4	5	6	7	8	9	10	11	12
B	1	2	3	2	1	1	2	2	2	1	1	2
G	1	1	5	1	1	3	1	0	3	1	1	1

KidsFeet KidsFeet is a dataset found in library(mosaic). %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
   group_by( “group_by” is a function from library(tidyverse) that allows us to split the dataset up into “little groups” according to the column specified. sex “sex” is a column from the KidsFeet dataset that records the gender of each child. ) Always close off your functions in R with a closing parathesis. %>% The pipe operator that will send the grouped according to gender version of the KidsFeet dataset down inside of the code on the following line.
   summarise( “summarise” is a function from library(tidyverse) that allows us to compute numerical summaries on data. aveLength A name we came up with that will store the results of the numerical summary. = mean(length), This computes the mean(...) of the length column from the KidsFeet dataset.
             sdLength A name we came up with that will store the results of the numerical summary. = sd(length), This computes the sd(...) of the length column from the KidsFeet dataset.
             sampleSize A name we came up with that will store the results of the numerical summary. = n( ) This computes the n(...), or sample size, of the length column from the KidsFeet dataset. ) Always close off your functions in R with a closing parathesis. %>% The pipe operator that will send the KidsFeet dataset down inside of the code on the following line.
   pander( The pander function will make the output of the above code look nice. caption=“Doesn’t that look nice?”) Always close off your functions in R with a closing parathesis.
Press Enter to run the code. Click to Show Output Click to View Output.

Doesn’t that look nice?
sex	aveLength	sdLength	sampleSize
B	25.11	1.217	20
G	24.32	1.33	19

What it looks like if you don't pander:

# A tibble: 2 x 4
  sex   aveLength sdLength sampleSize
  <fct>     <dbl>    <dbl>      <int>
1 B       25.105   1.21676         20
2 G       24.3211  1.33024         19

R Markdown

Think of an R Markdown File, or Rmd for short, as a command center. You write commands, then Knit the file, and an html output file is created according to your commands.

Hide Cheatsheet

The above tabs were created with the code:

## {.tabset .tabset-pills .tabset-fade}

### Hide Cheatsheet

### Show

Text and image were placed here...

Show

Carefully read through all parts of this image to learn…

Creating Links

To make a link use the code [Name of Link](addressForLink).

Linking to parts of your textbook:

[Numerical Summaries](NumericalSummaries.html) becomes Numerical Summaries
[Boxplots](GraphicalSummaries.html#boxplots) becomes Boxplots

Linking to outside resources:

[R Colors](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) becomes R Colors

Creating Headers

There are six available sizes of headings you can use in an Rmd file (left in image) that show up as shown below (right in image).

Emphasizing Words

To italisize a word use the asterisk (Shift 8) *italisize*. To bold a word use the double asterisk **bold**. The back tic can be used tohighlightwords by placing back tics on each side of a word: highlight `.
#### Lists {.tabset .tabset-pills .tabset-fade} ##### Simple Lists

To achieve the result: * This is the first item. * This is the second. * This is the third.

Use the code:

To achieve the result:
  
* This is the first item.
* This is the second.
* This is the third.

Numbered Lists

To achieve the result:

This is the first item.
This is the second.
This is the third.

Use the code:

To achieve the result:
  
1. This is the first item.
2. This is the second.
3. This is the third.

Lettered Lists

To achieve the result:

This is the first item.
This is the second.
This is the third.

Use the code:

To achieve the result:
  
A) This is the first item.
B) This is the second.
C) This is the third.

Nested Lists

What is $2+2$?
1. 4
2. 8
What is $3\times5$?
1. 14
2. 15

1. What is $2+2$?
    a. **4**
    
    b. 8
  
2. What is $3\times5$?
    a. 14
    
    b. **15**

Tables

There are many ways to make tables in R Markdown. Here is a simple way to make a “pipe” table.

| Name          | Age           | Gender       | 
|---------------|---------------|--------------|
| Jill          | 8             |  Female      |
| Jack          | 9             |  Male        |

Name	Age	Gender
Jill	8	Female
Jack	9	Male

Insert a Picture

To add a picture to your document, say some notes you took down on paper from class,

Use the code: ![](./Images/insertPictureNotes.jpg) to get…

Themes

Notice in the YAML (at the top of the RMD file) there is a line that reads:

“theme: cerulean”

Other possible themes are

“default”, “cerulean”, “journal”, “flatly”, “readable”, “spacelab”, “united”, and “cosmo”.

You can also change the highlighting by adding the line “highlight: tango” to the YAML as follows.

---
title: "Markdown Hints"
output: 
  html_document:
    theme: cerulean
    highlight: tango
---

Other highlighting options are

“default”, “tango”, “pygments”, “kate”, “monochrome”, “espresso”, “zenburn”, “haddock”, and “textmate”.

More Information

Go to the rmarkdown.rstudio.com website for more information on how to use R Markdown.

Latex

Latex is used in RMDs to create Math Equations

Use the dollar signs $x=5$ to write $x=5$ or $z=\frac{x-\mu}{\sigma}$ to write $z=\frac{x-\mu}{\sigma}$. For a nicely centered equation use the double dollar signs $$ $$ on separate lines

$$
  z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}
$$

to get \[ z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \]

$$
  H_0: \mu_1 = \mu_2
$$
$$ 
  H_a: \mu_1 \neq \mu_2
$$

to get \[ H_0: \mu_{\text{Group 1}} = \mu_{\text{Group 2}} \] \[ H_a: \mu_{\text{Group 1}} \neq \mu_{\text{Group 2}} \]

Symbol list:

Symbol	LaTeX Math Code
$\alpha$	$\alpha$
$\beta$	$\beta$
$\sigma$	$\sigma$
$\epsilon$	$\epsilon$
$\bar{x}$	$\bar{x}$
$\hat{Y}$	$\hat{Y}$
$=$	$=$
$\ne$	$\ne$ or $\neq$
$>$	$>$
$<$	$<$
$\ge$	$\ge$
$\le$	$\le$
$\{ \}$	$\{ \}$
$\text{Type just text}$	$\text{Type just text}$
$\overbrace{Y_i}^\text{label}$	$\overbrace{Y_i}^\text{label}$
$\underbrace{Y_i}_\text{label}$	$\underbrace{Y_i}_\text{label}$

Resources

The R for Data Science Book is used for data wrangling and visualization.

Data Wrangling and Visualization website gives a great resource for R.

Data Intuition and Insight is an introduction to data science using visualization and statistical inference.

Supplemental Readings - covers statistical concepts
MATH 221 - Intro to Stats

R Documentation

	\(H_0\) True	\(H_0\) False
Reject \(H_0\)	Type I Error	Correct Decision
Accept \(H_0\)	Correct Decision	Type II Error

Symbol	LaTeX Math Code
\(\alpha\)	$\alpha$
\(\beta\)	$\beta$
\(\sigma\)	$\sigma$
\(\epsilon\)	$\epsilon$
\(\bar{x}\)	$\bar{x}$
\(\hat{Y}\)	$\hat{Y}$
\(=\)	$=$
\(\ne\)	$\ne$ or $\neq$
\(>\)	$>$
\(<\)	$<$
\(\ge\)	$\ge$
\(\le\)	$\le$
\(\{ \}\)	$\{ \}$
\(\text{Type just text}\)	$\text{Type just text}$
\(\overbrace{Y_i}^\text{label}\)	$\overbrace{Y_i}^\text{label}$
\(\underbrace{Y_i}_\text{label}\)	$\underbrace{Y_i}_\text{label}$

Table of Contents

Response Variables

One Quantitative Response Variable Y

Data Summaries

Numerical

Graphics

Tests

Quantitative Y | Categorical X (2 Groups)

Data Summaries

Numerical

Graphics

Tests

Quantitative Y | Categorical X (3+ Groups)

Data Summaries

Numerical

Graphics

Tests

Quantitative Y | Quantitative X

Data Summaries

Numerical

Graphics

Tests

Quantitative Y | Multiple X

Data Summaries

Numerical

Graphics

Tests

Binomial Y | Quantitative X

Data Summaries

Numerical

Graphics

Tests

Binomial Y | Multiple X

Data Summaries

Numerical

Graphics

Tests

Categorical Y | Categorical X

Data Summaries

Numerical

Graphics

Tests

Making Inferences

Hypothesis Testing

Managing Decision Errors

Type I Error (Significance Level, Confidence and α)

Type II Errors (β, and Power)

Sufficient Evidence

Evidence not Proof

Probability and Odds

Probability

Odds

Calculating the p-Value

Methods

Parametric Methods

Parametric Tests

Nonparametric Methods

Nonparametric Tests

R Tools

A Quote to Remember

What is R?

Get Started with R

R Cheatsheets

R Notes

R Commands

? The Help Command

$ The Selection Operator

<- The Assignment Operator

c( ) The Combine Function

table( )

filter( )

select( )

%>% The Pipe Operator

summarise( ) and group_by( )

mutate( )

arrange( )

pander( )

R Markdown

Hide Cheatsheet

Show

`?` The Help Command

`$` The Selection Operator

`<-` The Assignment Operator