Chapter 1 Introductory Statistics in R

Before we begin our dive into Statistical Modeling we first review content from your introductory statistics course and introduce fundamental elements of the R programming language (R Core Team, 2019). The learning objective of this unit include:

  • Review of the fundamental concepts of populations, samples, parameters and statistics.
  • Review of exploratory and descriptive statistics: numerical quantities and graphics.
  • An introduction and overview of the R software for statistical computing.
  • Basic statistical inference through a two-sample t-test and confidence interval.

1.1 Goals of a statistical analysis

The language R is our tool to facilitate investigations into data and the process of making sense of it. This is what the science of statistics is all about.

Statistics can be broadly defined as the study of variation. We collect sample data from a parent population of interest typically because, even though we are ultimately interested in characteristics about the population, it is unfeasible to collect data from every member of a population. Good statistical practice demands that we collect sample data from the population in such a manner so as to ensure that the sample is representative; i.e., it is not presenting a systematic bias in its representation.

There are two issues that arise with respect to variation in such data:

  1. Measurements of the same attribute will vary from sample member to sample member. This is natural and is to be expected. This random (or stochastic) component of a sample is typically modeled through probabilistic assumptions (think the Normal distribution from your Intro Stat course).

  2. Any summary characteristic computed from the sample (known as a sample statistic) will only be an estimate of the corresponding summary characteristic for the population as a whole (known as a population parameter), since only a small subset of the population is used in its calculation. Thus, sample statistics will vary from their corresponding population parameters. In fact, different samples from the same population will produce different estimates of the same parameter, so sample statistics also vary from sample to sample.

This variation gives rise to the concept of uncertainty in any findings that we make about a population based on sample data. A meaningful statistical analysis helps us quantify this uncertainty.

The goals of a statistical analysis typically are

  • to make sense of data in the face of uncertainty.
  • to meaningfully describe the patterns of variation in collected sample data, and use this information to make reliable inferences/generalizations about the parent population.

1.2 Before you begin an analysis

The practice of statistics starts with an underlying problem or question. This will be a question with context. Some examples may include:

  • Are fish sensitive to Ultraviolet light?
  • What socio-economic variables and demographics predict college graduation rates?
  • Can we determine if there truly is a home field advantage in sports?
  • Does a new experimental drug improve over an existing drug?

From there, data is collected through an experiment, observations or a survey. You then proceed with a data analysis, and then finish with statistical, and contextual conclusions. Unfortunately, it is common for inexperienced statisticians to start with a complex analysis without ever considering the objectives of the study, or whether the data are appropriate for the chosen analysis and whether the underlying question can be answered.

We recommend you follow the guidance (outlined below) provided in Chapter 1 of the text by Faraway (2004) – Look before you leap!

Problem formulation - To perform statistical analysis correctly, you should:

  1. Understand the background. Statisticians often work in collaboration with researchers in other disciplines and need to understand something about the subject area. You should regard this as an opportunity to learn something new. In essence, a Statistician must be a renaissance (wo)man. Some level of elementary expertise is needed in the subject area of study.

  2. Understand the objectives. Again, often you will be working with a collaborator who may not be clear about what the objectives are. Beware of conducting “fishing expeditions”: if you look hard enough, you’ll almost always find something, but that something may just be a coincidence of no consequence. Also, sometimes an analysis far more complicated than what is really needed is performed. You may find sometimes that simple descriptive statistics are all that are really required. Even if a more complex analysis is necessary, simple descriptive statistics can provide a valuable supplement.

  3. Translate the problem into statistical terms. This can be a challenging step and oftentimes requires creativity on the part of the statistician (in other words, not every problem fits into a neat little box like in an intro stats course!). Care must be taken at this step.

Data Collection - It is also vitally important to understand how the data was collected. Consider:

  • Are the data observational or experimental? Are the data a “sample of convenience,” or were they obtained via a designed experiment or random sampling? How the data were collected has a crucial impact on what conclusions can be meaningfully made.
  • Is there “non-response?” The data you don’t see may be just as important as the data you do see! This gets to the issue of a sample being representative of the intended target population.
  • Are there missing values? This is a common problem that is troublesome and time consuming to deal with. Data that are “missing” according to some pattern are particularly troublesome.
  • How are the data coded? In particular, how are the qualitative or categorical variables represented? This has important impact on the implementation of statistical methods in R.
  • What are the units of measurement? Sometimes data is collected or represented using far more digits than are necessary. Consider rounding if this will help with the interpretation. For example, starting an R session with the command options(digits=4) will round all output to four digits.
  • Beware of data entry errors. This problem is all too common, and is almost a certainty in any real dataset of at least moderate size. Perform some data sanity checks.

1.3 Data frames

A statistical dataset in R is known as a data frame, or data.frame, and is a rectangular array of information (essentially a table or matrix). Rows of a data frame constitute different observations in the data, and the columns constitute different variables measured.

Imagine we record the name, height and age of all the students in the class: the first column will record each students’ name, with the second column their height and third column their age. Each row corresponds to a particular student and each column a variable.

1.3.1 Built-in data

R packages come bundled with many pre-entered datasets for exploration purposes. For example, here is the data frame stackloss from the datasets package (which loads automatically upon starting R):

stackloss
##    Air.Flow Water.Temp Acid.Conc. stack.loss
## 1        80         27         89         42
## 2        80         27         88         37
## 3        75         25         90         37
## 4        62         24         87         28
## 5        62         22         87         18
## 6        62         23         87         18
## 7        62         24         93         19
## 8        62         24         93         20
## 9        58         23         87         15
## 10       58         18         80         14
## 11       58         18         89         14
## 12       58         17         88         13
## 13       58         18         82         11
## 14       58         19         93         12
## 15       50         18         89          8
## 16       50         18         86          7
## 17       50         19         72          8
## 18       50         19         79          8
## 19       50         20         80          9
## 20       56         20         82         15
## 21       70         20         91         15

This data frame contains 21 observations on 4 different variables. Note the listing of the variable names, this is known as the header.

Another famous dataset is the Edgar Anderson’s Iris data

head(iris, n=10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
tail(iris, n=7)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Note the head() function prints the first handful of observations of the dataset, here consisting of the first 10 rows. The tail() functions prints out the last \(n\) rows (here specified to be 7). The dataset is actually much bigger which can be determined with the dim function,

dim(iris)
## [1] 150   5

There are 150 observations on 5 variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species).

1.3.2 Types of Data

You’ll notice in the above examples we see a mix of numeric and character values in the data frames. In R, every object has a type, or class. To discover the type for a specific variable use the class function,

class(iris)
## [1] "data.frame"

We see that the class of the object iris is a data.frame consisting of 150 rows and 5 columns. You should also note that the first four columns are numeric while the last column appears to be characters. We can explore the class of each variable by specifying the name of the column of the dataset using the $ notation. First, let’s get the list of variables in the dataset using the names() function.

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
class(iris$Sepal.Length)
## [1] "numeric"
class(iris$Species)
## [1] "factor"

The class of Sepal.Length is numeric, which should not be too surprising given the value is a number. The class of Species on the other hand is a factor. In R, a factor is a categorical variable, here it happens to be labeled with the scientific species name, but R treats it as a category.

Note: The factor class can be quite important when performing Statistics. Depending if a variable is of class factor or numeric it can cause different statistical methods to perform differently.

1.3.3 Importing datasets into R

In practice you will often encounter datasets from other external sources. These are most commonly created or provided in formats that can be read in Excel. The best formats for importing dataset files into R is space-delimited (typically with the file extension .txt), tab-delimited text (either the file extension .txt or .tsv) or comma-separated values (file extension .csv). R can handle other formats, but these are the easiest with which to work.

There are multiple functions to import a data file in R. The tools we will outline here include many functions in the readr package (Wickham and Hester, 2021), part of the tidyverse. We recommend reading through the Vignette by running the command vignette("readr") in the console.

1.3.3.1 read_table() – for Space delimited files

One common tool is using the read_table() command. Below are two examples:

  1. Read a file directly from the web. The below code imports the file univadmissions.txt (details about the data are below) from a data repository into the current R session, and assigns the name uadata to the data frame object:
site <- "https://tjfisher19.github.io/introStatModeling/data/univadmissions.txt"
class(site)
## [1] "character"
options(readr.show_col_types = FALSE)
uadata <- read_table(site, col_types = cols())
head(uadata)
## # A tibble: 6 x 5
##      id gpa.endyr1 hs.pct   act  year
##   <dbl>      <dbl>  <dbl> <dbl> <dbl>
## 1     1       0.98     61    20  1996
## 2     2       1.13     84    20  1996
## 3     3       1.25     74    19  1996
## 4     4       1.32     95    23  1996
## 5     5       1.48     77    28  1996
## 6     6       1.57     47    23  1996

A few notes on the above code. In the first line, we are creating a string of characters (that happens to be a web address) and assigning it to the R object site using the <- notation. You should note the class of the object site is character. The third line, options(readr.show_col_types = FALSE), is entirely optional and tells the readr package to suppress some extra output from display. The key is we are applying the function read_table() with the file specified in the site variable, we tell R to make a best guess on the variable types with the col_types = cols() statement (alternatively we can specify all the variable types, see help(cols)). Lastly, the output of the function is being assigned to uadata, again using the <- notation. We will be using this same data again below.

  1. Another option is to download the file to your computer, then read it. This is typically what you will do for homework and class examples. You could go to the URL above and save the data file to your default working directory on your computer. After saving it into a folder (or directory), you can read the file with code similar to:
uadata <- read_table("univadmissions.txt", col_types = cols())

If you save the file to a directory on your computer other than the local directory (where your R code is stored), you must provide the path in the read_table() command:

uadata <- read_table("c:/My Documents/My STA363 Files/univadmissions.txt", col_types = cols())

The function read_table() automatically converts a space delimited text file to a data frame in R. The header=TRUE option tells R that the top line in the text file contains the variable names rather than data values.

1.3.3.2 read_csv() – for Comma Separated Value files

Similar to the above examples, reading in a Comma Separated Value (CSV) file is very similar. This is becoming the standard format for transferring files. An example of some non-executed code is provided below:

uadata_csv <- read_csv("c:/My Documents/My STA363 Files/univadmissions.csv", col_types = cols())

Note: the file univadmissions.csv does not exist but this code has been included for example purposes.

1.3.3.3 load() – Loading an existing R data.frame

If data has already been input and processed into R as a data.frame, it can be saved in a format designed for R using the save() function and then can be reloaded using load().

load("someData.RData")

1.4 Referencing data from inside a data frame

We saw above that we can access a specific variable within a dataset using the $ notation. We can also access specific observations in the dataset based on rules we may wish to implement. As a working example we will consider the university admissions data we read in above.

Example: University admissions (originally in Kutner et al. (2004)).

The director of admissions at a state university wanted to determine how accurately students’ grade point averages at the end of their freshman year could be predicted by entrance test scores and high school class rank. The academic years covered 1996 through 2000. This example uses the univadmissions.txt dataset in the repository. The variables are as follows:

  • id – Identification number
  • gpa.endyr1 – Grade point average following freshman year
  • hs.pct – High school class rank (as a percentile)
  • act – ACT entrance exam score
  • year – Calendar year the freshman entered university

This is the first dataset we input above, in the uadata data frame.

The tidyverse package (Wickham, 2017) is actually a wrapper for many R packages. One of the key packages we use for data processing is the dplyr package (Wickham, François, et al., 2021). It includes many functions that are similar to SQL (for those who know it). These tools tend to be very readable and easy to understand. First we will load the package and then take a glimpse of the dataset with which we are working

library(tidyverse)   # Load the necessary package
glimpse(uadata)      # Take a glimpse at the data
## Rows: 705
## Columns: 5
## $ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ~
## $ gpa.endyr1 <dbl> 0.98, 1.13, 1.25, 1.32, 1.48, 1.57, 1.67, 1.73, 1.75, 1.76,~
## $ hs.pct     <dbl> 61, 84, 74, 95, 77, 47, 67, 69, 83, 72, 83, 66, 76, 88, 46,~
## $ act        <dbl> 20, 20, 19, 23, 28, 23, 28, 20, 22, 24, 27, 21, 23, 27, 21,~
## $ year       <dbl> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996,~

A key feature of the tidyverse is the pipe command, %>%. As an example, consider the case of extracting students with a GPA greater than 3.9:

  • Take the data frame uadata (read in above) and pipe it (send it) to a function that will filter
  • Then filter it based on some criteria, say gpa.endyr1 > 3.9.
uadata %>% 
  filter(gpa.endyr1 > 3.9)
## # A tibble: 29 x 5
##       id gpa.endyr1 hs.pct   act  year
##    <dbl>      <dbl>  <dbl> <dbl> <dbl>
##  1   138       3.91     99    28  1996
##  2   139       3.92     93    28  1996
##  3   140       4        99    25  1996
##  4   141       4        92    28  1996
##  5   142       4        99    33  1996
##  6   266       3.92     85    15  1997
##  7   267       3.92     63    23  1997
##  8   268       3.92     98    23  1997
##  9   269       3.93     99    29  1997
## 10   270       3.96     99    29  1997
## # ... with 19 more rows

We can also use dplyr commands for selecting specific variables using the select function. Below we select the ACT scores for students who have a first year GPA greater than 3.9 and with a graduating year of 1998.

uadata %>% 
  filter(gpa.endyr1 > 3.9, 
         year==1998) %>%
  select(act)
## # A tibble: 6 x 1
##     act
##   <dbl>
## 1    28
## 2    31
## 3    31
## 4    21
## 5    27
## 6    32

Note in the above, we take the uadata and pipe it to a filter, which selects only those observations with gpa.endyr1 greater than 3.9 and year equal (==) to 1998, then we pipe that subset to the select function that only returns the variable act.

IMPORTANT NOTE - R is an evolving language and as such occasionally has some wonky behavior. In fact, there are multiple versions of both the filter and select functions in R. If you ever use one and receive a strange error, first check for a typo. If the error still occurs it is likely the case that R is trying to use a different version of the filter or select. You can always force R to use the dplyr version by specifying the package as in the below example:

uadata %>% 
  dplyr::filter(gpa.endyr1 > 3.9, 
                year==1998) %>%
  dplyr::select(act)

1.5 Missing values and computer arithmetic in R

Sometimes data from an individual is lost or not available. R indicates such occurrences using the value NA. Missing values are important to account for, because we cannot simply ignore them when performing functions that involve all values. For example, we cannot find the mean of the elements of a data vector when some are set to NA. Consider the simple case of a vector of 6 values with one NA.

x <- c(14, 18, 12, NA, 22, 16)
class(x)
## [1] "numeric"
mean(x)
## [1] NA

Here, the object x is a vector of class numeric. We will not be working with vectors much in this class (a column within a data frame can be considered a vector) but we do so here to demonstrate functionality.

Since one of the observations is missing, NA, the mean of the vector is also “not available” or NA.
This is not incorrect: logically speaking there is no way to know the mean of these observations when one of them is missing. However, in many cases we may still wish to calculate the mean of the non-missing observations.

Many R functions have a na.rm= logical argument, which is set to FALSE by default. It basically asks if you want to remove the missing values before applying the function. It can be set to TRUE in order to remove the NA terms. For example,

mean(x, na.rm=TRUE)    # mean of the non-missing elements
## [1] 16.4

The number system used by R has the following ways of accommodating results of calculations on missing values or extreme values:

  • NA: Not Available. Any data value, numeric or not, can be NA. This is what you use for missing data. Always use NA for this purpose.

  • NaN: Not a Number. This is a special value that only numeric variables can take. It is the result of an undefined operation such as 0/0.

  • Inf: Infinity. Numeric variables can also take the values -Inf and Inf. These are produced by the low level arithmetic of all modern computers by operations such as 1/0. (You should not think of these as values as real infinities, but rather the result of the correct calculation if the computer could handle extremely large (or extremely small, near 0) numbers; that is, results that are larger than the largest numbers the computer can hold (about \(10^{300}\))).

Scientific notation. This is a shorthand way of displaying very small or very large numbers. For example, if R displays 3e-8 as a value, it really means \(3\times 10^{-8} = 0.00000003\).

In general, you should convert numbers from scientific notation into decimal notation when writing findings in your reports. For a really small numbers (like 0.00000003), typically it will suffice to write “< 0.0001” instead.

1.6 Exploratory Data Analysis (EDA)

The material presented in this section is typically called descriptive statistics in Intro Statistics books. This is an important step that should always be performed prior to a formal analysis. It looks simple but it is vital to conducting a meaningful analysis. EDA is comprised of:

  • Numerical summaries (means, standard deviations, five-number summaries, correlations)
  • Graphical summaries
    • One variable: boxplots, histograms, density plots, etc.
    • Two variables: scatterplots, side-by-side boxplots, overlaid density plots
    • Many variables: scatterplot matrices, interactive graphics

We look for outliers, data-entry errors and skewness or unusual distributions using EDA. Are the data distributed as you expect? Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming. It often takes more time than the data analysis itself! In this course, data will usually be ready to analyze but we will occasionally intermix messy data. You should realize that in practice it is rarely the case to receive clean data.

1.6.1 Numeric Summaries

1.6.1.1 Measures of centrality and quantiles

Common descriptive measures of centrality and quantiles from a sample include

  • Sample mean – typically denoted by \(\bar{x}\), this is implemented in the function mean()
  • Sample median – this is implemented in the function median()
  • Quantiles – Quantiles, or sample percentiles, can be found using the quantile() function. Three common values are \(Q_1\) (the sample \(25^\mathrm{th}\) percentile, or first quartile), \(Q_2\) (the \(50^\mathrm{th}\) percentile, second quartile or median) and \(Q_3\) (the sample \(75^\mathrm{th}\) percentile or third quartile).
  • Min and Max – the Minimum and Maximum values of a sample can be found using the min() or max() functions.

Each of these were covered in your introductory statistics course. We refer you to that material for more details.

1.6.1.2 Measuring variance and variability

Earlier we defined statistics as the study of variation in data. The truth is that everything naturally varies:

  • If you measure the same characteristic on two different individuals, you would potentially get two different answers (subject variability).
  • If you measure the same characteristic twice on the same individual, you would potentially get two different answers (measurement error).
  • If you measure the same characteristic on the same individual but at different times, you would potentially get two different answers (temporal variability).

Because everything varies, determining that things vary is simply not very interesting (what a tongue twister!). What we need is a way of discriminating between variation that is scientifically interesting and variation that just reflects the natural background fluctuations that are always there. This is what the science of statistics is for.

Knowing the amount of variation that we would expect to occur just by chance alone when nothing scientifically interesting is going on is key. If we then observe bigger differences than we would expect by chance alone, we say the result is statistically significant. If instead we observe differences no larger than we would expect by chance alone, we say the result is not statistically significant.

Although covered in your introductory statistics course, due to importance we review measuring the variance of a sample.

Variance

A variance is a numeric descriptor of the variation of a quantity. Variances will serve as the foundation for much of what we use in making statistical inferences in this course. In fact, a standard method of data analysis for many of the problems we will encounter is known as analysis of variance (shorthanded ANOVA).

You were (or should have been) exposed to variance in your introductory statistics course. What is crucial now is that you know what you are finding the variance of; i.e., what is the “quantity” we referred to in the preceding paragraph? Consider the following example:

  • Example. Suppose you collect a simple random sample of n fasting blood glucose measurements from a population of diabetic patients. The measurements are denoted \(x_1\), \(x_2\), \(\ldots\), \(x_n\). What are some of the different “variances” that one might encounter in this scenario?

    1. The \(n\) sampled measurements vary around their sample mean \(\bar{x}\). This is called the sample variance (traditionally denoted \(s^2\)). In R, the function is var().
    2. All fasting blood glucose values in the population vary around the true mean of the population \(\mu\). This is called the population variance (usually denoted \(\sigma^2\)). Note that \(\sigma^2\) cannot be calculated without a census of the entire population, so \(s^2\) serves as an estimate of \(\sigma^2\).
    3. \(\bar{x}\) is only a sample estimate of \(\mu\), so we should expect that will vary depending on which \(n\) individual diabetic patients get randomly chosen into our sample (i.e., with a different sample we will calculate a different sample mean). This is called the variance of \(\bar{x}\), and is denoted \(\sigma_{\bar{x}}^2\). This variance, unlike the sample or population variance, describes how sample estimates of \(\mu\) vary around \(\mu\) itself. We will need to estimate \(\sigma_{\bar{x}}^2\), since it cannot be determined without a full survey of the population. The estimate of \(\sigma_{\bar{x}}^2\) is denoted \(s_{\bar{x}}^2\).

The variances cited above are each very different things. We could dream up even more if we wanted to: for example, each different possible random sample of \(n\) patients will produce a different sample variance \(s^2\). Since these would all be estimates of the population variance \(\sigma^2\), we could even find the variance of the sample variances!

Some formulas:

\[\textbf{Sample Variance: } s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2, \textrm{ for sample size } n\]

\[\textbf{Estimated Variance of } \bar{x}: s^2_{\bar{x}} = \frac{s^2}{n}\]

Three important notes:

  1. The general form of any variance is a sum of squared deviations, divided by a quantity related to the number of independent elements in the sum known as degrees of freedom. So in general,

\[\textrm{Variance} = \frac{\textrm{sum of squares}}{\textrm{degrees of freedom}} = \frac{SS}{\textrm{df}}\]

  1. The formula for the population variance \(\sigma^2\) is only given for completeness. We will never use it, since \(\sigma^2\) cannot be calculated without a population census.

  2. When considering how a statistic (like \(\bar{x}\)) varies around the parameter it is estimating (\(\mu\)), it makes sense that the more data you have, the better it should perform as an estimator. This is precisely why the variance of \(\bar{x}\) is inversely related to the sample size. As \(n\) increases, the variance of the estimate decreases; i.e., the estimate will be closer to the thing it is estimating. This phenomenon is known as the law of large numbers.

Standard deviations and standard errors

While variances are the usual items of interest for statistical testing, often we will find that we would like a description of variation that is in the original units of the quantity being studied. Variances are always in units squared, so if we take their square roots, we are back to the original units. The square root of a variance is called a standard deviation (SD). The square root of the variance of an estimate is called the standard error (SE) of the estimate:

\[\textbf{Sample Standard Deviation: } s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}, \textrm{ for sample size } n\]

\[\textbf{Standard error of } \bar{x}: s_{\bar{x}} = \sqrt{s^2_{\bar{x}}} = \sqrt{\frac{s^2}{n}} = \frac{s}{\sqrt{n}}\]

1.6.2 Numeric Summaries in R

In this class we will be using dplyr for data handling. We can pipe (%>%) our data frame into a summarize function to calculate numeric summaries.

uadata %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>%
  summarize(across(everything(), 
                   list(Mean=mean, SD=sd, Min=min, Median=median, Max=max) ) )
## # A tibble: 5 x 6
##   name       value_Mean value_SD value_Min value_Median value_Max
##   <chr>           <dbl>    <dbl>     <dbl>        <dbl>     <dbl>
## 1 act             24.5     4.01      13           25           35
## 2 gpa.endyr1       2.98    0.635      0.51         3.05         4
## 3 hs.pct          77.0    18.6        4           81           99
## 4 id             353     204.         1          353          705
## 5 year          1998       1.40    1996         1998         2000

For now ignore the code that is used, we will describe the pivot_longer() and group_by() functions below. Presented here are the mean, standard deviation, the minimum, median and maximum of each numeric variable in the data frame. For example, the mean freshman year-end GPA over the period 1996-2000 was 2.98; the median over the same period was 3.05, suggesting the GPA distribution is slightly skewed left (not surprising). The range of the GPA distribution is from 0.51 to 4. The median ACT score for entering freshman was 25, and and the range of ACT scores was from 13 to 35.

This display introduces a few keys points you need to be aware of:

  • We have the code spread across multiple lines with most lines ending with the piping operator %>%. This is done (and recommended) for readability. When performing an analysis, you often will revisit the code at a later time. Readable code is paramount to reproducibility.
  • Since we applied the command to the entire data frame (everything()), we also see numeric summaries of the variable id, even though it is meaningless to do so. (note: just because a computer outputs it, that doesn’t mean it’s useful)
  • If you look carefully, you’ll see that year is treated as a number (e.g., 1996, 1997, etc.). However, that doesn’t necessarily mean it should be treated as a numeric variable! It is meaningless to find numeric summaries like means or medians for the year, because it is really a qualitative (categorical) variable in the context of these data (again, just because a computer outputs it, that doesn’t mean it is contextually correct).

These are common mistake to make. It is as simple as this: when R encounters a variable whose values are entered as numbers in a data frame, R will treat the variable as if it were a numeric variable (whether it is contextually or not). If R encounters a variable whose values are character strings, R will treat the variable as a categorical factor. But, in a case like year above, you have to explicitly tell R that the variable is really a factor since it was coded with numbers. To do this, we will mutate the variables id and year.

uadata <- uadata %>%
  mutate(id=as.factor(id),
         year=as.factor(year))

Here we tell R to take the uadata data, pipe it to the mutate function (which allows us to create and/or replace variables by changing/mutating them). We then assign the variables id and year as versions of themselves that are being treated as factors (or categories).

We can then check the type of each variable by summarizing over all the variables using the class function.

uadata %>% 
  summarize(across(everything(), class))
## # A tibble: 1 x 5
##   id     gpa.endyr1 hs.pct  act     year  
##   <chr>  <chr>      <chr>   <chr>   <chr> 
## 1 factor numeric    numeric numeric factor

No we can check the summary values again for only numeric values (i.e., the use of the select(where(is.numeric)) function).

uadata %>% 
  select(where(is.numeric)) %>%
  pivot_longer(everything()) %>% 
  group_by(name) %>%
  summarize(across(everything(), 
                   list(Mean=mean, SD=sd, Min=min, Median=median, Max=max) ) )
## # A tibble: 3 x 6
##   name       value_Mean value_SD value_Min value_Median value_Max
##   <chr>           <dbl>    <dbl>     <dbl>        <dbl>     <dbl>
## 1 act             24.5     4.01      13           25           35
## 2 gpa.endyr1       2.98    0.635      0.51         3.05         4
## 3 hs.pct          77.0    18.6        4           81           99

For the categorical variables, the main numeric measure we can consider is the count (or proportions) of each category. To do this in R, we tell it to treat each category as a group and then count within the group. We will group_by the year and then within that year, count the number of observations. We also add a new variable (using mutate) that is the proportion.

uadata %>%
  group_by(year) %>%
  summarize(Count=n() ) %>%
  mutate(Prop = Count/sum(Count))
## # A tibble: 5 x 3
##   year  Count  Prop
##   <fct> <int> <dbl>
## 1 1996    142 0.201
## 2 1997    131 0.186
## 3 1998    154 0.218
## 4 1999    141 0.2  
## 5 2000    137 0.194

Note the power of using the tidyverse tools (Wickham, 2017). We are only scratching the surface of what can be done here. For more information, we recommend looking into the Wrangling chapters in the book R for Data Science (Wickham et al., 2017).

1.6.3 Graphical Summaries

We begin a graphical summary with a matrix of univariate and bivariate plots using the ggpairs() function in the GGally package (Schloerke et al., 2018). This results can be seen in Figure 1.1.

library(GGally)
ggpairs(uadata, columns=2:5)
Pairs plot showing the relationship between the variables `gpa.endy1`, `hs.pct`, `act` and `year`.

Figure 1.1: Pairs plot showing the relationship between the variables gpa.endy1, hs.pct, act and year.

Coding note: Here we specify to only plot using the variables in columns 2, 3, 4 and 5 (the 2:5). We do this because the id variable in uadata is an identifier and unique identifies each observations – it is not a variable of interest – we do not want to try and plot it.

You will note that by default the function provides a collection of univariate and bivariate plots (Figure 1.1). We recap each of the plots provided (details on each is provided below).

  • On the diagonal of the matrix we see univariate plots – density plots for the numeric variables and a bargraph of frequencies for categorical variables.
  • In the upper right triangle of the matrix we see numeric measures for the amount of correlation between each of the numeric variables (Pearson Correlation). In the case of a numeric variable with a categorical variable, side-by-side boxplots are provided.
  • In the bottom left triangle of the matrix scatterplots are provided to show the relationship between numeric variables. We also see histograms for each of the numerical variable stratified by categories. Now we will discuss each of the elements in more detail.

1.6.4 Distribution of Univariate Variables

The ggpairs function provides a look at the distributional shape of each variable using what is known as a density plot (this can be consider a more-modern, and less subjective, histogram). The ggpairs function builds off a function known as ggplot() from the ggplot2 package (Wickham, 2016) (included in tidyverse).

We can build density plots directly using the geom_density() function as seen in the code below. The resulting density plot is in Figure 1.2.

ggplot(uadata) + 
  geom_density(aes(x=gpa.endyr1))
Density plot showing left-skewed distribution for the `gpa.endyr1` variable

Figure 1.2: Density plot showing left-skewed distribution for the gpa.endyr1 variable

You should note the code is similar to the %>% operator but with a +. The trick to working with plotting on the computer is to think in terms of layers. The first line ggplot(uadata) sets up a plot object based on the existing dataset uadata. The + says to add a layer, this layer is a density plot (geom_density) where the aesthetic values include an \(x\)-axis variable matching that of gpa.endyr.

To generate even more complex distributional plots first we need to perform some data processing. In tidyverse there is package called tidyr (Wickham, 2021) which is mainly comprised of two main functions: pivot_longer() and pivot_wider(). The purpose of the pivot_longer() function is to collect variables in a dataset and create a tall (or longer) dataset (this is typically known as going from wide-to-tall or wide-to-long format). We specify the variables (or columns) we want to pivot (here the c(gpa.endyr1, hs.pct, act) part of the code), the name of the new variable (here called variables) storing the column names, and the variable name that will store the values (here called value). The pivot_wider() function essentially does the opposite, turning a tall data frame into a wide format.

uadata.tall <- uadata %>%
  pivot_longer(c(gpa.endyr1, hs.pct, act), names_to="var_names", values_to="value" )
head(uadata)
## # A tibble: 6 x 5
##   id    gpa.endyr1 hs.pct   act year 
##   <fct>      <dbl>  <dbl> <dbl> <fct>
## 1 1           0.98     61    20 1996 
## 2 2           1.13     84    20 1996 
## 3 3           1.25     74    19 1996 
## 4 4           1.32     95    23 1996 
## 5 5           1.48     77    28 1996 
## 6 6           1.57     47    23 1996
head(uadata.tall)
## # A tibble: 6 x 4
##   id    year  var_names  value
##   <fct> <fct> <chr>      <dbl>
## 1 1     1996  gpa.endyr1  0.98
## 2 1     1996  hs.pct     61   
## 3 1     1996  act        20   
## 4 2     1996  gpa.endyr1  1.13
## 5 2     1996  hs.pct     84   
## 6 2     1996  act        20
dim(uadata)
## [1] 705   5
dim(uadata.tall)
## [1] 2115    4

Take a minute and step through the above code and output to understand what the code did. The data is now in a tall, or long, format. We can now utilize the ggplot2 package for more advanced plotting. Here we want to draw a density plot for each var_names in the dataset uadata but we want a different plot for each \(variable\) (act, gpa.endyr1, and hs.pct). Figure 1.3 provides the result for the below code.

ggplot(uadata.tall) +
  geom_density(aes(x=value)) + 
  facet_grid(.~var_names, scales="free_x") + 
  theme_bw()
Side-by-side density plots showing the relative distributions of `act`, `gpa.endyr1` and `hs.pct`.

Figure 1.3: Side-by-side density plots showing the relative distributions of act, gpa.endyr1 and hs.pct.

Notes: The facet_grid option tells the plot to draw each density plot for each observed value in var_names. The theme_bw() specifies the plotting style (there are many others! We will typically use theme_bw(), theme_classic() or the default behavior which corresponds to theme_grey()). The free_x option tells the plot that each \(x\)-axis is on a different scale. Feel free to modify the code to see how the plot changes!

We can see that GPA and high school percentile rankings are skewed left, and that ACT scores are reasonably symmetrically distributed.

Alternatively, we could build histograms to make the comparison.

ggplot(uadata.tall) +
  geom_histogram(aes(x=value), binwidth = 2.5) + 
  facet_grid(var_names~., scales="free_y") + 
  theme_bw()
Stacked histogram showing the relative distributions of `act`, `gpa.endyr1` and `hs.pct`.

(#fig:ch1-20_2)Stacked histogram showing the relative distributions of act, gpa.endyr1 and hs.pct.

Note the similarities and difference in Figures 1.3 and @ref(fig:ch1-20_2). Depending on whether we facet on the \(y\)-axis or \(x\)-axis we can get different appearances.

Bar Graphs

The best visual tool for simple counts is the Bar Graph (note: you should never use a pie chart, see here for an explanation ). The below code generates a simple (and fairly boring) bar graph (Figure 1.4) displaying the number of observations for each year in the uadata.

ggplot(uadata) + 
  geom_bar(aes(x=year) )
Bar graph displaying the number of observations in the `uadata` per `year`.

Figure 1.4: Bar graph displaying the number of observations in the uadata per year.

We see here that nothing too interesting is happening. A fairly uniform distribution of counts is seen across years.

1.6.5 Descriptive statistics and visualizations by levels of a factor variable

While the above summaries are informative, they overlook the fact that we have several years of data and that it may be more interesting to see how things might be changing year to year. This is why treating year as a factor is important, because it enables us to split out descriptions of the numeric variables by levels of the factor.

One of the simplest and best visualization tool for comparing distributions is the side-by-side boxplot, which displays information about quartiles and potential outliers. The below code generates Figure 1.5.

ggplot(uadata) +
  geom_boxplot(aes(x=year, y=act) ) +
  labs(title="ACT scores of Incoming Freshmen",
       x="Year",
       y="Composite ACT Score") + 
  theme_classic()
Side-by-side Box-whiskers plots showing the distribution of ACT scores in the `uadata` per year.

Figure 1.5: Side-by-side Box-whiskers plots showing the distribution of ACT scores in the uadata per year.

We see here the code is similar to the examples above except now we call the geom_boxplot function. Also note we are using theme_classic() which has some subtle differences compared to theme_bw() used above. Lastly, we include proper labels and titles on our plot here for a more professional appearance.

Next we construct side-by-side boxplots by year for each of the three numeric variables in the data frame. We utilize the taller or longer format of the admissions data in uadata.tall and the resulting plot is in Figure 1.6.

ggplot(uadata.tall) + 
  geom_boxplot(aes(x=year, y=value) ) +
  facet_grid(var_names~., scales="free_y") + 
  theme_dark()
Side-by-side Box-whiskers plots showing the distributions of the variables `act`, `gpa.endyr1` and `hs.pct` for each year in the `uadata` dataset.

Figure 1.6: Side-by-side Box-whiskers plots showing the distributions of the variables act, gpa.endyr1 and hs.pct for each year in the uadata dataset.

Note here that the order of the var_names statement in the facet_grid statement has changed compared to the previous side-by-side density plots (code used to generate Figure 1.3) now the plots are stacked). Also note another theme option, theme_dark() (obviously you will NOT want to use this theme if planning to print the document).

This is a nice compact visual description of the changes in patterns of response of these three variables over time.
All three variables appear pretty stable over the years; the left skews in GPA and high school percentile rankings are still evident here (note the detection of some low outliers for year on these two variables).

We now revisit calculating summary statistics. Extensive grouped numeric summary statistics are available by using the group_by() and summarize() functions in tidyverse. The group_by function tells R to treat each value of the specified variable (in this case, literally called variable) as a collection of similar options, then to perform the summarize function on each value in the grouped variables. The below example is a more detailed version of the summary table from above.

summary.table <- uadata.tall %>% 
  group_by(var_names) %>%
  summarize(Mean=mean(value),
            StDev=sd(value),
            Min=min(value),
            Q2=quantile(value, prob=0.25),
            Median=median(value),
            Q3=quantile(value, prob=0.75),
            Max=max(value) )
kable(summary.table)
var_names Mean StDev Min Q2 Median Q3 Max
act 24.54326 4.013636 13.00 22.000 25.00 28.00 35
gpa.endyr1 2.97731 0.634528 0.51 2.609 3.05 3.47 4
hs.pct 76.95319 18.633938 4.00 68.000 81.00 92.00 99

Here, we first pipe the uadata.tall into a group_by() function grouping by the var_names variable. Then we take that output and pipe it to the summarize function which allows us to calculate various summary statistics for each of the groups. In the above we calculate the mean and standard deviation of each group, along with the 5-number summary. Lastly, we wrap the constructed table (its technically a data frame) in the kable() function from the knitr package (literally means knitr table) for a nice display.

In the next example we group by both the year and the variables of interest.

summary.table2 <- uadata.tall %>%
  group_by(var_names, year) %>%
  summarize(Mean=mean(value),
            StDev=sd(value))
kable(summary.table2)
var_names year Mean StDev
act 1996 24.69014 3.885202
act 1997 24.09924 3.818689
act 1998 24.50649 4.276775
act 1999 24.83688 4.395816
act 2000 24.55475 3.609464
gpa.endyr1 1996 2.92873 0.652700
gpa.endyr1 1997 2.97366 0.633515
gpa.endyr1 1998 3.00033 0.654368
gpa.endyr1 1999 3.02599 0.600662
gpa.endyr1 2000 2.95520 0.632227
hs.pct 1996 78.29578 15.894059
hs.pct 1997 76.64122 20.196751
hs.pct 1998 74.64935 19.983173
hs.pct 1999 79.45390 17.045098
hs.pct 2000 75.87591 19.534828

1.6.6 Descriptive statistics and visualizations for two numeric variables

For a pair of numeric variables the amount of linear association between the variables can be measured with the Pearson correlation. The Pearson correlation assesses the strength and direction of any linear relationship existing between the GPA, ACT and HS rank variables (Figure 1.1).

Recall from introductory statistics that the Pearson correlation \(r\) between two variables \(x\) and \(y\) is given by the formula \[r = \frac{1}{n-1}\sum_{i=1}^n\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\] \(r\) is a unitless measure with the property that \(-1 \leq r \leq 1\).
Values of \(r\) near \(\pm 1\) indicate a near perfect (positive or negative) linear relationship between \(x\) and \(y\). A frequently used rule of thumb is that \(|r| > 0.80\) or so qualifies as a strong linear relationship. We can calculate the Pearson correlation with the cor function. When applied to a data.frame, or matrix, it displays a correlation matrix. In the below code we specify to only calculate the correlation based on the 3 numeric variables.

cor(uadata[,c("gpa.endyr1", "hs.pct", "act")])
##            gpa.endyr1   hs.pct      act
## gpa.endyr1   1.000000 0.398479 0.365611
## hs.pct       0.398479 1.000000 0.442507
## act          0.365611 0.442507 1.000000

All pairwise correlations in this example are of modest magnitude (\(r \approx 0.4\)) (see Figure 1.1 or the above correlation matrix).

To visualize the relationship between numeric variables, we typically use a scatterplot.

ggplot(uadata) + 
  geom_point(aes(x=act, y=gpa.endyr1) ) +
  labs(x="Composite ACT Score",
       y="Grade point average following freshman year") +
  theme_minimal()
Scatterplot showing the relationship between the variables `act` and `gpa.endyr1` in the `uadata` dataset.

Figure 1.7: Scatterplot showing the relationship between the variables act and gpa.endyr1 in the uadata dataset.

You should note in Figure 1.7 the overall positive relationship (higher ACT scores tend to correspond to higher GPAs). We also note the rounding of the ACT scores (always a whole number) resulting in these vertical striations in the plot.

We can attempt to visually address the striations with jittering (some visual random perturbations) in Figure 1.8. There, we specify no jittering on the vertical axis (thus, height=0) and let ggplot2 use the default behavior horizontally.

ggplot(uadata) + 
  geom_jitter(aes(x=act, y=gpa.endyr1), height=0 ) +
  labs(x="Composite ACT Score",
       y="Grade point average following freshman year") +
  theme_minimal()
Jittered scatterplot showing the relationship between the variables `act` and `gpa.endyr1` in the `uadata` dataset.

Figure 1.8: Jittered scatterplot showing the relationship between the variables act and gpa.endyr1 in the uadata dataset.

1.7 Sampling distributions: describing how a statistic varies

As described in the discussion on variability above, each of the numeric quantities calculated has some associated uncertainty. It is not enough to know just the variance or standard deviation of an estimate. Those are just ways to get a measure of the inconsistency in an estimate’s value from sample to sample of a given size \(n\).

If we wish to make a confident conclusion about an unknown population parameter using a sample statistic (e.g., \(\bar{x}\)), we also need to know something about the general pattern of behavior that statistic would take if we had observed it over many, many different potential random samples.

This long-run behavior pattern is described via the probability distribution of the statistic, also known as the sampling distribution of the statistic.

It is important to remember that we usually only collect one sample, so only one sample estimate will be available in practice. However, knowledge of the sampling distribution of our statistic or estimate is important because it will enable us to assess the reliability or confidence we can place in any generalizations we make about the population using our estimate.

Two useful sampling distributions: Normal and \(t\)

You should be very familiar with normal distributions and \(t\)-distributions from your introductory statistics course. In particular, both serve as useful sampling distributions to describe the behavior of the sample mean \(\bar{x}\) as an estimator for a population mean \(\mu\). We briefly review them here.

Normal distributions. A variable \(X\) that is normally distributed is one that has a symmetric bell-shaped density curve. A normal density curve has two parameters: its mean \(\mu\) and its variance \(\sigma^2\). Notationally, we indicate that \(X\) is normal by writing \(X \sim N(\mu, \sigma^2)\).

The normal distribution serves as a good sampling distribution model for the sample mean \(\bar{x}\) because of the powerful Central Limit Theorem, which states that if \(n\) is sufficiently large, the sampling distribution of \(\bar{x}\) will be approximately normal, regardless of the shape of the distribution of the original parent population from which you collect your sample (when making a few assumptions).

t distributions. The formula given by \[t_0 = \frac{\bar{x}-\mu}{{SE}_{\bar{x}}}\]

counts up the number of standard error units between the true value of \(\mu\) and its sample estimate \(\bar{x}\); it can be thought of as the standardized distance between a true mean value and its sample mean estimate. If the original population was normally distributed, then the quantity \(t_0\) will possess the \(t\)-distribution with \(n - 1\) degrees of freedom. Degrees of freedom is the only parameter of a \(t\)-distribution. Notationally, we indicate this by writing \(t_0 \sim t(\textrm{df})\).

\(t\)-distributions look a lot like the standard normal distribution, \(N(0, 1)\): they both center at 0 and are symmetric and bell-shaped (see below).
The key difference is due to the fact that the quantity \(t\) includes two sample summaries in its construction: the sample mean \(\bar{x}\), and \(SE_{\bar{x}}\), which incorporates the sample standard deviation \(s\).
Because of this, \(t\) is more variable than \(N(0, 1)\).
However, as \(n\) (and hence degrees of freedom) increases, \(s\) becomes a better estimate of the true population standard deviation, so the distinction ultimately vanishes with larger samples. This can all be visualized in Figure 1.9.

Overlayed distribution functions showing the relationship between a standard normal distribution and that of a $t$ distribution with 3, 5 and 10 degrees of freedom. As the degrees of freedom increases, the $t$ distribution becomes more *normal*

Figure 1.9: Overlayed distribution functions showing the relationship between a standard normal distribution and that of a \(t\) distribution with 3, 5 and 10 degrees of freedom. As the degrees of freedom increases, the \(t\) distribution becomes more normal

The \(t\)-distribution is usually the relevant sampling distribution when we use a sample mean \(\bar{x}\) to make inferences about the corresponding population mean \(\mu\). Later in the course, the need to make more complicated inferences will require us to develop different sampling distribution models to facilitate the statistics involved. These will be presented as needed.

1.8 Two-sample inference

A key concept from your introductory statistics course that builds off sampling distribution theory is the comparison of the mean from two populations based on the sample means using a two-sample t-test. Consider testing \[H_0:\ \mu_A = \mu_B ~~~\textrm{versus}~~~ H_A:\ \mu_A <\neq> \mu_B\] where the notation \(<\neq>\) represents one of the common alternative hypotheses from your introductory statistics course (less than, not equal to, or greater than) and, \(\mu_A\) is the population mean from population \(A\) and \(\mu_B\) is the population mean from population \(B\). We statistically test this hypothesis via \[t_0 = \frac{\bar{x}_A - \bar{x}_B}{{SE}_{\bar{x}_A-\bar{x}_B}}\]

where \({SE}_{\bar{x}_A-\bar{x}_B}\) is the standard error of the difference of sample means \(\bar{x}_A\) and \(\bar{x}_B\). It can be estimated in one of two ways (depending if we assume the population variances are equal or not) and we reference your intro stat class for details (pooled versus non-pooled variance). In R, we can perform this test, and construct the associated \((1-\alpha)100\%\) confidence interval for \(\mu_A-\mu_B\), \[\bar{x}_A - \bar{x}_B \pm t_{\alpha/2} {SE}_{\bar{x}_A-\bar{x}_B}\] using the function t.test(). Here \(t_{\alpha/2}\) is the \(1-\alpha/2\) quantile from the t-distribution (the degrees of freedom depend on your assumption about equal variance).

Consider comparing the ACT scores for incoming freshmen in 1996 to 2000 using the university admission data. Perhaps we have a theory that ACT scores were higher in 1996 than 2000. That is, we are formally testing \[H_0:\ \mu_{1996} = \mu_{2000} ~~~\textrm{versus}~~~ \mu_{1996} > \mu_{2000}\]

First we need to select a subset of the data.

uadata.trim <- uadata %>%
  filter(year %in% c(1996, 2000) )

Here we filter the uadata so that only years in the vector 1996, 2000 are included. Now that we have a proper subset of the data, we can compare the ACT scores between the two years and build a 98% confidence interval for the mean difference.

t.test(act ~ year, data=uadata.trim, conf.level=0.98, 
       alternative="greater", var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  act by year
## t = 0.3013, df = 277, p-value = 0.382
## alternative hypothesis: true difference in means between group 1996 and group 2000 is greater than 0
## 98 percent confidence interval:
##  -0.791858       Inf
## sample estimates:
## mean in group 1996 mean in group 2000 
##            24.6901            24.5547

We are 98% confident that the mean difference between 1996 ACT scores and 2000 ACT scores is in the interval (-0.792, \(\infty\)). Further, we lack significant evidence (\(p\)-value \(\approx 0.3817\)) to conclude that the average ACT score in 1996 is greater than that in 2000 for incoming freshmen (also note that the value 0 is included in the confidence interval).

A few things to note about the code above.

  • We use the notation act ~ year, this tells R that the act scores are a function of year.
  • We specify the data set to be used to perform the test (data=uadata.trim).
  • We specify the confidence level (by default it is 0.95) and the alternative hypothesis (by default it is a not-equal test).
  • Lastly, we set var.equal=TRUE so the standard error (and degrees of freedom) are calculated under the assumption that the variance of the two populations is equal (pooled variance).

References

Faraway, J. J., Linear Models with r, Taylor & Francis, from https://books.google.com/books?id=fvenzpofkagC, 2004.
Kutner, M. H., Nachtsheim, C., Neter, J. and Li, W., Applied Linear Statistical Models, New York, NY: McGraw-Hill Irwin, 2004.
R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing, from https://www.R-project.org/, 2019.
Schloerke, B., Crowley, J., Cook, D., Briatte, F., Marbach, M., Thoen, E., Elberg, A. and Larmarange, J., GGally: Extension to ’Ggplot2’, from https://CRAN.R-project.org/package=GGally, 2018.
Wickham, H., Ggplot2: Elegant Graphics for Data Analysis, Springer-Verlag New York, from https://ggplot2.tidyverse.org, 2016.
Wickham, H., Tidyr: Tidy Messy Data, from https://CRAN.R-project.org/package=tidyr, 2021.
Wickham, H., Tidyverse: Easily Install and Load the ’Tidyverse’, from https://CRAN.R-project.org/package=tidyverse, 2017.
Wickham, H., François, R., Henry, L. and Müller, K., Dplyr: A Grammar of Data Manipulation, from https://CRAN.R-project.org/package=dplyr, 2021.
Wickham, H. and Grolemund, G., R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, O’Reilly Media, from http://r4ds.had.co.nz/, 2017.
Wickham, H. and Hester, J., Readr: Read Rectangular Text Data, from https://CRAN.R-project.org/package=readr, 2021.