day4.2-slides

EDS 212: Day 4, Lecture 2

Essential summary statistics and exploration

August 7^th, 2025

Data types

Quantitative: numeric information
Qualitative: descriptions (usually words)

A bit deeper:

Continuous: measured values, can take an infinite possible values for a variable
Discrete: can only have certain values (e.g. counts)
Ordinal: order matters, but the difference between values isn’t known or equal (e.g. Likert Scale)
Binary: only two possible outcomes (yes/no, true/false, 1/0)

Quantitative data: continuous & discrete

Nominal, ordinal, binary data:

Data distributions

How can we describe how data are distributed?

Our starting points:

Shape / patterns / clusters (data vizualization)
Central tendency (mean / median)
Spread & uncertainty (standard deviation / standard error / confidence interval)

Useful data visualizations

Histograms
Boxplots
Scatterplots

…then get even more involved:

Beeswarm
Marginal plots
Raincloud plots
Pairs plots

Histogram

A histogram is a graph of the frequency of observations within a series of bins (usually of equal size) for a variable.

Example: distribution of penguin flipper lengths for chinstrap penguins

Boxplot

Most often:

Box extends to 1st and 3rd quartile observation values
Line at the median value
Whiskers extend to last observation within 1 step (1 step = 1.5*interquartile range)
Anything beyond whiskers indicated with a dot at the observation value

Boxplot

Boxplot example:

# create vectors of Tallie & Molly miles logged ----
tallie <- c(1.0, 1.2, 1.8, 2.1, 2.4, 2.9, 3.4, 4.7, 5.1, 5.6, 7.8, 10.4, 15.4)
molly <- c(0.5, 0.4, 1.1, 1.2, 3.2, 2.1, 3.3, 2.3, 0.7, 0.9, 1.9, 3.5, 1.9)

# turn vectors into a data frame that can be plotted ----
dog_miles <- data.frame(tallie, molly) |> 
  pivot_longer(cols = c(tallie, molly), names_to = "name", values_to = "miles")

# make boxplot of Tallie vs. Molly miles ----
ggplot(data = dog_miles, aes(x = miles, y = name)) +
  geom_boxplot()

Scatterplots

Always, always, always look at your data. It is the only way to make a responsible decision about an appropriate type of analysis.

ggplot(data = palmerpenguins::penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +
  geom_point(aes(color = species))

Summarizing data numerically

Central tendency
Variance and standard deviation
Standard error
Confidence interval

Mean

Average value of sample observations, calculated by summing all observation values and dividing by the number of observations. E.g. \(mean\;of\;3, 7, 17 = \frac{3+7+17}{3} = 9\)

Pros:

Average value is often useful metric
Commonly reported

Cons:

Susceptible to outliers and skew
Subject to misinterpretation as “most likely value”

Median

Middle value when all observations are arranged in order. If you have an even number of values, the median is calculated as the average of the middle two values. E.g. \(median\;of\;3, 7, 17 = 7\)

Pros:

Less susceptible to skew and outliers
Better as sample size increases

Cons:

Doesn’t take into account the magnitude of all values

The best way to describe the distribution of the data is to present the data itself.

Variance and standard deviation

Both are measures of data spread.

Variance

Reported in units of measurement squared

Standard deviation

Reported in units of measurement

Standard deviation is expressed in the same unit of measurement as the data, and therefore can be easier to interpret – for example it’s more intuitive to report that a group of people’s heights has a standard deviation of 3 inches, and less intuitive to report a variance of 9 square inches.

Variances add (standard deviation doesn’t). For example, if we say that sex explains four square inches of variance, you know that 5 square inches are due to other factors.

See this discussion and this discussion

The variance measures the mathematical dispersion of data relative to the mean, and while theoretically correct, can be difficult to interpret in a real-world sense. Standard deviation is expressed in the same unit of measurement as the data (e.g. inches, rather than square inches), and therefore can be a more intuitive value to report.

Variance

Variance: Mean squared distance of observations from the mean

Where \(s^2\) is the sample variance, \(x_i\) is a sample observation value, \(\bar x\) is the sample mean, and \(n\) is the number of observations.

Calculate variance by hand (1/2)

Given these data: \(2, 4, 4, 6, 9\)

Calculate the mean:

\[ mean = \frac{2+4+4+6+9}{5}=\frac{25}{5}=5\] 2. Subtract the mean from each data point and square the result:

\[(2-5)^2 = (-3)^2 = 9\] \[(4-5)^2 = (-1)^2 = 1\] \[(4-5)^2 = (-1)^2 = 1\]

\[(6-5)^2 = (1)^2 = 1\] \[(9-5)^2 = (4)^2 = 16\]

Calculate variance by hand (2/2)

Given these data: \(2, 4, 4, 6, 9\)

Sum the squared differences:

\[9+1+1+1+16 = 28\]

Divide by the number of data points minus 1 (\(n\) - 1)

\[Variance = \frac{28}{5-1} = 7\]

Alternatively, in R:

var(c(2, 4, 4, 6, 9))

[1] 7

Standard deviation

Also a measure of data spread, calculated by taking the square root of the variance.

In R:

sqrt(var(c(2, 4, 4, 6, 9)))

[1] 2.645751

# or alternatively, just use sd()
sd(c(2, 4, 4, 6, 9))

[1] 2.645751

Beware summary statistics alone . . .

Meet the Datasaurus Dozen

Same summary statistics, different distributions

Confidence interval

Confidence interval: a range of values (based on a sample) that, if we were to take multiple samples from the population and calculate the confidence interval from each, would contain the true population parameter X percent of the time.

What it’s NOT:

“There is a 95% chance that the true population parameter is between values X and Y.”

Confidence interval example

Mean shark length is 8.42 \(\pm\) 3.55 ft (mean \(\pm\) standard deviation), with a 95% confidence interval of [6.45, 10.39 ft] (n = 15).

What this DOES NOT mean: There is a 95% chance that the true population mean length is between 6.45 and 10.39 feet.

The true population mean is a fixed value and does not change – the CI either contains this true mean or it does not. There is no probability involved.

What this DOES mean: If we took a bunch of sets of samples from the population (all n = 15), then 95% of the time, the calculated mean would fall within this range.

This statement correctly describes the frequency with which we would expect CIs to capture the mean over many samples.

Communicating data summaries

The “Bar plots” philosophy: show as much as you can for the audience you’re presenting to
Summary statistics are often useful, but are a small part of the whole data story
Uncertainty is important! How can we responsibly communicate it?
All summaries are strongest when accompanied by additional data communication

Main text: 'Are your summary statistics hiding something?' On the left is a opaque gray bar plot with an error bar, looking mischievous while hiding individual data points in a net behind it. On the right is a transparent bar plot with an error bar, looking exposed, with individual data points exposed and dancing. The bottom four data points have arms that spell 'YMCA' as if dancing to the Village People.

Artwork by Allison Horst