1.4 Frequency Distributions

Categorical Variables

Frequency (count): the number of times a particular value occurs.

A frequency distribution lists each distinct value with its frequency.

A bar plot is a graphical representation of a frequency distribution. Each bar’s height is based on the frequency of the corresponding category.

Relative frequency is the ratio of the frequency to the total number of observations.

\[ \text{relative frequency} = \frac{\text{frequency}}{\text{number of observations}} \]

This is also called the proportion.

The percentage can be obtained by multiplying the proportion by 100.

A relative frequency distribution lists each distinct value with its relative frequency.

A dot plot shows a number line with dots drawn above the line. Each dot represents a single observation.

We would also like to be able to visualize larger, more complex data sets.
This is hard to do using a dot plot!
Instead, we can do this using bins, which group numeric data into equal-width consecutive intervals.

A random sample of weights (in lbs) from 12 cats:

\[\quad 6.2 \quad 11.6 \quad 7.2 \quad 17.1 \quad 15.1 \quad 8.4 \] \[\quad 7.7 \quad 13.9 \quad 21.0 \quad 5.5 \quad 9.1 \quad 7.3 \]

Lots of ways to break these into “bins”, but what about…

We’ve suggested bins

Each has an equal width of 5 (that’s good), but if we had a cat with a weight of exactly 15 lbs, would we use the second or third bin??

To make this clear, we need there to be no overlap. Instead, we could use:

Now, a cat with a weight of 15.0 lbs would be placed in the third bin (but not the second).

We will visualize this using a histogram, which is a lot like a bar plot but for numeric data:

This is a frequency histogram because each bar height reflects the frequency of that bin.

We can also create a relative frequency histogram which displays the relative frequency instead of the frequency:

Notice that those last two histograms look the same except for the numbers on the vertical axis!

This gives us insight into the shape of the data distribution, literally how the values are distributed across the bins.
The part of the distribution that “trails off” to one or both sides is called a tail of the distribution. - When a histogram trails off to one side, we say it is skewed.
- right-skewed if it trails off to the right
- left-skewed if it trails off to the left
Data sets with roughly equal tails are symmetric.

We can also use a histogram to identify modes.

For numeric data, especially continuous variables, we think of modes as prominent peaks.

Finally, we can also “smooth out” these histograms and use a smooth curve to examine the shape of the distribution.