Box Plots and Violin Plots

Introduction

Histograms are nice, but sometimes we need to show more of the data spread to tell our story. For this requirement, we can use the box plot and/or the violin plot. A box plot is an applied histogram that shows the five summary statistics: minimum value, 25th percentile, median, 75th percentile, and maximum value. A box plot is sometimes called a box-and-whisker plot because it looks like a box (showing the 25th percentile, median, and 75th percentile) with whiskers (showing the minimum and maximum value). Oftentimes, a box plot is drawn to remove outliers (typically those that are 1.5x higher than the 75th percentile or 1.5x lower than the 25th percentile). A violin plot is like a box plot with a histogram (or, more precisely, a kernel density plot) drawn on the side to show the actual distribution; a violin plot also usually shows the actual maximum and minimum values in the dataset without filtering any potential outliers.

We will use a dataset of 2012 SAT results for various New York high schools [1].

Column Name Data Type Description
DBN str A school identifier unique to each school
School Name str The name of the school
Number of Test Takers int Total number of test takers
Critical Reading Mean int The average Critical Reading section score
Mathematics Mean int The average Mathematics section score
Writing Mean int The average Writing section score

The Lie

A box plot is a summary of information; by definition, this means that we are losing the granularity of some of the data. Because we just see the range and potential outliers, we have no idea what a box plot’s sample size is. Box Plot One and Box Plot Two look roughly the same, but they are both hiding secrets. Which one has a larger sample size? Of the three box plots, which data do you think is the best?

Box Plot Three has the tightest data distribution, Box Plot One has the most spread and the most datapoints, and Box Plot Two is somewhere in the middle. Be careful if you only receive box plot data summaries because they can hide crucial data distribution information in the name of simplicity and statistical representation. Even though Box Plot One and Box Plot Two look very similar, you should be more suspicious of Box Plot Two given its smaller sample size.

Imports

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import plotly.io as pio
pio.renderers.default = "browser"

df = pd.read_csv('https://raw.githubusercontent.com/alexkenan/pyviz/main/datasets/sat.csv')
df.dropna(inplace=True)

There are some blank entries in the data, so let’s drop them with Pandas’ dropna() function. We could edit the spreadsheet to remove these blanks, but it is always good practice to clean data since you can’t always control the data source!

Matplotlib’s boxplot() function will create a box plot for us. There are a few customizable parameters, but let’s see a basic box plot first:

data = [df['Critical Reading Mean'], df['Mathematics Mean'], df['Writing Mean']]
plt.boxplot(data)
plt.xticks(ticks=[1, 2, 3], labels=['Reading', 'Math', 'Writing'])
plt.title('Boxplots for NY SAT Testing')
plt.ylabel('Score')
plt.show()

If we didn’t mess with the x axis, the plot would show 1, 2, 3 instead of Reading, Math, Writing. It wouldn’t be the end of the world, but we can make it look nice regardless ...

Want to read more? Check out purchasing options here.