Apr 20, 2017

Histogram in R: A Tutorial

Share:
The repository containing all of the code to build a histogram in R and examples here can be located at this link

Building a histogram in R can quickly help you explore the contours of your data and see where revisions need to be made. It’ll allow you to quickly unearth insights from your data values and practice the first rudimentary steps of data science. Through this tutorial, you’ll be able to build a histogram in R using basic R commands.

In this tutorial, we will be inspecting the date distributions of two datasets and their corresponding data points using the ggplot2 histograms functionality. Hot on the heels of the previous tutorial bar chart tutorial, found here, I will randomly sample 1000 of the Met Museum Artwork database and compare it to an equal size sample of the Tate Museum artwork database (both of these samples will be provided in the repo of this tutorial). These histograms will therefore give us a look into the yearly distributions of the artworks held in these museums.

As always, make sure you have ggplot2, the go-to data visualization library of R, installed and running. Here’s a tutorial on how to get ggplot2 installed if you need it. 

library(ggplot2)

Now, let’s import the data. I have provided two random sample subsets of each museum’s artworks datasets in the repo of this tutorial. Each subset contains a random 5000 rows from each collection.

Make sure to import the CSVs as dataframes — that will be important to the ggplot function later on. A R data frame is one of the easiest data formats to manipulate with R functions.

met <- as.data.frame(read.csv("MetObjects_5k-sample.csv"))
tate <- as.data.frame(read.csv("TateObjects_5k-sample.csv"))

Now, we will do another filter based on date. For our purposes, we want to make sure that the data that we are investigating matches some sort of uniform criteria. For an early look, I want to makes sure that the dates that we are looking at are between 2000 B.C. and 2016 C.E. This way, any non matching outliers or mistakes will be omitted, and our histogram in R will give us a reasonable estimate of the shape of the data. To do this, use the subset() function:

met.subset <- subset(met, met$Object.Begin.Date>-2000 & met$Object.Begin.Date<2016)
tate.subset <- subset(tate, tate$year>-2000 & tate$year<2016)

This process now ensures that we are looking at the exact same date windows despite the datasets being structured differently.

Ok, let’s plot a histogram in R!

I personally prefer using ggplot2’s geom_histogram() function as opposed to the more common “qplot” function due to ggplot2’s more robust customization. Here’s my code:

ggplot(data = met.subset, aes(met.subset$Object.Begin.Date)) +
  geom_histogram(binwidth = 10) +
  ggtitle("Art Object Date Distribution in the Met") + xlab("Year") + ylab("Number of Art Objects")

In terms of the “geom_histogram()” function, you only need one specification that will help produce a plot: “binwidth.” What is that? A bin is simply the interval at which you want the histogram bars to appear. The default, where “binwidth = 1”, simply means that each value will be given one bar on the graph. Since we’re working on dates and time, I prefer using “binwidth = 10”, where each bin represents a decade. We also add axis labels with the use of the xlab and ylab functions. 

Here’s the output:

Histogram in R

Well, it seems the data isn’t normally distributed if we look at both the x-axis (representing a time series of years) and the y-axis, representing the number of art objects associated with each year. Most of the Met’s collection seems to skew towards the modern period (1850-2000). But what about the Tate collection? Would we get a similar result? Let’s take a look.

ggplot(data = tate.subset, aes(tate.subset$year)) +
  geom_histogram(binwidth = 10) +
  ggtitle("Art Object Date Distribution in the Tate") + xlab("Year") + ylab("Number of Art Objects")

And the plot:

Histogram in R

The histograms are completely different! As we can see here, the Tate collection is a lot more specialized, as its earliest piece goes back to 1576. It seems to have a strong mode in the 19th century. For comparison, the earliest piece in the Met collection is at our subset cut-off at 2000 B.C..

For that reason, the Tate is primarily a ‘modern’ collection, as it predominantly features art from 1800-onwards. If we were to become interested in studying modern art and put the two collections on the same level playing field, we would have to re-subset!

met.modern <- subset(met, met$Object.Begin.Date>1850 & met$Object.Begin.Date<2016)
tate.modern <- subset(tate, tate$year>1850 & tate$year<2016)

And let’s plot. Here are the collections back-to-back:

Histogram in R

As we can see after we have traced a simple histogram in R, the Met’s modern art collection vastly differs from the Tate. Most of the Met art dates back to the period now known as “fin de siecle”: the years prior to the turn of the 20th century (1880-1900). The Tate, meanwhile, focuses on postwar and contemporary art (1950-current), and features very little art from before the 20th Century.

So, thanks to histograms, we now know the shape of our data and the decade distribution of each art collection. If we were to continue studying these datasets, the histograms would give us a good understanding of what we are drawing on when we think about the Met or the Tate collections. For these reasons, histograms play a bigger role than just determining whether or not your data is normally distributed — a histogram plot usefully visualizes the skew of your data, and allows you to make informed assumptions based on the shape of your output.