Histograms are a useful way of inspecting a continuous dataset for the distribution of its values. In some ways, histograms can be the fastest way of understanding whether your data is normally distributed. If the data is not normally distributed — as-is often the case — histograms allow one to inspect the skew and bias of the data set and account for it in an analysis.
The repository containing all of the code to build a histogram in R and examples here can be located at this link.
How to Make a Histogram in R
Building a histogram in R can quickly help you explore the contours of your data and see where revisions need to be made. It’ll allow you to quickly unearth insights from your data values and practice the first rudimentary steps of data science. Through this tutorial, you’ll be able to build a histogram in R using basic R commands.
In this tutorial, we will be inspecting the date distributions of two datasets and their corresponding data points using the ggplot2 histograms functionality. Hot on the heels of the previous tutorial bar chart tutorial, found here, I will randomly sample 1000 of the Met Museum Artwork database and compare it to an equal size sample of the Tate Museum artwork database (both of these samples will be provided in the repo of this tutorial). These histograms will therefore give us a look into the yearly distributions of the artworks held in these museums.
As always, make sure you have ggplot2, the go-to data visualization library of R, installed and running. Here’s a tutorial on how to get ggplot2 installed if you need it.
library(ggplot2)
Now, let’s import the data. I have provided two random sample subsets of each museum’s artworks datasets in the repo of this tutorial. Each subset contains a random 5000 rows from each collection.
Make sure to import the CSVs as dataframes — that will be important to the ggplot function later on. A R data frame is one of the easiest data formats to manipulate with R functions.
met <- as.data.frame(read.csv("MetObjects_5k-sample.csv")) tate <- as.data.frame(read.csv("TateObjects_5k-sample.csv"))
Now, we will do another filter based on date. For our purposes, we want to make sure that the data that we are investigating matches some sort of uniform criteria. For an early look, I want to makes sure that the dates that we are looking at are between 2000 B.C. and 2016 C.E. This way, any non matching outliers or mistakes will be omitted, and our histogram in R will give us a reasonable estimate of the shape of the data. To do this, use the subset() function:
met.subset <- subset(met, met$Object.Begin.Date>-2000 & met$Object.Begin.Date<2016) tate.subset <- subset(tate, tate$year>-2000 & tate$year<2016)
Get To Know Other Data Science Students
Jasmine Kyung
Senior Operations Engineer at Raytheon Technologies
Rane Najera-Wynne
Data Steward/data Analyst at BRIDGE
Karen Masterson
Data Analyst at Verizon Digital Media Services
This process now ensures that we are looking at the exact same date windows despite the datasets being structured differently.
Ok, let’s plot a histogram in R!
I personally prefer using ggplot2’s geom_histogram() function as opposed to the more common “qplot” function due to ggplot2’s more robust customization. Here’s my code:
ggplot(data = met.subset, aes(met.subset$Object.Begin.Date)) + geom_histogram(binwidth = 10) + ggtitle("Art Object Date Distribution in the Met") + xlab("Year") + ylab("Number of Art Objects")
In terms of the “geom_histogram()” function, you only need one specification that will help produce a plot: “binwidth.” What is that? A bin is simply the interval at which you want the histogram bars to appear. The default, where “binwidth = 1”, simply means that each value will be given one bar on the graph. Since we’re working on dates and time, I prefer using “binwidth = 10”, where each bin represents a decade. We also add axis labels with the use of the xlab and ylab functions.
Here’s the output:
Well, it seems the data isn’t normally distributed if we look at both the x-axis (representing a time series of years) and the y-axis, representing the number of art objects associated with each year. Most of the Met’s collection seems to skew towards the modern period (1850-2000). But what about the Tate collection? Would we get a similar result? Let’s take a look.
ggplot(data = tate.subset, aes(tate.subset$year)) + geom_histogram(binwidth = 10) + ggtitle("Art Object Date Distribution in the Tate") + xlab("Year") + ylab("Number of Art Objects")
The histograms are completely different! As we can see here, the Tate collection is a lot more specialized, as its earliest piece goes back to 1576. It seems to have a strong mode in the 19th century. For comparison, the earliest piece in the Met collection is at our subset cut-off at 2000 B.C..
For that reason, the Tate is primarily a ‘modern’ collection, as it predominantly features art from 1800-onwards. If we were to become interested in studying modern art and put the two collections on the same level playing field, we would have to re-subset!
met.modern <- subset(met, met$Object.Begin.Date>1850 & met$Object.Begin.Date<2016) tate.modern <- subset(tate, tate$year>1850 & tate$year<2016)
As we can see after we have traced a simple histogram in R, the Met’s modern art collection vastly differs from the Tate. Most of the Met art dates back to the period now known as “fin de siecle”: the years prior to the turn of the 20th century (1880-1900). The Tate, meanwhile, focuses on postwar and contemporary art (1950-current), and features very little art from before the 20th Century.
We can see how the R programming language may assist data scientists to examine data through the histogram method, and how the data science field could benefit from using it. So, thanks to histograms, we now know the shape of our data and the decade distribution of each art collection. If we were to continue studying these datasets, the histograms would give us a good understanding of what we are drawing on when we think about the Met or the Tate collections. For these reasons, histograms play a bigger role than just determining whether or not your data is normally distributed — a histogram plot usefully visualizes the skew of your data and allows you to make informed assumptions based on the shape of your output.
Since you’re here…Are you interested in this career track? Investigate with our free guide to what a data professional actually does. When you’re ready to build a CV that will make hiring managers melt, join our Data Science Bootcamp which will help you land a job or your tuition back!