The Github repository containing all of the code in this ggplot2 in R tutorial can be found here.
When I get my hands on a new dataset, I often want to take a quick look at the shape of the data and at preliminary results before developing my research any further. While many tutorials offer easy ways of plotting data in one way or another, few tutorials lead you through the first steps of data exploration in R. This ggplot2 in R tutorial will help you make sense of large datasets and gives you a framework to do some exploratory graphing of your own.
This ggplot2 in R tutorial assumes that you have already installed R, an IDE of your choice (I use RStudio), as well as the ggplot2 package. All these programs and packages are easy to access and free to install, so if you don’t have them already, you can use this guide to figure out how to get started. Jupyter with R is the most intuitive way to start with R if you don’t have anything installed. You can install ggplot2 and other libraries using the install.packages command in R.
For the rest of the tutorial, I will be working on a sample dataset obtained from The Metropolitan Museum of Art in New York City. This dataset contains a set of metadata for all the artworks housed in the museum’s collection, and can be found on GitHub thanks to the Met Museum’s Open Access Initiative.
First things first: make sure you have installed your libraries. Insert the following lines of code on the top.
You shouldn’t get any errors after running the code above if ggplot2 has been installed correctly.
Now, lets read in the Metropolitan dataset, which is a raw CSV file.
met.collection <- read.csv(file=”~/Documents/Springboard-Blog/Springboard-Blog-Tutorials/data/MetObjects.csv”)
Make sure you change the file path here to whatever it is on your computer! Here’s a quick guide to how to import CSVs into R. You may also have to work with git-lfs, Github’s large file system management system to get the CSV file we’re working with, as it exceeds 200mb in file size. Here’s a short tutorial on that.
After R has ingested the table (it may take a while!), we can move to one of my favorite R functions: summary()!
Summary is a great function because it looks at every column in your dataset and returns an insightful set of statistics about it. If the column is made of numeric values, it will return the average and standard deviation across the column’s values.
If your data is composed of strings (such as in our case), summary returns the count of unique strings within a column. The summary() function makes for a great first step for any exploratory data analysis using R.
I decided to use the summary() function to narrow where I should explore the data — the dataset has 43 columns in total!
This analysis got me to three interesting columns: which countries artists are from (their nationality), which cities they are from, and a column that collected the number of artworks associated with a particular artist. While a lot of the top-scoring values are obvious –the Met Collection is an American museum after all–some of the more interesting values are found in other columns, such as “City.” Paris, for instance, is the top-scoring city for artworks across the whole collection, beating New York by a fairly wide margin, which suggests that Paris is a particularly great place to meet talented artists.
Exploratory graphs of three of these four categories could help us find trends in the dataset that are ripe for further exploration. Let’s start with a bar plot of artists’ nationalities found in the Met Collection.
nationality <-data.frame(table(met.collection$Artist.Nationality)) nationality <- nationality[order(nationality$Freq, rank(nationality$Freq), decreasing = TRUE), ] df <- nationality[2:11, ] ggplot(df, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity", color = "black", fill = "grey") + labs(title = "Frequency by Country\n", x = "\nCountry", y = "Frequency\n") + theme_classic() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
The above code creates a frequency table of all elements found in the “Artist.Nationality” column in the dataframe, and then orders it in descending order. I then grab the top ten occurring terms and plot them as a bar graph, reversing the axis labels to make them readable.
The resulting graph, found below, indicates several things: 1) The Met Collection is primarily an American collection,with some affinity for French artists; 2) the Nationality labels need to be cleaned so that the results can be more easily read, especially duplicate labels.
Let’s see if we can add nuance to the nationality data above by looking at the most popular cities of origin for the Met Collection Archives:
city <- data.frame(table(met.collection$City)) city <- city[order(city$Freq,-rank(city$Freq), decreasing = TRUE), ] df <- city[2:11, ] ggplot(df, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity", color = "black", fill = "grey") + labs(title = "Frequency by City\n", x = "\nCountry", y = "Frequency\n") + theme_classic() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Wow! Paris really does a number on New York and London. Venice, usually the most disproportionate source of visual art in the world is lagging far behind the big culture capitals.
Finally, after all of this geographic analysis, it might be worth knowing what time-frame or period predominates the Met Collection.:
date <- data.frame(table(met.collection$Object.Date)) date <- date[order(date$Freq,-rank(date$Freq), decreasing = TRUE), ] df <- date[3:11, ] ggplot(df, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity", color = "black", fill = "grey") + labs(title = "Frequency by Date\n", x = "\nCountry", y = "Frequency\n") + theme_classic() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
The code above produces the plot below. The Met is primarily composed of 19th and 18th century artworks, coming either from America or from Europe (most coming from France or Italy). There seems to be a passing interest in art from ancient Egypt or Greece, but not much else by way of non-classical European artworks.
Through the use of R’s summary function and the ggplot2 library, we’ve started breaking down a large data set and looked for various insights in this ggplot2 in R tutorial. That work is never finished in a proper data analysis: we urge you to take this ggplot2 in R tutorial and use it to break down insights you’d like to see.