{"id":2644,"date":"2017-04-20T18:43:22","date_gmt":"2017-04-21T01:43:22","guid":{"rendered":"https:\/\/www.springboard.com\/?p=2644"},"modified":"2023-07-08T17:56:05","modified_gmt":"2023-07-09T00:56:05","slug":"histogram-in-r","status":"publish","type":"post","link":"https:\/\/www.springboard.com\/blog\/data-science\/histogram-in-r\/","title":{"rendered":"Histograms in R: A Tutorial"},"content":{"rendered":"\n<p>Histograms are a useful way of inspecting a continuous dataset for the distribution of its values. In some ways, histograms can be the fastest way of understanding whether your data is normally distributed. If the data is not normally distributed &#8212; as-is often the case &#8212; histograms allow one to inspect the skew and bias of the data set and account for it in&nbsp;an analysis.<\/p>\n\n\n\n<p><strong>The repository containing all of the code to build a histogram in R and examples here can be located at this <a href=\"https:\/\/github.com\/Rogerh91\/Springboard-Blog-Tutorials\/tree\/master\/Histograms%20with%20R%20and%20ggplot2\" target=\"_blank\" rel=\"noreferrer noopener\">link<\/a>.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to Make a Histogram in R<\/h2>\n\n\n\n<p>Building a histogram in R can quickly help you explore the contours of your data and see where revisions need to be made. It&#8217;ll allow you to quickly unearth insights from your data values and practice the first rudimentary steps of data science. Through this tutorial, you&#8217;ll be able to build a histogram in R using&nbsp;basic R commands.<\/p>\n\n\n\n<p>In this tutorial, we will be inspecting the date distributions of two datasets and their corresponding data points using the ggplot2 histograms functionality. Hot on the heels of the previous tutorial bar chart tutorial, found <span class=\"c6\"><a href=\"https:\/\/www.springboard.com\/blog\/data-science\/ggplot2-in-r-tutorial\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/ggplot2-in-r-tutorial\/\" rel=\"noreferrer noopener\">here<\/a>, I will randomly sample 1000 of the Met Museum Artwork database and compare it to an equal size sample of the Tate Museum artwork database (both of these samples will be provided in the repo of this tutorial). These histograms will therefore give us a look into the yearly distributions of the artworks held in these museums.<\/span><\/p>\n\n\n\n<p>As always, make sure you have ggplot2, the go-to data visualization&nbsp;library of R, installed and running. Here&#8217;s <a href=\"http:\/\/www.dummies.com\/programming\/r\/how-to-install-and-load-ggplot2-in-r\/\" target=\"_blank\" rel=\"noreferrer noopener\">a tutorial<\/a> on how to get ggplot2 installed if you need it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"c3\">library(ggplot2)<\/span><\/pre>\n\n\n\n<p id=\"h.joc300d25rr8\"><span class=\"c4\">Now, let\u2019s import the data. I have provided two random sample subsets of each museum&#8217;s artworks datasets in the <a href=\"https:\/\/github.com\/Rogerh91\/Springboard-Blog-Tutorials\/tree\/master\/Histograms%20with%20R%20and%20ggplot2\" target=\"_blank\" rel=\"noreferrer noopener\">repo of this tutorial<\/a>. Each subset contains a random 5000 rows from each collection.<\/span><\/p>\n\n\n\n<p><span class=\"c4\">Make sure to import the CSVs as dataframes &#8212; that will be important to the ggplot function later on. A R data frame is one of the easiest data formats to manipulate with R functions.<\/span><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"c3\">met &lt;- as.data.frame(read.csv(\"MetObjects_5k-sample.csv\"))<\/span>\n<span class=\"c3\">tate &lt;- as.data.frame(read.csv(\"TateObjects_5k-sample.csv\"))<\/span><\/pre>\n\n\n\n<p id=\"h.a930ivglfxwe\"><span class=\"c4\">Now, we will do another filter based on date. For our purposes, we want to make sure that the data that we are investigating matches some sort of uniform criteria. For an early look, I want to makes sure that the dates that we are looking at are between 2000 B.C. and 2016 C.E. This way, any non matching outliers or mistakes will be omitted, and our histogram in R will give us a reasonable estimate of the shape of the data. To do this, use the subset() function:<\/span><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"c3\">met.subset &lt;- subset(met, met$Object.Begin.Date&gt;-2000 &amp; met$Object.Begin.Date&lt;2016)<\/span>\n<span class=\"c3\">tate.subset &lt;- subset(tate, tate$year&gt;-2000 &amp; tate$year&lt;2016)<\/span><\/pre>\n\n\n<div class=\"bg-leaf-50 p-4 my-3\"><h4 class=\"fw-bold text-center\">Get To Know Other\tData Science Students<\/h4><div class=\"row row-cols-1 row-cols-lg-3\"><div class=\"col\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/rane-najera-wynne\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1659153158\/Student%20Success\/Rane_Najera_Wynne.jpg\" alt=\"Rane Najera-Wynne\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Rane Najera-Wynne<\/p><p class=\"text-muted lh-1\">Data Steward\/data Analyst at BRIDGE<\/p><\/div><div class=\"w-100 d-block d-md-none mt-3\"><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/rane-najera-wynne\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/bret-marshall\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203191\/Student%20Success\/Bret_Marshall_125x125.png\" alt=\"Bret Marshall\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Bret Marshall<\/p><p class=\"text-muted lh-1\">Software Engineer at Growers Edge<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/bret-marshall\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/haotian-wu\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203192\/Student%20Success\/Haotian_Wu_125x125.png\" alt=\"Haotian Wu\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Haotian Wu<\/p><p class=\"text-muted lh-1\">Data Scientist at RepTrak<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/haotian-wu\">Read Story<\/a><\/p><\/div><\/div><\/div><\/div>\n\n\n\n<p id=\"h.ti6m40r1mctk\"><span class=\"c4\">This process now ensures that we are looking at the exact same date windows despite the datasets being structured differently.<\/span><\/p>\n\n\n\n<p id=\"h.a188b3vjsy9p\"><span class=\"c4\">Ok, let\u2019s plot a histogram in R!<\/span><\/p>\n\n\n\n<p id=\"h.6ou5hjgf7odc\"><span class=\"c4\">I personally prefer using ggplot2\u2019s geom_histogram() function as opposed to the more common \u201cqplot\u201d function due to ggplot2\u2019s more robust customization. Here\u2019s my code:<\/span><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"c3\">ggplot(data = met.subset, aes(met.subset$Object.Begin.Date)) +<\/span>\n<span class=\"c3\">&nbsp; geom_histogram(binwidth = 10) +<\/span>\n<span class=\"c3\">&nbsp; ggtitle(\"Art Object Date Distribution in the Met\") + xlab(\"Year\") + ylab(\"Number of Art Objects\")<\/span><\/pre>\n\n\n\n<p id=\"h.saldry537qdm\"><span class=\"c4\">In terms of the \u201cgeom_histogram()\u201d function, you only need one specification that will help produce a plot: \u201cbinwidth.\u201d What is that? A bin is simply the interval at which you want the histogram bars to appear. The default, where \u201cbinwidth = 1\u201d, simply means that each value will be given one bar on the graph. Since we\u2019re working on dates and time, I prefer using \u201cbinwidth = 10\u201d, where each bin represents a decade. We also add axis labels with the use of the xlab and ylab functions.&nbsp;<\/span><\/p>\n\n\n\n<p id=\"h.b0i1s0oo95a\"><span class=\"c4\">Here\u2019s the output:<\/span><\/p>\n\n\n\n<p id=\"h.ui52ms8wb2ey\">Well, it seems the data <span class=\"c5\">isn\u2019t<\/span>&nbsp;normally distributed if we look at both the x-axis (representing a time series of years) and the y-axis, representing the number of art objects associated with each year. Most of the Met\u2019s collection seems to skew towards the modern period (1850-2000). But what about the Tate collection? Would we get a similar result? Let\u2019s take a look.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"c3\">ggplot(data = tate.subset, aes(tate.subset$year)) +<\/span>\n<span class=\"c3\">&nbsp; geom_histogram(binwidth = 10) +<\/span>\n<span class=\"c3\">&nbsp; ggtitle(\"Art Object Date Distribution in the Tate\") + xlab(\"Year\") + ylab(\"Number of Art Objects\")<\/span><\/pre>\n\n\n\n<p id=\"h.vsm3lygpyqgl\"><span class=\"c4\">The histograms are completely different! As we can see here, the Tate collection is a lot more specialized, as its earliest piece goes back to 1576. It seems to have a strong mode in the 19th century. For comparison, the earliest piece in the Met collection is at our subset cut-off at 2000 B.C..<\/span><\/p>\n\n\n\n<p id=\"h.qec7zu2s10ov\"><span class=\"c4\">For that reason, the Tate is primarily a \u2018modern\u2019 collection, as it predominantly features art from 1800-onwards. If we were to become interested in studying modern art and put the two collections on the same level playing field, we would have to re-subset!<\/span><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"c3\">met.modern &lt;- subset(met, met$Object.Begin.Date&gt;1850 &amp; met$Object.Begin.Date&lt;2016)<\/span>\n<span class=\"c3\">tate.modern &lt;- subset(tate, tate$year&gt;1850 &amp; tate$year&lt;2016)<\/span><\/pre>\n\n\n\n<p id=\"h.10pa3i73rwod\">As we can see after we have traced a simple histogram in R, the Met\u2019s modern art collection vastly differs from the Tate. Most of the Met art dates back to the period now known as &#8220;<span class=\"c5\">fin de siecle&#8221;<\/span><span class=\"c4\">: the years prior to the turn of the 20th century (1880-1900). The Tate, meanwhile, focuses on postwar and contemporary art (1950-current), and features very little art from before the 20th Century.<\/span><\/p>\n\n\n\n<p id=\"h.77emc1i7vc8w\">We can see how the R programming language may assist <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" data-type=\"post\" data-id=\"24427\" target=\"_blank\" rel=\"noreferrer noopener\">data scientists<\/a> to examine data through the histogram method, and how the <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\" target=\"_blank\" rel=\"noreferrer noopener\">data science<\/a> field could benefit from using it. So, thanks to histograms, we now know the shape of our data and the decade distribution of each art collection. If we were to continue studying these <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/free-public-data-sets-data-science-project\/\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/free-public-data-sets-data-science-project\/\" target=\"_blank\" rel=\"noreferrer noopener\">datasets<\/a>, the histograms would give us a good understanding of what we are drawing on when we think about the Met or the Tate collections. For these reasons, histograms play a bigger role than just determining whether or not your data is normally distributed &#8212; a histogram plot usefully visualizes the skew of your data and allows you to make informed assumptions based on the shape of your output.<\/p>\n\n\n\n<p class=\"rm has-background\" style=\"background-color:#efeff6\"><strong>Since you\u2019re here\u2026<\/strong>Are you interested in this career track? Investigate with our free guide to <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" data-type=\"post\" data-id=\"24427\">what a data professional <em>actually<\/em> does<\/a>. When you\u2019re ready to build a CV that will make hiring managers melt, join our <a href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" target=\"_blank\" rel=\"noreferrer noopener\">Data Science Bootcamp<\/a> which will help you land a job or your tuition back!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Histograms are a useful way of inspecting a continuous dataset for the distribution of its values. In some ways, histograms can be the fastest way of understanding whether your data is normally distributed. If the data is not normally distributed &#8212; as-is often the case &#8212; histograms allow one to inspect the skew and bias [&hellip;]<\/p>\n","protected":false},"author":23,"featured_media":19145,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_eb_data_table":"","footnotes":""},"categories":[67],"tags":[],"marketing_tags":[],"class_list":{"0":"post-2644","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/2644"}],"collection":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/users\/23"}],"replies":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/comments?post=2644"}],"version-history":[{"count":3,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/2644\/revisions"}],"predecessor-version":[{"id":47753,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/2644\/revisions\/47753"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media\/19145"}],"wp:attachment":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media?parent=2644"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/categories?post=2644"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/tags?post=2644"},{"taxonomy":"marketing_tags","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/marketing_tags?post=2644"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}