Text mining in R: a tutorial
This tutorial was built for people who wanted to learn the essential tasks required to process text for meaningful analysis in R, one of the most popular and open source programming languages for data science. At the end of this tutorial, you’ll have developed the skills to read in large files with text and derive meaningful insights you can share from that analysis. You’ll have learned how to do text mining in R, an essential data mining tool. The tutorial is built to be followed along with tons of tangible code examples. The full repository with all of the files and data is here if you wish to follow along.
If you don’t have an R environment set up already, the easiest way to follow along would be to use Jupyter with R. Jupyter offers an interactive R environment where you can easily modify inputs and get the outputs demonstrated rapidly so you can rapidly get up to speed on text mining in R.
Text mining definition
Natural languages (English, Hindi, Mandarin etc.) are different from programming languages. The semantic or the meaning of a statement depends on the context, tone and a lot of other factors. Unlike programming languages, natural languages are ambiguous.
Text mining deals with helping computers understand the “meaning” of the text. Some of the common text mining applications include sentiment analysis e.g if a Tweet about a movie says something positive or not, text classification e.g classifying the mails you get as spam or ham etc.
In this tutorial, we’ll learn about text mining and use some R libraries to implement some common text mining techniques. We’ll learn how to do sentiment analysis, how to build word clouds, and how to process your text so that you can do meaningful analysis with it.
R is succinctly described as “a language and environment for statistical computing and graphics,” which makes it worth knowing if you’re dabbling in the data science/art of statistics and exploratory data analysis. R has a wide variety of useful packages.
Here, we’ll focus on R packages useful in understanding and extracting insights from the text and text mining packages.
In this tutorial, we will be using the following packages:
- RSQLite, ‘SQLite’ Interface for R
- tm, framework for text mining applications
- SnowballC, text stemming library
- Wordcloud, for making wordcloud visualizations
- Syuzhet, text sentiment analysis
- ggplot2, one of the best data visualization libraries
- quanteda, N-grams
You can install the aforementioned packages using the following command:
Before we dive into analyzing text, we need to preprocess it. Text data contains white spaces, punctuations, stop words etc. These characters do not convey much information and are hard to process. For example, English stop words like “the”, “is” etc. do not tell you much information about the sentiment of the text, entities mentioned in the text, or relationships between those entities. Depending upon the task at hand, we deal with such characters differently. This will help isolate text mining in R on important words.
A word cloud is a simple yet informative way to understand textual data and to do text analysis. In this example, we will try to visualize Hillary Clinton’s Emails. This will help us quantify the content of the Emails and help us derive insights and better communicate our results Along the way, we’ll also learn about some data preprocessing steps that will be immensely helpful in other text mining tasks as well. Let’s start with getting the data. You can head over to Kaggle to download the dataset.
Let’s read the data and learn to implement the preprocessing steps.
library(RSQLite) db <- dbConnect(dbDriver("SQLite"), "/Users/shubham/Documents/hillary-clinton-emails/database.sqlite") # Get all the emails sent by Hillary emailHillary <- dbGetQuery(db, "SELECT ExtractedBodyText EmailBody FROM Emails e INNER JOIN Persons p ON e.SenderPersonId=P.Id WHERE p.Name='Hillary Clinton' AND e.ExtractedBodyText != '' ORDER BY RANDOM()") emailRaw <- paste(emailHillary$EmailBody, collapse=" // ")
The above code reads in the “database.sqlite” file into R. SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. So, you can read an SQLite file just as you would read a CSV or a text file. Accordingly, the same theory would apply to any type of CSV or text file or input file that you can work with in R, though you would use a different approach.
This guide shows how you would read different file formats such as Excel, R and .txt files into R and other data sources (including social media data).
Here, we’ll use the package RSQLite to read in a SQLite file containing all of Hillary Clinton’s emails. Next, we will be querying the column containing the Email text body. Then we’ll be ready to do an analysis of the Clinton emails that shaped this political season.
We’ll perform the following steps to make sure that the text mining in R we’re dealing with is clean:
- Convert the text to lower case, so that words like “write” and “Write” are considered the same word for analysis
- Remove numbers
- Remove English stopwords e.g “the”, “is”, “of”, etc
- Remove punctuation e.g “,”, “?”, etc
- Eliminate extra white spaces
- Stemming our text
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. E.g changing “car”, “cars”, “car’s”, “cars’” to “car”. This can also help with different verb tenses with the same semantic meaning such as digs, digging, and dig.
One very useful library to perform the aforementioned steps and text mining in R is the “tm” package. The main structure for managing documents in tm is called a Corpus, which represents a collection of text documents.
# Transform and clean the text library("tm") docs <- Corpus(VectorSource(emailRaw))
Once we have our email corpus (all of Hillary’s emails) stored in the variable “docs”, we’ll want to modify the words within the emails in it with the techniques we discussed above such as stemming, stopword removal and more. With the tm library, this can be done easily. Transformations are done via the tm_map() function which applies a function to all elements of the corpus. Basically, all transformations work on single text documents and tm_map() just applies them to all documents in a corpus. If you wanted to convert all the text of Hillary’s emails into lowercase at once, you’d use the tm library and the techniques below to do so easily.
# Convert the text to lower case docs <- tm_map(docs, content_transformer(tolower)) # Remove numbers docs <- tm_map(docs, removeNumbers) # Remove english common stopwords docs <- tm_map(docs, removeWords, stopwords("english")) # Remove punctuations docs <- tm_map(docs, removePunctuation) # Eliminate extra white spaces docs <- tm_map(docs, stripWhitespace)
To stem text, we will need another library, known as SnowballC.
# Text stemming (reduces words to their root form) library("SnowballC") docs <- tm_map(docs, stemDocument) # Remove additional stopwords docs <- tm_map(docs, removeWords, c("clintonemailcom", "stategov", "hrod"))
A document term matrix is an important representation for text mining in R tasks and an important concept in text analytics. Each row of the matrix is a document vector, with one column for every term in the entire corpus.
Naturally, some documents may not contain a given term, so this matrix is sparse. The value in each cell of the matrix is the term frequency.
tm makes it very easy to create the term-document matrix. With the document term matrix made, we can then proceed to build a word cloud for Hillary’s emails, highlighting which words are the most frequently made.
dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)
# Generate the WordCloud library("wordcloud") library("RColorBrewer") par(bg="grey30") png(file="WordCloud.png",width=1000,height=700, bg="grey30") wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per=0.3 ) title(main = "Hillary Clinton's Most Used Used in the Emails", font.main = 1, col.main = "cornsilk3", cex.main = 1.5) dev.off()
Sentiment analysis is the process of determining whether a piece of writing is positive, negative or neutral. Here, we’ll work with the package “syuzhet”.
Just as the previous example, we’ll read the Emails from the database.
Emails <- data.frame(dbGetQuery(db,"SELECT * FROM Emails")) library('syuzhet')
“syuzhet” uses NRC Emotion lexicon. The NRC emotion lexicon is a list of words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).
The get_nrc_sentiment function returns a data frame in which each row represents a sentence from the original file. The columns include one for each emotion type was well as the positive or negative sentiment valence. It allows us to take a body of text and return which emotions it represents — and also whether the emotion is positive or negative.
d<-get_nrc_sentiment(Emails$RawText) td<-data.frame(t(d)) td_new <- data.frame(rowSums(td[2:7945])) #The function rowSums computes column sums across rows for each level of a grouping variable. #Transformation and cleaning names(td_new) <- "count" td_new <- cbind("sentiment" = rownames(td_new), td_new) rownames(td_new) <- NULL td_new2<-td_new[1:8,]
Now, we’ll use “ggplot2” to create a bar graph. Each bar represents how prominent the each of the emotion is in text.
#Visualisation library("ggplot2") qplot(sentiment, data=td_new2, weight=count, geom="bar",fill=sentiment)+ggtitle("Email sentiments")
You must have noticed YouTube’s auto-captioning feature. Auto-captioning is a speech recognition problem. One of the features in being able to generate captions automatically from audio is to predict what word comes after a given sequence of words. E.g
I’d like to make a …
Hopefully, you concluded that the next word in the sequence is “call”. We do this by first analyzing what words frequently co-occur. We formalize this by introducing N-grams. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. In other words, we’ll be finding collocations. a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. An example of this would be the term “very much.”
In this section, we’ll use the R-library “quanteda” to compute tri-grams to find commonly occuring sequence of 3 words.
library(tm) library(RSQLite) library(quanteda) db <- dbConnect(dbDriver("SQLite"), "/Users/shubham/Documents/hillary-clinton-emails/database.sqlite") # Get all the emails sent by Hillary emailHillary <- dbGetQuery(db, "SELECT ExtractedBodyText EmailBody FROM Emails e INNER JOIN Persons p ON e.SenderPersonId=P.Id WHERE p.Name='Hillary Clinton' AND e.ExtractedBodyText != '' ORDER BY RANDOM()") emails <- paste(emailHillary$EmailBody, collapse=" // ")
We will use quanteda’s function collocations to do so. And, finally we’ll remove stopwords from the collocations so we can get a full view of which are the most frequently used collection of three words in Hillary’s emails.
collocations(emails, size = 2:3) print(removeFeatures(collocations(emails, size = 2:3), stopwords("english")))
We set out to inform you how to do some of the most common text mining in R tasks with examples and sample code. Leave a comment below if you think we’re missing something or if you want to add something to this text mining in R discussion!
The author of this piece is Shubham Singh Tomar. Email firstname.lastname@example.org if you want to contribute to the blog.