Jun 15, 2017

Machine Learning with R: A Tutorial on Building Text Classifiers

Share:

In this tutorial, we will be using a host of R packages in order to run a quick classifier algorithm on some Amazon reviews. This classifier should be able to predict whether a review is positive or negative with a fairly high degree of accuracy. In an effort to provide a clear working example of what classification can be used for, this data, retrieved from the Stanford Network Analysis Project, has been parsed into small text chunks and labelled appropriately. You can retrieve these files from the Github repo linked here.

Before we begin, it is important to mention that data curation — making sure that your information is properly categorized and labelled — is one of the most important parts of the whole process! In machine learning, the labelling and classification of your data will often dictate the accuracy of your model. That being said, it is worth going over how these files have been organized and labelled: the “Train” directory contains 400 1-star book reviews labeled “Neg” (for negative) and 400 5-star book reviews labelled “Pos” (for positive). This is our “gold standard”: we know these reviews are positive or negative based on the stars that the user assigned to them when they wrote the review. We will use the files in the “Train” directory in order to train our classifier, which will then use what it learned about our training directory in order to predict whether or not the reviews in our “Test” directory are negative or positive. In this way, we will develop a machine learned classifier that can accurately predict whether an Amazon book review — or any short text — reflects a positive or a negative customer experience with a given product. Thinking more broadly, this process reflects a bare-bones entry-level attempt at using R to learn and makepredictions about human writing. This is a very effective use case of machine learning with R. 

This tutorial is run with Jupyter Notebook in R.

You can run it in anything that complies and executes R scripts.

W will be using the R “caret,” “tm,“ and “kernlab” packages to parse and machine-read the data and then subsequently train the model. If you don’t have those packages, use the following command to get them installed. For more instructions on how to install R packages, click here.

install.packages("kernlab")
install.packages("caret")
install.packages("tm")
install.packages("dplyr")
install.packages("splitstackshape")
install.packages("e1071")

The “dplyr” and “splitstackshape” packages will help us manipulate the data and organize it in such a way that the model can make use of the data. Now, we can activate the installed libraries and start doing machine learning with R. 

library("kernlab") 
library("caret") 
library("tm") 
library("dplyr") 
library("splitstackshape")
library("e1071")

Our first step is ingesting and cleaning all of the data. For that you will need the “tm” package, which uses the “VCorpus” functions and “tm_map” functions to make our data usable to the classifier. Below is a fairly large chunk of code, but hopefully the annotation makes it fairly straightforward with what is happening in R:

# Step 1. Ingest your training data and clean it.

train <- VCorpus(DirSource("Training", encoding = "UTF-8"), readerControl=list(language="English"))
train <- tm_map(train, content_transformer(stripWhitespace))
train <- tm_map(train, content_transformer(tolower))
train <- tm_map(train, content_transformer(removeNumbers))
train <- tm_map(train, content_transformer(removePunctuation))

# Step 2. Create your document term matrices for the training data.

train.dtm <- as.matrix(DocumentTermMatrix(train, control=list(wordLengths=c(1,Inf))))

# Step 3. Repeat steps 1 & 2 above for the Test set.

test <- VCorpus(DirSource("Test", encoding = "UTF-8"), readerControl=list(language="English"))
test <- tm_map(test, content_transformer(stripWhitespace))
test <- tm_map(test, content_transformer(tolower))
test <- tm_map(test, content_transformer(removeNumbers))
test <- tm_map(test, content_transformer(removePunctuation))
test.dtm <- as.matrix(DocumentTermMatrix(test, control=list(wordLengths=c(1,Inf))))

The code above should net you two data new data matrices: one “train.dtm,” containing all of the words from the “Training” folder, and a “test.dtm” matrix, containing all of the words from the “Test” folder. For the vast majority of the tutorial, we will be working with the “train.dtm” in order to create, train, and validate our results. Iterating with your training data and then working with your test data is an essential part of doing machine learning with R. 

Our next two steps involve two important aspects of the data manipulation process that we will need in order to make sure that the classifier function works: 1) the first step involves making sure that our data sets have the same amount of columns, meaning that we only take overlapping words from both matrices, and 2) making sure that our data has a column that dictates whether the files are “Neg” (negative) or “Pos” (positive). Since we know these values for the training data, we have to separate out the labels from the original files and append them to the “corpus” column in the data. For our testing data, we do not have these labels, so we put dummy values instead (that will then be filled later).

# Step 4. Make test and train matrices of identical length (find intersection)

train.df <- data.frame(train.dtm[,intersect(colnames(train.dtm), colnames(test.dtm))])
test.df <- data.frame(test.dtm[,intersect(colnames(test.dtm), colnames(train.dtm))])

# Step 5. Retrieve the correct labels for training data and put dummy values for testing data

label.df <- data.frame(row.names(train.df))
colnames(label.df) <- c("filenames")
label.df<- cSplit(label.df, 'filenames', sep="_", type.convert=FALSE)
train.df$corpus<- label.df$filenames_1
test.df$corpus <- c("Neg")

If all of these steps run successfully, you are ready to start running your classifier! It is important that we will not be running cross-validation of the model in this tutorial, although more advanced users and researchers should look into creating folds within the data and cross-validating your model across multiple cuts of the data in order to be sure that the results that you are getting are accurate.

In any case, this model will only run one validation using the confusion matrix, below, which will spit out metrics for us to measure the accuracy of the predictive machine learning model we just built:

# Step 6. Create folds of your data, then run the training once to inspect results

df.train <- train.df
df.test <- train.df
df.model<-ksvm(corpus~., data= df.train, kernel="rbfdot")
df.pred<-predict(df.model, df.test)
con.matrix<-confusionMatrix(df.pred, df.test$corpus)
print(con.matrix)

As you can see above, we are using the training dataframes for both training and testing our model. If the process runs successfully, you should see this output:

       Reference

Prediction Neg Pos

       Neg 343   0

       Pos  57 400                               

               Accuracy : 0.9288          

                 95% CI : (0.9087, 0.9456)

    No Information Rate : 0.5             

    P-Value [Acc > NIR] : < 2.2e-16                                         

                  Kappa : 0.8575          

 Mcnemar's Test P-Value : 1.195e-13       

                                        
            Sensitivity : 0.8575          

            Specificity : 1.0000          

         Pos Pred Value : 1.0000          

         Neg Pred Value : 0.8753          

             Prevalence : 0.5000          

         Detection Rate : 0.4288          

   Detection Prevalence : 0.4288          

      Balanced Accuracy : 0.9287          
                                   
       'Positive' Class : Neg         

In the simplest terms possible, the confusion matrix provides you with the back-end output and analysis of the model’s performance in predicting the same files that it was trained on. The “Accuracy” field, for instance, gives us a quick estimate of what percent of the files the classifier predicted correctly: in our case, it was at a very high 92.8%! That means that roughly 93 percent of the time the classifier was successful in determining whether or not a file was positive or negative just based on its content.

machine learning with r

In a more advanced scenario, you would need to cross-validate your data by running the same process on several more “folds,” which are basically random subsets of your training data. For the example used above, it is clear that our classifier is pretty good at determining whether an Amazon Book Review is negative or positive, so we can move on to our testing. We’ve built something useful with our new knowledge of machine learning with R — not it’s time to put it to use! Luckily, to run the model on your testing data and to validate our knowledge of machine learning with R requires only one small change — the variable of your “df.test”:

# Step 7. Run the final prediction on the test data and re-attach file names.

df.test <- test.df
df.pred <- predict(df.model, df.test)
results <- as.data.frame(df.pred)
rownames(results) <- rownames(test.df)
print(results)

The code above runs the predict() model on the test data, and plops the results in the “results” data frame. We can then reattach the original filenames to the rownames of the new results vector, and produce the machine learning predictions of your test directory.

In conclusion, the process of building something with machine learning with R, enumerated above, helps you build a quick-start classifier that can categorize the sentiment of online book reviews with a fairly high degree of accuracy. Such a classifier is useful when you have a large quantity of user-submitted text that needs to be analyzed for sentiments around a product or a service, and can more generally help a researcher build an algorithm that can weed out bad or good reviews automatically for either research or moderation purposes. We hope this tutorial has given you a sense of the power of machine learning with R!