{"id":2721,"date":"2017-06-15T17:02:59","date_gmt":"2017-06-16T00:02:59","guid":{"rendered":"https:\/\/www.springboard.com\/?p=2721"},"modified":"2023-07-07T02:40:48","modified_gmt":"2023-07-07T09:40:48","slug":"machine-learning-with-r","status":"publish","type":"post","link":"https:\/\/www.springboard.com\/blog\/data-science\/machine-learning-with-r\/","title":{"rendered":"Machine Learning With R: Building Text Classifiers"},"content":{"rendered":"\n<p><span style=\"font-weight: 400;\">In this tutorial, we will be using a host of R packages in order to run a quick classifier algorithm on some Amazon reviews. This classifier should be able to predict whether a review is positive or negative with a fairly high degree of accuracy. In an effort to provide a clear working example of what classification can be used for, this data, retrieved from the Stanford Network Analysis Project, has been parsed into small text chunks and labelled appropriately. You can retrieve these files from the Github repo linked <\/span><strong><a href=\"https:\/\/github.com\/Rogerh91\/Springboard-Blog-Tutorials\/tree\/master\/Machine%20Learning%20with%20R%20Tutorial\" target=\"_blank\" rel=\"noopener\">here<\/a><\/strong><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<p><em><strong>Related<\/strong>: <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/text-mining-in-r\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/text-mining-in-r\/\" rel=\"noreferrer noopener\">Text Mining in R: A Tutorial<\/a><\/em><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Before we begin, it is important to mention that data curation &#8212; making sure that your information is properly categorized and labelled &#8212; is one of the most important parts of the whole process! In machine learning, the labelling and classification of your data will often dictate the accuracy of your model. That being said, it is worth going over how these files have been organized and labelled: the \u201cTrain\u201d directory contains 400 1-star book reviews labeled \u201cNeg\u201d (for negative) and 400 5-star book reviews labelled \u201cPos\u201d (for positive). This is our \u201cgold standard\u201d: we know these reviews are positive or negative based on the stars that the user assigned to them when they wrote the review. We will use the files in the \u201cTrain\u201d directory in order to train our classifier, which will then use what it learned about our training directory in order to predict whether or not the reviews in our \u201cTest\u201d directory are negative or positive. In this way, we will develop a machine learned classifier that can accurately predict whether an Amazon book review &#8212; or any short text &#8212; reflects a positive or a negative customer experience with a given product. Thinking more broadly, this process reflects a bare-bones entry-level attempt at using R to learn and makepredictions about human writing. This is a very effective use case of machine learning with R.&nbsp;<\/span><\/p>\n\n\n\n<p><em>Want more data?\u00a0 Check out our <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/free-public-data-sets-data-science-project\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/free-public-data-sets-data-science-project\/\" rel=\"noreferrer noopener\">list of free public datasets<\/a>.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">A Look at Machine Learning in R<\/h2>\n\n\n\n<p>This tutorial is run with <a href=\"http:\/\/blog.revolutionanalytics.com\/2015\/09\/using-r-with-jupyter-notebooks.html\" target=\"_blank\" rel=\"noopener\">Jupyter Notebook in R.<\/a><\/p>\n\n\n\n<p>You can run it in anything that complies and executes R scripts.<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">We will be using the R \u201ccaret,\u201d \u201ctm,\u201c and \u201ckernlab\u201d packages to parse and machine-read the data and then subsequently train the model. If you don&#8217;t have those packages, use the following command to get them installed. For more instructions on how to install R packages, click <a href=\"https:\/\/www.r-bloggers.com\/installing-r-packages\/\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/span><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">install.packages(\"kernlab\")\ninstall.packages(\"caret\")\ninstall.packages(\"tm\")\ninstall.packages(\"dplyr\")\ninstall.packages(\"splitstackshape\")\ninstall.packages(\"e1071\")<\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">The \u201cdplyr\u201d and \u201csplitstackshape\u201d packages will help us manipulate the data and organize it in such a way that the model can make use of the data. Now, we can activate the installed libraries and start doing machine learning with R.&nbsp;<\/span><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">library(\"kernlab\") \nlibrary(\"caret\") \nlibrary(\"tm\") \nlibrary(\"dplyr\") \nlibrary(\"splitstackshape\")\nlibrary(\"e1071\")<\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">Our first step is ingesting and cleaning all of the data. For that you will need the \u201ctm&#8221; package, which uses the \u201cVCorpus\u201d functions and \u201ctm_map\u201d functions to make our data usable to the classifier. Below is a fairly large chunk of code, but hopefully the annotation makes it fairly straightforward with what is happening in R:<\/span><\/p>\n\n\n<div class=\"bg-leaf-50 p-4 my-3\"><h4 class=\"fw-bold text-center\">Get To Know Other\tData Science Students<\/h4><div class=\"row row-cols-1 row-cols-lg-3\"><div class=\"col\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/mikiko-bazeley\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203192\/Student%20Success\/Mikiko_Bazeley_125x125.png\" alt=\"Mikiko Bazeley\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Mikiko Bazeley<\/p><p class=\"text-muted lh-1\">ML Engineer at MailChimp<\/p><\/div><div class=\"w-100 d-block d-md-none mt-3\"><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/mikiko-bazeley\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/hastings-reeves\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1648517255\/Student%20Success\/Hastings_Reeves_3.png\" alt=\"Hastings Reeves\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Hastings Reeves<\/p><p class=\"text-muted lh-1\">Business Intelligence Analyst at Velocity Global<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/hastings-reeves\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/joy-opsvig\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1662144818\/Student%20Success\/Joy_Opsvig.jpg\" alt=\"Joy Opsvig\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Joy Opsvig<\/p><p class=\"text-muted lh-1\">Data Science Apprentice Engineer at LinkedIn<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/joy-opsvig\">Read Story<\/a><\/p><\/div><\/div><\/div><\/div>\n\n\n\n<p><strong># Step 1. Ingest your training data and clean it.<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\">train &lt;- VCorpus(DirSource(\"Training\", encoding = \"UTF-8\"), readerControl=list(language=\"English\"))<\/span>\n<span style=\"font-weight: 400;\">train &lt;- tm_map(train, content_transformer(stripWhitespace))<\/span>\n<span style=\"font-weight: 400;\">train &lt;- tm_map(train, content_transformer(tolower))<\/span>\n<span style=\"font-weight: 400;\">train &lt;- tm_map(train, content_transformer(removeNumbers))<\/span>\n<span style=\"font-weight: 400;\">train &lt;- tm_map(train, content_transformer(removePunctuation))<\/span><\/pre>\n\n\n\n<p><strong># Step 2. Create your document term matrices for the training data.<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\">train.dtm &lt;- as.matrix(DocumentTermMatrix(train, control=list(wordLengths=c(1,Inf))))<\/span><\/pre>\n\n\n\n<p><strong># Step 3. Repeat steps 1 &amp; 2 above for the Test set.<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\">test &lt;- VCorpus(DirSource(\"Test\", encoding = \"UTF-8\"), readerControl=list(language=\"English\"))<\/span>\n<span style=\"font-weight: 400;\">test &lt;- tm_map(test, content_transformer(stripWhitespace))<\/span>\n<span style=\"font-weight: 400;\">test &lt;- tm_map(test, content_transformer(tolower))<\/span>\n<span style=\"font-weight: 400;\">test &lt;- tm_map(test, content_transformer(removeNumbers))<\/span>\n<span style=\"font-weight: 400;\">test &lt;- tm_map(test, content_transformer(removePunctuation))<\/span>\n<span style=\"font-weight: 400;\">test.dtm &lt;- as.matrix(DocumentTermMatrix(test, control=list(wordLengths=c(1,Inf))))<\/span><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">The code above should net you two data new data matrices: one \u201ctrain.dtm,\u201d containing all of the words from the \u201cTraining\u201d folder, and a \u201ctest.dtm\u201d matrix, containing all of the words from the \u201cTest\u201d folder. For the vast majority of the tutorial, we will be working with the \u201ctrain.dtm\u201d in order to create, train, and validate our results. Iterating with your training data and then working with your test data is an essential part of doing machine learning with R.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Our next two steps involve two important aspects of the data manipulation process that we will need in order to make sure that the classifier function works: 1) the first step involves making sure that our data sets have the same amount of columns, meaning that we only take overlapping words from both matrices, and 2) making sure that our data has a column that dictates whether the files are \u201cNeg\u201d (negative) or \u201cPos\u201d (positive). Since we know these values for the training data, we have to separate out the labels from the original files and append them to the \u201ccorpus\u201d column in the data. For our testing data, we do not have these labels, so we put dummy values instead (that will then be filled later). <\/span><\/p>\n\n\n\n<p><strong># Step 4. Make test and train matrices of identical length (find intersection)<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\">train.df &lt;- data.frame(train.dtm[,intersect(colnames(train.dtm), colnames(test.dtm))])<\/span>\n<span style=\"font-weight: 400;\">test.df &lt;- data.frame(test.dtm[,intersect(colnames(test.dtm), colnames(train.dtm))])<\/span><\/pre>\n\n\n\n<p><strong># Step 5. Retrieve the correct labels for training data and put dummy values for testing data<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\">label.df &lt;- data.frame(row.names(train.df))<\/span>\n<span style=\"font-weight: 400;\">colnames(label.df) &lt;- c(\"filenames\")<\/span>\n<span style=\"font-weight: 400;\">label.df&lt;- cSplit(label.df, 'filenames', sep=\"_\", type.convert=FALSE)<\/span>\n<span style=\"font-weight: 400;\">train.df$corpus&lt;- label.df$filenames_1<\/span>\n<span style=\"font-weight: 400;\">test.df$corpus &lt;- c(\"Neg\")<\/span><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">If all of these steps run successfully, you are ready to start running your classifier! It is important that we will not be running cross-validation of the model in this tutorial, although more advanced users and researchers should look into creating folds within the data and cross-validating your model across multiple cuts of the data in order to be sure that the results that you are getting are accurate. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In any case, this model will only run one validation using the confusion matrix, below, which will spit out metrics for us to measure the accuracy of the predictive machine learning model we just built:<\/span><\/p>\n\n\n\n<p><strong># Step 6. Create folds of your data, then run the training once to inspect results<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\">df.train &lt;- train.df<\/span>\n<span style=\"font-weight: 400;\">df.test &lt;- train.df<\/span>\n<span style=\"font-weight: 400;\">df.model&lt;-ksvm(corpus~., data= df.train, kernel=\"rbfdot\")<\/span>\n<span style=\"font-weight: 400;\">df.pred&lt;-predict(df.model, df.test)<\/span>\n<span style=\"font-weight: 400;\">con.matrix&lt;-confusionMatrix(df.pred, df.test$corpus)<\/span>\n<span style=\"font-weight: 400;\">print(con.matrix)<\/span><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">As you can see above, we are using the training dataframes for both training and testing our model. If the process runs successfully, you should see this output:<\/span><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Reference<\/span>\n\n<span style=\"font-weight: 400;\">Prediction Neg Pos<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Neg 343 &nbsp;&nbsp;0<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Pos &nbsp;57 400<\/span><span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Accuracy : 0.9288 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;95% CI : (0.9087, 0.9456)<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;No Information Rate : 0.5 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;P-Value [Acc &gt; NIR] : &lt; 2.2e-16 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Kappa : 0.8575 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> Mcnemar's Test P-Value : 1.195e-13 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Sensitivity : 0.8575 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Specificity : 1.0000 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Pos Pred Value : 1.0000 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Neg Pred Value : 0.8753 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Prevalence : 0.5000 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Detection Rate : 0.4288 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;Detection Prevalence : 0.4288 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Balanced Accuracy : 0.9287 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span>\n<span style=\"font-weight: 400;\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'Positive' Class : Neg &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">In the simplest terms possible, the confusion matrix provides you with the back-end output and analysis of the model\u2019s performance in predicting the same files that it was trained on. The \u201cAccuracy\u201d field, for instance, gives us a quick estimate of what percent of the files the classifier predicted correctly: in our case, it was at a very high 92.8%! That means that roughly 93 percent of the time the classifier was successful in determining whether or not a file was positive or negative just based on its content.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2017\/06\/cranium-2099083_640-300x182.png\" alt=\"machine learning with r\" class=\"wp-image-2727\"\/><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">In a more advanced scenario, you would need to cross-validate your data by running the same process on several more \u201cfolds,\u201d which are basically random subsets of your training data. For the example used above, it is clear that our classifier is pretty good at determining whether an Amazon Book Review is negative or positive, so we can move on to our testing. We&#8217;ve built something useful with our new knowledge of machine learning with R &#8212; not it&#8217;s time to put it to use! Luckily, to run the model on your testing data and to validate our knowledge of machine learning with R requires only one small change &#8212; the variable of your \u201cdf.test\u201d:<\/span><\/p>\n\n\n\n<p><strong># Step 7. Run the final prediction on the test data and re-attach file names. <\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\">df.test &lt;- test.df<\/span>\n<span style=\"font-weight: 400;\">df.pred &lt;- predict(df.model, df.test)<\/span>\n<span style=\"font-weight: 400;\">results &lt;- as.data.frame(df.pred)<\/span>\n<span style=\"font-weight: 400;\">rownames(results) &lt;- rownames(test.df)<\/span>\n<span style=\"font-weight: 400;\">print(results)<\/span><\/pre>\n\n\n\n<p><span style=\"font-weight: 400;\">The code above runs the predict() model on the test data, and plops the results in the \u201cresults\u201d data frame. We can then reattach the original filenames to the rownames of the new results vector, and produce the machine learning predictions of your test directory.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In conclusion, the process of building something with machine learning with R, enumerated above, helps you build a quick-start classifier that can categorize the sentiment of online book reviews with a fairly high degree of accuracy. Such a classifier is useful when you have a large quantity of user-submitted text that needs to be analyzed for sentiments around a product or a service, and can more generally help a researcher build an algorithm that can weed out bad or good reviews automatically for either research or moderation purposes. R is an extremely important language in <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\" rel=\"noreferrer noopener\">data science<\/a>. Every aspiring <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" rel=\"noreferrer noopener\">data scientist<\/a> should have a sound knowledge of R.<\/span> <span style=\"font-weight: 400;\">We hope this tutorial has given you a sense of the power of machine learning with R!<\/span><\/p>\n\n\n\n<p class=\"rm has-background\" style=\"background-color:#efeff6\"><strong>Since you\u2019re here\u2026<br><\/strong>Curious about a career in data science? Experiment with our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/resources\/guides\/data-science-process\/\" target=\"_blank\">free data science learning path<\/a>, or join our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" target=\"_blank\">Data Science Bootcamp<\/a>, where you\u2019ll get your tuition back if you don&#8217;t land a job after graduating. We\u2019re confident because our courses work \u2013 check out our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/success\/\" target=\"_blank\">student success stories<\/a> to get inspired.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we will be using a host of R packages in order to run a quick classifier algorithm on some Amazon reviews. This classifier should be able to predict whether a review is positive or negative with a fairly high degree of accuracy. In an effort to provide a clear working example of [&hellip;]<\/p>\n","protected":false},"author":23,"featured_media":19144,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_eb_data_table":"","footnotes":""},"categories":[67],"tags":[],"marketing_tags":[],"class_list":{"0":"post-2721","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/2721"}],"collection":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/users\/23"}],"replies":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/comments?post=2721"}],"version-history":[{"count":3,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/2721\/revisions"}],"predecessor-version":[{"id":47506,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/2721\/revisions\/47506"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media\/19144"}],"wp:attachment":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media?parent=2721"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/categories?post=2721"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/tags?post=2721"},{"taxonomy":"marketing_tags","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/marketing_tags?post=2721"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}