Machine Learning With R: Building Text Classifiers


Machine Learning With R: Building Text Classifiers

In this tutorial, we will be using a host of R packages in order to run a quick classifier algorithm on some Amazon reviews. This classifier should be able to predict whether a review is positive or negative with a fairly high degree of accuracy. In an effort to provide a clear working example of what classification can be used for, this data, retrieved from the Stanford Network Analysis Project, has been parsed into small text chunks and labelled appropriately.

Before we start, notice that data curation — ensuring that your data is appropriately sorted and named — is one of the most significant pieces of the entire procedure! In machine learning, the naming and arrangement of your data will frequently direct the precision of your model. That being stated, it merits going over how these records have been sorted out and marked: the “Train” catalog contains 400 1-star book audits named “Neg” (for negative) and 400 5-star book surveys named “Pos” (for positive). This is our “highest quality level”: we realize these audits are certain or negative dependent on the stars that the client relegated to them when they composed the survey. We will utilize the documents in the “Train” catalog so as to prepare our classifier, which will at that point use what it found out about our preparation index so as to foresee whether the surveys in our “Test” registry are negative or positive. Along these lines, we will build up a machine learned classifier that can precisely foresee whether an Amazon book audit — or any short content — mirrors a positive or a negative client involvement in a given item. Thinking all the more extensively, this procedure mirrors a no frills section level endeavor at utilizing R to learn and makepredictions about human composition. This is an exceptionally viable use instance of machine learning with R.

A Look at Machine Learning in R

You can run it in anything that complies and executes R scripts.

We will be using the R “caret,” “tm,“ and “kernlab” packages to parse and machine-read the data and then subsequently train the model. If you don’t have those packages, use the following command to get them installed. For more instructions on how to install R packages,


The “dplyr” and “splitstackshape” packages will help us manipulate the data and organize it in such a way that the model can make use of the data. Now, we can activate the installed libraries and start doing machine learning with R.


Our initial step is ingesting and cleaning the entirety of the data. For that you will require the “tm” bundle, which utilizes the “VCorpus” capacities and “tm_map” capacities to make our data usable to the classifier. The following is a genuinely enormous lump of code, yet ideally the comment makes it genuinely clear with what’s going on in R:

# Step 1. Ingest your training data and clean it.

train <- VCorpus(DirSource(“Training”, encoding = “UTF-8″), readerControl=list(language=”English”))
train <- tm_map(train, content_transformer(stripWhitespace))
train <- tm_map(train, content_transformer(tolower))
train <- tm_map(train, content_transformer(removeNumbers))
train <- tm_map(train, content_transformer(removePunctuation))

# Step 2. Create your document term matrices for the training data.

train.dtm <- as.matrix(DocumentTermMatrix(train, control=list(wordLengths=c(1,Inf))))

# Step 3. Repeat steps 1 & 2 above for the Test set.

test <- VCorpus(DirSource(“Test”, encoding = “UTF-8″), readerControl=list(language=”English”))
test <- tm_map(test, content_transformer(stripWhitespace))
test <- tm_map(test, content_transformer(tolower))
test <- tm_map(test, content_transformer(removeNumbers))
test <- tm_map(test, content_transformer(removePunctuation))
test.dtm <- as.matrix(DocumentTermMatrix(test, control=list(wordLengths=c(1,Inf))))

The code above should net both of you data new data networks: one “train.dtm,” containing the entirety of the words from the “Training” envelope, and a “test.dtm” framework, containing the entirety of the words from the “Test” organizer. For most by far of the instructional exercise, we will be working with the “train.dtm” so as to make, train, and approve our outcomes. Emphasizing with your training data and afterward working with your test data is a basic piece of doing machine learning with R.

Our next two stages include two significant parts of the data control process that we will require so as to ensure that the classifier work works: 1) the initial step includes ensuring that our data sets have a similar measure of sections, implying that we just take covering words from the two networks, and 2) ensuring that our data has a segment that directs whether the documents are “Neg” (negative) or “Pos” (positive). Since we know these qualities for the training data, we need to isolate out the marks from the first documents and annex them to the “corpus” segment in the data. For our testing data, we don’t have these names, so we put fakers esteems rather (that will at that point be filled later).

# Step 4. Make test and train matrices of identical length (find intersection)

train.df <- data.frame(train.dtm[,intersect(colnames(train.dtm), colnames(test.dtm))])
test.df <- data.frame(test.dtm[,intersect(colnames(test.dtm), colnames(train.dtm))])

# Step 5. Retrieve the correct labels for training data and put dummy values for testing data

label.df <- data.frame(row.names(train.df))
colnames(label.df) <- c(“filenames”)
label.df<- cSplit(label.df, ‘filenames’, sep=”_”, type.convert=FALSE)
train.df$corpus<- label.df$filenames_1
test.df$corpus <- c(“Neg”)

If all of these steps run successfully, you are ready to start running your classifier! It is important that we will not be running cross-validation of the model in this tutorial, although more advanced users and researchers should look into creating folds within the data and cross-validating your model across multiple cuts of the data in order to be sure that the results that you are getting are accurate.

In any case, this model will only run one validation using the confusion matrix, below, which will spit out metrics for us to measure the accuracy of the predictive machine learning model we just built:

# Step 6. Create folds of your data, then run the training once to inspect results

df.train <- train.df
df.test <- train.df
df.model<-ksvm(corpus~., data= df.train, kernel=”rbfdot”)
df.pred<-predict(df.model, df.test)
con.matrix<-confusionMatrix(df.pred, df.test$corpus)

As you can see above, we are using the training dataframes for both training and testing our model. If the process runs successfully, you should see this output:


Prediction Neg Pos

Neg 343 0

Pos 57 400

Accuracy : 0.9288

95% CI : (0.9087, 0.9456)

No Information Rate : 0.5

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.8575

Mcnemar’s Test P-Value : 1.195e-13

Sensitivity : 0.8575

Specificity : 1.0000

Pos Pred Value : 1.0000

Neg Pred Value : 0.8753

Prevalence : 0.5000

Detection Rate : 0.4288

Detection Prevalence : 0.4288

Balanced Accuracy : 0.9287

‘Positive’ Class : Neg

In the least difficult terms conceivable, the disarray lattice gives you the back-end yield and investigation of the model’s exhibition in anticipating similar records that it was prepared on. The “Exactness” field, for example, gives us a snappy gauge of what percent of the records the classifier anticipated effectively: for our situation, it was at a high 92.8%! That implies that approximately 93 percent of the time the classifier was fruitful in deciding if a document was certain or negative simply dependent on its substance.

In a further developed situation, you would need to cross-approve your data by running a similar procedure on a few additional “folds,” which are fundamentally arbitrary subsets of your training data. For the model utilized above, plainly our classifier is entirely acceptable at deciding if an Amazon Book Review is negative or positive, so we can proceed onward to our testing. We’ve fabricated something valuable with our new information on machine learning with R — not it’s an ideal opportunity to put it to utilize! Fortunately, to run the model on your testing data and to approve our insight into machine learning with R requires just a single little change — the variable of your “df.test”:

# Step 7. Run the final prediction on the test data and re-attach file names.

df.test <- test.df
df.pred <- predict(df.model, df.test)
results <-
rownames(results) <- rownames(test.df)

The code above runs the predict() model on the test data, and plops the results in the “results” data frame. We can then reattach the original filenames to the rownames of the new results vector, and produce the machine learning predictions of your test directory.

In conclusion, the process of building something with machine learning with R, enumerated above, helps you build a quick-start classifier that can categorize the sentiment of online book reviews with a fairly high degree of accuracy. Such a classifier is useful when you have a large quantity of user-submitted text that needs to be analyzed for sentiments around a product or a service, and can more generally help a researcher build an algorithm that can weed out bad or good reviews automatically for either research or moderation purposes. We hope this tutorial has given you a sense of the power of machine learning with R!