Multinomial Logit Regression for Twitter Text Classification

In this notebook, we will apply logistic regression to classify twitter text into three tones which are “neutral”, “offensive language” and “hate speech”.

First of all, let’s import library that we will use in this note book.

import library

options(warn=-1) #turn off warning
options(jupyter.plot_mimetypes = 'image/png')
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-5

Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':


read data in

data that will be used in this notebook are comprised of training data and test data.

Now, let’s read data in from CSV file.

df <- read.csv("Twitter-hate_speech-labeled_data - Sheet1.csv")
test <- read.csv("Twitter-hate_speech-test_unlabeled - Sheet1.csv")
<th scope=col>id</th><th scope=col>tweet_text</th><th scope=col>Label</th>
2 @NFLfantasy @Akbar_Gbaja Damn right, Akbar. neutral
3 @NFLfantasy @Akbar_Gbaja I've got backup _‰гў‰Ы__ neutral
5 "No one is going to replace the production of Odell Beckham Jr. He's let a lot of fantasy owners down."-@Akbar_Gbaja neutral
7 @Sheikh__Akbar It's actually PERFECT for you man! Always on the phone_‰гў‰Ы_М__‰гў‰Ы_М__‰гў‰Ы_М_ neutral
10 Was Trump right? Officers say pockets of Muslims celebrated 9/11 via @MailOnline neutral
12 Female Killer In Las Vegas Shouted ЃК‰ЫУ‰Ы_Allahu AkbarЃК‰ЫУМц As She Ran Over 40 Innocent People Last Night

Now, let’s inspect classes that text is classified into by using unique command in R which will tell us how many and what classes there are in the training set.

unique(df$Label) #this will print out 3 classes that we will have to classify later.

<ol class=list-inline> <li>neutral</li> <li>offensive language</li> <li>hate speech</li> </ol>

Let’s print out the number of training dataset and test dataset.

[1] 11418
[1] 3024

merge df and test tweet text and create corpus

Now, we will bind train data and test data in order to apply corpus transformation latter on.

data <- rbind(df, test)
[1] 14442

use ‘RTextTools’ for transformation to corpus and remove stop words

Now we will make the word corpus and also remove stopwords as well as number from our corpus.

# prepare data
corpus <- create_matrix(data$tweet_text, language = "english", removeStopwords = TRUE,
    removeNumbers = TRUE, stemWords = FALSE, tm::weightTfIdf)
corpusmatrix <- as.matrix(corpus)

Now, we will randomly split training data into training set and validation set to use to validate our model.

train_ind <- sample(seq_len(nrow(df)), size = floor(0.75 * nrow(df)))
dfmatrix <- corpusmatrix[1:nrow(df),]
trainmatrix <- dfmatrix[train_ind, ]
trainlabel <- as.factor(df$Label[train_ind])
validmatrix <- dfmatrix[-train_ind, ]
validlabel <- as.factor(df$Label[-train_ind])

train ‘glmnet’ multinomial

Now, we train logit regression with losso regularisation. In the meanwhile, this will also perform variable elimination from the model as well.

multifit2 <- cv.glmnet(trainmatrix, trainlabel, family="multinomial", type.multinomial = "grouped", parallel = TRUE)

plot fit

The following plot visualise the value of log lambda that gives the least error.



validate model

We will now validate the performance of our model by using validation set that we put aside earlier.

pvalid2 <- predict(multifit2, validmatrix , s="lambda.min", type="class")

confusion metrix

After we apply our model to the validation set, now let’s see how well it can classify our validation set. This can be done using command table which will create “confusion metrix” for us automatically.

predicted_df <-, as.character(validlabel)))
colnames(predicted_df) <- c("predicted","actual")
predicted            hate speech neutral offensive language
  hate speech                169       0                112
  neutral                     63    1357                151
  offensive language         278      40                685

predict classes

After all the hustle, let’s apply the model that we build to our test set. Then, let’s inspect first 10 elements in our test set.

pfit2 <- predict(multifit2, corpusmatrix[(nrow(df)+1):nrow(corpusmatrix),] , s="lambda.min", type="class")
test$Label <- pfit2
<th scope=col>id</th><th scope=col>tweet_text</th><th scope=col>Label</th>
1 @NFLfantasy @Akbar_Gbaja the last thing he cares about is fantasy owners neutral
4 @NFLfantasy @Akbar_Gbaja He also let his team down. It's a must win game. neutral
6 Man Shouting 'Allahu Akbar' Drives Into Crowd neutral
8 Nan _‰гў‰Ыў‰Ыў (@ Akbar's in Leeds, West Yorkshire) neutral
9 VIDEOS: Immigrant Somali Muslim Woman Yelling Allahu Akbar Intentionally Ran Over 40 People On...
11 Was Trump right? Officers say pockets of Muslims celebrated 9/11 via @MailOnline neutral
#get some samples of "hate speech"
head(subset(test, Label=="hate speech",))
<th scope=col>id</th><th scope=col>tweet_text</th><th scope=col>Label</th><th scope=row>162</th><th scope=row>165</th><th scope=row>169</th><th scope=row>170</th><th scope=row>171</th><th scope=row>172</th>
842 @rwnc70 @JebBush LOL suck my motherfucking dick, faggot hate speech
872 @Alexcisneros69 Shut up you fucking faggot. hate speech
886 <span style=white-space:pre-wrap>running with lfg faggots who think they know everything &lt;&lt;&lt;&lt;&lt;&lt;&lt; </span>hate speech
902 Bunch of slack jawed faggots around here! speech
903 @kalumevs @CallumHarries I've seen you two bench so don't pipe up #Faggots hate speech
906 Faggots? hate speech
#get some samples of "offensive language"
head(subset(test, Label=="offensive language",))
<th scope=col>id</th><th scope=col>tweet_text</th><th scope=col>Label</th><th scope=row>11</th><th scope=row>40</th><th scope=row>42</th><th scope=row>45</th><th scope=row>47</th><th scope=row>50</th>
42 Dem philly niggas b so great at that dressing them some fly niggas they make everything they put on look great _‰гў‰Ы_еЌ_‰гў‰Ы_еЌ_‰гў‰Ы_еЌ_‰гў‰Ы_еЌ_‰гў‰Ы_еЌ_‰гў‰Ы_еЏoffensive language
182 The same thing applies to these niggas too. How you gone get mad a bitch posted a picture of yo tiny ashy ass dick? offensive language
188 When you fuck darkskin girls in the winter time they ass cheeks Be ashy as hell _‰гў‰Ы_М__‰гў‰Ы_М__‰гў‰Ы_М_ offensive language
208 @EgyptTaughtMe @srslyab that's not even your skin tone and that's not even your body you ashy bitch offensive language
226 Nothing says "IЃК‰ЫУМцm a fat bastard" like wearing a T-shirt in a swimming pool. offensive language
246 That monochrome camera is brand new and costs about six grand. Bastard. offensive language

last edit 25/10/2016

Go to top