Original Source Here

How to approach a text classification problem (part 2/3)

This is the second part of the text classification problem (First one).

In this post we are going to generate a model for the clasification problem with:

scikit-learn: SVC, MultinomialNB, LogisticRegression.
Tensorflow: LSTM, CNN.

First We Import the needed classes.

Just as we did in the previous post, we are going to read the text and clean it (it is widely explained in the First post)

['politics',
 'exploration',
 'intelligence',
 'weapons',
 'headhunters',
 'transportation',
 'logistics']

CPU times: user 4.79 s, sys: 34.9 ms, total: 4.83 s
Wall time: 5.18 s

Modeling

To do a model exploration we are going to start with a baseline of models testing MultinomialNB, SVC and logistic regression, all of them with a conversion of the text to TF-IDF. Among them, the cross-validation method will be used to verify which of them gives us the best results and the final test will be carried out on that.

This baseline model will be compared to an LSTM model and a convolutional one.

We import all the sklearn dependencies.

In order to train the models, we will first vectorize the text, this generates an array with a “1” if the word appears in the text.

We will also use TF-IDF (term frequency–inverse document frequency). This technique provides us with a numerical representation of the terms within the text.

['aa', 'aaa', 'aaaa', 'aaaaaaaaaaaa', 'aaah', 'aaef', 'aah', 'aalborg', 'aamir', 'aammmaaaazzzzzziinnnnggggg'][0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

array([6.08295596, 5.70566172, 8.56786261, ..., 8.56786261, 8.56786261, 8.1623975 ])array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

How is TF*IDF calculated? The TF (term frequency) of a word is the frequency of a word (i.e. number of times it appears) in a document. When you know TF, you’re able to see if you’re using a term too much or too little. For example, when a 100-word document contains the term “aalborg” 12 times, the TF for the word ‘aalborg’ is

The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus.

Let’s say the size of the corpus is 10,000,000 million documents. If we assume there are 0.3 million documents that contain the term “aalborg”, then the IDF (i.e. log {DF}) is given by the total number of documents (10,000,000) divided by the number of documents containing the term “aalborg” (300,000).

Baseline models

To evaluate the performance of the 3 models, we will use the cross validation technique, which consists of training with a series of data and leaving another series to validate. The selection of the number of data that is chosen is a variable to choose. I attach a graphic where you can see an example.

model_name
LogisticRegression    0.934610
MultinomialNB         0.929186
SVC                   0.943398
Name: accuracy, dtype: float64

After analyzing the 3 models we see that SVC behaves better than the others and in a more stable way, so a more exhaustive test will be carried out with this model.

To carry out this test, the data processed with TF-IDF will be used and the data will be separated to obtain 70% of the data to train the model and 30% of data that the model has not seen to validate what right or wrong that generalizes.

As can be seen in the confusion matrix in the model, it generalizes quite well, noting that the “Headhunters” class has a certain deviation to predict incorrectly as “Exploration”.

Next we will analyze precision, recall and f1-score.

But before analyzing it, we must understand what each value refers to:

precision: the number of cases classified correctly over the total number of those that should be correct If the number of false positives has a high impact, this is the metric that needs to be paid more attention
recall: the number of cases classified as positive above all that was positive what was predicted vs what was not predicted correctly. If you need to reduce false negatives, this is the metric to pay the most attention to.
f1score: it is a balance between the previous 2

precision    recall  f1-score   support

   exploration       0.96      0.95      0.96       199
   headhunters       0.91      0.97      0.94       179
  intelligence       0.96      0.94      0.95       133
     logistics       0.99      0.96      0.97       137
      politics       0.97      0.96      0.97       185
transportation       0.93      0.96      0.95       165
       weapons       0.99      0.94      0.97       163

      accuracy                           0.96      1161
     macro avg       0.96      0.96      0.96      1161
  weighted avg       0.96      0.96      0.96      1161

As we can see, the model behaves very well and is robust.

Tensorflow

Let’s check if there are GPUs available to make training more efficient.

Num GPUs Available:  1
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11162430468200319055
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 233496321260097046
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 6961726752
locality {
  bus_id: 1
  links {
  }
}
incarnation: 15285252465848281027
physical_device_desc: "device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 17115642552350290797
physical_device_desc: "device: XLA_GPU device"
]


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

If we have GPU we configure tensorflow so that it uses the memory of it without limits.

In this case, due to the neural network architecture, instead of classifying the classes in a numerical way, we will make a one-hot encoding in which each column will refer to each class, so that each neuron from the output of the neuroal network is identified with a class and the neuron with the highest value will be the one that identifies the predicted class.

So, there will be 7 columns showing wich class each row belongs to, and therefore 7 output neurons.

Due to this particular problem, we know that each text belong only to ONE class so each row only can have ONE and only ONE class (column wiht a “1″), this is called a Multiclass classification. If our problem could have more thatn one class it would be a Multilabel classification.

Let’s Tokenize again (as before, we vectorize the input) using the Tokenizer library of keras.

[4, 4318, 127, 2696, 2, 1845, 906, 3803, 3, 4, 11450, 2640, 127, 2175, 2, 1329, 10290, 3, 4798, 7039, 8635, 4319, 145, 4798, 7039, 3,
 1459, 416, 132, 1098]

151.14913414318946

6050

As we can see, the length of the texts is in average of 151 and a max of 6050 so, to normalize the input length of the tokenized text sequences we will use a length of 200.

To test this architecture the tokenized data will be used. The data will be splited with 70% to train the model and 30% to test, the model has not seen how those 30% looks like, so we can check how good or bad it generalizes.

The model consists of:

an embedding layer With the length of the tokens + 1 (adding 1 is for the “unknown” token, any token that is not within the token list) and a length of the embedding vector of 64 (this is a hyperparameter of the model)
a Bidirectional layer of LSTM of 64 (hyperparameter)
a Bidirectional layer of LSTM of 32 (hyperparameter)
a layer of 64 (hyperparameter) fully connected neurons with reluctance activation
a layer of droput
a layer of 7 fully connected neurons with softmax activation

The last layer has 7 neurons referring to the 7 classes that must be predicted and the activation is softmax since it makes a probability distribution over all neurons and the neuron with the highest probability will refer to the associated class being the predicted one.

The model will use a loss calculation with the “categorical_crossentropy” algorithm. To do the training the optimizer for the calculation of the gradient will be “adam” and the metric to be calculated will be “accuracy”.

A training is done with 4 passes of the algorithm using batches of 32 data to avoid overloading the memory

Epoch 1/4
85/85 [==============================] - 5s 55ms/step - loss: 1.8506 - accuracy: 0.2142 - val_loss: 1.5143 - val_accuracy: 0.3463
Epoch 2/4
85/85 [==============================] - 3s 38ms/step - loss: 1.1420 - accuracy: 0.5251 - val_loss: 0.9627 - val_accuracy: 0.6503
Epoch 3/4
85/85 [==============================] - 3s 38ms/step - loss: 0.6128 - accuracy: 0.7843 - val_loss: 0.6739 - val_accuracy: 0.7933
Epoch 4/4
85/85 [==============================] - 3s 38ms/step - loss: 0.2878 - accuracy: 0.9184 - val_loss: 0.5204 - val_accuracy: 0.8562

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 64)          2457408   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         66048     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                41216     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 7)                 455       
=================================================================
Total params: 2,569,287
Trainable params: 2,569,287
Non-trainable params: 0
_________________________________________________________________

As can be seen in the training graphs, the training model has a lot of acurracy and also in validation, we must be careful with overfitting since we may be overtraining the model and then it will not be able to generalize.

Let’s move on to the convolutional model

Convolutional

Epoch 1/4
43/43 [==============================] - 1s 28ms/step - loss: 1.8761 - accuracy: 0.2371 - val_loss: 1.6784 - val_accuracy: 0.3781
Epoch 2/4
43/43 [==============================] - 1s 21ms/step - loss: 0.6809 - accuracy: 0.8364 - val_loss: 0.3644 - val_accuracy: 0.8906
Epoch 3/4
43/43 [==============================] - 1s 21ms/step - loss: 0.0619 - accuracy: 0.9860 - val_loss: 0.3042 - val_accuracy: 0.9078
Epoch 4/4
43/43 [==============================] - 1s 21ms/step - loss: 0.0170 - accuracy: 0.9970 - val_loss: 0.2632 - val_accuracy: 0.9190

The model consists of:

a layer of embedings with the length of the tokens + 1 (adding 1 is for the “unknown” token, any token that is not within the token list) and a length of the embedding vector of 64 (this is a hyperparameter of the model)
a Convolutional layer of 1 dimension with 50 filters and a kernel dimension of 5 (length of the window in which the convolution is going to move)
a Flatten layer so that all the parameters are in the form of a list not an array
a layer of 100 (hyperparameter) fully connected neurons with reluctance activation
a layer of 7 fully connected neurons with softmax activation

As in the previous model because there are 7 classes.

The results are similar to the previous model but it behaves somewhat better in validation

We are going to classify the test set to make an analysis of the confusion matrix and the metrics.

We obtain the position of the neuron with the highest value and obtain the class to which it belongs.

precision    recall  f1-score   support

   exploration       0.89      0.95      0.92       199
   headhunters       0.98      0.89      0.93       179
  intelligence       0.95      0.79      0.86       133
     logistics       0.93      0.92      0.92       137
      politics       0.97      0.97      0.97       185
transportation       0.81      0.97      0.88       165
       weapons       0.95      0.91      0.93       163

      accuracy                           0.92      1161
     macro avg       0.92      0.91      0.92      1161
  weighted avg       0.92      0.92      0.92      1161

As we can see, the transport part has some shortcomings, but due to the embedding layers and the convolutions it should better abstract the model.

Annex: Embeddings

Embeddings are dense form representations of words in which similar words have similar encodings and are close to words related to themselves.

Let’s see an example using the data from the embeddings layer from the last training (convolution). We take the tokenizer data and the weights of the embeddings layer and save them in the vectors and metadata files respectively

We can use the web https://projector.tensorflow.org/ to obtain a graphic projection of our embeddings map, for this we load the files in the “Load” section and mark the icon with an “A” to show us the labels into words.

As we can see the label “religious” has close (in cosine distance) the labels “satan”, “jesus” and “independent”.

Although the best known example of embedings is the representation of the male-female concepts related to that of king and queen for gender in addition to verval times and country-capital.

Code (soon)

Contact

Buy me a coffe

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

via WordPress https://ramseyelbasheer.wordpress.com/2021/02/21/how-to-approach-a-text-classification-problem-part-2-3/