### Distributing learning for sentiment analysis with Pyspark

https://miro.medium.com/max/1200/0*0Qh_Tp6ou3NvRg9g

Original Source Here

# Distributing learning for sentiment analysis with Pyspark

## The trade-off between different approaches in Pyspark in the speed of learning and accuracy

*In this article, I would like to show the trade-off between different approaches in Pyspark in the speed of learning and accuracy in the task to identify positive and negative twitts.*

Sentiment analysis is extremely useful in social media monitoring as it allows us to gain an overview of the wider public opinion behind certain topics. This task has got a lot of research and implementation with different frameworks and programming languages.

In this article, I would like to show how we can use different approaches in Pyspark on Databricks, which metrics we could get, and the most important how much time we need for this. As we all know model time performance is maybe the most important thing in real life because new data and cases in it grow very fast and we need to retrain our model mostly every day. The second thing is model prediction response time, it is very good if the model has 99,9% accuracy, but if during the input data stream model response time too long we miss all the benefits that machine and deep learning give for us.

For this research I used — Twitter and Reddit Sentimental analysis Dataset from Kaggle.

My first approach is — *“HashingTF + IDF + Logistic Regression”**.*

Let’s start to figure out what is it and how it works. Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.

There are several variants on the definition of term frequency and document frequency. In MLlib, separate TF and IDF to make them flexible.

**TF**: Both `HashingTF`

and `CountVectorizer`

can be used to generate the term frequency vectors.

`HashingTF`

is a `Transformer`

which takes sets of terms and converts those sets into fixed-length feature vectors. Text processing, a “set of terms” might be a bag of words. `HashingTF`

utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise, the features will not be mapped evenly to the vector indices. The default feature dimension is `218=262,144218=262,144`

. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

**IDF**: `IDF`

is an `Estimator`

which is fit on a dataset and produces an `IDFModel`

. The `IDFModel`

takes feature vectors (generally created from `HashingTF`

) and scales each feature. Intuitively, it down-weights features that appear frequently in a corpus.

This is the first part of the pipeline mode which response to text preprocessing. My first model is Logistic regression it is one of the most simple and fast model in machine learning, but with good data preprocessing it could give good results. In statistics, the **logistic model** (or **logit model**) is used to model the probability of a certain class or event existing such as positive or negative class. Logistic regression achieves the best predictions using the maximum likelihood technique. Sigmoid is a mathematical function having a characteristic that can take any real value between -∞ and +∞ and map it to a real value between 0 to 1. So if the outcome of the sigmoid function is more than 0.5 then we classify it as positive class and if it is less than 0.5 then we can classify it as negative class.

As the result we have:

- Accuracy Score: 0.8771
- ROC-AUC: 0.8529
- Training time 56.79 seconds

The second approach is — *“CountVectorizer + IDF + Logistic Regression”**.* This approach is very similar to the previous one, but for text preprocessing, I would like to use the CountVectorizer method. So, what is it, and how it works?

`CountVectorizer`

and `CountVectorizerModel`

aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, `CountVectorizer`

can be used as`Estimator`

to extract the vocabulary, and generates a `CountVectorizerModel`

. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, `CountVectorizer`

will select the top `vocabSize`

words ordered by term frequency across the corpus. An optional parameter `minDF`

also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

As the result we have:

- Accuracy Score: 0.8931
- ROC-AUC: 0.8708
- Training time 1.27 minutes

As you can see, the accuracy is higher, but we need to pay by training time.

The third approach is — *“*N-gram + Chi Squared*+ Logistic Regression”**.*

In this approach, I used N-gram as a technique for text preprocessing and Chi-Squared as an approach for feature selection.

Let’s start from N-gram. An n-gram is a sequence of n tokens (typically words) for some integer n. The NGram class can be used to transform input features into n-grams.

NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words. If the input sequence contains fewer than n strings, no output is produced. Spark does not automatically combine features from different n-grams, so I had to use VectorAssembler in the pipeline, to combine the features I get from each n-gram.

The next step in this approach — feature selection. Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. The benefits of performing feature selection before modeling your data are:

- Avoid Overfitting: Less redundant data gives a performance boost to the model and results in less opportunity to make decisions based on noise
- Reduces Training Time: fewer data means that algorithms train faster

*The Chi-square* test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with the best Chi-square scores. It determines if the association between two categorical variables of the sample would reflect their real association in the population.

Machine learning model leaves the same — logistic regression.

As the result we have:

- Accuracy Score: 0.9111
- ROC-AUC: 0.8921
- Training time 15.25 minutes

Let’s make the same but without feature selection. So, my fourth approach — *“*N-gram *+ Logistic Regression”**. Here I would like to use the power of sequence analysis that* **N-gram** produces and fast of Logistics Regression.

As the result we have:

- Accuracy Score: 0.9022
- ROC-AUC: 0.8816
- Training time 1.54 minutes

As we can see, the accuracy reduces a little bit at the same time the training time reduces incredibly.

Ok, logistic regression is good, but what about using something more complicated — Gradient-boosted trees. Let’s leaves the same data preprocessing —N-gram + Chi-Squared and add the power of the Gradient-boosted classifier. So, my next iteration would be — *“*N-gram + Chi-Squared *+* Gradient-boosted tree classifier*”**.*

Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees. GBTs iteratively train decision trees to minimize a loss function. In gradient boosting, an ensemble of weak learners is used to improve the performance of a machine learning model. The weak learners are usually decision trees. Combined, their output results in better models. In gradient boosting, weak learners work sequentially. Each model tries to improve on the error from the previous model.

As the result we have:

- Accuracy Score: 0.8003
- ROC-AUC: 0.7227
- Training time 52.408 minutes

As for me, it is too long and also we have got a lot of risks to have model overfitting that GBT could produce. GBT is a really good model but in this case, we have got too many features for it and as for me Chi-Squared approaches maybe not so good.

Let’s check the same pipeline but without Chi-Squared, just to check how much time we save if we exclude feature selection. So, it will be the next approach — *“*N-gram + Gradient-boosted tree classifier*”**.*

As the result we have:

- Accuracy Score: 0.7976
- ROC-AUC: 0.7185
- Training time 39.564 minutes

We really could reduce the time, but as I said before our algorithm really needs fewer features, so we can see that we lost a little bit in accuracy.

Let’s imagine that we don’t know or wouldn’t like to use in our pipeline any other features selection technic but still would like to increase the accuracy. For this case, we could use the power of deep learning.

In the next steps, I also would like to leave the same data preprocessing pipeline but use the Deep learning (MLP) model — **N-gram + Chi-Squared + MLP”**

*.*

Multilayer perceptron classifier (MLPC) is a classifier based on the feedforward artificial neural network. MLPC consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the node’s weights `w`

and bias `b`

and applying an activation function. This can be written in matrix form for MLPC with `K+1`

layers as follows:

Nodes in intermediate layers use the sigmoid (logistic) function:

Nodes in the output layer use the softmax function:

The number of nodes `N`

in the output, layer corresponds to the number of classes. MLPC employs backpropagation for learning the model. We use the logistic loss function for optimization and L-BFGS as an optimization routine.

In my case, I would use 2 hidden layers for 64 and 32 neurons.

As the result we have:

- Accuracy Score: 0.9261
- ROC-AUC: 0.9143
- Training time 70.2 minutes

It is the best result in comparison with previous approaches, but the training time is too long and it is just for a little amount of data.

Let’s check the same pipeline but without Chi-Squared, just to check how much time we save if we exclude feature selection. So, it will be the next approach — *“*N-gram + MLP*”**.*

As the result we have:

- Accuracy Score: 0.9115
- ROC-AUC: 0.8975
- Training time 54.108 minutes

We really save time, but if we compare results with all previous tests we understand that it better to use some approach with Logistic regression which produces mostly the same accuracy but for a shorter period of time.

Here is the table and the chart with all results:

# Conclusions

The conclusion will be pretty simple — the faster you model want the simple model and data preprocessing pipeline use. In this article, I have used only a little part of all from the entire arsenal of methods and models of data science.

For this task, you can also use TensorFlow or PyTorch frameworks with more complex architecture than MLP. To use the power of Databricks you also can use Distributed training approach. Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. Databricks supports distributed deep learning training using HorovodRunner and the `horovod.spark`

package. For Spark ML pipeline applications using Keras or PyTorch, you can use the `horovod.spark`

estimator API.

Also, you can use fine-tuning Language Models for sentiment analysis tasks For example use BERT and all other derivatives.

All code you can find in the Git repository — link.

You also could read how to use Keras for sentiment analysis in my article — “Fake news detector with deep learning approach (Part-II) Modeling”.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

via WordPress https://ramseyelbasheer.io/2021/06/11/distributing-learning-for-sentiment-analysis-with-pyspark/