An End-to-End Web Service Implementation for Text Classification using Word2Vec and LGBM



Original Source Here

Before starting the code review, I’d like to talk about data. Amazon product reviews which can be found on Kaggle were used for text classification. Each review has a score and the application predicts this score. In order to simplify the problem, I grouped scores into 2 different groups which are greater than 3 or not (maximum score is 5, minimum score is 0) then our problem became a binomial classification problem.

Controller

In Controller, there are 3 main methods which handle:

  • a word2vec model implementation
  • a training of classifier model
  • a prediction of text.

train_wv_model recieves a text file for model training and returns the model name as a response. I want to use model identifiers in order to support multiple models for both word2vec and classification model. train_classifier_model also receives the text file since I didn’t want to store the text file in the application. In addition to the file, model identifier which was generated after word2vec model is trained is passed by request to retrieve vector representation of each text. Lastly, predict_text_class retrieves the text and modelID in the request. Review is classified by a model whose modelID is given.

In wordEmbeddingService, there are 2 main tasks: training a word2vec model and generation of vector representation of reviews. Second method, create_word_embeddings, is called by classifier to use vectors as an input.

When text is obtained, the whole text is lowercased, non-alphabetic words and stop words are removed, then each row is tokenized to the word lists. After preprocessing is completed, word2vec model starts to be trained then it is saved using datetime. Afterwards, this modelID is returned to the controller.

In the second method, a vector representation of a given text is generated using the existing model. When any word is missing in the vocabulary, a default vector filled with zeros is utilized. When all word vectors are fetched from the model, the average of all vectors are taken to generate the final version of text representation.

TextClassifierService handles the training of the LGBM and the prediction of a given text using the trained classifier model. First method requires text dataframe and modelID to train classifier, second method requires a vector representation instead of dataframe.

train_classifier_model starts with text preprocessing. As I mentioned above, the same preprocessing is applied for word2vec model training and it’s implemented in WordEmbeddingService. Separation of preprocessing from the WordEmbeddingService will increase the code quality, I just wanted to stop the implementation. After preprocessing is completed, the list vector representations are split into test and training data sets. Model is trained then the classification performance is measured using AUC and F-score metrics. Since the majority of classification data sets is unbalanced, I avoid using the Accuracy metric. Regarding the performance measurement, any hypertuning process for both language model and classifier model is not applied. You can also try different types of optimal parameter search methods. Finally, Performance metrics and modelID is saved into model_info object then it’s returned to Contoller.

predict method does not contain any complex function because the complexity was burdened by predict_text_class proxy class which I’m going to discuss last. It receives a vector representation of text and modelID to handle prediction. When the result object is created, it’s sent to our final service.

I wanted to create this service to make better readable code and reduce the complexity of prediction implementation. As other services does, it conducts the text preprocessing before prediction. Vector representation of each word is obtained from word2vec model then the representation of whole text is generated using the average value of word vectors. At the end of the method, classifier predicts text’s class then it’s returned to Controller.

In order to call these 3 functions, you can use cURLs below:

curl POST 'http://localhost:8080/api/v1/wv-model-training' \
--form 'file={file_location}'
curl POST http://localhost:8080/api/v1/text-classifier-training \
--form 'file={file_location}' \
--form 'model_id="{model_id_which_is_returned_from_the_first_method}"'
curl POST http://localhost:8080/api/v1/predict-text \
--header "Content-Type: application/json" \
--data '{"text": "{any_review_you_want_to_classify}",
"model_id": "{model_id_which_is_returned_from_the_first_method}"}'

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot



via WordPress https://ramseyelbasheer.wordpress.com/2020/12/31/an-end-to-end-web-service-implementation-for-text-classification-using-word2vec-and-lgbm/

Popular posts from this blog

I’m Sorry! Evernote Has A New ‘Home’ Now

Jensen Huang: Racism is one flywheel we must stop

Fully Explained DBScan Clustering Algorithm with Python