Text Classification using Transformers



Original Source Here

This was all about how to write the building blocks of a Self-Attention Transformer from scratch in PyTorch. Let’s now move on to a real-world dataset we will be using to train a Classification Transformer to classify a question into two categories. The categories being the class and the subclass in which the question falls.

2. Introduction to the Dataset and Data Preparation

There are many ways one can achieve good accuracy for classification a sentence or a paragraph into different categories. As the title suggests in these series of blogs we will discuss one of the most talked-about and employed model architecture. Instead of using the transformers library by HuggingFace or any other pre-trained models, we will code a Multi-Head Self Attention Transformer using PyTorch. To make things more fun and somewhat complicated, the dataset that we will be training with has two sets of categories and we will discuss and implement different approaches to achieve a good classification model which can classify text into two different sets of categories each having several classes.

The dataset we will be using is a question classification dataset. The two sets of categories provide information about what type of answer would be required for a question asked. You can find the dataset here.

For example, the question asked is What are liver enzymes? This question requires a descriptive text and most suitably a definition. So here, the class is descriptive text and the subclass is the definition.

How does the data look?

Image by Author

Data Preparation

You can do a wget with the link address of the download files to download the data. We will be using the toknizers library to convert the question texts to tokens. As the hugging face tokenizers library is written in rust and faster than any python implementation we are leveraging it. You can also try to use BytePairEncoding library available here to convert questions to tokens. It’s much slower than the hugging face tokenizers.

Once you download the data we will clean the sentences and get our class and subclass labels using the following steps,

Image by Author
  • Decoding a line from bytes to string
Image by Author
Image by Author
  • Strings to class, sub-class, and questions
Image by Author
  • How will the data look now?
Image by Author
  • Converting a list of dictionaries to a dataframe
Image by Author
  • Class names to indexes and vice-versa

In total there are 6 classes

Image by Author
  • Saving classtoidx and idxtoclass
Image by Author
  • Repeating the above two steps for subclasses

There are 47 subclasses in total

Image by Author
  • Mapping the classes and subclasses inside the dataframe to their indexes
Image by Author
  • Tokenizing the question texts

We will convert the text to computer understandable numbers as we did for labels. This vocabulary file is obtained after training the BertWordPieceTokenizer on the wikitext data and the vocabulary size is 10k. You can download it from here.

Image by Author

Let’s start tokenizing, we separately have a list which stores the number of tokens for every question. The longest sequence has 52 tokens that form the question. We can have the maximum sequence length to 100.

Image by Author
  • Save the outputs list to a pickle
Image by Author

This notebook can be followed to implement all of the above. The code for all the parts is available in this GitHub repo.

If this article helped you in any which way possible and you liked it, please appreciate it by sharing it in among your community. If there are any mistakes feel free to point those out by commenting down below.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot



via WordPress https://ramseyelbasheer.io/2021/03/11/text-classification-using-transformers/

Popular posts from this blog

I’m Sorry! Evernote Has A New ‘Home’ Now

Jensen Huang: Racism is one flywheel we must stop

Fully Explained DBScan Clustering Algorithm with Python