Text Classification using Transformers
Original Source Here
This was all about how to write the building blocks of a Self-Attention Transformer from scratch in PyTorch. Let’s now move on to a real-world dataset we will be using to train a Classification Transformer to classify a question into two categories. The categories being the class
and the subclass
in which the question falls.
2. Introduction to the Dataset and Data Preparation
There are many ways one can achieve good accuracy for classification a sentence or a paragraph into different categories. As the title suggests in these series of blogs we will discuss one of the most talked-about and employed model architecture. Instead of using the transformers library by HuggingFace or any other pre-trained models, we will code a Multi-Head Self Attention Transformer using PyTorch. To make things more fun and somewhat complicated, the dataset that we will be training with has two sets of categories and we will discuss and implement different approaches to achieve a good classification model which can classify text into two different sets of categories each having several classes.
The dataset we will be using is a question classification dataset. The two sets of categories provide information about what type of answer would be required for a question asked. You can find the dataset here.
For example, the question asked is What are liver enzymes? This question requires a descriptive text and most suitably a definition. So here, the class is descriptive text and the subclass is the definition.
How does the data look?
Data Preparation
You can do a wget
with the link address of the download files to download the data. We will be using the toknizers
library to convert the question texts to tokens. As the hugging face tokenizers library is written in rust
and faster than any python implementation we are leveraging it. You can also try to use BytePairEncoding
library available here to convert questions to tokens. It’s much slower than the hugging face tokenizers.
Once you download the data we will clean the sentences and get our class and subclass labels using the following steps,
- Decoding a line from bytes to string
- Strings to class, sub-class, and questions
- How will the data look now?
- Converting a list of dictionaries to a dataframe
- Class names to indexes and vice-versa
In total there are 6 classes
- Saving
classtoidx
andidxtoclass
- Repeating the above two steps for subclasses
There are 47 subclasses in total
- Mapping the classes and subclasses inside the dataframe to their indexes
- Tokenizing the question texts
We will convert the text to computer understandable numbers as we did for labels. This vocabulary file is obtained after training the BertWordPieceTokenizer
on the wikitext data and the vocabulary size is 10k. You can download it from here.
Let’s start tokenizing, we separately have a list which stores the number of tokens for every question. The longest sequence has 52 tokens that form the question. We can have the maximum sequence length to 100.
- Save the outputs list to a pickle
This notebook can be followed to implement all of the above. The code for all the parts is available in this GitHub repo.
If this article helped you in any which way possible and you liked it, please appreciate it by sharing it in among your community. If there are any mistakes feel free to point those out by commenting down below.
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot
via WordPress https://ramseyelbasheer.io/2021/03/11/text-classification-using-transformers/