Text detection in a natural scene image
Original Source Here
Text detection in a natural scene image
Detecting texts in a natural scene image using deep learning techniques.
This blog is a simplistic attempt to give an in-depth explanation on text localization techniques in a real world natural scene image. We will try to implement a text detection algorithm from scratch and use to to detect text.
Table of contents:
- Description
- Business Problem and use case
- Problem statement and expected solution
- Data Overview
- Deep learning approach
- EAST architecture and model description
- Training and loss
- Implementation and results
- Model quantization
- Further improvements and future work
- References
Description
Recognizing text in an image or scene has been a challenging task. We humans have no problem in detecting and understanding text in the real world, be it in any shape, size, orientation and color. Along with this the image may have noise, blur, low light conditions etc.
So our challenge here will be to solve the problem of detecting text in an image (real world image with text) automatically.
Note that it is different from general OCR techniques where the text is structured and usually with an even background.
Business problem and use cases
Given an image if we are able to detect text in a scene containing noise, it will open doors to further solutions for many other use cases like text recognition, text translation etc. This will help in reducing cost of manual labor in many sectors. For example from a phone camera we can detect text in a language and translate it into English or native language of the person, so that they can read address or names of roads in a foreign environment for self-guidance. This can also be used in advertisements industry.
One more use case for OCR is where a visually disabled person can be helped with the technology advancements in text-to speech along with text detection, which can detect text in any format/font/orientation and convert it into speech for them to understand.
Problem statement
Given a set of natural scene images having text in them, we have to detect text in it with different orientation, colors, sizes. The detected text will be bounded by a rotated bounding box rectangle, which will try to enclose the text within it as accurately as possible.
Data source
Data is downloaded from Synthtext dataset .Annotations are given in a .mat file from which we will extract it to a pandas dataframe.
Dataset overview
This data contains 858,750 synthetic scene-image files (.jpg) split across 200 directories out of which we have filtered 5860 images randomly for training. This number can be increased for even more intensive training.
Annotations of each image has image names, bounding box coordinates() and texts at word level.
Bounding box coordinates -> word-level bounding-boxes for each image, represented by tensors of size 2x4xNWORDS_i, where:
- the first dimension is 2 for x and y respectively,
- the second dimension corresponds to the 4 points(clockwise starting from top-left)
- And the third dimension of size NWORDS_i, corresponds to the number of words in the ith image.
For more understanding about the data please visit this link – Synthtext dataset .
Deep learning problem
Using a set of real world scene images with word level text in them annotated by a bounding box, we have to train a deep learning model(CNN) which can detect text at multiple word level separately given a new image. Different words in same line is detected when the model learns to separate two consecutive words by space.
We will use EAST text detector model’s architecture to implement our own CNN based model along with the help Opencv for image processing and to generate rotated bounding box.
Now we will understand each component used to solve this problem.
EAST — An Efficient and Accurate Scene Text detector
EAST is a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes. The pipeline directly predicts words or text lines of arbitrary orientations and shapes, eliminating unnecessary intermediate steps with a single neural network.
Model Architecture:
The model is a fully-convolutional neural network adapted for text detection that outputs dense per-pixel predictions of words or text lines. This eliminates intermediate steps such as candidate proposal, text region formation and word partition. The post-processing steps only include thresholding and NMS on predicted geometric shapes.
EAST adopt the idea from U-shape (U-net) to merge feature maps gradually, while keeping the up-sampling branches small. Together we end up with a network that can both utilize different levels of features. The U-net architecture helps the model to learn from the previous layer’s output information.
The model can be decomposed in to three parts: Feature extractor stem, Feature merging branch and output layer as show in fig. above.
Feature extractor stem:
In original paper PVAnet was used but I will be using pre-trained Resnet50 model’s weights as feature extractor stem. This branch helps in extracting basic information from images like shapes, edges, color and patterns. This helps in making further process of learning about feature maps to detect text easier for the model.
Feature merging branch:
In each merging stage, the feature map from the last stage is first fed to an unpooling layer to double its size, and then concatenated with the current feature map. Next, a conv1×1 bottleneck cuts down the number of channels and reduces computation, followed by a conv3×3 that fuses the information to finally produce the output of this merging stage. Following the last merging stage, a conv3×3 layer produces the final feature map of the merging branch and feed it to the output layer.
output layer:
The final output layer contains several 1×1 convolution operations to project 32 channels of feature maps into 1 channel of score map Fs and a multi-channel geometry map Fg. Here Fg has 5 channel output, 4 for the distances of edges from a pixel and 1 for rotation angle.
Training and loss:
During training we have two outputs/target on which we have to compute and decrease the loss.
Score map:
The positive area of the bounding box/quadrangle on the score map is designed to be roughly a shrunk version of the original one. It is nothing but a binary mask having area of bounding box as positive.
Geometry map:
The geometry map is either one of RBOX(Rotated Box) or QUAD. I have used the RBOX implementation as it gave me better results than the QUAD style geometry map. Final shape of the map is same as that of image shape.
For each pixel that has positive score in score map, on the corresponding pixel of geometry map we calculate its distances to the 4 boundaries of the text box, and put them to the 4 channels of RBOX ground truth. 1 more channel is used for keeping the angle information by which the box is rotated.
Loss function:
Loss for Score map: This balanced cross-entropy loss is discussed in the paper for the score map loss calculation, as we are having huge unbalance between text and background pixels.
However, I have used Dice coefficient loss which works better. It is a commonly used loss function for semantic segmentation. It also takes into account global and local composition of pixels, thereby providing better boundary detection than a weighted cross entropy.
The equation for it is as below:
Loss for Geometry map: We cannot use L1 or L2 loss directly to regress the 4 distances and 1 angle value because the sizes for texts in natural scenes vary greatly and would lead the loss bias towards larger and longer text regions. So we need the regression loss to be scale invariant.
So, we will take the IOU loss since it is invariant against objects of different scales. IOU is nothing but Intersection over Union. As we have d1, d2, d3, d4 distances to left, right, top and bottom distances of box from a pixel, both the intersection/union area can be computed easily.
Next, the loss of rotation angle is computed as below:
Finally, the overall geometry loss is the weighted sum of IOU loss and angle loss, given by Lg = IOU + Lambda*L-theta. Where Lamda is set to 20 in my implementation.
Training
The network is finally trained using ADAM optimizer. I started the learning rate at 1e-3 till 3rd epoch, after that kept it to 1e-4 till 30th epoch. After noticing the loss is not decreasing much, I changed the learning rate to 1e-5 after 30th epoch. This was done using keras callbacks.
The batch size was kept to 24 and each epoch trained completely on 4688 images, and 1172 out of total 5680 images were kept for validation test at each epoch. The model was trained for a total of 40 epochs.
Obtaining final results from predictions
The geometries obtained after thresholding are then merged using NMS i.e Non Maximum Suppression.
As we know that geometries from adjacent pixels are highly corelated and are often for same bounding box, we will use locality-aware NMS instead of standard NMS.
In locality-aware NMS the geometries are merged row by row and the merged bounding box are weight-averaged by the scores of two given bounding boxes on their corresponding coordinates in score map. Psuedo code for NMS is as shown below:
This was all what was needed for our understanding about the entire architecture for developing EAST text detection model. Now let us dive into the implementation part.
Implementation
After importing the annotations we will have our final dataframe looking something like this:
We have ‘imnames’ column having image names along with the word level text in ‘txt’ column and the bounding boxes(bbox) in list of lists corresponding to it.
Input Pipeline
We will be using Tensorflow data pipeline using ‘tensorflow.data.Dataset’ module.
Loss :
Then we will be defining the losses for our model.
After everything is defined, we will compile our model
Inference:
After our model is trained we need to see our results what is predicted. The inference of model is done using below code.
The output of our trained model:
It is also performing well on the type of images where the text form and background are very different from our training dataset(Synthtext). For example on an image of ICDAR2015 image.
As we can see in above results, our model is detecting the unstructured texts in the images very well given the less amount of training with a very simplistic approach. Combined with other image processing techniques and more training I am sure it will give even better results.
Model analysis:
After training our model we got the following losses on our train and test data.
Train loss:
test loss:
Now we will analyze the train and test losses distribution based on per epoch while training.
As we can see the loss in both test and train are having high density when loss is around 0.3 .
Let us categorize the losses into 3 categories:
We will analyze how data is distributed with best losses being in top 33 percentile of training loss and worst being in 75 percentile and above, as lower values of loss means better results.
Categorizing training loss:
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot
via WordPress https://ramseyelbasheer.io/2021/06/26/text-detection-in-a-natural-scene-image/