Extracting printed and handwritten text from pictures of cheques. Automated the workflow for an international bank. Trained a custom neural network as an end-to-end data-driven solution. Provided a pipeline to increase the engine’s accuracy with more labeled images without required knowledge of programming. Reduced the time needed to go through the cheques manually. Achieved an accuracy of 99% with a recall value of 98%.
The client is an international bank. They wanted to simplify transactions for their users by allowing them to deposit cheques without having to visit their office through a mobile app.
The client wants to develop an automated solution for remote cheque cashing through a companion mobile app. The app user takes a picture of the cheque using their smartphone camera, and the image is sent to the bank’s server where the automated extraction of various cheque fields is performed. After that, a bank’s employee simply compares the image with the extracted fields before approving the transaction.
The bank first experimented with several paid and open-source solutions, but it ultimately failed on handwritten digits with pictures taken from a smartphone camera, which is far from ideal conditions the models are often trained on. In order to make the solution acceptable, accuracy needed to be extremely high in order to save the time of bank employees.
They reached out to BroutonLab to develop an engine for automated cheque scanning with high accuracy, based on the client’s labeled dataset. Also, the requirement was that the engine’s accuracy increases along with the increase of labeled images acquired during app usage and human verification.
The solution can be divided into two parts:
The first part is a task that can rely on plenty of preceding work done by the community. There are a lot of open-source solutions that can do these tasks with high quality, and the problem can be broken down as follows:
The hardest part is recognizing the amount that needs to be cashed because these values are handwritten in most cases. We trained a custom neural network to resolve this issue.
For training, testing, and validation, we used 500 thousand images with the “amount” field cropped along with the corresponding handwritten amount and a label indicating that the amount is.
We researched different approaches to resolve this problem, and many experiments were made. In the end, our final decision was based on the architecture trained using CTC loss.
Connectionist Temporal Classification (CTC) is an algorithm used to train deep neural networks for the problems of speech and handwriting recognition, being novel for not requiring any alignment between the input and the output.
The backbone of architecture can be any neural network type: CNN, RNN, 3D-CNN, or Transformer. Each of them has its advantages, but we decided to ensemble 5 different networks into one large model:
At the time of development, image transformer models weren’t available and, In future research, we plan to use image transformers.
Ensembling coupled with carefully selected hyperparameters allowed us to significantly increase the quality and confidence of our model as a result, being able to achieve the accuracy of 99% and recall of 98%. A useful property of our model is that it’s possible to check how confident is the prediction. This allows us to filter out unreadable receipts, which also improves the final quality.
The time needed to process a single cheque image was reduced from 10 minutes to 30 seconds. The short processing time and high accuracy led to a 5X increase in app downloads. Employees didn’t have to waste time going through cheque images and were able to focus on other, more important tasks.