Natural Language Processing parsing software to achieve high accuracy of an ATS

Leveraging NLP and Deep Learning to help a startup optimize job matching. Automated parsing of unstructured resumes to extract useful information about the candidate's experience, skills, and past employment. Automatically summarizing the candidate’s resume with a generated paragraph.

Client

The client is a Dublin-based startup that develops an AI-powered talent recommender engine for HR tech vendors, enterprises, and staffing agencies. Through resume parsing and analysis, information extraction, and candidate matching and ranking, talent management is made easy. The startup was acquired by a leading US cloud recruiting platform.

Problem

The client wants to develop a resume parser and summarizer to work with unstructured CVs in document form, sent by candidates in various formats. The engine is supposed to extract all important information about the candidates, including personal information, level of seniority, years spent with different companies, and experience with different skills. Finally, the engine should summarize everything about the client in a short, generated paragraph.

The client has experimented with open-source NLP solutions, but the results were not satisfactory as the models weren’t trained on relevant data, accuracy was very important so the engine does not discriminate against random candidates by making a mistake during parsing or summarization.

They reached to BroutonLab to develop a custom engine with a relatively small labeled training dataset.

Example of extracted candidate information

Solution

The final solution can be broken down into several problems:

  • Named Entity Recognition - we had to extract names, organizations, and skills from the candidate’s CV for further analysis.
  • Automatic Salary Prediction - we had to predict the potential salary for each combination of resume and job post, based on everything extracted in the previous step.
  • Resume Summarization - finally, we had to summarize the resumes in one paragraph in order to help everyone reading it evaluate the candidate without reading the entire document.

Datasets

Within the scope of the project, 2 types of information were analyzed:

  • Resumes
  • Job posts

Resumes and job posts were taken from public job boards. Some of the first paying clients also provided their datasets of resumes and job posts as well, in order to help us improve the accuracy of our engine.

The labeling of these documents was performed by using the outputs of several commercial resume/job post parsers. This helped us skip the process of manually labeling each candidate’s CV and job post, which would have taken more time and a dedicated data labeling budget.

Example of a bidirectional LSTM network (from What Is Bidirectional LSTM?)

Models

Recurrent and Convolutional Networks

To resolve different NLP problems, we used stacked bidirectional GRU/LSTM recurrent layers. It’s worth noting that the research was done before the invention of Google Transformers that often outperform the methods presented below.

This approach was also compared with Convolutional layers, which are generally faster than recurrent layers, but recurrent layers have shown better accuracy and we decided to pursue such architecture for Named Entity Recognition, Salary Prediction, and Resume Summarization problems.

We experimented with different vector embeddings including fastText and custom Word2vec. This helped us to significantly reduce the following:

  • dependency on the amount of labeled data
  • the time needed model training
Authors of NIPS papers clustered by the topic of their papers (from New Gensim feature: Author-topic modeling)

Topic modeling

Since we had to deal with a large number of texts, we used several unsupervised approaches for text clusterization by topics. Particularly, we used BigARTM to resolve this problem because it showed better performance and accuracy when compared to other libraries such as Gensim.

Server

We helped develop and optimize the backend of the project. We used GoLang for the deployment of developed neural networks. The communication between different services was achieved using the gRPC protocol. It helped us increase the throughput and total speed of the whole engine.

Result

The client received an end-to-end solution for automated resume parsing and summarization, allowing them to grow 20% in market share and start working with larger clients, where the percentage of clients being Fortune 100 companies grew to 80%. After a while, the startup was successfully acquired by a large US cloud talent platform, launching a new AI solution within a larger organization.