Automatic Summarization of Resumes with NER -> Evaluate resumes at a glance through Named Entity Recognition

Home Page: https://medium.com/@dataturks/automatic-summarization-of-resumes-with-ner-8b97a5f562b

Python 100.00%

named-entity-recognition spacy-models resume-parser python annotation-tool labeling-tool text-annotation

entity-recognition-in-resumes-spacy's Introduction

Automatic Summarization of Resumes with NER

Evaluate resumes at a glance through Named Entity Recognition

*Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. *Check us out!

This blog speaks about a field in Natural language Processing and Information Retrieval called Named Entity Recognition and how we can apply it for automatically generating summaries of resumes by extracting only chief entities like name, education background, skills, etc..

It is often observed that resumes may be populated with excess information, often irrelevant to what the evaluator is looking for in it. Therefore, the process of evaluation of resumes in bulk often becomes tedious and hectic. Through our NER model, we could facilitate evaluation of resumes at a quick glance, thereby simplifying the effort required in shortlisting candidates among a pile of resumes.

What is Named Entity Recognition?

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a sub-task of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

NER systems have been created that use linguistic grammar-based techniques as well as statistical models such as machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists . Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches have been suggested to avoid part of the annotation effort

NER For Resume Summarization

Dataset :

The first task at hand of course is to create manually annotated training data to train the model. For this purpose, 220 resumes were downloaded from an online jobs platform. These documents were uploaded to our online annotation tool and manually annotated.

The tool automatically parses the documents and allows for us to create annotations of important entities we are interested in and generates json formatted training data with each line containing the text corpus along with the annotations.

A snapshot of the dataset can be seen below :

A sample of the generated json formatted data is as follows :

The above dataset consisting of 220 annotated resumes can be found [here](https://dataturks.com/projects/abhishek.narayanan/Entity Recognition in Resumes). We train the model with 200 resume data and test it on 20 resume data.

Training the Model :

We use python’s spaCy module for training the NER model. spaCy’s models are statistical and every “decision” they make — for example, which part-of-speech tag to assign, or whether a word is a named entity — is a prediction. This prediction is based on the examples the model has seen during training.

The model is then shown the unlabelled text and will make a prediction. Because we know the correct answer, we can give the model feedback on its prediction in the form of an error gradient of the loss function that calculates the difference between the training example and the expected output. The greater the difference, the more significant the gradient and the updates to our model.

When training a model, we don’t just want it to memorise our examples — we want it to come up with theory that can be generalised across other examples. After all, we don’t just want the model to learn that this one instance of “Amazon” right here is a company — we want it to learn that “Amazon”, in contexts like this, is most likely a company. In order to tune the accuracy, we process our training examples in batches, and experiment with minibatch sizes and dropout rates.

Of course, it’s not enough to only show a model a single example once. Especially if you only have few examples, you’ll want to train for a number of iterations. At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.

Another technique to improve the learning results is to set a dropout rate, a rate at which to randomly “drop” individual features and representations. This makes it harder for the model to memorise the training data. For example, a 0.25dropout means that each feature or internal representation has a 1/4 likelihood of being dropped. We train the model for 10 epochs and keep the dropout rate as 0.2.

Results and Evaluation of the model :

The model is tested on 20 resumes and the predicted summarized resumes are stored as separate .txt files for each resume.

For each resume on which the model is tested, we calculate the accuracy score, precision, recall and f-score for each entity that the model recognizes. The values of these metrics for each entity are summed up and averaged to generate an overall score to evaluate the model on the test data consisting of 20 resumes. The entity wise evaluation results can be observed below . It is observed that the results obtained have been predicted with a commendable accuracy.

A sample summary of an unseen resume of an employee from indeed.com obtained by prediction by our model is shown below :

Resume of an Employee of Microsoft from indeed.com

Summary of the above Resume

If you have any queries or suggestions, I would love to hear about it. Please write to me at [email protected].

*Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. *Check us out!

DataTurks: Data Annotations Made Super Easy

Data Annotation Platform. Image Bounding, Document Annotation, NLP and Text Annotations. #HumanInTheLoop #AI, #TrainingData for #MachineLearning.

entity-recognition-in-resumes-spacy's People

Contributors

Stargazers

Watchers

Forkers

nguyentankhtn krish-mahajan bhaskarshankarling anishpurohit krishna-suraj kashenfelter pilgrim2go rohitxtage victorx98 fraxini dataturks horuzhiy mdp0999 thanhluanuit pralave arunimaray boralprophecy rajib-stats poivrenoir ufukhurriyetoglu madhbhavikar msukhotskaya set92 kormilitzin david-blaszka akshayjh thanhtd91 manikandanvengatesan tusharlp ruiyuh harrywong365 harshalwavre7 priyadatta truongc2 lazuraslong nilportugues sahiliem abhijeetd01 swaup3275 smartats nicotrombon reeturaj-19 bendickson aj-ajay-ay rakeshbm rickydangc stereoraj saneysrikanth xan678 msoancah rileyzhang1029 stenpiren ashoksmore ompanda amir22010 vsathiesh bytearchive hitman56 pivotsecurity aayushkubb nimitkothari mithileshpradhan shalevy1 robscottd ahmedtijane mangy007 cf-praveen mr-aeonian chanvkook pragyarikhari dskov ashutosh1919 nikhilsatram ankita-aggarwal-scry faizaanwani spodilapu-ut allensmile trendingtechnology salman0 esdairim fighting41love harishr1308 simon0729 abhimanyu100 hemakiranyadla ankitsalian anuragsinghchaudhary sudhu26 enisya66 nishultomar emailhy jay1493 javascriptdev sekhar2017 mteterin feynmanium databill86 sachin5nowal lokeshmeesala addie11

entity-recognition-in-resumes-spacy's Issues

Read annotated data with Doccano

Hi,
Please how can i read my annotated data with another tool named Doccano.
1) Here is my annotated data's form:

"annotation": [
[
79,
99,
"Nom complet"
],

2) The code that i want to change to read my annotated data:

    for line in lines:
        data = json.loads(line)
        text = data['content']
        entities = []
        for annotation in data['annotation']:
            #only a single point in text annotation.
            point = annotation['points'][0]
            labels = annotation['label']
            # handle both list of labels or a single label.
            if not isinstance(labels, list):
                labels = [labels]

            for label in labels:
                #dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                entities.append((point['start'], point['end'] + 1 ,label))

How can export and use the trained model ?

Hello;
I just trained my model with my data, so i want to export my model and use it back later with data.
Thank you :)

E24 error

I'm getting the following error when I execute train.py

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

Anything I should be doing differently?

Why trained model does not show all entities?

There are 6 entities, but trained model only shows 4

ValueError: [E103]

I get the error mentioned below while training, even when I used the same code.

ValueError: [E103] Trying to set conflicting doc.ents: '(6861, 6870, 'Companies worked at')' and '(6305, 7258, 'Skills')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

Can someone just dump the trained model here?

I'm getting some issues with SEG Faults in spacy 2.0.12. If someone can just upload the trained data somewhere, that'd be great!

Need Resume NER Dataset

Can we get a link for the dataset? or any other source which can help for resumes?

model

how to test the model with a pdf resume

Input to test the model

Is it possible to test the model with personal PDF resume? Or is there any function to convert a PDF resume in the "DataTurks JSON format"?

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

Please Can someone paste a working code here or tell me the version of spacy they are using

It shows only one label result

Hi,

I've create dataset with 5 label (person, location, date, time, counrty) using dataturks NER tagging and i trained my model using dataturks to scapy and train scapy scripts.

I get results but its returning only Person label result. I re-run script few time than i get date and person label sometimes.

Also i realized that if i use 'testdata.json' (project test data) i get results for all labels even if i trained the model with my own data.

I couldnt find what can be reason for it.

Code: https://paste.ubuntu.com/p/7G28brTWgv/

Train File: https://paste.ubuntu.com/p/GGTFdGbHDN/
Test File: https://paste.ubuntu.com/p/29XWP6MRrp/

ERROR:root:Unable to process traindata.json

error = 'NoneType' object is not iterable
Traceback (most recent call last):
File "", line 22, in convert_dataturks_to_spacy
for annotation in data['annotation']:
TypeError: 'NoneType' object is not iterable

TypeError Traceback (most recent call last)
in
----> 1 train_spacy()

in train_spacy()
51
52 # add labels
---> 53 for _, annotations in TRAIN_DATA:
54 for ent in annotations.get('entities'):
55 ner.add_label(ent[2])

TypeError: 'NoneType' object is not iterable

getting this error after function calling

Saving/Loading Custom Dataset

Hi, I am trying to do inference with the given code. I am getting decent results when testing the code with testdata.json after using nlp.update(). Issue is when i save the model to output_dir with nlp.to_disk() after training the nlp with nlp.update(). When I load the trained model with nlp2.from_disk(output_dir) or nlp2 = spacy.load(output_dir), and then test the model with nlp2, then I am getting very wrong results. Also noticed that the output_dir has number of files and folders in it instead of a single file (like in the case of keras, if we save a model, it is save as a single '.h5' file.). Am I missing out something here? I am relatively new to SpaCy.

How to give test data as pdf instead of annotated json file

In this repo, u have given annotated test data for prediction, but what if the test data is not annotated before hand for prediction. I mean what modifications should be made to take input as a pdf not an annotated json file.

Annotated Resume Link isn't working Properly

The link to download 220 annotated resumes in the README.md isn't working.

https://dataturks.com/projects/abhishek.narayanan/Entity

Can you provide the dataset?

Read another form of annotated data

How can i read my annotated data ?
"annotation": [
[
79,
99,
"Nom complet"
],

This is the annotated data of the code:
"annotation": [
{
"label": [
"Companies worked at"
],
"points": [
{
"start": 1749,
"end": 1754,
"text": "Oracle"
}
]
},

kernel died restarting

is there any solution for this?

Warning: Unnamed vectors -- this won't allow multiple vectors models to be loaded. (Shape: (0, 0))
Statring iteration 0

Kernel died, restarting

Info Needed...

Hi @DataTurks @abhishek-narayanan-dataturks ,
I clone this repository and run successfully without any errors and may i know how to give the resume as input from the code i was unable to find the code related can you explain this..

Thanks and Regards,
Manikantha Sekhar..

how to run this project

means i don't know python and also i dont know about ER ans spacy so can you plz guide me ?
means what all thing that i should install to run bcoz i want resume parser functinality

https://www.dataturks.com/

Bitdefender shows
Suspicious page blocked for your protection

How to read json file, after run this code it can't read the json file

In the repo you have given json file and when i run the code it can't read the json file.
it just simply loading the juPYTER notebook after the main function call and when i run on pycharm it throw the error

ERROR:root:Unable to process traindata.json
error = 'NoneType' object is not iterable
Traceback (most recent call last):
File "train.py", line 22, in convert_dataturks_to_spacy
for annotation in data['annotation']:
TypeError: 'NoneType' object is not iterable
Traceback (most recent call last):
File "train.py", line 123, in
train_spacy()
File "train.py", line 53, in train_spacy
for _, annotations in TRAIN_DATA:
TypeError: 'NoneType' object is not iterable

ERROR:root:Unable to process traindata.json
error = 'NoneType' object is not iterable
Traceback (most recent call last):
File "", line 12, in convert_dataturks_to_spacy
for annotation in data['annotation']:
TypeError: 'NoneType' object is not iterable

Spcay can't train overlaped entity

ValueError: [E103] Trying to set conflicting doc.ents: '(549, 582, 'Designation')' and '(539, 581, 'Designation')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

Accuracy Statistics

Hi,
Thanks for putting together this project. I have a question about the accuracy reporting. It seems that we are only reporting accuracy (and F1 scores etc ) only for the last Resume and not a aggregate scores. Is that a correct understanding?

Error : Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

Could you please help me with spacy part: it keeps breaking down and it seems to be a very common issue. how did you solve it?

Tahnks a lot.

dataturks-engg / entity-recognition-in-resumes-spacy Goto Github PK