GithubHelp home page GithubHelp logo

jianghu50 / resumeparser Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gogsbread/resumeparser

0.0 2.0 0.0 63.62 MB

Resume Parser using rule and machine-learning based approach. Developed using framework provided by GATE

License: GNU Lesser General Public License v3.0

Shell 0.86% Perl 0.62% CSS 0.45% HTML 66.97% Groovy 0.30% Java 30.76% JavaScript 0.03%

resumeparser's Introduction

ResumeParser

Resume Parser using a hybrid machine-learning and rule-based approach that focuses on semantic rather than syntactic parsing. This is a console based application

###System:### Windows 8.1 (tested) . Should also run in Windows 7)

###Framework:### GATE (https://gate.ac.uk/) - Open source language processing framework.
Apache Tikka (http://tika.apache.org/) - Open source format handling framework

###Pre-requisites:### Windows
Powershell
git
Latest Java (jre8 tested)

###Installation:### Open powershell in windows (run->powershell)

  1. Git clone https://github.com/antonydeepak/ResumeParser.git
  2. cd ResumeParser
  3. cd ResumeTransducer
  4. $env:GATE_HOME="..\GATEFiles" (beware: you are giving a relative path for ease.)

###Run\Test:### Run syntax: > java -cp '.\bin*;..\GATEFiles\lib*;..\GATEFILES\bin\gate.jar;.\lib*' code4goal.antony.resumeparser.ResumeParserProgram <input_file> [output_file]

Test:
> java -cp '.\bin\*;..\GATEFiles\lib\*;..\GATEFILES\bin\gate.jar;.\lib\*' code4goal.antony.resumeparser.ResumeParserProgram .\UnitTests\AntonyDeepakThomas.pdf antony_thomas.json

###Parser Capabilities:###

Supported formats: PDF, doc, docx, rtf, html, txt
Supported Resume Language: English

Output JSON format:

>	
{
"title":""
"gender":"",
"name":{
	"first": "Antony"
	"middle":"Deepak",
	"last" " "Thomas"
}
"email":[],
"address":[]
"phone":[]
"url":[]
"work_experience":[{
  "date_start" : "",
		"jobtitle" : "",
		"organization" : "",
		"date_end" : "",
		"text" : ""
	},{
	  <section_title>:""
}
],
"skills":[
	{"<section_title_from_resume>":"text"}
],
"education_and_training":[
	{"<section_title_from_resume>":"text"}
],
"accomplishments":[
	{"<section_title_from_resume>":"text"}
],
"awards" : [
  {"<section_title_from_resume>":"text"}
],
"credibility" : [
  {"<section_title_from_resume>":"text"}
],
"extracurricular" : [
  {"<section_title_from_resume>":"text"}
],
"misc" : [
  {"<section_title_from_resume>":"text"}
],
}

###Pros### a) Very powerful semantic parsing of resumes. I did not syntactically parse based on common styles or appearances of sections because these approaches do not scale.
b) Relies on proven grammar engines (GATE) and open source projects.

###Everything is not perfect### I tried my best to not blow in the face of user, but these are some gotchas:
1) The file should have an extension in one of the supported format. I simply use the extension to determine the parser and unknown formats will be returned with error. I did not have time for MIME-type evaluation.
2) The engine has a one-time initilization cost and technically I should be faster for subsequent files, however, I did not expose the capability to process corpus data, so it will incur the same cost for every run.
3) There is a log4j warning at the start. Did not have time to fix that :)
4) Page numbers are part of PDF files. Hence you would see page 1, page 2, page n every now and then. This will improve as Apache Tikka improves.
5) Some grammar parsing especially in identifying adjectives is not on par. I did not have time to try out other NL parsers such as Stanford NLP but this is just a matter of improvement of the fundamental engine overtime.

###SourceCode structure:### \ResumeParser
-\ANNIEGazetterFiles
Contains all the compiled lists for common resume section titles
-\GATEFiles
Contains all the GATE libraries needed for NL processing
-\JAPEGrammars
Contains all the JAPE grammars for resume parsing.
-\ResumeTransducer
Console application written in JAVA

###How does the parse work?### Parse uses the Engligh grammar engine provided by GATE through its ANNIE framework. The output is then transduced using the grammar rules and lists specifically written for resume parsing. The JAPE grammar defines a generic set of rules that complies with popular ways of resume writing. It takes Proper nouns from lists and applies them to rules to identify entities. Explore the source code and read about GATE for more details. Also, feel free to pose questions.

resumeparser's People

Contributors

gogsbread avatar

Watchers

James Cloos avatar lex avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.