GithubHelp home page GithubHelp logo

neoannqi / pii_advisor Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zax456/pii_advisor

0.0 0.0 0.0 16.73 MB

BT3101 Business Analytics Capstone Project 2019/2020 AY Semester 1

Dockerfile 0.65% Makefile 0.60% Python 49.33% TSQL 1.27% Shell 1.43% Jupyter Notebook 46.71%

pii_advisor's Introduction

BT3101 Capstone Project: Personal Identifiable Information (PII) Advisor

Background:

Govtech’s MyCareersFuture team is facing challenges understanding unstructured data in resumes on Singapore's national job portal. They wish to improve how personally identifiable information (PII) in resumes are handled, which would minimise the costs of any data leakage or security breach.

In collaboration with NUS School of Computing (Business Analytics), Govtech has developed a PII advisor which alerts administrators about, flags out, and masks PIIs in resumes and enable classification of documents according to their level of confidentiality.

Technical Documentation:

The main directory is split into 3 key components:

The software directory is the most important, where it houses productionised data science codes, together with docker configurations that can be launched to create an API that can receive resume files and return parsed results.

Key Commands
Building the Image

Run the following to build the Docker image:

make build;
Starting a complete environment

Run the following to run the image with a complete expected setup:

make start;

To run a sample:

curl -vv localhost:5000/0011.doc;

More details in Software Engineering directory

The data science folder stores the data science codes and unit tests. It is mainly used for research and development on methods to improve parsing, and any machine learning models that were tested.

For our model, we used a pretrained model from Dataturks as our baseline. To create an improved model that can work better with Singaporean resumes, the NUS team has trained it with local resumes.

This directory contains: Dataturk's pre-trained model and application code Unit Tests

As well as some jupyter notebooks used in the research and development phase of our project.

** Note that the files in Data Science are decoupled from those in Software Engineering, hence changes only made in Data Science directory would not be propagated.

The key files in Software Engineering that contains code from here is found in convert_to_text.py and process_string.py.

The dashboard folder contains a prototype of what could be visualised after analysing the results of the PII advisor. These charts were conceptualised based on client feedback and available data.

Dashboard Screenshot

At the moment, there is no ETL process that converts the parsed resumes to dashboard visualisations.

This directory contains: Dataturk's pre-trained model and application code Unit Tests

Contributors

  • Govtech Project Lead: zephinzer (Joseph Matthias Goh)

  • NUS Students:

  • Lee Chen Yuan

  • Markus Ng

  • Ang Kian Hwee

  • Sheryl Ker

  • Tong Tsz Hin (Tony)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.