GithubHelp home page GithubHelp logo

iamadisri / auth-id Goto Github PK

View Code? Open in Web Editor NEW
6.0 5.0 3.0 3.42 MB

A hierarchical bi-LSTM model trained to identify the author of a given email (SMAI@IIIT-H 2017)

Python 100.00%
nlp nlp-machine-learning lstm-neural-network python2

auth-id's Introduction

E-Mail Author Identification

SMAI@IIIT-H (Monsoon 2017)

Team 15

Course Instructor

Project Mentor

Table of Contents

Overview

Classify emails from the Enron email dataset based on their predicted authorship, and used the trained classifier to identify authors of test samples.

Method

Enron Email Dataset

Available here, the dataset contains 0.5 million emails from about 150 users, who were employees of Enron.

The classifers use the authors as classess and the emails as samples to be assigned to those classes by authorship.

Data Preparation

The number of author classes were fixed while maximising the number of emails per author, and while keeping the emails-per-author ratio similar for every author class.

This number was found to be 10 authors with 800-1000 emails each.

Cleaning

The Enron corpus contains all emails in raw form, including not only the message but also all the email metadata.

The data is cleaned to keep only the subject and body of the mails. All attached forward chains are removed, including forwarded threads, and salutations.

The data is also tokenised by word, sentence and paragraph, and is case normalised.

Models

The following different models have been implemented and tested:

CNN implementation

The CNN can identify commonly used groups of words and phrases by an author. Also, the CNN captures localized chunks of information which is useful for finding phrasal units within long texts. There are three layers to the CNN

  • First, the embedding layer generates a sequence of word-embeddings from a sequence of words
  • Second, the conv layer performs the convolution operation using 128 5x5 filter
  • Third, the dense layer is used for classification

Bi-LSTM implementation

The Bi-LSTM is a commonly used technique for text classification.

LSTMs are a special kind of RNN which are more capable of remembering long term dependencies in a sequence. This gives more context to the classifier which helps in author identification while processing a sequence of text.

There are three layers to the model

  • First, the embedding layer generates a sequence of word-embeddings from a sequence of words
  • Second, the bidirectional LSTM generates email embeddings from the sequence of word embeddings
  • Third, the dense layer is performs the classification

Hierarchical Bi-LSTM implementation

LSTMs are known to work best for a sequence of length of 10-15 elements. However, in this implementation the model can take the entire document, increasing the length and hence the overall context for classification.

There are four layers to this model

  • First, the embedding layer generates a sequence of word-embeddings from a sequence of words
  • Second, the first bidirectional LSTM generates sentence embeddings from the sequence of word embeddings
  • Third second bidirectional LSTM generates email embeddings from sentence embeddings
  • Fourth, the dense layer is performs the classification

Augmented Hierarchical Bi-LSTM implementation

This model appends stylometric features to the final document embedding in the hierarchical Bi-LSTM, right before it is passed on to the dense layer. The classification is now performed these augmented documenting-embeddings.

Stylometry

The stylometric features extracted from the data and experimented with are

  • Lexical
    1. Average sentence length
    2. Average word length
    3. total number of words
    4. Ratio of unique words to total number of words
    5. Total number of characters
  • Syntactic
    1. Total number of function words
    2. Total number of personal pronouns
    3. Total number of adjectives

Dependencies

  1. python2
  2. numpy
  3. cPickle
  4. keras
  5. tensorflow
  6. nltk
  7. MySQL and mysqldb

Project Structure

root/

    | data_preprocessing_scripts/
        - dataProcessing.py

    | extracted_features/
        - adjperemail.txt
        - avgsentlenperemail.txt
        - avgwordlenperemail.txt
        - charsperemail.txt
        - funcwordsperemail.txt
        - perpronperemail.txt
        - stylometricVector.txt
        - uniqbytotperemail.txt
        - wordsperemail.txt

    | feature_extraction_scripts/
        - adjperemail.py
        - avgsentlenperemail.py
        - avgwordlenperemail.py
        - charsperemail.py
        - funcwordsperemail.py
        - perpronperemail.py
        - stylometricVector.py
        - uniqbytotperemail.py
        - wordsperemail.py

    | models/
        - CNN.py
        - HierLSTM_withStylometry.py
        - HierLSTM.py
        - LSTM_final.py

    - README.md

References

auth-id's People

Contributors

iamadisri avatar karthikchintapalli avatar kritikalcoder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.