GithubHelp home page GithubHelp logo

cassandra's Introduction

Cassandra'22

The task was to build a machine learning model that helps estimate when an invoice will be paid. The estimation doesn't have to be an exact date and time, a rough estimation in terms of days is sufficient for this task.

Data Description

  • Description : The textual description entered by the user during recording the invoice on the accounting system (string).
  • Vendor Name : The name of the vendor/supplier who provided the goods or services (string).
  • Created : The date and time of entering the invoice details on the accounting system (datetime).
  • Invoice Date : The date and time of the invoice. It represents when the goods or services have been delivered to Traxes (datetime).
  • Due Date : The date and time when the invoice is due to be paid by Traxes (datetime).
  • Amount : The cost of the goods/service provided by vendors and due to be paid by Traxes (float).
  • Settled : The amount that has been paid by Traxes to vendors on the payment date (float).
  • Outstanding : The unpaid part of the invoice in which Traxes is required to pay (float)
  • Number of Days until Payment : count of days after Invoice Date after which payment was made.

Our goal was to predict Number of Days until Payment feature by training a Machine Learning model on the Data given.

Graphs and Insights derived from Exploratory Data Analysis (EDA)

Pre Processing

  • Most samples have amount less than 200 and 40-60 Number_of_days_until_payment
  • High amount implies fewer days till payment.
  • Days till payment is less if invoice is generated on weekends.
  • Most days till payment if Invoice is generated in month of October and least if in July.

Pre Processing

  • For most samples Due date is within 10 days of Invoice date.
  • Payment time is more if difference between Due and Invoice date is large.

Pre Processing

  • Target is less than 50 for most samples.
  • Some unusual samples have negative target values.

Data Preprocessing:

Pre Processing

  1. Replace Nan Values with Empty Strings in the Description
  2. Text Preprocessing on Description Text: list of stop list of 25 semantically non-selective words were taken from list used by Stanford NLP Group which are common in Reuters-RCV1.
  3. Count Vectorizer to numerically encode text features: It is great tool which is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text
  4. Converted Dates into Date-Time Format: We noticed that dataset had various columns which corresponded to dates like Created, Invoice date etc. But these columns were in string format so we split them into day, date, weekday, month and year to improve inference.

Feature Engineering

  1. Due_Invoice_delta - difference between the due date and invoice date
  2. As the “Outstanding” column was mostly zero, we made a new feature called “Outstanding_zero” which was 1 if the column was zero and 0 otherwise.
  3. We had three continuous numerical columns : [“Outstanding:, “Amount”, “Settled”]. We took ratios of these three columns to create three new features.
  4. To utilize “Vendor_Name” features, we took mean, median, minimum, maximum, std-dev and count of numerical columns for each unique Vendor_Name and made these new features. These new features would help model learn about properties of vendors and how they affect the target.

Training & Validation

  1. CatBoost is a high-performance algorithm for gradient boosting on decision trees.
  2. Categorical features supported without any preprocessing
  3. Reduces overfitting when constructing your models with a novel gradient-boosting scheme.
  4. We used 5-Fold Cross Validation on our model to evaluate our model.

Other Approachers we tried

  1. Classical Neural Network with one hot encoding for the categorical features.
  2. Decision Trees
  3. StackingRegressor by stacking DecisionTree, XGB, CatBoost and RandomForest but CatBoost alone outperformed them.

Presentation slides explaining our approach

Competition Link

cassandra's People

Contributors

kratos-is-here avatar eshaanagarwal avatar

Watchers

 avatar

Forkers

eshaanagarwal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.