GithubHelp home page GithubHelp logo

cashinvoicepredictor's Introduction

Problem Statement:

Part 1: Predict and Identify the Customers who have a very high delay in Invoice to Cash Collection. Part2: Predict and Identify the weekly Collection Insight from a Date, i.eThe Customers who are likely to pay within the next week of a date given as input. Part 3: Predict the sales of Cash based Customers for the next 5 days from the current date.

Analytics

• The First Problem:The Dataset provided was the past record of 9 Months of the BPR Only, B2B Customers of Tata Steel • The Dataset contained Cross Sectional and Time Series Features and was basically a Panel Dataset. So, I could neither use basic approaches of Machine Learning nor pure ARIMA Model. • The main challenge was to predict the minority class, i.e, in the entire dataset contained over 110 thousand data points out of which the Class of No Delay, i.e. The Customer transactions which did not incur delay in Invoice to Cash Collection was more than 90 thousand. The Customer Transactions which suffered the significant delay of more than 30 days were very few in number, 224. o The technique used to tackle this problem was to generate Synthetic data points using the ADASYN algorithm o PAPER :ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning . o AUTHORS : Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li o Source : IEEE JOURNAL 2008 o The picture of the algorithm is attached.

• The Next task was based on Feature Engineering where two Columns were generatedmanually, one being the number of Pending Invoices prior to a Posting Date of an Invoice for every customer and the last was the Amount of Pending Invoices prior to a Posting Date for every Customer. • After Feature Engineering choice of Model was very important, being a Supervised Learning Type Classification Problem. After checking accuracy for 3 of the Models, a weighted Voting Classifier was used to predict the Final Output. • The 3 Models being used are – Random Forest, Adaaboot with base learner as Decision Tree, Bagging Classifier with base estimator as KNN. The Random Forest algorithm performs the best. Though no correct set of reasons are known yet with a few searches I got a few probable reasons:  Random Forests can handle thousands of input variables without variable deletion.  It generates an internal unbiased estimate of the generalization error as the forest building progresses. • Prototypes are computed that give information about the relation between the variables and the classification.

• The Final Output contained 3 classes – No Delay (Customers with no Invoice to Cash Delay) ; Low Delay (Customers who pay within 7 days from Due Date) ; High Delay (Customers who don’t pay within the 1st week.) • The Dataset was split into 3 segments : Training – 70% Testing and Validation – 15% and 15% each

• Accuracy: o Training Accuracy : 99.35 o Testing Accuracy : 97.28


• The Second problem : This was a very interesting problem where we needed to predict the estimated Clearing Date. • We used two models in order to predict the Estimated Date of Clearing.  First: Model predicting estimated date of clearing from Net Due Date.  Second : Model predicting estimated date of clearing from Posting. o We use the agreement of both the models in a range of 5 days to predict the final number of Invoices that will be cleared. • In Second Problem too Random Forest outshone the Other Classifiers. • The training and Testing data was all data before the Input Posting Date and the Validation Data was all the data points whose invoices were raised before the Posting Date but was not cleared. o The Training Accuracy for Model 1 : 99.98 o The Testing Accuracy for Model 1 : 94.04 o The Training Accuracy for Model 2 : 99.36 o The Testing Accuracy for Model 2 : 97.67

• The Third problem: This problem was not as difficult as the rest of the problems. We used Arima to forecast the sales of the Individual Customers for the next 5 days from a specified date. o ARIMA is an Ideology that captures autocorrelation in the series by modelling it directly. o Lags of the stationarized series are called “autoregressive” that refers to (AR) terms & Lags of the forecast errors are called “moving average” which refers to (MA) terms. o The autocorrelation function (ACF). Intuitively, a stationary time series is defined by its mean, variance and ACF. A useful result is that any function of a stationary time series is also a stationary time series. o We plotted the PACF and the ACF and used those as q and p, respectively in ARIMA(p,d,q). The differential o In time series analysis, the partial autocorrelation function (PACF) gives the partial correlation of a time series with its own lagged values, controlling for the values of the time series at all shorter lags. It contrasts with the autocorrelation function, which does not control for other lags. o The Terminologies involved:  p = No. of Auto-Regressive Terms  d = No. of Non-Seasonal Differences  q = No. of Moving Average Terms • These 3 are used to model ARIMA(p,d,q) o Advantages :It is a strong underlined mathematical theory which makes it easy to predict “PREICTIVE INTERVALS” which is it is flexible in capturing a lot of different parameters. o Disadvantages :No explicit seasonal indices, hard to interpret coefficients or explain “how the model works”, there is danger of overfitting or mis-identification if not used with care.

cashinvoicepredictor's People

Contributors

soumyabrotobanerjee avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.