GithubHelp home page GithubHelp logo

weiwenxu21 / edamame Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dsp-uga/edamame

0.0 0.0 0.0 71.43 MB

Web Traffic Time Series Forecasting for CSCI8360.

Home Page: https://dsp-uga.github.io/Edamame/

License: MIT License

Python 100.00%

edamame's Introduction

Web Traffic Forecasting

This repository contains various algorithms implemented on web traffic time series forecasting which are completed on CSCI 8360, Data Science Practicum at the University of Georgia, Spring 2018.

This project uses the time series web visits on Wikipedia from Kaggle competition Web Traffic Time Series Forecasting. The dataset contains the visits record of approximately 145,000 pages on Wikipedia, from 07/31/15 to 12/31/16 for training set 1 and 07/31/15 to 09/01/17 for training set 2. In the training set, each row represents a visit series of a page and each column represents a day between the target time period. The pages are categorized into different names, projects, access, and agents as:

  • Names: page names
  • Projects: website language as Deutsch (de), English (en), Spanish (es), French (fr), Japanese (ja), Russian(ru), Chinese(zh), mediawiki, commons.wikimedia
  • Accessibility: type of access as all-access, desktop, mobile
  • Agent: Type of agent as all-agents, spider

In this repository, we are offering two different methods as follows using different packages to forecast the following two months web visit of 145k pages:

  1. Autoregressive Integrated Moving Average model using repackaged itsm
  2. Long Short-term Memory model using keras

Read more details about each algorithm and their applications in our WIKI tab, or visit our website (Edamame.) to follow the process flow.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Environment Setting

  1. Clone this repository
$ git clone https://github.com/dsp-uga/Edamame
$ cd Edamame
  1. Set up environment
$ conda env create -f environments.yml -n edamame_env python=3.6
$ source activate edamame_env
  1. Make this repo as packages
$ python setup.py install

Running the tests

python -m [algorithm] [args-for-the-algorithm]
Algorithms
  • ARIMA: Running Autoregressive Integrated Moving Average model
  • LSTM: Running Long Short-term Memory model

Each folders includes one module and you can run it through the command above. You are also able to import module ARIMA and LSTM in python scripts as usual package. Each module provides their own arguments, use help() to know more details when running the algorithms.

Evaluation

The results count on the mean SMAPE (Symmetric Mean Absolute Percent Error) scores for 145k pages. SMAPE is an alternative method to MAPE when there are zero or near-zero demand for items. Since the low volume items have infinitely high error rates that skew the overall error rate, SMAPE self-limits to an error rate of 200% and reduces the influence of low volume items.

Test Results

train_1 is the training set 1 of web visits from 07/31/15 to 12/31/16, and train_2 is the training set 2 of web visits from 07/31/15 to 09/01/16.

ARIMA

Preprocessing Training set # of pages Mean SMAPE
fill nan with 0 train_1, high sd, stationary 1,867 39.6649
fill nan with 0 train_1, high sd, stationary   2,075     39.4344  
fill nan with 0 train_1, high sd, stationary   2,358     38.8875   

LSTM

Preprocessing Model structure Batch Size Epochs Mean SMAPE
fill nan with 0 LSTM(50) + Dense(60) 3000 30 61.9849
fill nan with 0 LSTM(50) + Dense(60) 5000 30 61.2177
fill nan with 0 LSTM(50) + Dense(60) 10000 50 55.4024
fill nan with 0 LSTM(50) + Dense(60) 10000 70 53.8052
fill nan with 0 LSTM(50) + Dense(60) 10000 100 59.2045

Discussion

ARIMA

  • Has relatively high SMAPE score than LSTM and works well for short-run forecasts with high frequency data
  • High coast and super time consuming (100 days for 145k pages on training set 1)
  • Strict assumptions check before fitting models
    • stationarity check for ARMA model
    • autocorrelation, seasonal components, and trend components for ARIMA model
  • Nice forecast with SMAPE score 7.7685 for ARIMA:



LSTM

  • A lot faster than ARIMA (only 20 mins for 20 epochs) and not sensitive to non-stationary data
  • Starts to forget what happened very long ago (limit is 400 days)
  • Below is an example of SMAPE value distribution for LSTM model. We can see there are quite a few outliners with SMAPE value of 200.



  • Then for those with SMAPE value of 200, if we plot their raw data and predicted data, we can see the raw data are all 0. After inspecting the original data, we found quite a few pages have 0 visit throughout the entire time series.



Performance comparisons between two models




Further Improvement

ARIMA

  • Time consuming is not solvable if we are still fitting each page one by one. Detacting high autocorrelation values by specific threshold and assigning the parameters of seasonal and trend components might reduce the time on augmented Dickey-Fuller test which is not as robust as expected.

LSTM

  • A good way of avoiding those 200 SMAPE values could be to remove those pages with 0 visit throughout the entire time series for training. However, There are 752 such series in train_1. if we extend the time to the end of our final prediction date, it will be 38 pages being 0 for the whole time. This means there are 714 pages that we have to make prediction out of nothing...

  • Also, it might be helpful to train different models for different page categories. For example, different models for pages with different languages.

  • A solution for the memory issue of LSTM proposed by 1st place winner of this kaggle competition is to use information from certain time period ago as additional features to feed into LSTM model.

Authors

(Order alphabetically)

See the CONTRIBUTORS file for details.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

edamame's People

Contributors

melanieihuei avatar weiwenxu21 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.