GithubHelp home page GithubHelp logo

profbiyi / web-server-log-analysis-pyspark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from olalakul/web-server-log-analysis-pyspark

0.0 1.0 0.0 864 KB

This example demonstrates parsing (including incorrectly formated strings) and analysis of web server log data

HTML 64.03% Jupyter Notebook 35.97%

web-server-log-analysis-pyspark's Introduction

Web-Server-Log-Analysis-with-PySpark

This example demonstrates parsing (including incorrectly formated strings) and analysis of web server log data .

The lines may look like

  • local - - [24/Oct/1994:13:41:41 -0600] "GET index.html HTTP/1.0" 200 150
  • remote - - [27/Oct/1994:23:17:17 -0600] "GET index.html 200 3185
  • local - - [27/Oct/1994:15:28:10 -0600] "GET index.html Inch Nails HTTP/1.0" 404 -

Out of 726739 log-lines, 723267 are parsed with protocol info, 1847 are parsed without protocol info, 1419 are of the "local index.html"-type and carry no useful information, and 206 lines are left unparsed till further decision.

The analysis includes:
  1. Step-by-step parsing of log lines to arrive at final "production" parsing code

  2. Exploratory data analysis and visualizations

  3. Analysis of "notFound" (404) response codes and visualizations

The data are taken from here. The code assumes that the file "calgary_access_log.gz" is downloaded, gunziped and put into "data" subdirectory.

Figures are interactive online

Number of requests with various response codes over time (interactive in notebook)

Percentage of requests with various response codes over time

Counting requests with a certain size of returned content (interactive in notebook)

Countint requests with a certain percentage of notFound-code (interactive in notebook)

On the very left of this histogram there are 8751 requests that never (strictly speaking, no more than 1% of the times each was requested) returned "notFound" status. Everything looks good for those. On the very right of this histogram there are 2920 requests that very always (stricty speaking, more than 99% of the times each of those was requested) not found. Those definitely require further investigation.

web-server-log-analysis-pyspark's People

Contributors

olalakul avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.