GithubHelp home page GithubHelp logo

kekewind / log-analysis-notes Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dom-devel/log-analysis-notes

0.0 0.0 0.0 26 KB

Some notes about constructing a log analyser for SEO log analysis. (Also includes some code snippets.)

License: GNU General Public License v3.0

Python 100.00%

log-analysis-notes's Introduction

Help on constructing a tool for SEO log analysis

This is the companion code section for the Guide to Log Analysis wit Big Query.

Onwards. Our log processer needs to do two crucial things:

  1. Format the logs so they can be uploaded to BigQuery (I will reference turning them to into CSV files for the purposes of this tutorial, but BigQuery accepts multiple formats.)
  2. Double check which logs are Googlebot.

There are also several convenience things which are nice to do in code:

  1. Accept log files whether they're zipped or unzipped.
  2. Accept log files which are in an S3 bucket (often necessary if the log files are very very large.)
  3. Upload the log files to BigQuery.

##How to format the logs I wrote a python log parser, which takes a regex string and a line of a log file and outputs a pythonary dictionary with the various values. You can find it under log_parser.py.

It's by far the least terrible and most re-useable part of my code which is why I uploaded it. (I've commented it a lot for any other amateur coders that may be reading this.)

I've also got several regex strings for parsing common log formats that may come in useful:

##How to double check which logs are Googlebot? The primary way for identifying Googlebot is using a reverse DNS lookup and then a forward DNS lookup as recommended by Google themselves.

https://support.google.com/webmasters/answer/80553?hl=en

This works fine for most scenarios, however if you're performing historical log analysis on logs which are often a year or older, then Google will have occasionally rotated the IP addresses they're using for crawling and your reverse DNS lookup will fail.

In this case we can do an ASN lookup to discover the owner of that particular IP, because Google as a large organisation will buy static IP ranges to crawl from.

In python we used the DNS module to perform the reverse and forward DNS lookups and the cymruwhois module to perform ASN lookups and discover the owner of the IP.

##How best to go about doing this? The following is a rough spec of how my own code works.

A bash script works with the log files one by one, either selecting them from a folder or downloading from an S3 bucket. (Working with them 1 by 1 was a decision made because the server I run this on is a 1GB VPS, with not enough storage or processing power to run in parallel or store an entire set of log files if the client has a particularly large website.)

Grep or zgrep is used to the filter logs down to the bots I'm interested in. For Googlebot for example the following regex will catch 99.9% of Googlebot:

zgrep -i -E "googlebot|mediapartners-google|adsbot-google"

The bash script then calls a python script which will process the logs in python and output them as a CSV file. As the logs are being processed line by line, all the checked and validated Googlebot IPs are saved to a standalone file to avoid having to recheck later. Everytime the python script runs it checks the IP for each log line against the IPs in the standalone file to avoid having to perform an extra lookup as these are time consuming.

The python script also creates the BQ schema needed for uploading to BigQuery from the python dictionary that is created. This is saved to a text file for the bash script to access.

When the python script is done, the bash file then uploads the processed logs to Google Cloud Storage using gsutil before uploading them into BigQuery with bq.

log-analysis-notes's People

Contributors

dom-devel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.