GithubHelp home page GithubHelp logo

caherrerapa / lacerate Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shivabhusal/lacerate

0.0 1.0 0.0 1.65 MB

Opensource Ruby App to scrape Google search results and generate reports

Ruby 15.84% JavaScript 0.52% CSS 0.99% HTML 82.65%

lacerate's Introduction

README

CircleCI Code Climate


Lacerate is an OpenSource Rails application under MIT License. Lacerate basically takes keywords in a CSV file, scrapes Google and extracts useful information to generate analytics and report for business and SEO analysis.

Info we extract from google

  • Number of AdWords advertisers in the top position.
  • Number of AdWords advertisers in the bottom position.
  • Total number of AdWords advertisers on the page.
  • Display URLs (in green) of the AdWords advertisers in the top position.
  • Display URLs (in green) of the AdWords advertisers in the right side position.
  • Number of the non-AdWords results on the page.
  • URLs of the non-AdWords results on the page.
  • Total number of links (all of them) on the page
  • Total of search results for this keywords e.g. About 21,600,000 results (0.42 seconds)
  • HTML code of the page/cache of the page.

Core Features

  • Live status update using polling in the dashboard : uses progress bar and live data update
    • alt tag
  • Uses multiple servers to boostup performance
  • Scraps Google search pages efficiently

Technical challenges

Preventing Banning IP

Google lets normal user to search queries as much they like, however, they don't like bots goofing around the site. So, suspecious activity can get our IP blacklisted for future access. So, we should not let that happen. Imitating human search patterns is the only way to get unnoticed by Google's bot detection algorithm. Things we gonna try are:-

  • Set a Query Rate Limit
    • It is recommended by experts to keep minimum pause/gap of 2 seconds in between two consecutive queries.
  • Set Your Referrer URL
    • since a genuine request(like a human user) start from google.com and then search begins; like wise, need to set referral of that web request to google.com or something.
  • Create Unique User Agents for your Proxies
    • familiar user agent like google chrome to make believe google that this request is originated from user's browser.

Redis Connection Limitation

Redis we are using; it has connection limit 20; and we are using 5 servers to process the data

alt tag

Speeding up searches

(talking about thousands of keywords)
When there are thousands of keywords you need to take care of, employing a single server(IP) to query with necessary pauses will be pretty time consuming. So, best way is to

  • employ additional servers to crawl via the internet.
    • for the purpose, have planned to use 5 different heroku servers/dynos with separate IP
      • they will be using the same Redis instance via connection pool.
      • they will be using the same PG-Database as they are the part of the same system.
  • get rid of duplicate keywords
  • It also requires you to evenly distribute the keywords so that no re-work occurs and all workers complete around the same time.
    • turns out, you will just have to assign jobs in sidekiq; since all the sidekiq instances share the same Redis instance, they will pick jobs from the default queues and execute. In this way by the end of execution, all the workers will have completed almost equal number of jobs.

For maintainers

  • adding new elements
    • If you need to add view elements as per new features, you can see all the visual elements from our already setup style-guides. This style guide will only appear in development environment; an extra menu-item in the sidebar will appear called "UI Elements/Style Guide". You can then either copy the code(in SLIM) from views/style_guides/*.slim or write it your self. It is supposed to make code consistent and speed up development.

    • alt tag

Tools used

  • Ruby: ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
  • Rails: Rails 5.0.2
  • ActiveModel::Serializer
    • ActiveModel::Serializer allows you to define which attributes and relationships you would like to include in your JSON response. It also acts as a presenter where you can define custom methods to display extra information or override how it’s displayed in your JSON.
  • Sidekiq - as Background job processor
  • Redis - Used by Sidekiq as database

Known issues:

  • Responsiveness of Charts
    • It appears perfect no matter which device you load it from. However, if you load in wider screens and then Zoom-In, you might encounter overflow.
    • But if you zoom-in and refresh the page, it should apepar just fine.
    • Issue: Chart does not re-render when you resize the window.

System dependencies

Configuration

  • see the application.yml.sample for sample environment variables with dummy data.

API documentation

  • Api is well documented and mounted to http://host:post/dev/v1/
    • For production: https://lacera.herokuapp.com/dev/v1/
    • For development: http://localhost:3000/dev/v1/
    • Note: Please make sure put an extra / in URLs above. Looks like there is some issue with rspec_api_documentation library.

  • Our API documentation is easy to read and conprehensible.
  • If you update the test cases in spec/acceptance/**/*_spec.rb, then you will have to run rails docs:generate. This will generate the latest API doc in public/dev/v1.
  • Upside: Clients will always be able to see the latest version of the document.

OAuth 2 Guidelines

alt tag

  • If you wish to use OAuth2 authentication using facebook for your mobile apps/ FrontEnd App(in Ember) then here is the workflow.
    • Send GET request at users/auth/facebook
    • it will return a redirection response(code 302) to facebook; you need to make that get request; let user authorize the app.
    • then it will respond with a redirection response with URL /users/auth/facebook/callback with authorization code; you will need to make that get request and in return you will get a authentication_token and user's email in JSON format.
    • Mobile app will have to send the authentication_token and the email with every request to Lecerate via HTTP request header.
    • Note: Content-Type →application/json; charset=utf-8 is a must.

Database creation

  • rails db:create

Database initialization

  • rails db:setup

How to run the test suite

  • rspec spec/

Services

  • ActiveJob using Sidekiq

Deployment instructions

  • deploy to heroku using git push heroku master

Maintainer of this project: Shiva Bhusal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.