GithubHelp home page GithubHelp logo

loadkpi / crawler_detect Goto Github PK

View Code? Open in Web Editor NEW
114.0 2.0 12.0 7.53 MB

Ruby gem to detect bots and crawlers via the user agent

License: MIT License

Ruby 92.30% Shell 5.17% Dockerfile 2.53%
ruby crawler spider bots crawler-detection bot-detection

crawler_detect's Introduction

CrawlerDetect

Build Gem Version

About

CrawlerDetect is a Ruby version of PHP class @CrawlerDetect.

It helps to detect bots/crawlers/spiders via the user agent and other HTTP-headers. Currently able to detect 1,000's of bots/spiders/crawlers.

Why CrawlerDetect?

Comparing with other popular bot-detection gems:

CrawlerDetect Voight-Kampff Browser
Number of bot-patterns >1000 ~280 ~280
Number of checked HTTP-headers 10 1 1
Number of updates of bot-list (1st half of 2018) 14 1 7

In order to remain up-to-date, this gem does not accept any crawler data updates โ€“ any PRs to edit the crawler data should be offered to the original JayBizzle/CrawlerDetect project.

Requirements

  • Ruby: MRI 2.5+ or JRuby 9.3+.

Installation

Add this line to your application's Gemfile:

gem 'crawler_detect'

Basic Usage

CrawlerDetect.is_crawler?("Bot user agent")
=> true

Or if you need crawler name:

detector = CrawlerDetect.new("Googlebot/2.1 (http://www.google.com/bot.html)")
detector.is_crawler?
# => true
detector.crawler_name
# => "Googlebot"

Rack::Request extension

Optionally you can add additional methods for request:

request.is_crawler?
# => false
request.crawler_name
# => nil

It's more flexible to use request.is_crawler? rather than CrawlerDetect.is_crawler? because it automatically checks 10 HTTP-headers, not only HTTP_USER_AGENT.

Only one thing you have to do is to configure Rack::CrawlerDetect midleware:

Rails

class Application < Rails::Application
  # ...
  config.middleware.use Rack::CrawlerDetect
end

Rack

use Rack::CrawlerDetect

Configuration

In some cases you may want to use your own white-list, or black-list or list of http-headers to detect User-agent.

It is possible to do via CrawlerDetect::Config. For example, you may have initializer like this:

CrawlerDetect.setup! do |config|
  config.raw_headers_path    = File.expand_path("crawlers/MyHeaders.json", __dir__)
  config.raw_crawlers_path   = File.expand_path("crawlers/MyCrawlers.json", __dir__)
  config.raw_exclusions_path = File.expand_path("crawlers/MyExclusions.json", __dir__)
end

Make sure that your files are correct JSON files. Look at the raw files which are used by default for more information.

Development

You can run rubocop \ rspec with any ruby version using docker like this:

docker build --build-arg RUBY_VERSION=3.3 --build-arg BUNDLER_VERSION=2.5 -t crawler_detect:3.3 .
docker run -it crawler_detect:3.3 bundle exec rspec

License

MIT License

crawler_detect's People

Contributors

bss avatar dependabot[bot] avatar loadkpi avatar olegsmelov avatar olleolleolle avatar pandaiolo avatar walski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

crawler_detect's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.