GithubHelp home page GithubHelp logo

terorie / od-database-crawler Goto Github PK

View Code? Open in Web Editor NEW
25.0 4.0 5.0 654 KB

OD-Database Go crawler

License: GNU General Public License v2.0

Go 99.87% Shell 0.07% Dockerfile 0.06%
od-database crawler golang go fasthttp

od-database-crawler's Introduction

OD-Database Crawler ๐Ÿ•ท

Build Status CodeFactor

  • Crawler for OD-Database
  • In production at https://od-db.the-eye.eu/
  • Over 880 TB actively crawled
  • Crawls HTTP open directories (standard Web Server Listings)
  • Gets name, path, size and modification time of all files
  • Lightweight and fast

https://od-db.the-eye.eu/

Usage

Deploys

  1. With Config File (if config.yml found in working dir)

    • Download default config
    • Set server.url and server.token
    • Start with ./od-database-crawler server --config <file>
  2. With Flags or env

    • Override config file if it exists
    • --help for list of flags
    • Every flag is available as an environment variable: --server.crawl_stats โžก๏ธ OD_SERVER_CRAWL_STATS
    • Start with ./od-database-crawler server <flags>
  3. With Docker

    docker run \
        -e OD_SERVER_URL=xxx \
        -e OD_SERVER_TOKEN=xxx \
        terorie/od-database-crawler

Flag reference

Here are the most important config flags. For more fine control, take a look at /config.yml.

Flag/Environment Description Example
server.url
OD_SERVER_URL
OD-DB Server URL https://od-db.mine.the-eye.eu/api
server.token
OD_SERVER_TOKEN
OD-DB Server Access Token Ask Hexa TM
server.recheck
OD_SERVER_RECHECK
Job Fetching Interval 3s
output.crawl_stats
OD_OUTPUT_CRAWL_STATS
Crawl Stats Logging Interval (0 = disabled) 500ms
output.resource_stats
OD_OUTPUT_RESORUCE_STATS
Resource Stats Logging Interval (0 = disabled) 8s
output.log
OD_OUTPUT_LOG
Log File (none = disabled) crawler.log
crawl.tasks
OD_CRAWL_TASKS
Max number of sites to crawl concurrently 500
crawl.connections
OD_CRAWL_CONNECTIONS
HTTP connections per site 1
crawl.retries
OD_CRAWL_RETRIES
How often to retry after a temporary failure (e.g. HTTP 429 or timeouts) 5
crawl.dial_timeout
OD_CRAWL_DIAL_TIMEOUT
TCP Connect timeout 5s
crawl.timeout
OD_CRAWL_TIMEOUT
HTTP request timeout 20s
crawl.user-agent
OD_CRAWL_USER_AGENT
HTTP Crawler User-Agent googlebot/1.2.3
crawl.job_buffer
OD_CRAWL_JOB_BUFFER
Number of URLs to keep in memory/cache, per job. The rest is offloaded to disk. Decrease this value if the crawler uses too much RAM. (0 = Disable Cache, -1 = Only use Cache) 5000

od-database-crawler's People

Contributors

dependabot-support avatar dependabot[bot] avatar pascaldulieu avatar riptl avatar simon987 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.