GithubHelp home page GithubHelp logo

pycrawler's Introduction

Setup

  • Open settings.py and adjust database settings
  • DATABASE_ENGINE can either be "mysql" or "sqlite"
  • For sqlite only DATABASE_HOST is used, and it should begin with a '/'
  • All other DATABASE_* settings are required for mysql
  • DEBUG mode causes the crawler to output some stats that are generated as it goes, and other debug messages
  • LOGGING is a dictConfig dictionary to log output to the console and a rotating file, and works out-of-the-box, but can be modified

Current State

  • mysql engine untested
  • Issue in some situations where the database is locked and queries cannot execute. Presumably an issue only with sqlite's file-based approach

Logging

  • DEBUG+ level messages are logged to the console, and INFO+ level messages are logged to a file.
  • By default, the file for logging uses a TimedRotatingFileHandler that rolls over at midnight
  • Setting DEBUG in the settings toggles wether or not DEBUG level messages are output at all
  • Setting USE_COLORS in the settings toggles whether or not messages output to the console use colors depending on the level.

Misc

  • Designed to be able to run on multiple machines and work together to collect info in central DB
  • Queues links into the database to be crawled. This means that any machine running the crawler with the central db can grab from the same queue. Reduces crawling redundancy.
  • Thread pool apprach to analyzing keywords in text.

pycrawler's People

Contributors

mswietochowski avatar sjparsons avatar theanti9 avatar wilhall avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.