GithubHelp home page GithubHelp logo

link-extractor's Introduction

Simple producer/consumer web link extractor

Build Status Coverage Status

Installation

Needed:

  • Python 3.5
  • Redis
  • Supervisord

Preferably under virtualenv:

pip install pip-tools (once)

pip-sync requirements*.txt (keeping the PyPI dependencies up-to-date)

Configuration

Customized settings are expected in extractor/settings_local.py. (But they shouldn't be needed.)

Usage

supervisord starts Supervisord with background services (Redis, Celery workers โ€“ "consumers"). They can be controlled by supervisorctl then. Logs are stored in the log directory.

./app.py is the "producer", expecting list of URLs on the standard input. They are parsed, so you can use HTML as input: ./app.py < index.html.

Each URL from input is processed by consumers in a way that the referenced webpage is downloaded and parsed for absolute URLs, which are then saved in a JSON file in the out directory. The output file name is an MD5 hash of the input URL.

Example

$ supervisord  # if not already done before
$ ./app.py
http://example.com
Ctrl+D
$ jq < out/a9b9f04336ce0181a08e774e01113b31.json
{
  "url": "http://example.com",
  "links": [
    "http://www.iana.org/domains/example"
  ],
  "version": "0.1.0"
}

Testing

./test.sh (also generates a coverage)

link-extractor's People

Contributors

garncarz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.