GithubHelp home page GithubHelp logo

originlive / spider Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 3.0 2.6 MB

A web crawler engine that gathers data, sorts it and outputs it to file. The goal is for it to be fully customizable and extensible, with possibilities for scripting the behaviour of the spiders.

License: MIT License

C++ 96.28% Makefile 0.13% C 3.42% Python 0.02% Shell 0.16%

spider's Introduction

SpideR

NOTE: A lot of functionality is still missing. This will be rectified eventually.

A web crawler engine that gathers data, sorts it and outputs it to file.
The goal is for it to be fully customizable and extensible, with possibilities for scripting the behaviour of the spiders. The emphasis is on speed and ease of embeddance.

Requirements:
libcurl, the c++ wrapper curlpp, and gumbo-parser..
See: https://curl.haxx.se/libcurl/
http://www.curlpp.org/
https://github.com/google/gumbo-parser for downloads.
For arch-linux, curl is to be found in the core repos, while libcurlpp and gumbo-git can be found in the AUR.

Installation: Compile it. For linux there is a makefile that should work.

Settings:
Settings are set through the Settings.json file.
The settings currently available are:
textspeed : int - Does nothing atm.
depth : int - Determines how far it should follow links that are found. Default is 1.
debug : bool - Setting this to 1 sets the verbose flag for the connection. But this ends up in the parsing..
type : unchanged|small|firstcapital|fullcapital - Format of the words stored.

Use:
At the moment there are only 3 commands: help, connect and quit.
Connect url - Attempts to connect to the specified site and gather words and url based on the settings set.
Example:

connect www.google.com

Output goes to Output.txt.

spider's People

Contributors

kingrodian avatar originlive avatar wolk avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

spider's Issues

Todo

TODO:

    • Make program work with depth > 1. Fix parsing of urls
    • Dump data regularly to file to avoid too large a tree
    • Incorrect Settings.json should not crash the program, issue an error message.
    • Add a quit command
    • Lacking a Settings.json, the user should be given the option to generate a default one.
    • Add ability to read robots.txt and avoid sites that should not be accessed (patterns like "?id=")
    • Make robot more polite (make sure not to flood a site, and make it a default but optional setting)
    • Add ability to deal with ssl or handshakes. (Curl does this)
    • More customizability, make settings a polymorphic class that has functions to alter logic? (<< this part will be implemented with a scripting language)
    • Better linux console.
    • Sort the libraries, they should be downloadable, rather than embedded already.
    • Make program more into a state machine, polymorphic

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.