GithubHelp home page GithubHelp logo

goto-eof / octoscraper Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 341 KB

A web scraper tool implemented in Rust.

License: GNU General Public License v3.0

Rust 100.00%
imagedownloader scraper webscraper

octoscraper's Introduction

        ___     _        __                                
        /___\___| |_ ___ / _\ ___ _ __ __ _ _ __   ___ _ __ 
       //  // __| __/ _ \\ \ / __| '__/ _` | '_ \ / _ \ '__|
      / \_// (__| || (_) |\ \ (__| | | (_| | |_) |  __/ |   
      \___/ \___|\__\___/\__/\___|_|  \__,_| .__/ \___|_|   
                                           |_|              

Description

OctoScraper is a multithread web scraper tool implemented in Rust.

Execute it

Download the executable from here and run it.

./octoscraper -w http://dodu.it -e .png,.PNG -d DIRECTORY_NAME -s 100 -t 90000 -i true -l 3 -a OctoScraper

Examples

Download midi and mp3 files, no same domain, scan all website

./octoscraper -w http://audiomidimania.com  -oa true -sd false -r false

Download midi and mp3 files, same domain, only from the page passed as parameter

./octoscraper -w http://ininternet.org/midi_file.htm -oa true -sd true -r true

Download image files, no same domain

./octoscraper -w https://wallpaper.mob.org/ -oi true -sd false -si 1000000

Download video, process only the page passed as parameter

./octoscraper -w http://www.w3schools.com/html/html5_video.asp -ov true -r true

where

argument meaning value example
-h Help
-w website - with http/https http://dodu.it or http://audiomidimania.com or http://wallpaper.mob.org or http://www.w3schools.com/html/html5_video.asp
-sd same domain true
-oi enable image extractor true
-ov enable video extractor true
-oa enable audio extractor true
-oo enable other file extractor true
-si minimum image file size (in bytes) 1000000
-sv minimum video file size (in bytes) 1000000
-sa minimum audio file size (in bytes) 1000000
-so minimum other file size (in bytes) 1000000
-ei list of image extensions separated by comma .jpg,.JPG,.png,.PNG
-ev list of video extensions separated by comma .ogg,.OGG,.MP4,.mp4
-ea list of audio extensions separated by comma .mp3,.MP3,.midi,.MIDI
-eo list of other file extensions separated by comma .zip,.ZIP,.exe,.EXE,.pdf,.PDF
-d directory where files will be saved Images
-s sleep time in millis before making the request 1000
-t download timeout 90000
-i insistent mode (it retries until download succeed) true
-l download limit (by default it makes as much requests as possibile) 3
-a user agent OctoScraper
-c enables downloaded file hash check for avoiding duplicate downloads true
-r process only the root link (process only one page) false
-u consider unique resources by filename (1) or by link (2). Allowed values: 1 or 2 1

For developers

Allow reqwest crate to work properly:

sudo apt install libssl-dev

Run application with your configuration:

cargo run -- -w http://dodu.it -e .png,.PNG -d DIRECTORY_NAME -s 100 -t 90000 -i true -l 3 -a OctoScraper

Tests

cargo test

Screenshot

image

Tested on Linux, MacOS and Windows.

if any problems arise, feel free to contact me.

octoscraper's People

Contributors

goto-eof avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.