GithubHelp home page GithubHelp logo

lexnovic / torcrawl.py Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mikemeliz/torcrawl.py

0.0 0.0 0.0 112 KB

Crawl and extract (regular or onion) webpages through TOR network

License: GNU General Public License v3.0

Python 97.40% YARA 2.60%

torcrawl.py's Introduction

TorCrawl.py

Version Python license

Basic Information:

TorCrawl.py is a python script to crawl and extract (regular or onion) webpages through TOR network.

  • Warning: Crawling is not illegal, but violating copyright is. It’s always best to double check a website’s T&C before crawling them. Some websites set up what’s called robots.txt to tell crawlers not to visit those pages. This crawler will allow you to go around this, but we always recommend respecting robots.txt.
  • Keep in mind: Extracting and crawling through TOR network take some time. That's normal behaviour; you can find more information here.

What makes it simple?

If you are a terminal maniac you know that things have to be simple and clear. Passing output into other tools is necessary and accuracy is the key.

With a single argument you can read an .onion webpage or a regular one through TOR Network and using pipes you can pass the output at any other tool you prefer.

$ torcrawl -u http://www.github.com/ | grep 'google-analytics'
    <meta-name="google-analytics" content="UA-XXXXXX- "> 

If you want to crawl the links of a webpage use the -c and BAM you got on a file all the inside links. You can even use -d to crawl them and so on. As far, there is also the necessary argument -p to wait some seconds before the next crawl.

$ torcrawl -v -u http://www.github.com/ -c -d 2 -p 2
# TOR is ready!
# URL: http://www.github.com/
# Your IP: XXX.XXX.XXX.XXX
# Crawler started from http://www.github.com/ with 2 depth crawl and 2 second(s) delay:
# Step 1 completed with: 11 results
# Step 2 completed with: 112 results
# File created on /path/to/project/links.txt

Installation:

To install this script, you need to clone that repository:

git clone https://github.com/MikeMeliz/TorCrawl.py.git

You'll also need to install dependecies:

pip install -r requirements.txt

Of course, the TOR Hidden Service is needed:

Debian/Ubuntu: apt-get install tor (for more distros and instructions)

Arguments:

arg Long Description
General:
-h --help Help
-v --verbose Show more informations about the progress
-u --url *.onion URL of Webpage to crawl or extract
-w --without Without the use of Relay TOR
-f --folder The directory which will contain the generated files (@guyo13)
Extract:
-e --extract Extract page's code to terminal or file. (Default: Terminal)
-i --input filename Input file with URL(s) (seperated by line)
-o --output [filename] Output page(s) to file(s) (for one page)
-y --yara Perform yara keyword search (0 = search entire html object. 1 = search only text).
Crawl:
-c --crawl Crawl website (Default output on /links.txt)
-d --cdepth Set depth of crawl's travel (Default: 1)
-p --pause The length of time the crawler will pause (Default: 0)
-l --log Log file with visited URLs and their response code

Usage:

As Extractor:

To just extract a single webpage to terminal:

$ python torcrawl.py -u http://www.github.com
<!DOCTYPE html>
...
</html>

Extract into a file (github.htm) without the use of TOR:

$ python torcrawl.py -w -u http://www.github.com -o github.htm
## File created on /script/path/github.htm

Extract to terminal and find only the line with google-analytics:

$ python torcrawl.py -u http://www.github.com | grep 'google-analytics'
    <meta name="google-analytics" content="UA-*******-*">

Extract to file and find only the line with google-analytics using yara:

$ python torcrawl.py -v -w -u https://github.com -e -y 0
...

Note: update res/keyword.yar to search for other keywords. Use -y 0 for raw html searching and -y 1 for text search only.

Extract a set of webpages (imported from file) to terminal:

$ python torcrawl.py -i links.txt
...

As Crawler:

Crawl the links of the webpage without the use of TOR, also show verbose output (really helpfull):

$ python torcrawl.py -v -w -u http://www.github.com/ -c
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com/ with step 1 and wait 0
## Step 1 completed with: 11 results
## File created on /script/path/links.txt

Crawl the webpage with depth 2 (2 clicks) and 5 seconds waiting before crawl the next page:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 2 and wait 5
## Step 1 completed with: 11 results
## Step 2 completed with: 112 results
## File created on /script/path/links.txt

As Both:

You can crawl a page and also extract the webpages into a folder with a single command:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5 -e
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 1 and wait 5
## Step 1 completed with: 11 results
## File created on /script/path/FolderName/index.htm
## File created on /script/path/FolderName/projects.html
## ...

Note: The default (and only for now) file for crawler's links is the links.txt document. Also, to extract right after the crawl you have to give -e argument

Following the same logic; you can parse all these pages to grep (for example) and search for specific text:

$ python torcrawl.py -u http://www.github.com/ -c -e | grep '</html>'
</html>
</html>
...

As Both + Keyword Search:

You can crawl a page, perform a keyword search and extract the webpages that match the findings into a folder with a single command:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5 -e -y 0
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 1 and wait 5
## Step 1 completed with: 11 results
## File created on /script/path/FolderName/index.htm
## File created on /script/path/FolderName/projects.html
## ...

Note: Update res/keyword.yar to search for other keywords. Use -y 0 for raw html searching and -y 1 for text search only.

Demo:

peek 2018-12-08 16-11

Contributors:

Feel free to contribute on this project! Just fork it, make any change on your fork and add a pull request on current branch! Any advice, help or questions would be appreciated :shipit:

License:

“GPL” stands for “General Public License”. Using the GNU GPL will require that all the released improved versions be free software. source & more

Changelog:

v1.21:
    * Fixed typos of delay (-d)
    * Fixed TyperError and IndexError 
v1.2:
    * Migrated to Python3
    * Option to generate log file (-l)
    * PEP8 Fixes
    * Fix double folder generation (http:// domain.com)

torcrawl.py's People

Contributors

mikemeliz avatar the-siegfried avatar guyo13 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.