rivermont / spidy Goto Github PK

The simple, easy to use command line web crawler.

License: GNU General Public License v3.0

Python 97.91% Makefile 1.28% Dockerfile 0.81%

web-crawler web-spider python python3 crawling crawler

spidy's Introduction

spidy Web Crawler

Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler.
Given a list of web links, it uses Python requests to query the webpages, and lxml to extract all links from the page.
Pretty simple!

Created by rivermont (/rɪvɜːrmɒnt/) and FalconWarriorr (/fælcʌnraɪjɔːr/), and developed with help from these awesome people.
Looking for technical documentation? Check out DOCS.md
Looking to contribute to this project? Have a look at CONTRIBUTING.md, then check out the docs.

🎉 New Features!

Multithreading

Crawl all the things! Run separate threads to work on multiple pages at the same time.
Such fast. Very wow.

PyPI

Install spidy with one line: pip install spidy-web-crawler!

Automatic Testing with Travis CI

Release v1.4.0 - #31663d3

spidy Web Crawler Release 1.4

How it Works

Spidy has two working lists, TODO and DONE.
'TODO' is the list of URLs it hasn't yet visited.
'DONE' is the list of URLs it has already been to.
The crawler visits each page in TODO, scrapes the DOM of the page for links, and adds those back into TODO.
It can also save each page, because datahoarding 😜.

Why It's Different

What sets spidy apart from other web crawling solutions written in Python?

Most of the other options out there are not web crawlers themselves, simply frameworks and libraries through which one can create and deploy a web spider for example Scrapy and BeautifulSoup. Scrapy is a Web crawling framework, written in Python, specifically created for downloading, cleaning and saving data from the web whereas BeautifulSoup is a parsing library that allows a programmer to get specific elements out of a webpage but BeautifulSoup alone is not enough because you have to actually get the webpage in the first place.

But with Spidy, everything runs right out of the box. Spidy is a Web Crawler which is easy to use and is run from the command line. You have to give it a URL link of the webpage and it starts crawling away! A very simple and effective way of fetching stuff off of the web.

Features

We built a lot of the functionality in spidy by watching the console scroll by and going, "Hey, we should add that!"
Here are some features we figure are worth noting.

Error Handling: We have tried to recognize all of the errors spidy runs into and create custom error messages and logging for each. There is a set cap so that after accumulating too many errors the crawler will stop itself.
Cross-Platform compatibility: spidy will work on all three major operating systems, Windows, Mac OS/X, and Linux!
Frequent Timestamp Logging: Spidy logs almost every action it takes to both the console and one of two log files.
Browser Spoofing: Make requests using User Agents from 4 popular web browsers, use a custom spidy bot one, or create your own!
Portability: Move spidy's folder and its contents somewhere else and it will run right where it left off. Note: This only works if you run it from source code.
User-Friendly Logs: Both the console and log file messages are simple and easy to interpret, but packed with information.
Webpage saving: Spidy downloads each page that it runs into, regardless of file type. The crawler uses the HTTP Content-Type header returned with most files to determine the file type.
File Zipping: When autosaving, spidy can archive the contents of the saved/ directory to a .zip file, and then clear saved/.

Tutorial

Using with Docker

Spidy can be easily run in a Docker container.

First, build the Dockerfile: docker build -t spidy .
- Verify that the Docker image has been created: docker images
Then, run it: docker run --rm -it -v $PWD:/data spidy
- --rm tells Docker to clean up after itself by removing stopped containers.
- -it tells Docker to run the container interactively and allocate a pseudo-TTY.
- -v $PWD:/data tells Docker to mount the current working directory as /data directory inside the container. This is needed if you want Spidy's files (e.g. crawler_done.txt, crawler_words.txt, crawler_todo.txt) written back to your host filesystem.

Spidy Docker Demo

Installing from PyPI

Spidy can be found on the Python Package Index as spidy-web-crawler.
You can install it from your package manager of choice and simple run the spidy command.
The working files will be found in your home directory.

Installing from Source Code

Alternatively, you can download the source code and run it.

Python Installation

The way that you will run spidy depends on the way you have Python installed.

Windows and Mac

There are many different versions of Python, and hundreds of different installations for each them.
Spidy is developed for Python v3.5.2, but should run without errors in other versions of Python 3.

Anaconda

We recommend the Anaconda distribution.
It comes pre-packaged with lots of goodies, including lxml, which is required for spidy to run and not including in the standard Python package.

Python Base

You can also just install default Python, and install the external libraries separately.
This can be done with pip:

pip install -r requirements.txt

Linux

Python 3 should come preinstalled with most flavors of Linux, but if not, simply run

sudo apt update
sudo apt install python3 python3-lxml python3-requests

Then cd into the crawler's directory and run python3 crawler.py.

Crawler Installation

If you have git or GitHub Desktop installed, you can clone the repository from here. If not, download the latest source code or grab the latest release.

Launching

Use cd to navigate to the directory that spidy is located in, then run:

python crawler.py

Running

Spidy logs a lot of information to the command line throughout its life.
Once started, a bunch of [INIT] lines will print.
These announce where spidy is in its initialization process.

Config

On running, spidy asks for input regarding certain parameters it will run off of.
However, you can also use one of the configuration files, or even create your own.

To use spidy with a configuration file, input the name of the file when the crawler asks

The config files included with spidy are:

blank.txt: Template for creating your own configurations.
default.cfg: The default version.
heavy.cfg: Run spidy with all of its features enabled.
infinite.cfg: The default config, but it never stops itself.
light.cfg: Disable most features; only crawls pages for links.
rivermont.cfg: My personal favorite settings.
rivermont-infinite.cfg: My favorite, never-ending configuration.

Start

Sample start log.

Autosave

Sample log after hitting the autosave cap.

Force Quit

Sample log after performing a ^C (CONTROL + C) to force quit the crawler.

How Can I Support This?

The easiest thing you can do is Star spidy if you think it's cool, or Watch it if you would like to get updates.
If you have a suggestion, create an Issue or Fork the master branch and open a Pull Request.

Contributors

See the CONTRIBUTING.md

The logo was designed by Cutwell
3onyc - PEP8 Compliance.
DeKaN - Getting PyPI packaging to work.
esouthren - Unit testing.
Hrily - Multithreading.
j-setiawan - Paths that work on all OS's.
michellemorales - Confirmed OS/X support.
petermbenjamin - Docker support.
quatroka - Fixed testing bugs.
stevelle - Respect robots.txt.
thatguywiththatname - README link corrections.

License

We used the Gnu General Public License (see LICENSE) as it was the license that best suited our needs.
Honestly, if you link to this repo and credit rivermont and FalconWarriorr, and you aren't selling spidy in any way, then we would love for you to distribute it.
Thanks!

spidy's People

Contributors

Stargazers

Watchers

spidy's Issues

Tests

I have no experience with writing Python tests, but it seems to be necessary when a program gets big.

Since the small parts of the crawler are already broken up into separate functions, one approach might be to test that each function runs without error and returns the expected type.

No error raised for incorrect input

Checklist

Same issue has not been opened before.

Expected Behavior

Raise an InputError and then stop the crawler.

Actual Behavior

No output, just exits straight to console.

Steps to Reproduce the Problem

Run crawler
Choose 'No' to config file.
Type incorrect type of input

Specifications

Crawler Version: 1.6.2
Platform: Ubuntu (16.04 LTS)
Python Version: 3.5.2
Dependency Versions: Latest

Feature and Bug Reports

Calling any passers-by to take a moment to submit feature requests or bugs, no matter how small!

Please see the README for a general overview of this project, docs.md for some outdated documentation, and CONTRIBUTING.md for some more words.

Fail crawling relative url and protocol

The crawler concat the child's uri relative to the parent :
https://mysite/folder/page
=> found : /js/main.js
https://mysite/folder/page/js/main.js

Same thing when a link doesn't have protocol declared :
https://mysite/folder/page
=> found : //subdomain.mysite/images/myimage.png
https://mysite/folder/page//subdomain.mysite/images/myimage.png

Install
apt-get install python3 python3-lxml python3-requests
apt-get install python3-pip python-pip
pip3 install spidy-web-crawler

Starting spidy Web Crawler version 1.6.5

Am i the only one with this problem ?

Thx for you help

Always download from wikipedia

GitHub Release last in 2017?

Can you please use GitHub Releases so I can Unwatch issues and only follow via Watch-> Custom->Releases.

Documentation Needed

CONTRIBUTING.md has some guidelines, but essentially there is simply a lot of stuff that needs filled out in the docs.

Also, if you would like to use another documentation format feel free. Listing everything is something I came up with in early development but it's probably not scalable.

PyPI Description Formatting

After multiple tries I have yet to get PyPI to format the README correctly. Current state can be viewed here.

At the moment, I followed this SO answer and convert README.md to ReStructuredText when running setup.py. However it looks like you cannot have relative links on a PyPI page, so I changed them all to https links to the GitHub files.

It's still displaying the RST as text and not rendering it, but I don't know where to go from here.

Cookie handling

Feature Description

Some cookie handling functionality would be pretty valuable. Setting cookies in the config file should be trivial to implement. An option to send a GET or POST every n scraped URLs / minutes, and apply the recieved cookies to subsequent request would be great too. I might do a PR if I have the time in the near future.

Checklist

This feature does not already exist.
This feature has not been requested before.

Respect robots.txt

There should be an option (disable-able) to ignore links that are forbidden by a site's robots.txt. Another library or huge regex might need to be used to parse out the domain the page is on.

Fails to crawl certain sites

Expected Behavior

crawl like hell

Actual Behavior

dies of an unknown error

Steps to Reproduce the Problem

echo "https://www.golem.de/"> ./crawler_todo.txt
spidy

What I've tried so far:

Using spidy

Autosave triggered by single thread and not global.

Checklist

Same issue has not been opened before.

Expected Behavior

All threads to stop as crawler prints info and saves files.

Actual Behavior

Once one thread reaches SAVE_COUNT links crawled, it saves while the other threads continue. This results in [CRAWL] logs in between [INFO] logs.

It seems like this is inefficient and could result in some saving errors.

Steps to Reproduce the Problem

Run crawler
Wait for the autosave cap to be hit.

Specifications

Crawler Version: 1.6.2
Platform: Ubuntu (16.04 LTS)
Python version: 3.5.2
Dependency Versions: All latest.

Docker is unusable

Expected Behavior

Docker should simplify things not make them harder

Actual Behavior

Docker is a strugle, you have to build image several times before it works:
It ignores configs that in /data directory, and use only those defaults which was in repo.
It creates results as root, etc etc.

What I've tried so far:

Best workaround is to use -v $PWD:/src/app/spidy/config/ but it still ugly

Crawler saving bad links as jumbled mess - sometimes

If you run the crawler and take a look at crawler_bad (or whatever the bad links file is for your configuration), it will have some links in byte form - another problem - but most lines will be single characters. Must fix.

Write comparison of spidy and similar projects

I have been asked multiple times what makes Spidy different from similar projects like Scrapy and BeautifulSoup. There is a section in the README, but it doesn't directly address other projects.

String Index Error on perfectly normal URLs

Checklist

Same issue has not been opened before.

Expected Behavior

No errors.

Actual Behavior

Seemingly randomly, crawling a url will fail with a

string index out of range

error. There doesn't seem to be anything wrong with the URLs:

http://www.denverpost.com/breakingnews/ci_21119904
https://www.publicintegrity.org/2014/07/15/15037/decades-making-decline-irs-nonprofit-regulation
https://cdn.knightlab.com/libs/timeline3/latest/js/timeline-min.js
https://github.com/rivermont/spidy/
https://twitter.com/adamwhitcroft

Steps to Reproduce the Problem

Run the crawler.
Wait a few seconds.

What I've tried so far

Raising the error gave the traceback:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "crawler.py", line 260, in crawl_worker
    if link[0] == '/':
IndexError: string index out of range

Specifications

Crawler Version: 1.6.0
Platform: Linux (Ubuntu 16.04 LTS)
Dependency Versions: All latest

Tests for multithreading

Feature Description

It would be great to have a check in tests.py for the multithreading and queue.

Checklist

This feature does not already exist.
This feature has not been requested before.

PyPI Package

It would be great to have spidy on the Package Index (the new one); installing would be only one command. I have tried to get it going, my efforts can be found on the pypi-dev branch.

There are some problems with imports, file saving, tests, etc.

While I would like to be the owner/maintainer of the package, some help getting it started would greatly appreciated.

Save robots.txt results

Currently, a request is sent for a site's robots.txt every time a link is crawled. It would be much faster if results of a robots.txt query were saved in some database. Only one request should need to be sent.

Issue and PR Templates

It would be a good idea to have templates for these. I have created TEMPLATES.md as a start but would love some more input.

I looked here to get started.

Linux version of crawler

So far spidy has been developed on Windows for Windows, but obviously that won't work.

After doing some testing, it seems that on Linux (Ubuntu 16.04, at least) the crawler interprets folders such as config/ with the slash as part of the folder name, whereas on Windows the slash is needed to indicate being a folder.

Update for Python 3.9+, deprecate Reppy dependency

Reppy doesn't work past Python 3.8 - seomoz/reppy#122, seomoz/reppy#132 - which means our robots.txt parser isn't working (#81).
Python 3.8 also reaches end-of-life next year so this needs to happen anyway.

Branch without any robots parser
Find replacement robots parser
Test whole program for 3.9-3.11 compatibility

Platform Support

We would love to confirm that Spidy will run on all systems, and fix any bugs that may be hidden!

Install spidy either from source, through PyPI, or a GitHub Release (instructions found in the README). Run the crawler, using any config file (preferably rivermont-infinite or heavy to test all features) or a custom configuration.
Report any bugs by opening a new Issue here.
Comment your Platform specs, Python version, spidy version, etc. No information is too much.

Tests not working properly

From tests.py.

test_make_words_given_string
- Fails with AttributeError: 'str' object has no attribute 'text'
- make_words needs to be passed a requests.models.Response object in order to extract to text properly, however I'm not sure there is a way to simulate that. Something like StringIO for files...?
- I feel like getting a page and passing it would take too long to be an acceptable test, but I could be wrong. Would take ~5 seconds.
test_mime_lookup_given_unknown_type
- Fails with crawler.HeaderError: Unknown MIME type: this_mime_doesn't_exist
- It should be caught with the assertRaises statement.

Failed crawl for http://www.frankshospitalworkshop.com/

$ docker run --rm -it -v $PWD:/data spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Starting spidy Web Crawler version 1.6.5
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Report any problems to GitHub at https://github.com/rivermont/spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating classes...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating functions...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating variables...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Should spidy load settings from an available config file? (y/n):
n
[01:01:40] [spidy] [WORKER #0] [INIT] [INFO]: Please enter the following arguments. Leave blank to use the default values.
[01:01:40] [spidy] [WORKER #0] [INIT] [INPUT]: How many parallel threads should be used for crawler? (Default: 1):

[01:01:47] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy load from existing save files? (y/n) (Default: Yes):

[01:01:54] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy raise NEW errors and stop crawling? (y/n) (Default: No):

[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy save the pages it scrapes to the saved folder? (y/n) (Default: Yes):

[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy zip saved documents when autosaving? (y/n) (Default: No):

[01:01:57] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy download documents larger than 500 MB? (y/n) (Default: No):

[01:01:58] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy scrape words and save them? (y/n) (Default: Yes):

[01:01:59] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy restrict crawling to a specific domain only? (y/n) (Default: No):
y
[01:02:02] [spidy] [WORKER #0] [INIT] [INPUT]: What domain should crawling be limited to? Can be subdomains, http/https, etc.
http://www.frankshospitalworkshop.com/
[01:02:07] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy respect sites' robots.txt? (y/n) (Default: Yes):
y
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: What HTTP browser headers should spidy imitate?
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: Choices: spidy (default), Chrome, Firefox, IE, Edge, Custom:

[01:02:14] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the TODO save file (Default: crawler_todo.txt):
/data/crawler_todo.txt
[01:02:24] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the DONE save file (Default: crawler_done.txt):
/data/crawler_done.txt
[01:02:31] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the words save file (Default: crawler_words.txt):
/data/crawler_words.txt
[01:02:38] [spidy] [WORKER #0] [INIT] [INPUT]: After how many queried links should the crawler autosave? (Default: 100):

[01:02:39] [spidy] [WORKER #0] [INIT] [INPUT]: After how many new errors should spidy stop? (Default: 5):

[01:02:40] [spidy] [WORKER #0] [INIT] [INPUT]: After how many known errors should spidy stop? (Default: 10):

[01:02:41] [spidy] [WORKER #0] [INIT] [INPUT]: After how many HTTP errors should spidy stop? (Default: 20):

[01:02:42] [spidy] [WORKER #0] [INIT] [INPUT]: After encountering how many new MIME types should spidy stop? (Default: 20):

[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Loading save files...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Successfully started spidy Web Crawler version 1.6.5...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Using headers: {'User-Agent': 'spidy Web Crawler (Mozilla/5.0; bot; +https://github.com/rivermont/spidy/)', 'Accept-Language': 'en_US, en-US, en', 'Accept-Encoding': 'gzip', 'Connection': 'keep-alive'}
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Spawning 1 worker threads...
[01:02:43] [spidy] [WORKER #1] [INIT] [INFO]: Starting crawl...
[01:02:43] [reppy] [WORKER #0] [ROBOTS] [INFO]: Reading robots.txt file at: http://www.frankshospitalworkshop.com/robots.txt
[01:02:45] [spidy] [WORKER #1] [CRAWL] [ERROR]: An error was raised trying to process http://www.frankshospitalworkshop.com/equipment.html
[01:02:45] [spidy] [WORKER #1] [ERROR] [INFO]: An XMLSyntaxError occurred. A web dev screwed up somewhere.
[01:02:45] [spidy] [WORKER #1] [LOG] [INFO]: Saved error message and timestamp to error log file
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: Stopping all threads...
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: I think you've managed to download the entire internet. I guess you'll want to save your files...
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved TODO list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved DONE list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved 0 words to /data/crawler_words.txt

spidy Logo Design

Possibly something reminiscent of the Python logo, but not obvious. Other than that, anything would be great.

Web Crawler GUI!

Having a clicky interface has been a goal for a long time now. There are many users who abhor the command line but are still interested in the tools that use them.

The remnants of a TkInter interface can be found in gui.py.
Some thoughts can be found in the docs here, as well as a wireframe sketch.

Command line arguments

Arguments for:

NOTES: Don't try to use sys.argv.

HEAD Request uses default Requests headers

The initial HEAD request sent to get the document's size is not using the set HTTP headers that the GET request is. Should be a simple fix.

Docker Container

I've been looking for a decent command-line web crawler for some time and came across this project. It seems very promising.

I have been working on getting this project working in a docker container.

Would you be interested in my contributing the dockerfile back to this project for anyone else who might be interested in the same?

Multiple HTTP Threads

Crawling would go much faster connecting to multiple pages at once.

Possible Problems:

Crawling same page twice
Corruption of save files if reading/writing at different places at the same time.

These should be solvable using mutexes.

unusable

hi, i tried to use spidy b.c. it looked promising.
Is it dead?

first:
sudo pip install -r requirements.txt
doest work, reppy is not installable (python 3.9)

snd:
Docker is a pita...
Please look into ConfigArgParse if you need config files BUT make sure that arguments can be used as well
with docker, there is no error log...
I ended with
docker run --rm -it -v $PWD:/data -w /data --entrypoint /src/app/spidy/crawler.py spidy
so that the error log is accessible (why is there no config option?!)

why is a suffix on the config file enforced? What is that? Windows?

thrd:
my config contained either an Ip or a hostname (resolved via /etc/hosts)
Spidy did not spider either.
For the hostname option it gave

ERROR: OSError
EXT: HTTPConnectionPool(host='example.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ae176ecc0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Seems that it doesnt respect /etc/hosts?!
But neither did the ip option work...
e.g. '192.168.1.55/wiki/'

Getting indent error when running on fresh install.

Getting the below error

  File "crawler.py", line 217
    package='reppy')
    ^
IndentationError: unexpected indent

when running python crawler.py

Incidentally, that line was changed recently in 3a62f5e

Expected Behavior

a project management platform that integrates with GitHub
If you make changes to crawler.py...

Actual Behavior

an project management platform hat integrates with GitHub
If you make changed to crawler.py...

Steps to Reproduce the Problem

change "an" to "a"
change "hat" to "that"
change "changed" to "changes"

What I've tried so far:

Simple spelling (typo) and grammar correction, also testing contribute process.

Specifications

Crawler Version: NA
Platform: NA
Python Version: NA
Dependency Versions: NA

Checklist

Same issue has not been opened before.

rivermont / spidy Goto Github PK

spidy's Introduction

spidy Web Crawler

🎉 New Features!

Multithreading

PyPI

Automatic Testing with Travis CI

Release v1.4.0 - #31663d3

Contents

How it Works

Why It's Different

Features

Tutorial

Using with Docker

Spidy Docker Demo

Installing from PyPI

Installing from Source Code

Python Installation

Windows and Mac

Anaconda

Python Base

Linux

Crawler Installation

Launching

Running

Config

Start

Autosave

Force Quit

How Can I Support This?

Contributors

License

spidy's People

Contributors

Stargazers

Watchers

Forkers

spidy's Issues

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Feature Description

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

What I've tried so far:

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Expected Behavior

Actual Behavior

What I've tried so far:

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

What I've tried so far

Specifications

Feature Description

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

What I've tried so far:

Specifications

Checklist

Recommend Projects

Recommend Topics

Recommend Org

Jobs