learningequality / basiccrawler Goto Github PK
View Code? Open in Web Editor NEWBasic web crawler that automates website exploration and producing web resource trees.
Home Page: https://pypi.org/project/basiccrawler/
License: MIT License
Basic web crawler that automates website exploration and producing web resource trees.
Home Page: https://pypi.org/project/basiccrawler/
License: MIT License
After timing out six times with a bad website, the crawler crashes on
Line 560 in 79a81c4
I can't find where Dummy404ResponseObject
is defined; it's also referenced in https://github.com/fle-internal/te-sushi-chef/blob/master/te_chef.py [thanks, Google]
I was just about to do this when I saw this comment. Can you give me some pointers?
https://github.com/learningequality/ricecooker/blob/master/ricecooker/utils/caching.py
has imports from cachecontrol
do I just steal class CacheForeverHeuristic(BaseHeuristic):
or is there something simpler?
btw. I got a huge number of errors trying to pip install this because of all the dependencies and their dependencies.
It can happen that after running for an hour or two the crawler just stops, mostly due to failed retries. It would be good if it gave results and wrote to the json file before exiting and could start with the file when told to do so.
ERROR:crawler:ERROR 404 when getting url=https://www.cdc.gov/ncbddd/spanish/actearly/milestones/photolibrary/videos/1year/language/Trata-de-imitar-las-palabras-que-escucha-508.pdf
WARNING:crawler:HEAD request failed for url https://www.cdc.gov/ncbddd/spanish/actearly/milestones/photolibrary/videos/1year/language/Trata-de-imitar-las-palabras-que-escucha-508.pdf
Traceback (most recent call last):
File "/usr/lib/python3.8/encodings/idna.py", line 165, in encode
raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./scan-cdc", line 26, in <module>
channel_tree = crawler.crawl(limit=300000)
File "/usr/local/lib/python3.8/dist-packages/basiccrawler/crawler.py", line 312, in crawl
verdict, head_response = self.is_media_file(original_url)
File "/usr/local/lib/python3.8/dist-packages/basiccrawler/crawler.py", line 195, in is_media_file
head_response = self.make_request(url, method='HEAD')
File "/usr/local/lib/python3.8/dist-packages/basiccrawler/crawler.py", line 409, in make_request
response = self.SESSION.request(method, url, *args, timeout=timeout, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/cachecontrol/adapter.py", line 53, in send
resp = super(CacheControlAdapter, self).send(request, **kw)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 996, in _validate_conn
conn.connect()
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 300, in connect
conn = self._new_conn()
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 156, in _new_conn
conn = connection.create_connection(
File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 61, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.