jmg / crawley Goto Github PK
View Code? Open in Web Editor NEWPythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
Home Page: http://project.crawley-cloud.com
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
Home Page: http://project.crawley-cloud.com
Implement a way to use a regex in the scraper's matching urls.
Hello:
Is there a way to access the URL of the current document inside the scrape() method of the scraper?
Thanks!
-K
Consider the posibility of make a simple webbrowser desktop application that allows the "end-users" scrape web pages with a GUI. This app should show the webpage to the user and allow him to select the needed data. Then it should generate a DSL template based on the selected data and submit it to the crawler.
Here's a Qt based simple webbrowser code:
https://github.com/jmg/simple-web-browser
We can start using this base of code.
Create a crawley project that demostrate how to use the crawler's login and then scrape data behind sessioned pages.
Evaluate the possibility of use difflib in order to recognize similar html pages.
http://docs.python.org/library/difflib.html
Write some tests to check if it works properly and relatively fast.
Then we can write a "SmartCrawler" class wich crawls the web searching for similar pages.
Hi,
there are some missing dependencies on master branch.
If i try to use the shell, the follwoing packages are missing:
I can't find anything in the documentation about how to use mongodb to save the crawled data. Am I missing something ?
We've a simple dsl designed and we're able to compile it into a scraper classes.
Now we can finish the integration of the run-time generated scrapers with the crawlers.
Write more tests and more complex dsl templates.
Replace the non pythonic method "inspect_module" in manager/utils with metaclasses in order to read the models and crawlers modules written by users. :-)
Any ideas why I'm getting...
ImportError: cannot import name ScopedSession
Now we're doing the http requests without any delay. It can be a problem when sending a thousands of requests to the same server.
The solution is make delayed http requests when we're are overloading a external server (Consider the algorithm to decide this).
Put the delay time constant in a config file.
I tryed to use the shell command to test my xpaths, but it does'nt work.
$ crawley shell http://somewebsite.com/index.html
Traceback (most recent call last):
File "/home/maik/.virtualenvs/crawley/bin/crawley", line 4, in
manage()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 25, in manage
run_cmd(sys.argv)
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 18, in run_cmd
cmd.checked_execute()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/command.py", line 50, in checked_execute
self.execute()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/shell.py", line 30, in execute
response = crawler._get_data(url)
AttributeError: 'BaseCrawler' object has no attribute '_get_data'
Check the A tags of an html and try to fix them. The urls obtained from this proccess must be added to the list of crawleds urls.
I'm using PyQuery, and I get wrong encode detection for this page:
http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?nomeArq=0148.html
The problem is that the html has this meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
But the page is actually utf-8
I get this info from the response headers:
Connection:close
Content-Length:29187
Content-Type:text/html;charset=UTF-8
Date:Fri, 11 Jul 2014 23:21:04 GMT
Last-Modified:Fri, 11 Jul 2014 23:21:05 GMT
Server:OpenCms/7.5.4
That's how the browser (chrome) is able to guess the right encoding and display the page with the right encoding. I work in a place that have to deal with a lot of different kinds of pages, and I can tell this is far from a rare case (especially in brazilian portuguese websites), so it would be nice to fix this in crawley.
So far I saw two solutions as proposed in this answer in SO, using chardet
module or UnicodeDammit
(from BeautifulSoup).
I've develop, locally, these two alternatives and tested them with PyQuery, seems to fix the problem.
I would like to hear your opinion on this issue and if you want, I can submit one of those solutions.
BTW, good work in building crawley, I'm having a very nice time using it! Hope I can contribute somehow.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.