jmg / crawley Goto Github PK

View Code? Open in Web Editor NEW

184.0 184.0 33.0 1.67 MB

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Home Page: http://project.crawley-cloud.com

Shell 0.19% Python 97.77% Tcl 2.04%

crawley's People

Contributors

Stargazers

Watchers

crawley's Issues

Matching urls via regex

Implement a way to use a regex in the scraper's matching urls.

A way to access document's URL in scarper

Hello:

Is there a way to access the URL of the current document inside the scrape() method of the scraper?

Thanks!
-K

Consider the posibility of make a simple webbrowser desktop application that allows the "end-users" scrape web pages with a GUI. This app should show the webpage to the user and allow him to select the needed data. Then it should generate a DSL template based on the selected data and submit it to the crawler.

Here's a Qt based simple webbrowser code:

https://github.com/jmg/simple-web-browser

We can start using this base of code.

Create an example project for Login.

Create a crawley project that demostrate how to use the crawler's login and then scrape data behind sessioned pages.

Similar HTML Pages Recognition

Evaluate the possibility of use difflib in order to recognize similar html pages.

http://docs.python.org/library/difflib.html

Write some tests to check if it works properly and relatively fast.
Then we can write a "SmartCrawler" class wich crawls the web searching for similar pages.

missing dependencies

Hi,

there are some missing dependencies on master branch.
If i try to use the shell, the follwoing packages are missing:

pymongo
couchdb
PyQt4

Documentation missing nosql info

I can't find anything in the documentation about how to use mongodb to save the crawled data. Am I missing something ?

Integrate the DSL with Crawlers

We've a simple dsl designed and we're able to compile it into a scraper classes.
Now we can finish the integration of the run-time generated scrapers with the crawlers.

Write more tests and more complex dsl templates.

Use metaclasses to read user's modules

Replace the non pythonic method "inspect_module" in manager/utils with metaclasses in order to read the models and crawlers modules written by users. :-)

Getting an Import error

Any ideas why I'm getting...

 ImportError: cannot import name ScopedSession

Delayed Requests

Now we're doing the http requests without any delay. It can be a problem when sending a thousands of requests to the same server.

The solution is make delayed http requests when we're are overloading a external server (Consider the algorithm to decide this).

Put the delay time constant in a config file.

shell does'nt work

I tryed to use the shell command to test my xpaths, but it does'nt work.

$ crawley shell http://somewebsite.com/index.html
Traceback (most recent call last):
File "/home/maik/.virtualenvs/crawley/bin/crawley", line 4, in
manage()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 25, in manage
run_cmd(sys.argv)
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 18, in run_cmd
cmd.checked_execute()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/command.py", line 50, in checked_execute
self.execute()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/shell.py", line 30, in execute
response = crawler._get_data(url)
AttributeError: 'BaseCrawler' object has no attribute '_get_data'

Check for broken links in A tags

Check the A tags of an html and try to fix them. The urls obtained from this proccess must be added to the list of crawleds urls.

Wrong encoding detection

I'm using PyQuery, and I get wrong encode detection for this page:

http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?nomeArq=0148.html

The problem is that the html has this meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

But the page is actually utf-8

I get this info from the response headers:

Connection:close
Content-Length:29187
Content-Type:text/html;charset=UTF-8
Date:Fri, 11 Jul 2014 23:21:04 GMT
Last-Modified:Fri, 11 Jul 2014 23:21:05 GMT
Server:OpenCms/7.5.4

That's how the browser (chrome) is able to guess the right encoding and display the page with the right encoding. I work in a place that have to deal with a lot of different kinds of pages, and I can tell this is far from a rare case (especially in brazilian portuguese websites), so it would be nice to fix this in crawley.

So far I saw two solutions as proposed in this answer in SO, using chardet module or UnicodeDammit (from BeautifulSoup).

I've develop, locally, these two alternatives and tested them with PyQuery, seems to fix the problem.

I would like to hear your opinion on this issue and if you want, I can submit one of those solutions.

BTW, good work in building crawley, I'm having a very nice time using it! Hope I can contribute somehow.

jmg / crawley Goto Github PK

crawley's People

Contributors

Stargazers

Watchers

Forkers

crawley's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs