GithubHelp home page GithubHelp logo

jmg / crawley Goto Github PK

View Code? Open in Web Editor NEW
184.0 184.0 33.0 1.67 MB

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Home Page: http://project.crawley-cloud.com

Shell 0.19% Python 97.77% Tcl 2.04%

crawley's People

Contributors

bossiernesto avatar danielfv avatar dlitvakb avatar jmg avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawley's Issues

Desktop Client Application

Consider the posibility of make a simple webbrowser desktop application that allows the "end-users" scrape web pages with a GUI. This app should show the webpage to the user and allow him to select the needed data. Then it should generate a DSL template based on the selected data and submit it to the crawler.

Here's a Qt based simple webbrowser code:

https://github.com/jmg/simple-web-browser

We can start using this base of code.

missing dependencies

Hi,

there are some missing dependencies on master branch.
If i try to use the shell, the follwoing packages are missing:

  • pymongo
  • couchdb
  • PyQt4

Integrate the DSL with Crawlers

We've a simple dsl designed and we're able to compile it into a scraper classes.
Now we can finish the integration of the run-time generated scrapers with the crawlers.

Write more tests and more complex dsl templates.

Delayed Requests

Now we're doing the http requests without any delay. It can be a problem when sending a thousands of requests to the same server.

The solution is make delayed http requests when we're are overloading a external server (Consider the algorithm to decide this).

Put the delay time constant in a config file.

shell does'nt work

I tryed to use the shell command to test my xpaths, but it does'nt work.

$ crawley shell http://somewebsite.com/index.html
Traceback (most recent call last):
File "/home/maik/.virtualenvs/crawley/bin/crawley", line 4, in
manage()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 25, in manage
run_cmd(sys.argv)
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 18, in run_cmd
cmd.checked_execute()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/command.py", line 50, in checked_execute
self.execute()
File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/shell.py", line 30, in execute
response = crawler._get_data(url)
AttributeError: 'BaseCrawler' object has no attribute '_get_data'

Check for broken links in A tags

Check the A tags of an html and try to fix them. The urls obtained from this proccess must be added to the list of crawleds urls.

Wrong encoding detection

I'm using PyQuery, and I get wrong encode detection for this page:

http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?nomeArq=0148.html

The problem is that the html has this meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

But the page is actually utf-8

I get this info from the response headers:

Connection:close
Content-Length:29187
Content-Type:text/html;charset=UTF-8
Date:Fri, 11 Jul 2014 23:21:04 GMT
Last-Modified:Fri, 11 Jul 2014 23:21:05 GMT
Server:OpenCms/7.5.4

That's how the browser (chrome) is able to guess the right encoding and display the page with the right encoding. I work in a place that have to deal with a lot of different kinds of pages, and I can tell this is far from a rare case (especially in brazilian portuguese websites), so it would be nice to fix this in crawley.

So far I saw two solutions as proposed in this answer in SO, using chardet module or UnicodeDammit (from BeautifulSoup).

I've develop, locally, these two alternatives and tested them with PyQuery, seems to fix the problem.

I would like to hear your opinion on this issue and if you want, I can submit one of those solutions.

BTW, good work in building crawley, I'm having a very nice time using it! Hope I can contribute somehow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.