scrapy-plugins / scrapy-splash Goto Github PK
View Code? Open in Web Editor NEWScrapy+Splash for JavaScript integration
License: BSD 3-Clause "New" or "Revised" License
Scrapy+Splash for JavaScript integration
License: BSD 3-Clause "New" or "Revised" License
I'm a follower of scrapy framework and it is lucky to find your scrapyjs project there.
But some web apps which build with AngularJs may have a lot of asynchronous javascript load and render, so that pages crawled not work very well when using "webview.connect('load-finished', self.stop_gtk)" to determine a page was loaded.
Is there any other event of GTK to connect, or something delay function to deal this issue?
Hi and thanks for the great work.
I am currently experimenting with splash and scrapyjs and stumbled upon a behavior.
If I call from scrapy:8050 (start page) a url with "!" it renders without problem. The url arg is being properly escaped.
If I call it from scrapyjs the "!" leads to a "escaped_fragment=" in the request.body, being not correctly escaped by scrapyjs, it seems.
P.S.: Source was pip install scrapyjs
There are examples of using cookies in the docs, but no examples of setting method and body. I think it would be useful to add it, or perhaps even add the following class (with a better name): with it is possible to use full capabilities of scrapyjs without digging into splash scripts:
class DefaultExecuteSplashRequest(SplashRequest):
'''
This is a SplashRequest subclass that uses minimal default script
for the execute endpoint with support for POST requests and cookies.
'''
SPLASH_SCRIPT = '''
function last_response_headers(splash)
local entries = splash:history()
local last_entry = entries[#entries]
return last_entry.response.headers
end
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
return {
headers=last_response_headers(splash),
cookies=splash:get_cookies(),
html=splash:html(),
}
end
'''
def __init__(self, *args, **kwargs):
kwargs['endpoint'] = 'execute'
splash_args = kwargs.setdefault('args', {})
splash_args['lua_source'] = self.SPLASH_SCRIPT
super(DefaultExecuteSplashRequest, self).__init__(*args, **kwargs)
Hello,
I am crawling a website with 10K contents, when I crawl first it's all response 200, everything is ok, but after few minutes 504 Gateway Time-out appears and after 3 times retrying scrapy give up retrying. I set :
'CONCURRENT_REQUESTS':10,
'HTTPCACHE_ENABLED':True,
'DOWNLOAD_DELAY':5,
'CONCURRENT_REQUESTS_PER_IP':10,
and endpoint is render.html
'splash' : {
'endpoint' : 'render.html',
'args' : {'wait':1},
}
I am using :
*scrapy version: 1.0.3
*python:2.7
*docker server
How can I optimize my crawler ? and avoid 504 error?
At the moment you can use a proxy to connect to splash by setting the proxy
property of request.meta
, but there is no way to set the proxy splash will use for accessing the page.
Splash doesn't have a parameter to set the proxy yet (see scrapinghub/splash#160 ) but if implemented this middleware should allow setting it somehow.
I propose that if proxy is set before scrapyjs middleware, scrapyjs uses that proxy as splash outgoing proxy and clear the request.meta['proxy']
variable, if it's set after it means that the proxy will be used to connect to splash instead. Note that you can use a proxy to connect to splash and a proxy to connect to the page by setting the same property twice in different middlewares (confusing... and at the same intuitive).
We would, of course, also still support request.meta[splash][args][proxy]
and request.meta[splash][args][proxy_url]
if added, and rise an exception if different options are set and differ.
As an alternative, we could support only request.meta[splash][args][proxy_url]
, but it would make it incompatible with other middlewares that might set the proxy and are not scrapyjs-aware. This would not require changes to scrapyjs, just to implement that parameter in splash.
Its not an issue just a query.
I was just wondering why ScrapyJS has Lua as scripting language and not Python?
What superiority Lua provide to Python.
Would it not be nice to have execute endpoint for python as well?
I see the following warnings:
py.warnings - WARNING - /home/tsouras/Desktop/web2py/gluon/custom_import.py:74: ScrapyDeprecationWarning: Module scrapy.log
has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
return NATIVE_IMPORTER(oname, globals, locals, fromlist, level)
2016-02-27 23:20:35,840 - py.warnings - WARNING - /home/tsouras/Desktop/web2py/gluon/custom_import.py:74: ScrapyDeprecationWarning: Module scrapy.dupefilter
is deprecated, use scrapy.dupefilters
instead
return NATIVE_IMPORTER(oname, globals, locals, fromlist, level
2016-02-27 23:20:35,841 - py.warnings - WARNING - /home/tsouras/Desktop/web2py/gluon/custom_import.py:74: ScrapyDeprecationWarning: Module scrapy.contrib.httpcache
is deprecated, use scrapy.extensions.httpcache
instead
return NATIVE_IMPORTER(oname, globals, locals, fromlist, level)
2016-02-27 23:20:35 [py.warnings] WARNING: /home/tsouras/Desktop/web2py/gluon/custom_import.py:74: ScrapyDeprecationWarning: Module scrapy.contrib.httpcache
is deprecated, use scrapy.extensions.httpcache
instead
return NATIVE_IMPORTER(oname, globals, locals, fromlist, level)
According to http://doc.scrapy.org/en/latest/news.html
the logging and the names of some modules have been changed on Scrapy version 1.0.0
I use scrapy 1.0.5 and linux Mint
Hi.
I' started the splash docker image as tutorial said.
If I enable scrapyjs with this lines in settings.py:
SPLASH_URL = 'http://127.0.0.1:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'
I got
[scrapy] DEBUG: Retrying <POST http://127.0.01:8050/render.html> (failed 1 times): 500 Internal Server Error
And no connection is registered in splash container
Please help me.
Regards,
Giovanni
I used a splash in the url. However, that did not work.
def parse(self, response):
url = 'http://www.gymboree.com/shop/dept_item.jsp?PRODUCT%3C%3Eprd_id=845524446071240&FOLDER%3C%3Efolder_id=2534374306289499&ASSORTMENT%3C%3East_id=1408474395917465&bmUID=kQCLfxQ&productSizeSelected=0&fit_type='
yield scrapy.Request(url, self.parse_link, meta={
'splash': {
'args': {'png': 1},
}
})
def parse_link(self, response):
body = json.loads(response.body)
print body['url']
import base64
png_bytes = base64.b64decode(body['png'])
with open('result.png', 'wb') as f:
f.write(png_bytes)
But it was applied directly to the browser.
It was working.
And I found that that works by using the url obtained by using the Google URL Shortener.
I do not know why that should work when using the original url. I would like to know the solutions. Please help me.
I try to use two yields:
def parse(self, response):
...
yield scrapy.Request(url, self.parse_item, meta={'splash': ...})
def parse_item(self, response):
...
yield scrapy.Request(url, self.parse_list, meta={'splash': ...})
And I got
Traceback (most recent call last):
File "/Users/hurryhx/Library/Python/2.7/lib/python/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
GeneratorExit
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object iter_errback at 0x10a1c6c80> ignored
Unhandled error in Deferred:
2016-01-29 01:02:18 [twisted] CRITICAL: Unhandled error in Deferred:
But this code could work without splash. (Of course cannot deal with JavaScript)
How to fix it?
$ pip install "scrapy<1.1" scrapy-splash
Collecting scrapy<1.1
Downloading Scrapy-1.0.6-py2-none-any.whl (291kB)
100% |████████████████████████████████| 296kB 396kB/s
Collecting scrapy-splash
Downloading scrapy_splash-0.7-py2.py3-none-any.whl (45kB)
100% |████████████████████████████████| 51kB 1.7MB/s
Collecting queuelib (from scrapy<1.1)
Downloading queuelib-1.4.2-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from scrapy<1.1)
Downloading cssselect-0.9.1.tar.gz
Requirement already satisfied (use --upgrade to upgrade): lxml in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5.2 in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from scrapy<1.1)
Collecting w3lib>=1.8.0 (from scrapy<1.1)
Downloading w3lib-1.14.2-py2.py3-none-any.whl
Collecting Twisted>=10.0.0 (from scrapy<1.1)
Downloading Twisted-16.2.0.tar.bz2 (2.9MB)
100% |████████████████████████████████| 2.9MB 187kB/s
Requirement already satisfied (use --upgrade to upgrade): service-identity in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from scrapy<1.1)
Collecting zope.interface>=3.6.0 (from Twisted>=10.0.0->scrapy<1.1)
Downloading zope.interface-4.1.3.tar.gz (141kB)
100% |████████████████████████████████| 143kB 47kB/s
Requirement already satisfied (use --upgrade to upgrade): attrs in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from service-identity->scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): pyasn1 in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from service-identity->scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): pyasn1-modules in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from service-identity->scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): setuptools in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages/setuptools-22.0.5-py2.7.egg (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy<1.1)
Building wheels for collected packages: cssselect, Twisted, zope.interface
Running setup.py bdist_wheel for cssselect ... done
Stored in directory: /Users/rolando/Library/Caches/pip/wheels/1b/41/70/480fa9516ccc4853a474faf7a9fb3638338fc99a9255456dd0
Running setup.py bdist_wheel for Twisted ... done
Stored in directory: /Users/rolando/Library/Caches/pip/wheels/fe/9d/3f/9f7b1c768889796c01929abb7cdfa2a9cdd32bae64eb7aa239
Running setup.py bdist_wheel for zope.interface ... done
Stored in directory: /Users/rolando/Library/Caches/pip/wheels/52/04/ad/12c971c57ca6ee5e6d77019c7a1b93105b1460d8c2db6e4ef1
Successfully built cssselect Twisted zope.interface
Installing collected packages: queuelib, cssselect, w3lib, zope.interface, Twisted, scrapy, scrapy-splash
Successfully installed Twisted-16.2.0 cssselect-0.9.1 queuelib-1.4.2 scrapy-1.0.6 scrapy-splash-0.7 w3lib-1.14.2 zope.interface-4.1.3
$ ipython
In [1]: import scrapy_splash
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-64c0fb72e7d6> in <module>()
----> 1 import scrapy_splash
/Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages/scrapy_splash/__init__.py in <module>()
2 from __future__ import absolute_import
3
----> 4 from .middleware import (
5 SplashMiddleware,
6 SplashCookiesMiddleware,
/Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages/scrapy_splash/middleware.py in <module>()
15 from scrapy.http.headers import Headers
16 from scrapy import signals
---> 17 from scrapy.utils.python import to_native_str
18
19 from scrapy_splash.responsetypes import responsetypes
ImportError: cannot import name to_native_str
In scrapy for default request we can set priority_adjust value by setting RETRY_PRIORITY_ADJUST in settings.
For scrapy-splash priority_adjust is a hardcoded value equal to 5. Should be taken from RETRY_PRIORITY_ADJUST.
Hi all:
after i enable the DEFAULT_REQUEST_HEADERS from the settings.py ,when SplashRequest make the request header ,it should use
{'Content-Type': 'application/json'}
Splash is controlled via HTTP API. For all endpoints below parameters may be sent either as GET arguments or encoded to JSON and POSTed with Content-Type: application/json header.
since the splash doc referred here:
but the header will be replaced by the default settings.py value ,then the whole request does not work
well .
What do you think about renaming this project to scrapy-splash
to follow the standard naming convention of scrapy plugins?
In the past, we had another scrapy-splash
repository (that could conflict with this) but that is now gone and removed for good.
i'm not able to install jswebkit into my system. I'm just wondering if you can direct me to the version of jswebkit that would work with the scrapyjs. I'm running python 2.7, cython 0.19.1. Should i be getting the lower version cython? Thanks so much
I am a little nervous to post this but can you please see at this SO question:
http://stackoverflow.com/questions/34656424/scrapyjs-example-from-official-github-not-running
I exactly have the same problem. It constantly times out. Though I used solutions from these:
Even opening splash in browser and rendering a simple page like google.com fails as well. I just pulled the new image for docker but it is not working.
PS. I am nervous because with everyone else it seems to work fine. :)
Please add handling for dont_filter for SplashRequest, similar like for default Scrapy Request.
It is called very often (for each link, before the dupefilter), so optimizing it would be nice.
The SPLASH_URL = 'http://192.168.59.103:8050' provided does not connect, I get the following error:
DEBUG: Gave up retrying <POST http://192.168.59.103:8050/render.html> (failed 3 times): TCP connection timed out: 110: Connection timed out.
I can use the following command to run Splash in my browser:
$ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
Which makes Splash available at: 0.0.0.0 at ports 8050 (http), 8051 (https) and 5023 (telnet). However, I can not then use the terminal to run my Scrapy spider.
I'm running Ubuntu 14.04.
Any help you can provide to get me up and running would be much appreciated.
`def start_requests(self):
url='http://www.locoyposter.com/site/index.html'
script="""
function main(splash)
splash:add_cookie('PHPSESSID','nlakk0ghcimidnjhldeh17c3h1')
splash:add_cookie('800019423mh','1464136489340')
splash:add_cookie('800019423slid_263_95','1464136489340')
splash:add_cookie('Example_auth','d873517cdae7e82bb2357b4ab458682da6dc3546s:62:\"b3d5n89xrLpbksm70p4b3OMUz+9BgNFyDIcCqYomE5ZvPjaqnVkS8QB/5ZWwMw\";')
splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 OPR/35.0.2066.92')
assert(splash:go(splash.args.url))
return splash:html()
end
"""
yield SplashRequest(url,self.parse_next,args={'lua_source':script,'url':url},endpoint='execute') `
error code 2016-05-24 19:24:53 [scrapy] INFO: Spider opened 2016-05-24 19:24:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-05-24 19:24:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-05-24 19:25:23 [scrapy] DEBUG: Retrying <GET http://www.locoyposter.com/site/index.html via http://127.0.0.1:8050/execute> (failed 1 times): 504 Gateway Time-out 2016-05-24 19:25:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-05-24 19:25:53 [scrapy] DEBUG: Retrying <GET http://www.locoyposter.com/site/index.html via http://127.0.0.1:8050/execute> (failed 2 times): 504 Gateway Time-out 2016-05-24 19:26:19 [scrapy_splash.middleware] WARNING: Bad request to Splash: {u'info': {u'line_number': 16, u'message': u'Lua error: [string "..."]:16: network3', u'type': u'LUA_ERROR', u'source': u'[string "..."]', u'error': u'network3'}, u'type': u'ScriptError', u'description': u'Error happened while executing Lua script', u'error': 400} 2016-05-24 19:26:19 [scrapy] DEBUG: Crawled (400) <GET http://www.locoyposter.com/site/index.html via http://127.0.0.1:8050/execute> (referer: None) 2016-05-24 19:26:19 [scrapy] DEBUG: Ignoring response <400 http://www.locoyposter.com/site/index.html>: HTTP status code is not handled or not allowed 2016-05-24 19:26:19 [scrapy] INFO: Closing spider (finished)
but,If you remove the splash:add_cookie wouldn't errors
Maybe it could work like FormRequest?
This won't work for #11, but with a custom request class / request creation function the user won't be required to add middleware, dupefilter, etc. to settings; also, API could be nicer.
Examples:
I think these helpers could be scrapy.Request subclasses.
See also: https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/request.py
I was looking through middleware.py and see this:
def process_request(self, request, spider):
if 'splash' not in request.meta:
return
if request.method not in {'GET', 'POST'}:
logger.warn(
"Currently only GET and POST requests are supported by "
"SplashMiddleware; %(request)s will be handled without Splash",
{'request': request},
extra={'spider': spider}
)
return request
....
new_request = request.replace(
url=splash_url,
method='POST',
body=body,
headers=headers,
priority=request.priority + self.rescheduling_priority_adjust
)
self.crawler.stats.inc_value('splash/%s/request_count' % endpoint)
return new_request
This means any requests sent from scrapy to splash will be POST requests. This is confimed in the splash logs.
Problem I have is when CURL sends something to Splash, it uses a GET and the JavaScript is executed and the page is returned correctly.
If I issue the same CURL command with -X POST, the JavaScript is not executed (or takes too long) and I only get back HTML.
If there a reason this is not:
new_request = request.replace(
url=splash_url,
method=request.method ,
body=body,
headers=headers,
priority=request.priority + self.rescheduling_priority_adjust
)
If someone can offer me a workaround I would be forever grateful.
Here is the CURL request with GET (which works as expected):
sudo docker run -p 5023:5023 -p 8050:8050 -p :8051 scrapinghub/splash -v 3
curl -X GET 'http://localhost:8050/render.html?url=https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html&iframe=1&html=1&png=1&width=1024&height=768&script=1&console=1&timeout=10&wait=0.5'
Last part of output from CURL:
/b><span class="description"></span></div><div class="sectionItem itemName namespace static"><b class="icon" title="SAP UxAP"><a href="sap.uxap.html">uxap</a></b><span class="description">SAP UxAP</span></div><div class="sectionItem itemName namespace static"><b class="icon" title="Chart controls based on the SAP BI CVOM charting library"><a href="sap.viz.html">viz</a></b><span class="description">Chart controls based on the SAP BI CVOM charting library</span></div></div></div></div></div></body></html>
Splash Log
2016-06-04 16:55:17.101804 [network-manager] Headers received for https://sapui5.hana.ondemand.com/sdk/docs/api/images/namespace_obj.png
2016-06-04 16:55:17.109265 [network-manager] Finished downloading https://sapui5.hana.ondemand.com/sdk/docs/api/images/namespace_obj.png
2016-06-04 16:55:17.109562 [render] [140142934044512] mainFrame().initialLayoutCompleted
2016-06-04 16:55:17.124551 [render] [140142934044512] loadFinished: ok
2016-06-04 16:55:17.124953 [render] [140142934044512] loadFinished: disconnecting callback 0
2016-06-04 16:55:17.125597 [render] [140142934044512] loadFinished; waiting 500ms
2016-06-04 16:55:17.125880 [render] [140142934044512] waiting 500ms; timer 140142934153688
2016-06-04 16:55:17.642614 [render] [140142934044512] wait timeout for 140142934153688
2016-06-04 16:55:17.642988 [render] [140142934044512] _loadFinishedOK
2016-06-04 16:55:17.643053 [render] [140142934044512] stop_loading
2016-06-04 16:55:17.643170 [render] [140142934044512] HAR event: _onPrepareStart
2016-06-04 16:55:17.643264 [render] [140142934044512] getting HTML
2016-06-04 16:55:17.643447 [render] [140142934044512] HAR event: _onHtmlRendered
2016-06-04 16:55:17.643556 [pool] [140142934044512] SLOT 0 is closing <splash.qtrender.HtmlRender object at 0x7f7591ce2cc0>
2016-06-04 16:55:17.643620 [render] [140142934044512] close is requested by a script
2016-06-04 16:55:17.646391 [render] [140142934044512] cancelling 0 remaining timers
2016-06-04 16:55:17.646845 [pool] [140142934044512] SLOT 0 done with <splash.qtrender.HtmlRender object at 0x7f7591ce2cc0>
2016-06-04 16:55:17.650720 [events] {"client_ip": "172.17.0.1", "rendertime": 1.519012451171875, "fds": 22, "user-agent": "curl/7.47.0", "load": [0.28, 0.15, 0.14], "status_code": 200, "_id": 140142934044512, "args": {"height": "768", "script": "1", "console": "1", "url": "https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html", "timeout": "10", "uid": 140142934044512, "wait": "0.5", "png": "1", "html": "1", "iframe": "1", "width": "1024"}, "method": "GET", "qsize": 0, "timestamp": 1465059317, "path": "/render.html", "maxrss": 74432, "active": 0}
2016-06-04 16:55:17.651474 [-] "172.17.0.1" - - [04/Jun/2016:16:55:17 +0000] "GET /render.html?url=https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html&iframe=1&html=1&png=1&width=1024&height=768&script=1&console=1&timeout=10&wait=0.5 HTTP/1.1" 200 5562 "-" "curl/7.47.0"
2016-06-04 16:55:17.651792 [pool] SLOT 0 is available
Here is the CURL request using POST which does not render the JavaScript:
curl -X POST 'http://localhost:8050/render.html?url=https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html&iframe=1&html=1&png=1&width=1024&height=768&script=1&console=1&timeout=10&wait=0.5'
Last part of the CURL request, quite different from the GET request:
</div><div class="snippetHighlightLine">
<span class="lineno">136</span>
<code class="code">        request_content_type = request.getHeader(b'content-type').decode('latin1')</code>
</div><div class="snippetLine">
<span class="lineno">137</span>
<code class="code">        supported_types = ['application/javascript', 'application/json']</code>
</div>
</div>
</div>
</div>
<div class="error">
<span>builtins.AttributeError</span>: <span>'NoneType' object has no attribute 'decode'</span>
</div>
</div>
</body></html>
And the Splash log for this POST request:
2016-06-04 16:57:56.910757 [_GenericHTTPChannelProtocol,1,172.17.0.1] Unhandled Error
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
why = self.lineReceived(line)
File "/usr/local/lib/python3.4/dist-packages/twisted/web/http.py", line 1688, in lineReceived
self.allContentReceived()
File "/usr/local/lib/python3.4/dist-packages/twisted/web/http.py", line 1767, in allContentReceived
req.requestReceived(command, path, version)
File "/usr/local/lib/python3.4/dist-packages/twisted/web/http.py", line 768, in requestReceived
self.process()
--- <exception caught here> ---
File "/usr/local/lib/python3.4/dist-packages/twisted/web/server.py", line 183, in process
self.render(resrc)
File "/usr/local/lib/python3.4/dist-packages/twisted/web/server.py", line 234, in render
body = resrc.render(self)
File "/app/splash/resources.py", line 50, in render
return Resource.render(self, request)
File "/usr/local/lib/python3.4/dist-packages/twisted/web/resource.py", line 250, in render
return m(request)
File "/app/splash/resources.py", line 136, in render_POST
request_content_type = request.getHeader(b'content-type').decode('latin1')
builtins.AttributeError: 'NoneType' object has no attribute 'decode'
2016-06-04 16:57:56.930981 [-] "172.17.0.1" - - [04/Jun/2016:16:57:56 +0000] "POST /render.html?url=https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html&iframe=1&html=1&png=1&width=1024&height=768&script=1&console=1&timeout=10&wait=0.5 HTTP/1.1" 500 5344 "-" "curl/7.47.0
Thanks for any input.
David
I have been running in to many problems with this middleware, especially random freezing and performance issues. I solved some of the freezing problems by not using Mongodb in my pipeline.
However now i encountered a big problem that keeps coming back:
GliB - Too many open files.
I know what the error means, what i dont know is why? It seems that this keeps happening at the gtk.main() loop.
Best Regards
Pontus
is posible run scrapyjs with python-gtk2? need changes in code?
2016-04-27 15:18:54.748787 [events] {"_id": 140682213021232, "client_ip": "172.17.42.1", "args": {"cookies": [], "lua_source": "\nfunction main(splash)\n splash:go(splash.args.url)\n splash:wait(1)\n return splash:html()\nend\n", "url": "http://www.google.com/all", "uid": 140682213021232, "headers": {"Accept-Language": "en", "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}}, "status_code": 200, "fds": 41, "maxrss": 140476, "method": "POST", "path": "/execute", "qsize": 0, "user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36", "rendertime": 18.67064356803894, "timestamp": 1461770334, "active": 0, "load": [0.21, 0.35, 0.41]} 2016-04-27 15:18:54.749018 [-] "172.17.42.1" - - [27/Apr/2016:15:18:53 +0000] "POST /execute HTTP/1.1" 200 258608 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"
It seems that the timeout argument is still 30 not 3600
AutoThrottle extension doesn't play nicely with scrapy-splash because it thinks requests take a very long time, and adjusts request rate accordingly.
I'm having a Can't convert object: depth limit is reached error as showing below
==> default: 2015-04-22 12:52:01.126719 [-] ('enrich_from_lua_error', ScriptError(ValueError("Can't convert object: depth limit is reached",),), LuaError(u'ValueError("Can\'t convert object: depth limit is reached",)',))
Here is my scrapy/splash code:
def start_requests(self):
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(20.0)
return splash:evaljs("document.querySelectorAll('p.MSMRSTSignature_FullName.MSMRSTArticleItemTitle')")
end
"""
for url in self.start_urls:
yield Request(url, self.parse, meta={
'splash': {
'endpoint': 'execute',
'args': {'lua_source': script}
}
})
When I try my javascript code in chrome console a I get about 77 p
elements with nested elements.
How I could fix this?
Had to change:
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
to:
'splash': {
'endpoint': 'render.html',
'args': {'wait': '0.5'}
}
in read me. Assuming same rule applies for the rest of the examples put on readme.md file.
I ran into this issue while trying to interact with a page that had a <div>
which I needed to click to load additional data. I tried doing a Lua script that loaded jQuery and then clicked the element like so:
assert(splash:runjs("$(#element-id).click()"))
but it kept returning the same original html with no modifications. Took me a while to figure this out.
For anyone running into this issue, using basic javascript to simulate the click worked for me:
local fireClick = splash:jsfunc([[
function() {{
var c = document.createEvent("MouseEvents"),
el = document.getElementsByClassName(
//some selector
)[0];
c.initMouseEvent("click",true,true,window,0,0,0,0,0,false,false,false,false,0,null);
el.dispatchEvent(c);
}}
]])
fireClick()
When I try import SplashRequest error occurred:
from scrapyjs import SplashRequest
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name SplashRequest
but
import scrapyjs
works.
Metadata-Version: 1.1
Name: scrapyjs
Version: 0.2
Summary: JavaScript support for Scrapy using Splash
Home-page: https://github.com/scrapy-plugins/scrapy-splash
Author: Mikhail Korobov
Author-email: [email protected]
License: BSD
Location: /usr/local/lib/python2.7/dist-packages
Hi:
sorry that I'm not really familiar about scrapy. but I had to use scrapyJs to get rendered contents.
I noticed that you have scrapySpider example but I want to use crawlSpider. So I wrote this:
class JhsSpider(CrawlSpider):
name = "jhsspy"
allowd_domains=["taobao.com"]
start_urls = ["https://ju.taobao.com/"]
rules = [
Rule(SgmlLinkExtractor(allow = (r'https://detail.ju.taobao.com/.*')), follow = False),
Rule(SgmlLinkExtractor(allow = (r'https://detail.tmall.com/item.htm.*')), callback = "parse_link"),
]
def parse_link(self, response):
le = SgmlLinkExtractor()
for link in le.extract_links(response):
yield scrapy.Request(link.url, self.parse_item, meta={
'splash':{
'endpoint':'render.html',
'args':{
'wait':0.5,
}
}
})
def parse_item(self, response):
...get items with reponse...
but I had some problem that I'm not sure what caused them. So, want to know is it the right way to yield request like what I did above.
Hello, I tried to use either of the download handler and the middleware to scrape the PRICE from page like this one without success. I followed all your instructions, but cannot get the content of price field. My Xpath is correct. Everything else seems to be able to be scraped (even without the js script middleware) except for price. Any help would be much appreciated. Thx
I think there's a problem with removing cookies: if we get them via splash:get_cookies
, then if the cookie is set to an empty string (as a result of logout), it will be absent in response.data['cookies']
, and as a result, they will stay in the cookiejar
in SplashCookiesMiddleware
: lopuhin@478db5d
I see three ways to fix this:
response.data['cookies']
, remove it from the cookiejar
tooresponse.data['cookies']
, add it to cookiejar
with an empty string value?har
log?Not sure which way is the best. (1) is by far the easiest and looks very logical, but this might be different from what people will expect?
we are running a scrapy-splash instance , where we have HTTP Auth is in place to grant access to splash instance...
after surfing and reading for a while i came to know that there is no HTTP Auth for Scrapy-splash API..
I tried using something like http://username:[email protected]
but scrapy failed with error DNS "username" is not resolved
after that i have started hacking into packages in scrapyjs ( midddleware.py) and came to a realization that it can be done ,very efficiently with minimal changes in code-base
#in Settings.py file///
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
SPLASH_URL = 'URL'
SPLASH_USERNAME = 'username'
SPLASH_PASS = 'password'
#in scrapyjs.middleware.py file
from w3lib.http import basic_auth_header
class SplashMiddleware(object):
.....
default_splash_username = ""
default_splash_password = ""
.....
def __init__(self, crawler, splash_base_url, slot_policy):
....
self.splash_username = crawler.settings.get('SPLASH_USERNAME', self.default_splash_username)
self.splash_password = crawler.settings.get('SPLASH_PASS', self.default_splash_password)
def process_request(self, request, spider):
if self.splash_username != "" and self.default_splash_password != "":
auth = basic_auth_header(http_user, http_pass)
request.headers['Authorization'] = auth
i think this will hlep many who have HTTP Auth in place for splash instance
I can submit a PR if we agree that this is a useful adon
Replacing scrapyjs
Repository was already renamed.
Currently it is no straightforward to use FormRequest with scrapy-splash; one have to go back to request.meta['splash']
API.
Hi,
when I run scrapy with splash, it shows the errors:
**Is this because my boot2docker IP is different from docker IP?
My docker IP is: 192.168.99.102
boot2docker IP is: 192.168.59.103
2016-02-23 22:42:48 [scrapy] INFO: Enabled item pipelines:
2016-02-23 22:42:48 [scrapy] INFO: Spider opened
2016-02-23 22:42:48 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-23 22:42:48 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-23 22:42:48 [scrapy] DEBUG: Retrying <POST http://192.168.59.103:8050/execute> (failed 1 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:48 [scrapy] DEBUG: Retrying <POST http://192.168.59.103:8050/execute> (failed 2 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:48 [scrapy] DEBUG: Gave up retrying <POST http://192.168.59.103:8050/execute> (failed 3 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:48 [scrapy] ERROR: Error downloading <POST http://192.168.59.103:8050/execute>: Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] DEBUG: Retrying <POST http://192.168.59.103:8050/execute> (failed 1 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] DEBUG: Retrying <POST http://192.168.59.103:8050/execute> (failed 2 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] DEBUG: Gave up retrying <POST http://192.168.59.103:8050/execute> (failed 3 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] ERROR: Error downloading <POST http://192.168.59.103:8050/execute>: Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] INFO: Closing spider (finished)
2016-02-23 22:42:49 [scrapy] INFO: Dumping Scrapy stats:
it should be possible to run Splash in the same event loop as Scrapy, similar to how it worked in gtk-based scrapyjs.
We could create a middleware which adds 'splash' meta key to all requests, or to all requests matching some pattern. It could also decode the results to make the whole thing more or less transparent.
Is it a good idea? Or are explicit requests enough?
When Splash returns some error it is just retried by Scrapy without printing error messages. This is a pain because if you get 400 from Splash you should be able to see error message immediately, usually 400 means some error from your side so you must know what you did wrong. WIth current middleware you just get 400 in logs and usually this is retried by Scrapy retry middleware, there is no quick and easy way to see what error you got.
crawl information:
2016-04-06 05:16:56 [scrapy] INFO: Crawled 62 pages (at 6 pages/min), scraped 42 items (at 6 items/min)
2016-04-06 05:17:56 [scrapy] INFO: Crawled 56 pages (at 0 pages/min), scraped 36 items (at 0 items/min)
2016-04-06 05:18:56 [scrapy] INFO: Crawled 56 pages (at 0 pages/min), scraped 36 items (at 0 items/min)
2016-04-06 05:19:56 [scrapy] INFO: Crawled 62 pages (at 6 pages/min), scraped 42 items (at 6 items/min)
.......................
.......................
my code is here:
class JdSpider(scrapy.Spider):
name = "jdscroll"
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
script2 = """
function main(splash)
local click_element = splash:jsfunc([[
function(){{
var i = 0;
var t=0;
t= setInterval(function(){
if(i<15){
window.scrollTo(100, i * 600);
}else{
window.clearInterval(t);
}
i+=1;
}, 800)
}}
]])
splash.resource_timeout = 10.0
assert(splash:go(splash.args.url))
splash:set_custom_headers({
[':host']='serach.jd.com',
[':method']='GET',
[':version']='HTTP/1.1',
['accept']='text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
['accept-language']='zh-CN,zh;q=0.8',
})
click_element()
splash:wait(5.5)
splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 OPR/35.0.2066.92')
return splash:html()
end
"""
for i in range(1, 101, 1):
url2 = 'http://search.jd.com/Search?keyword=苹果&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=%d&click=0' % i
yield scrapy.Request(url2, self.parse_next, meta={
'splash': {
'args': {'lua_source': script2, 'url': url2},
'endpoint': 'execute'
}
})
def parse_next(self, response):
# item['title']=[]
# item['price']=[]
self.header={
':host':'item.jd.com',
':method':'GET',
':version':'HTTP/1.1',
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'accept-language ':'zh-CN,zh;q=0.8',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 OPR/35.0.2066.92',
'cache-control':'max-age=0',
'if-modified-since':'Sun, 13 Mar 2016 03:01:20 GMT',
'if-none-match':'TURBO2-f6ee053be24d1412ffdc31ea5736c30e',
'referer':'http://search.jd.com/Search?keyword=苹果&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=1&click=0',
'upgrade-insecure-requests':1,
'x-opera-requesttype':'main-frame'
}
self.cookies={
'__jdv':r'122270672|direct|-|none|-',
'ipLocation':r'%u5317%u4EAC',
'areaId':r'1',
'ipLoc-djd':r'1-72-2799-0',
'__jda':r'122270672.1995789348.1456028093.1457791917.1457836622.5',
'__jdb':r'122270672.7.1995789348|5.1457836622',
'__jdc':r'122270672',
'__jdu':r'1995789348',
'_jrda':r'1',
'_jrdb':r'1457839559736',
'3AB9D23F7A4B3C9B':r'0c366f2124ff48f3bbc2b4f580f5877d1065706156',
'thor':r'18B675F7378C3675F88CD078BF5D43195247693D4EA4B1CDE4EF4E66DE6D2D57947216E14331B23722623087D2D4FC9D1C30AD08690008428298D3FDC5A2B6DCD796953A25D88B59985BE35E3284DFB649B6371CCA803EFF49842DCCB87EBC9CB460DEE1877F240AD0E8CD2765FE81229BACEC814CE454FE76DA3E61BC402168E46A4113FA01DE54FB2305EB6D33DA4D',
'_tp':r'PqLmWq%2BZjSCnGLeF2%2FywBw%3D%3D',
'logining':r'1',
'unick':r'%E9%A3%9E%E9%A9%ACfly',
'_pst':r'xiaowenjie6434',
'TrackID':r'1syOovo3tNb_sA4R5vbDFGhvSzevkcnpll7sAFW_zJbB2f0cYry8GLn8bbsQZt5u-CfBbNzLpOpoXCPFB9uSFXLyXGH2G70XuQ4UXXEJ6qwwskKPlime48Ono9LwRDCCs',
'pinId':r'JZDTIH8Hx2RJY4yW2Xf1Mg',
'pin':r'xiaowenjie6434'
}
#items=[]
select = response.xpath("//ul[@class='gl-warp clearfix']/li[@class='gl-item']/div/div[@class='p-name p-name-type-2']/a/@href")
for i in select.extract():
#urls=i.xpath("div[@class='p-name p-name-type-2']/a/@href").extract()[0]
item_urls = urlparse.urljoin('http:', i)
yield scrapy.Request(item_urls,callback=self.parse_item,dont_filter=True)
def parse_item(self,response):
item=TaobaoItem()
item['title']=response.xpath("//div[@id='name']/h1/text()").extract()
#item['price']=response.xpath("//strong[@id='jd-price']/text()").extract()
item['url']=response.url
item['shopname']=response.xpath("//div[@id='extInfo']/div[@class='seller-infor']/a[@class='name']/text()").extract()
#item['title']=response.body
return item
This is settings.py code :
`BOT_NAME = 'taobao'
SPIDER_MODULES = ['taobao.spiders']
NEWSPIDER_MODULE = 'taobao.spiders'
DOWNLOADER_MIDDLEWARES={
'scrapyjs.SplashMiddleware':725,
}
ITEM_PIPELINES = {
'taobao.pipelines.mysql.MysqlWriter': 800,
}
RANDOMIZE_DOWNLOAD_DELAY = True
MYSQL_PIPELINE_URL = 'mysql://root:2955112@localhost:3306/properties'
CONCURENT_REQUESTS =16
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 OPR/35.0.2066.92"]
DUPEFILTER_CLASS='scrapyjs.SplashAwareDupeFilter'
SPLASH_URL='http://127.0.0.1:8050/'
COOKIES_ENABLED=False
COOKIES_DEBUG=False
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""`
Eclipse SplashFormRequest.from_response is marked red as "Undefined variable from import: from_response" - for scrapy-splash 0.4.
But it works during script execution.
Hi, I got a question.
Given a spider has cookies by previous requests, this request scrapy.Request(url, callback=self.parse_result)
will be sent with headers/cookies included. When using splash to render the page, I want it to use these headers as well. Is it currently supported?
I fount that there is an argument headers
for endpoint: render.html
, but this option is only supported for application/json
POST requests. How about GET requests?
Thanks in advance,
Canh
When processing splash response we could just replace response from splash with new HTMLResponse generated from html in splash response. This way user will not have to worry about generating Html response in spider callback, she can just forget about js rendering - when splash will be enabled she will just get normal target response with JS rendered (with proper url, html etc). User can just use normal response.xpath on this response without doing anything extra. I think rendering html would probably be most common use case for splash middleware so it's worth enabling this by default.
If there will be some other keys in splash response (other then html) we could pass them in meta perhaps? .har content could be converted to response headers
Say we want to add some cookies we got elsewhere to request: we set request.cookies
. But the format is HAR, which is not a native python format. I think it would be convenient to allow setting cookies in the format of a list of http.cookiejar.Cookie.__dict__
or http.cookiejar.Cookie
to avoid workarounds like this TeamHG-Memex/undercrawler@27d87f2. Do you think it's worth adding, and if yes, what format should scrapy-splash support for request.cookies
?
url:http(s)://bbs.ngacn.cc/nuke.php?__lib=login&__act=login_ui
<a id="showcaptchad" href="javascript:void(0)" onclick="showCaptchad(this)">xxxxxxxx</a>
script = '''
function main(splash)
splash.images_enabled = false
splash.resource_timeout = 100.0
splash:autoload('http://code.jquery.com/jquery-2.1.3.min.js')
assert(splash:go(splash.args.url))
splash:wait(1)
splash:runjs("$('#showcaptchad').click()")
splash:wait(0.5)
return splash:html()
end
'''
it never works.
but
document.getElementById("showcaptchad").click()
worked well.
1.Whether jQuery does not support click .
2.how can I click the link like
<a tabindex="-1" href="#J_Reviews" rel="nofollow" hidefocus="true" data-index="1">xxxxxxx<em class="J_ReviewsCount" style="display: inline;">62791</em></a>
or
<a href="?spm=a220o.1000855.0.0.JZj6pP&page=2" data-spm-anchor-id="a220o.1000855.0.0">2</a>
they don't contain id='xxx' or name='xxx' .
I can't use 'getElementById()' or 'getElementsByName()'.
what should I do?
edit: apologies, I created this on the wrong project. Have closed.
I am a newbie on scrapy,I put this project scrapyjs into practice.It generate problem like"raise error.ReactorAlreadyInstalledError("reactor already installed")
ReactorAlreadyInstalledError: reactor already installed" when I add "gtk2reactor.install()" on init.py.How to make sure scrapy use gtk2reactor on twisted?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.