scrapy-plugins / scrapy-splash Goto Github PK

View Code? Open in Web Editor NEW

3.1K 3.1K 451.0 222 KB

Scrapy+Splash for JavaScript integration

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

headless-browsers scrapy

scrapy-splash's People

Contributors

Stargazers

Watchers

Forkers

andresp99999 bheller84 tongming netconstructor scraping-xx parsing leadsplus big-data bigdata-tools models datamodels jhg yupengyan bobosky yangmian7721 tudyzhb liangkai imclab malanchuk isu007 atassumer jezeelmohd vicjung zhangzewen findhy younghz luzi2000 yolanda1989 wyrover zozoz huokedu etongle bopo wenkezhou fashtimedotcom sundisee guowei1003 huangdehui2013 fusionsquare getwingm wusuowe shiym lekhakpadmanabh alecxe frostyplanet liwei123o0 eaglezzb cgc1983 batulu12 pawelmhm lovejavaee smarthomekit 1060460048 yan68 cnlouts heianxing breakpointninja codeops josericardo toejough xsongx michaelgerace yiakwy py-web cognitivescale ruairif lijiajia2023 dvdbng leoliang2048 bluesl oogee78 dixonshen barravi leeomar daize1994 durban89 ii0 sunilsharma07 strange-jiong bugonee stummjr rdowinton zsmj513 stevencoding colorsof weieast nkhuyu xiaowenjie21 holgerd77 wuxw erwinjulius tsouras dotmark lopuhin mikelambert regzhuce xk1411 deerlux jayforest orchestor

scrapy-splash's Issues

javascript loaded may incomplete

I'm a follower of scrapy framework and it is lucky to find your scrapyjs project there.
But some web apps which build with AngularJs may have a lot of asynchronous javascript load and render, so that pages crawled not work very well when using "webview.connect('load-finished', self.stop_gtk)" to determine a page was loaded.
Is there any other event of GTK to connect, or something delay function to deal this issue?

! within url leading to not understood url

Hi and thanks for the great work.
I am currently experimenting with splash and scrapyjs and stumbled upon a behavior.

If I call from scrapy:8050 (start page) a url with "!" it renders without problem. The url arg is being properly escaped.
If I call it from scrapyjs the "!" leads to a "escaped_fragment=" in the request.body, being not correctly escaped by scrapyjs, it seems.

P.S.: Source was pip install scrapyjs

An example of http_method and body in splash script

There are examples of using cookies in the docs, but no examples of setting method and body. I think it would be useful to add it, or perhaps even add the following class (with a better name): with it is possible to use full capabilities of scrapyjs without digging into splash scripts:

class DefaultExecuteSplashRequest(SplashRequest):
    '''
    This is a SplashRequest subclass that uses minimal default script
    for the execute endpoint with support for POST requests and cookies.
    '''
    SPLASH_SCRIPT = '''
    function last_response_headers(splash)
        local entries = splash:history()
        local last_entry = entries[#entries]
        return last_entry.response.headers
    end

    function main(splash)
        splash:init_cookies(splash.args.cookies)
        assert(splash:go{
            splash.args.url,
            headers=splash.args.headers,
            http_method=splash.args.http_method,
            body=splash.args.body,
            })
        assert(splash:wait(0.5))

        return {
            headers=last_response_headers(splash),
            cookies=splash:get_cookies(),
            html=splash:html(),
        }
    end
    '''

    def __init__(self, *args, **kwargs):
        kwargs['endpoint'] = 'execute'
        splash_args = kwargs.setdefault('args', {})
        splash_args['lua_source'] = self.SPLASH_SCRIPT
        super(DefaultExecuteSplashRequest, self).__init__(*args, **kwargs)

504 Gateway Time-out

Hello,
I am crawling a website with 10K contents, when I crawl first it's all response 200, everything is ok, but after few minutes 504 Gateway Time-out appears and after 3 times retrying scrapy give up retrying. I set :

    'CONCURRENT_REQUESTS':10,
    'HTTPCACHE_ENABLED':True,
    'DOWNLOAD_DELAY':5,
    'CONCURRENT_REQUESTS_PER_IP':10,

and endpoint is render.html

'splash' : {
    'endpoint' : 'render.html',
    'args' : {'wait':1},
}

I am using :
*scrapy version: 1.0.3
*python：2.7
*docker server

How can I optimize my crawler ? and avoid 504 error?

Support exit proxies

At the moment you can use a proxy to connect to splash by setting the proxy property of request.meta, but there is no way to set the proxy splash will use for accessing the page.

Splash doesn't have a parameter to set the proxy yet (see scrapinghub/splash#160 ) but if implemented this middleware should allow setting it somehow.

I propose that if proxy is set before scrapyjs middleware, scrapyjs uses that proxy as splash outgoing proxy and clear the request.meta['proxy'] variable, if it's set after it means that the proxy will be used to connect to splash instead. Note that you can use a proxy to connect to splash and a proxy to connect to the page by setting the same property twice in different middlewares (confusing... and at the same intuitive).

We would, of course, also still support request.meta[splash][args][proxy] and request.meta[splash][args][proxy_url] if added, and rise an exception if different options are set and differ.

As an alternative, we could support only request.meta[splash][args][proxy_url], but it would make it incompatible with other middlewares that might set the proxy and are not scrapyjs-aware. This would not require changes to scrapyjs, just to implement that parameter in splash.

Why ScrapyJS has Lua as scripting language and not Python?

Its not an issue just a query.

I was just wondering why ScrapyJS has Lua as scripting language and not Python?
What superiority Lua provide to Python.
Would it not be nice to have execute endpoint for python as well?

justWondering :)

Warnings in Scrapy 1.0.5

I see the following warnings:
py.warnings - WARNING - /home/tsouras/Desktop/web2py/gluon/custom_import.py:74: ScrapyDeprecationWarning: Module scrapy.log has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
return NATIVE_IMPORTER(oname, globals, locals, fromlist, level)

2016-02-27 23:20:35,840 - py.warnings - WARNING - /home/tsouras/Desktop/web2py/gluon/custom_import.py:74: ScrapyDeprecationWarning: Module scrapy.dupefilter is deprecated, use scrapy.dupefilters instead
return NATIVE_IMPORTER(oname, globals, locals, fromlist, level

2016-02-27 23:20:35,841 - py.warnings - WARNING - /home/tsouras/Desktop/web2py/gluon/custom_import.py:74: ScrapyDeprecationWarning: Module scrapy.contrib.httpcache is deprecated, use scrapy.extensions.httpcache instead
return NATIVE_IMPORTER(oname, globals, locals, fromlist, level)

2016-02-27 23:20:35 [py.warnings] WARNING: /home/tsouras/Desktop/web2py/gluon/custom_import.py:74: ScrapyDeprecationWarning: Module scrapy.contrib.httpcache is deprecated, use scrapy.extensions.httpcache instead
return NATIVE_IMPORTER(oname, globals, locals, fromlist, level)

According to http://doc.scrapy.org/en/latest/news.html
the logging and the names of some modules have been changed on Scrapy version 1.0.0

I use scrapy 1.0.5 and linux Mint

http call never delivered to splash

Hi.

I' started the splash docker image as tutorial said.

If I enable scrapyjs with this lines in settings.py:

SPLASH_URL = 'http://127.0.0.1:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'

I got
[scrapy] DEBUG: Retrying <POST http://127.0.01:8050/render.html> (failed 1 times): 500 Internal Server Error
And no connection is registered in splash container

Please help me.

Regards,
Giovanni

not work splash

I used a splash in the url. However, that did not work.

    def parse(self, response):
        url = 'http://www.gymboree.com/shop/dept_item.jsp?PRODUCT%3C%3Eprd_id=845524446071240&FOLDER%3C%3Efolder_id=2534374306289499&ASSORTMENT%3C%3East_id=1408474395917465&bmUID=kQCLfxQ&productSizeSelected=0&fit_type='
        yield scrapy.Request(url, self.parse_link, meta={
                'splash': {
                    'args': {'png': 1},
                }
            })


    def parse_link(self, response):
        body = json.loads(response.body)
        print body['url']
        import base64
        png_bytes = base64.b64decode(body['png'])
        with open('result.png', 'wb') as f:
            f.write(png_bytes)

But it was applied directly to the browser.

http://192.168.59.103:8050/render.png?url=http://www.gymboree.com/shop/dept_item.jsp?PRODUCT%3C%3Eprd_id=845524446070955&FOLDER%3C%3Efolder_id=2534374305284899&ASSORTMENT%3C%3East_id=1408474395917465&bmUID=kQD1brH&productSizeSelected=0&fit_type=

It was working.
And I found that that works by using the url obtained by using the Google URL Shortener.

I do not know why that should work when using the original url. I would like to know the solutions. Please help me.

Cannot scrape recursively with scrapyjs

I try to use two yields:

def parse(self, response):
  ...
  yield scrapy.Request(url, self.parse_item, meta={'splash': ...})

def parse_item(self, response):
  ...
  yield scrapy.Request(url, self.parse_list, meta={'splash': ...})

And I got

Traceback (most recent call last):
File "/Users/hurryhx/Library/Python/2.7/lib/python/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
GeneratorExit
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object iter_errback at 0x10a1c6c80> ignored
Unhandled error in Deferred:
2016-01-29 01:02:18 [twisted] CRITICAL: Unhandled error in Deferred:

But this code could work without splash. (Of course cannot deal with JavaScript)

How to fix it?

Cannot use scrapy-splash with scrapy<1.1

$ pip install "scrapy<1.1" scrapy-splash
Collecting scrapy<1.1
  Downloading Scrapy-1.0.6-py2-none-any.whl (291kB)
    100% |████████████████████████████████| 296kB 396kB/s
Collecting scrapy-splash
  Downloading scrapy_splash-0.7-py2.py3-none-any.whl (45kB)
    100% |████████████████████████████████| 51kB 1.7MB/s
Collecting queuelib (from scrapy<1.1)
  Downloading queuelib-1.4.2-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from scrapy<1.1)
  Downloading cssselect-0.9.1.tar.gz
Requirement already satisfied (use --upgrade to upgrade): lxml in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5.2 in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from scrapy<1.1)
Collecting w3lib>=1.8.0 (from scrapy<1.1)
  Downloading w3lib-1.14.2-py2.py3-none-any.whl
Collecting Twisted>=10.0.0 (from scrapy<1.1)
  Downloading Twisted-16.2.0.tar.bz2 (2.9MB)
    100% |████████████████████████████████| 2.9MB 187kB/s
Requirement already satisfied (use --upgrade to upgrade): service-identity in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from scrapy<1.1)
Collecting zope.interface>=3.6.0 (from Twisted>=10.0.0->scrapy<1.1)
  Downloading zope.interface-4.1.3.tar.gz (141kB)
    100% |████████████████████████████████| 143kB 47kB/s
Requirement already satisfied (use --upgrade to upgrade): attrs in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from service-identity->scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): pyasn1 in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from service-identity->scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): pyasn1-modules in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages (from service-identity->scrapy<1.1)
Requirement already satisfied (use --upgrade to upgrade): setuptools in /Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages/setuptools-22.0.5-py2.7.egg (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy<1.1)
Building wheels for collected packages: cssselect, Twisted, zope.interface
  Running setup.py bdist_wheel for cssselect ... done
  Stored in directory: /Users/rolando/Library/Caches/pip/wheels/1b/41/70/480fa9516ccc4853a474faf7a9fb3638338fc99a9255456dd0
  Running setup.py bdist_wheel for Twisted ... done
  Stored in directory: /Users/rolando/Library/Caches/pip/wheels/fe/9d/3f/9f7b1c768889796c01929abb7cdfa2a9cdd32bae64eb7aa239
  Running setup.py bdist_wheel for zope.interface ... done
  Stored in directory: /Users/rolando/Library/Caches/pip/wheels/52/04/ad/12c971c57ca6ee5e6d77019c7a1b93105b1460d8c2db6e4ef1
Successfully built cssselect Twisted zope.interface
Installing collected packages: queuelib, cssselect, w3lib, zope.interface, Twisted, scrapy, scrapy-splash
Successfully installed Twisted-16.2.0 cssselect-0.9.1 queuelib-1.4.2 scrapy-1.0.6 scrapy-splash-0.7 w3lib-1.14.2 zope.interface-4.1.3

$ ipython

In [1]: import scrapy_splash
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-64c0fb72e7d6> in <module>()
----> 1 import scrapy_splash

/Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages/scrapy_splash/__init__.py in <module>()
      2 from __future__ import absolute_import
      3
----> 4 from .middleware import (
      5     SplashMiddleware,
      6     SplashCookiesMiddleware,

/Users/rolando/miniconda3/envs/tmp-splash2/lib/python2.7/site-packages/scrapy_splash/middleware.py in <module>()
     15 from scrapy.http.headers import Headers
     16 from scrapy import signals
---> 17 from scrapy.utils.python import to_native_str
     18
     19 from scrapy_splash.responsetypes import responsetypes

ImportError: cannot import name to_native_str

scrapy_splash.SplashMiddleware priority_adjust value should be taken from settings RETRY_PRIORITY_ADJUST

In scrapy for default request we can set priority_adjust value by setting RETRY_PRIORITY_ADJUST in settings.

For scrapy-splash priority_adjust is a hardcoded value equal to 5. Should be taken from RETRY_PRIORITY_ADJUST.

After enable the DEFAULT_REQUEST_HEADERS from settings.py ,scrapy splash does not work well

Hi all:
after i enable the DEFAULT_REQUEST_HEADERS from the settings.py ,when SplashRequest make the request header ,it should use
{'Content-Type': 'application/json'}

Splash is controlled via HTTP API. For all endpoints below parameters may be sent either as GET arguments or encoded to JSON and POSTed with Content-Type: application/json header.

since the splash doc referred here:

but the header will be replaced by the default settings.py value ,then the whole request does not work
well .

Rename to scrapy-splash?

What do you think about renaming this project to scrapy-splash to follow the standard naming convention of scrapy plugins?

In the past, we had another scrapy-splash repository (that could conflict with this) but that is now gone and removed for good.

jswebkit help

i'm not able to install jswebkit into my system. I'm just wondering if you can direct me to the version of jswebkit that would work with the scrapyjs. I'm running python 2.7, cython 0.19.1. Should i be getting the lower version cython? Thanks so much

Timeouts in scrapyjs

I am a little nervous to post this but can you please see at this SO question:
http://stackoverflow.com/questions/34656424/scrapyjs-example-from-official-github-not-running
I exactly have the same problem. It constantly times out. Though I used solutions from these:

Even opening splash in browser and rendering a simple page like google.com fails as well. I just pulled the new image for docker but it is not working.

PS. I am nervous because with everyone else it seems to work fine. :)

dont_filter value for SplashRequest

Please add handling for dont_filter for SplashRequest, similar like for default Scrapy Request.

json_based_hash is slow

It is called very often (for each link, before the dupefilter), so optimizing it would be nice.

Can not connect to splash url

The SPLASH_URL = 'http://192.168.59.103:8050' provided does not connect, I get the following error:

DEBUG: Gave up retrying <POST http://192.168.59.103:8050/render.html> (failed 3 times): TCP connection timed out: 110: Connection timed out.

I can use the following command to run Splash in my browser:

$ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash

Which makes Splash available at: 0.0.0.0 at ports 8050 (http), 8051 (https) and 5023 (telnet). However, I can not then use the terminal to run my Scrapy spider.

I'm running Ubuntu 14.04.

Any help you can provide to get me up and running would be much appreciated.

How do I send a cookie, I tried using splash:add_cookie but send error 504

`def start_requests(self):

    url='http://www.locoyposter.com/site/index.html'
    script="""
           function main(splash)

             splash:add_cookie('PHPSESSID','nlakk0ghcimidnjhldeh17c3h1')
             splash:add_cookie('800019423mh','1464136489340')
             splash:add_cookie('800019423slid_263_95','1464136489340')
             splash:add_cookie('Example_auth','d873517cdae7e82bb2357b4ab458682da6dc3546s:62:\"b3d5n89xrLpbksm70p4b3OMUz+9BgNFyDIcCqYomE5ZvPjaqnVkS8QB/5ZWwMw\";')

             splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 OPR/35.0.2066.92')
             assert(splash:go(splash.args.url))
             return splash:html()

           end
           """               
    yield SplashRequest(url,self.parse_next,args={'lua_source':script,'url':url},endpoint='execute')  `

error code 2016-05-24 19:24:53 [scrapy] INFO: Spider opened 2016-05-24 19:24:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-05-24 19:24:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-05-24 19:25:23 [scrapy] DEBUG: Retrying <GET http://www.locoyposter.com/site/index.html via http://127.0.0.1:8050/execute> (failed 1 times): 504 Gateway Time-out 2016-05-24 19:25:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-05-24 19:25:53 [scrapy] DEBUG: Retrying <GET http://www.locoyposter.com/site/index.html via http://127.0.0.1:8050/execute> (failed 2 times): 504 Gateway Time-out 2016-05-24 19:26:19 [scrapy_splash.middleware] WARNING: Bad request to Splash: {u'info': {u'line_number': 16, u'message': u'Lua error: [string "..."]:16: network3', u'type': u'LUA_ERROR', u'source': u'[string "..."]', u'error': u'network3'}, u'type': u'ScriptError', u'description': u'Error happened while executing Lua script', u'error': 400} 2016-05-24 19:26:19 [scrapy] DEBUG: Crawled (400) <GET http://www.locoyposter.com/site/index.html via http://127.0.0.1:8050/execute> (referer: None) 2016-05-24 19:26:19 [scrapy] DEBUG: Ignoring response <400 http://www.locoyposter.com/site/index.html>: HTTP status code is not handled or not allowed 2016-05-24 19:26:19 [scrapy] INFO: Closing spider (finished)

but,If you remove the splash:add_cookie wouldn't errors

Use a custom scrapy.Request subclass instead of a meta key

Maybe it could work like FormRequest?

This won't work for #11, but with a custom request class / request creation function the user won't be required to add middleware, dupefilter, etc. to settings; also, API could be nicer.

Provide helpers for common tasks

Examples:

render HTML;
render a screenshot;
do a request to render.json, decode the result;
run Lua script.

I think these helpers could be scrapy.Request subclasses.
See also: https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/request.py

GET never used

I was looking through middleware.py and see this:

    def process_request(self, request, spider):
        if 'splash' not in request.meta:
            return

        if request.method not in {'GET', 'POST'}:
            logger.warn(
                "Currently only GET and POST requests are supported by "
                "SplashMiddleware; %(request)s will be handled without Splash",
                {'request': request},
                extra={'spider': spider}
            )
            return request
        ....
        new_request = request.replace(
            url=splash_url,
            method='POST',
            body=body,
            headers=headers,
            priority=request.priority + self.rescheduling_priority_adjust
        )    
        self.crawler.stats.inc_value('splash/%s/request_count' % endpoint)
        return new_request

This means any requests sent from scrapy to splash will be POST requests. This is confimed in the splash logs.

Problem I have is when CURL sends something to Splash, it uses a GET and the JavaScript is executed and the page is returned correctly.

If I issue the same CURL command with -X POST, the JavaScript is not executed (or takes too long) and I only get back HTML.

If there a reason this is not:

        new_request = request.replace(
            url=splash_url,
            method=request.method ,
            body=body,
            headers=headers,
            priority=request.priority + self.rescheduling_priority_adjust
        )

If someone can offer me a workaround I would be forever grateful.

Here is the CURL request with GET (which works as expected):

sudo docker run -p 5023:5023 -p 8050:8050 -p :8051 scrapinghub/splash -v 3

curl -X GET 'http://localhost:8050/render.html?url=https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html&iframe=1&html=1&png=1&width=1024&height=768&script=1&console=1&timeout=10&wait=0.5'

Last part of output from CURL:
/b><span class="description"></span></div><div class="sectionItem itemName namespace static"><b class="icon" title="SAP UxAP"><a href="sap.uxap.html">uxap</a></b><span class="description">SAP UxAP</span></div><div class="sectionItem itemName namespace static"><b class="icon" title="Chart controls based on the SAP BI CVOM charting library"><a href="sap.viz.html">viz</a></b><span class="description">Chart controls based on the SAP BI CVOM charting library</span></div></div></div></div></div></body></html>
Splash Log

2016-06-04 16:55:17.101804 [network-manager] Headers received for https://sapui5.hana.ondemand.com/sdk/docs/api/images/namespace_obj.png
2016-06-04 16:55:17.109265 [network-manager] Finished downloading https://sapui5.hana.ondemand.com/sdk/docs/api/images/namespace_obj.png
2016-06-04 16:55:17.109562 [render] [140142934044512] mainFrame().initialLayoutCompleted
2016-06-04 16:55:17.124551 [render] [140142934044512] loadFinished: ok
2016-06-04 16:55:17.124953 [render] [140142934044512] loadFinished: disconnecting callback 0
2016-06-04 16:55:17.125597 [render] [140142934044512] loadFinished; waiting 500ms
2016-06-04 16:55:17.125880 [render] [140142934044512] waiting 500ms; timer 140142934153688
2016-06-04 16:55:17.642614 [render] [140142934044512] wait timeout for 140142934153688
2016-06-04 16:55:17.642988 [render] [140142934044512] _loadFinishedOK
2016-06-04 16:55:17.643053 [render] [140142934044512] stop_loading
2016-06-04 16:55:17.643170 [render] [140142934044512] HAR event: _onPrepareStart
2016-06-04 16:55:17.643264 [render] [140142934044512] getting HTML
2016-06-04 16:55:17.643447 [render] [140142934044512] HAR event: _onHtmlRendered
2016-06-04 16:55:17.643556 [pool] [140142934044512] SLOT 0 is closing <splash.qtrender.HtmlRender object at 0x7f7591ce2cc0>
2016-06-04 16:55:17.643620 [render] [140142934044512] close is requested by a script
2016-06-04 16:55:17.646391 [render] [140142934044512] cancelling 0 remaining timers
2016-06-04 16:55:17.646845 [pool] [140142934044512] SLOT 0 done with <splash.qtrender.HtmlRender object at 0x7f7591ce2cc0>
2016-06-04 16:55:17.650720 [events] {"client_ip": "172.17.0.1", "rendertime": 1.519012451171875, "fds": 22, "user-agent": "curl/7.47.0", "load": [0.28, 0.15, 0.14], "status_code": 200, "_id": 140142934044512, "args": {"height": "768", "script": "1", "console": "1", "url": "https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html", "timeout": "10", "uid": 140142934044512, "wait": "0.5", "png": "1", "html": "1", "iframe": "1", "width": "1024"}, "method": "GET", "qsize": 0, "timestamp": 1465059317, "path": "/render.html", "maxrss": 74432, "active": 0}
2016-06-04 16:55:17.651474 [-] "172.17.0.1" - - [04/Jun/2016:16:55:17 +0000] "GET /render.html?url=https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html&iframe=1&html=1&png=1&width=1024&height=768&script=1&console=1&timeout=10&wait=0.5 HTTP/1.1" 200 5562 "-" "curl/7.47.0"
2016-06-04 16:55:17.651792 [pool] SLOT 0 is available

Here is the CURL request using POST which does not render the JavaScript:

curl -X POST 'http://localhost:8050/render.html?url=https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html&iframe=1&html=1&png=1&width=1024&height=768&script=1&console=1&timeout=10&wait=0.5'

Last part of the CURL request, quite different from the GET request:

        </div><div class="snippetHighlightLine">
          <span class="lineno">136</span>
          <code class="code"> &#160; &#160; &#160; &#160;request_content_type = request.getHeader(b'content-type').decode('latin1')</code>
        </div><div class="snippetLine">
          <span class="lineno">137</span>
          <code class="code"> &#160; &#160; &#160; &#160;supported_types = ['application/javascript', 'application/json']</code>
        </div>
      </div>
    </div>
  </div>
  <div class="error">
    <span>builtins.AttributeError</span>: <span>'NoneType' object has no attribute 'decode'</span>
  </div>
</div>

</body></html>

And the Splash log for this POST request:

2016-06-04 16:57:56.910757 [_GenericHTTPChannelProtocol,1,172.17.0.1] Unhandled Error
    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
        why = self.lineReceived(line)
      File "/usr/local/lib/python3.4/dist-packages/twisted/web/http.py", line 1688, in lineReceived
        self.allContentReceived()
      File "/usr/local/lib/python3.4/dist-packages/twisted/web/http.py", line 1767, in allContentReceived
        req.requestReceived(command, path, version)
      File "/usr/local/lib/python3.4/dist-packages/twisted/web/http.py", line 768, in requestReceived
        self.process()
    --- <exception caught here> ---
      File "/usr/local/lib/python3.4/dist-packages/twisted/web/server.py", line 183, in process
        self.render(resrc)
      File "/usr/local/lib/python3.4/dist-packages/twisted/web/server.py", line 234, in render
        body = resrc.render(self)
      File "/app/splash/resources.py", line 50, in render
        return Resource.render(self, request)
      File "/usr/local/lib/python3.4/dist-packages/twisted/web/resource.py", line 250, in render
        return m(request)
      File "/app/splash/resources.py", line 136, in render_POST
        request_content_type = request.getHeader(b'content-type').decode('latin1')
    builtins.AttributeError: 'NoneType' object has no attribute 'decode'

2016-06-04 16:57:56.930981 [-] "172.17.0.1" - - [04/Jun/2016:16:57:56 +0000] "POST /render.html?url=https://sapui5.hana.ondemand.com/sdk/docs/api/symbols/sap.html&iframe=1&html=1&png=1&width=1024&height=768&script=1&console=1&timeout=10&wait=0.5 HTTP/1.1" 500 5344 "-" "curl/7.47.0

Thanks for any input.

David

Middlware - Too many open files

I have been running in to many problems with this middleware, especially random freezing and performance issues. I solved some of the freezing problems by not using Mongodb in my pipeline.

However now i encountered a big problem that keeps coming back:
GliB - Too many open files.

I know what the error means, what i dont know is why? It seems that this keeps happening at the gtk.main() loop.

Best Regards
Pontus

python-gtk2

is posible run scrapyjs with python-gtk2? need changes in code?

handle USER_AGENT and DEFAULT_REQUEST_HEADERS scrapy settings properly

"docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 3600" doesn't work

2016-04-27 15:18:54.748787 [events] {"_id": 140682213021232, "client_ip": "172.17.42.1", "args": {"cookies": [], "lua_source": "\nfunction main(splash)\n splash:go(splash.args.url)\n splash:wait(1)\n return splash:html()\nend\n", "url": "http://www.google.com/all", "uid": 140682213021232, "headers": {"Accept-Language": "en", "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}}, "status_code": 200, "fds": 41, "maxrss": 140476, "method": "POST", "path": "/execute", "qsize": 0, "user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36", "rendertime": 18.67064356803894, "timestamp": 1461770334, "active": 0, "load": [0.21, 0.35, 0.41]} 2016-04-27 15:18:54.749018 [-] "172.17.42.1" - - [27/Apr/2016:15:18:53 +0000] "POST /execute HTTP/1.1" 200 258608 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"

It seems that the timeout argument is still 30 not 3600

Explain how to use scrapy-splash with AutoThrottle

AutoThrottle extension doesn't play nicely with scrapy-splash because it thinks requests take a very long time, and adjusts request rate accordingly.

Can't convert object: depth limit is reached error

I'm having a Can't convert object: depth limit is reached error as showing below

==> default: 2015-04-22 12:52:01.126719 [-] ('enrich_from_lua_error', ScriptError(ValueError("Can't convert object: depth limit is reached",),), LuaError(u'ValueError("Can\'t convert object: depth limit is reached",)',))

Here is my scrapy/splash code:

def start_requests(self):
        script = """
        function main(splash)
            assert(splash:go(splash.args.url))
            splash:wait(20.0)
            return splash:evaljs("document.querySelectorAll('p.MSMRSTSignature_FullName.MSMRSTArticleItemTitle')")
        end
        """
        for url in self.start_urls:
            yield Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'execute',
                    'args': {'lua_source': script}
                }
            })

When I try my javascript code in chrome console a I get about 77 p elements with nested elements.
How I could fix this?

Meta JSON in readme.md needs updates

Had to change:
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
to:
'splash': {
'endpoint': 'render.html',
'args': {'wait': '0.5'}
}

in read me. Assuming same rule applies for the rest of the examples put on readme.md file.

Add support for .click() on non <button> DOM elements.

I ran into this issue while trying to interact with a page that had a <div> which I needed to click to load additional data. I tried doing a Lua script that loaded jQuery and then clicked the element like so:

assert(splash:runjs("$(#element-id).click()"))

but it kept returning the same original html with no modifications. Took me a while to figure this out.

For anyone running into this issue, using basic javascript to simulate the click worked for me:

local fireClick = splash:jsfunc([[
    function() {{
      var c = document.createEvent("MouseEvents"),
        el = document.getElementsByClassName(
          //some selector
        )[0];

        c.initMouseEvent("click",true,true,window,0,0,0,0,0,false,false,false,false,0,null);
        el.dispatchEvent(c);
    }}
]])

fireClick()

from scrapyjs import SplashRequest

When I try import SplashRequest error occurred:
from scrapyjs import SplashRequest
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name SplashRequest

but
import scrapyjs
works.

Metadata-Version: 1.1
Name: scrapyjs
Version: 0.2
Summary: JavaScript support for Scrapy using Splash
Home-page: https://github.com/scrapy-plugins/scrapy-splash
Author: Mikhail Korobov
Author-email: [email protected]
License: BSD
Location: /usr/local/lib/python2.7/dist-packages

Is this the right way to use scrapyJs with crawlSpider?

Hi:
sorry that I'm not really familiar about scrapy. but I had to use scrapyJs to get rendered contents.
I noticed that you have scrapySpider example but I want to use crawlSpider. So I wrote this:

class JhsSpider(CrawlSpider):
    name = "jhsspy"
    allowd_domains=["taobao.com"]
    start_urls = ["https://ju.taobao.com/"]
    rules = [
            Rule(SgmlLinkExtractor(allow = (r'https://detail.ju.taobao.com/.*')), follow = False),

            Rule(SgmlLinkExtractor(allow = (r'https://detail.tmall.com/item.htm.*')), callback = "parse_link"),
        ]
def parse_link(self, response):
    le = SgmlLinkExtractor()
    for link in le.extract_links(response):
        yield scrapy.Request(link.url, self.parse_item, meta={
            'splash':{
                'endpoint':'render.html',
                'args':{
                    'wait':0.5,
                    }
                }
            })

def parse_item(self, response):
    ...get items with reponse...

but I had some problem that I'm not sure what caused them. So, want to know is it the right way to yield request like what I did above.

java page not scraping

Hello, I tried to use either of the download handler and the middleware to scrape the PRICE from page like this one without success. I followed all your instructions, but cannot get the content of price field. My Xpath is correct. Everything else seems to be able to be scraped (even without the js script middleware) except for price. Any help would be much appreciated. Thx

http://www.walmart.ca/en/grocery/food/baking/N-328

Removed cookie handling

I think there's a problem with removing cookies: if we get them via splash:get_cookies, then if the cookie is set to an empty string (as a result of logout), it will be absent in response.data['cookies'], and as a result, they will stay in the cookiejar in SplashCookiesMiddleware: lopuhin@478db5d

I see three ways to fix this:

if the cookie disappeared from response.data['cookies'], remove it from the cookiejar too
if the cookie disappeared from response.data['cookies'], add it to cookiejar with an empty string value?
get cookies in some other way, for example from har log?

Not sure which way is the best. (1) is by far the easiest and looks very logical, but this might be different from what people will expect?

Basic access authentication (aka. HTTP auth) Support for scrapy-splash

we are running a scrapy-splash instance , where we have HTTP Auth is in place to grant access to splash instance...

after surfing and reading for a while i came to know that there is no HTTP Auth for Scrapy-splash API..

I tried using something like http://username:[email protected]
but scrapy failed with error DNS "username" is not resolved

after that i have started hacking into packages in scrapyjs ( midddleware.py) and came to a realization that it can be done ,very efficiently with minimal changes in code-base

#in Settings.py file///
DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}
SPLASH_URL = 'URL'
SPLASH_USERNAME = 'username'
SPLASH_PASS = 'password'

#in scrapyjs.middleware.py file
from w3lib.http import basic_auth_header

class SplashMiddleware(object):
    .....
    default_splash_username = ""
    default_splash_password = ""
    .....
def __init__(self, crawler, splash_base_url, slot_policy):
        ....
        self.splash_username = crawler.settings.get('SPLASH_USERNAME', self.default_splash_username)
        self.splash_password = crawler.settings.get('SPLASH_PASS', self.default_splash_password)

def process_request(self, request, spider):
       if self.splash_username != "" and self.default_splash_password != "":
            auth = basic_auth_header(http_user, http_pass)
            request.headers['Authorization'] = auth

i think this will hlep many who have HTTP Auth in place for splash instance

I can submit a PR if we agree that this is a useful adon

Publish package as scrapy-splash on PyPI

Replacing scrapyjs
Repository was already renamed.

provide SplashFormRequest

Currently it is no straightforward to use FormRequest with scrapy-splash; one have to go back to request.meta['splash'] API.

boot2docker ip is different from docker

Hi,

when I run scrapy with splash, it shows the errors:
**Is this because my boot2docker IP is different from docker IP?
My docker IP is: 192.168.99.102
boot2docker IP is: 192.168.59.103

2016-02-23 22:42:48 [scrapy] INFO: Enabled item pipelines:
2016-02-23 22:42:48 [scrapy] INFO: Spider opened
2016-02-23 22:42:48 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-23 22:42:48 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-23 22:42:48 [scrapy] DEBUG: Retrying <POST http://192.168.59.103:8050/execute> (failed 1 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:48 [scrapy] DEBUG: Retrying <POST http://192.168.59.103:8050/execute> (failed 2 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:48 [scrapy] DEBUG: Gave up retrying <POST http://192.168.59.103:8050/execute> (failed 3 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:48 [scrapy] ERROR: Error downloading <POST http://192.168.59.103:8050/execute>: Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] DEBUG: Retrying <POST http://192.168.59.103:8050/execute> (failed 1 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] DEBUG: Retrying <POST http://192.168.59.103:8050/execute> (failed 2 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] DEBUG: Gave up retrying <POST http://192.168.59.103:8050/execute> (failed 3 times): Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] ERROR: Error downloading <POST http://192.168.59.103:8050/execute>: Connection was refused by other side: 61: Connection refused.
2016-02-23 22:42:49 [scrapy] INFO: Closing spider (finished)
2016-02-23 22:42:49 [scrapy] INFO: Dumping Scrapy stats:

Allow to use Splash in-process

it should be possible to run Splash in the same event loop as Scrapy, similar to how it worked in gtk-based scrapyjs.

Add an option to send requests to Splash by default

We could create a middleware which adds 'splash' meta key to all requests, or to all requests matching some pattern. It could also decode the results to make the whole thing more or less transparent.

Is it a good idea? Or are explicit requests enough?

Add some Splash specific retry logic to scrapy-splash

When Splash returns some error it is just retried by Scrapy without printing error messages. This is a pain because if you get 400 from Splash you should be able to see error message immediately, usually 400 means some error from your side so you must know what you did wrong. WIth current middleware you just get 400 in logs and usually this is retried by Scrapy retry middleware, there is no quick and easy way to see what error you got.

I tried using scrapyjs crawl websites, but crawl stops 2 minutes at a time, and then be able to crawl, and stop for 2 minutes, and so on, what caused it?

crawl information:

2016-04-06 05:16:56 [scrapy] INFO: Crawled 62 pages (at 6 pages/min), scraped 42 items (at 6 items/min)
2016-04-06 05:17:56 [scrapy] INFO: Crawled 56 pages (at 0 pages/min), scraped 36 items (at 0 items/min)
2016-04-06 05:18:56 [scrapy] INFO: Crawled 56 pages (at 0 pages/min), scraped 36 items (at 0 items/min)
2016-04-06 05:19:56 [scrapy] INFO: Crawled 62 pages (at 6 pages/min), scraped 42 items (at 6 items/min)
.......................
.......................

my code is here:

class JdSpider(scrapy.Spider):
    name = "jdscroll"
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        script2 = """
           function main(splash)
               local click_element = splash:jsfunc([[
                 function(){{
                    var i = 0;
                    var t=0;
                    t= setInterval(function(){
                            if(i<15){
                                window.scrollTo(100, i * 600);
                            }else{
                                window.clearInterval(t);
                            }
                            i+=1;
                        }, 800)
                    }}
                ]])
                splash.resource_timeout = 10.0
                assert(splash:go(splash.args.url))
                splash:set_custom_headers({
                        [':host']='serach.jd.com',
                        [':method']='GET',
                        [':version']='HTTP/1.1',
                        ['accept']='text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                        ['accept-language']='zh-CN,zh;q=0.8',
                })
                click_element()
                splash:wait(5.5)
                splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 OPR/35.0.2066.92')
                return splash:html()
        end
        """

        for i in range(1, 101, 1):
            url2 = 'http://search.jd.com/Search?keyword=苹果&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=%d&click=0' % i
            yield scrapy.Request(url2, self.parse_next, meta={
                'splash': {
                    'args': {'lua_source': script2, 'url': url2},
                    'endpoint': 'execute'
                }
            })

    def parse_next(self, response):

        # item['title']=[]
        # item['price']=[]

        self.header={
            ':host':'item.jd.com',
            ':method':'GET',
            ':version':'HTTP/1.1',

            'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'accept-language ':'zh-CN,zh;q=0.8',
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 OPR/35.0.2066.92',
            'cache-control':'max-age=0',

            'if-modified-since':'Sun, 13 Mar 2016 03:01:20 GMT',

            'if-none-match':'TURBO2-f6ee053be24d1412ffdc31ea5736c30e',
            'referer':'http://search.jd.com/Search?keyword=苹果&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=1&click=0',
            'upgrade-insecure-requests':1,


            'x-opera-requesttype':'main-frame'            


        }
        self.cookies={
           '__jdv':r'122270672|direct|-|none|-',
           'ipLocation':r'%u5317%u4EAC',
           'areaId':r'1',
           'ipLoc-djd':r'1-72-2799-0',
           '__jda':r'122270672.1995789348.1456028093.1457791917.1457836622.5',
           '__jdb':r'122270672.7.1995789348|5.1457836622',
           '__jdc':r'122270672',
           '__jdu':r'1995789348',
           '_jrda':r'1',
           '_jrdb':r'1457839559736',
           '3AB9D23F7A4B3C9B':r'0c366f2124ff48f3bbc2b4f580f5877d1065706156',
           'thor':r'18B675F7378C3675F88CD078BF5D43195247693D4EA4B1CDE4EF4E66DE6D2D57947216E14331B23722623087D2D4FC9D1C30AD08690008428298D3FDC5A2B6DCD796953A25D88B59985BE35E3284DFB649B6371CCA803EFF49842DCCB87EBC9CB460DEE1877F240AD0E8CD2765FE81229BACEC814CE454FE76DA3E61BC402168E46A4113FA01DE54FB2305EB6D33DA4D',
           '_tp':r'PqLmWq%2BZjSCnGLeF2%2FywBw%3D%3D',
           'logining':r'1',
           'unick':r'%E9%A3%9E%E9%A9%ACfly',
           '_pst':r'xiaowenjie6434',
           'TrackID':r'1syOovo3tNb_sA4R5vbDFGhvSzevkcnpll7sAFW_zJbB2f0cYry8GLn8bbsQZt5u-CfBbNzLpOpoXCPFB9uSFXLyXGH2G70XuQ4UXXEJ6qwwskKPlime48Ono9LwRDCCs',

           'pinId':r'JZDTIH8Hx2RJY4yW2Xf1Mg',
           'pin':r'xiaowenjie6434'

        }
        #items=[]
        select = response.xpath("//ul[@class='gl-warp clearfix']/li[@class='gl-item']/div/div[@class='p-name p-name-type-2']/a/@href")
        for i in select.extract():
            #urls=i.xpath("div[@class='p-name p-name-type-2']/a/@href").extract()[0]

            item_urls = urlparse.urljoin('http:', i)

            yield scrapy.Request(item_urls,callback=self.parse_item,dont_filter=True)

    def parse_item(self,response):
        item=TaobaoItem()
        item['title']=response.xpath("//div[@id='name']/h1/text()").extract()
        #item['price']=response.xpath("//strong[@id='jd-price']/text()").extract()
        item['url']=response.url
        item['shopname']=response.xpath("//div[@id='extInfo']/div[@class='seller-infor']/a[@class='name']/text()").extract()
        #item['title']=response.body
        return item

This is settings.py code :
`BOT_NAME = 'taobao'

SPIDER_MODULES = ['taobao.spiders']
NEWSPIDER_MODULE = 'taobao.spiders'
DOWNLOADER_MIDDLEWARES={
'scrapyjs.SplashMiddleware':725,
}

DOWNLOAD_TIMEOUT=60

ITEM_PIPELINES = {
'taobao.pipelines.mysql.MysqlWriter': 800,
}
RANDOMIZE_DOWNLOAD_DELAY = True

MYSQL_PIPELINE_URL = 'mysql://root:2955112@localhost:3306/properties'
CONCURENT_REQUESTS =16

USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 OPR/35.0.2066.92"]
DUPEFILTER_CLASS='scrapyjs.SplashAwareDupeFilter'
SPLASH_URL='http://127.0.0.1:8050/'
COOKIES_ENABLED=False
COOKIES_DEBUG=False
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""`

SplashFormRequest.from_response marked as undefined in eclipse

Eclipse SplashFormRequest.from_response is marked red as "Undefined variable from import: from_response" - for scrapy-splash 0.4.

But it works during script execution.

set headers/cookies for the outgoing GET requests.

Hi, I got a question.

Given a spider has cookies by previous requests, this request scrapy.Request(url, callback=self.parse_result) will be sent with headers/cookies included. When using splash to render the page, I want it to use these headers as well. Is it currently supported?

I fount that there is an argument headers for endpoint: render.html, but this option is only supported for application/json POST requests. How about GET requests?

Thanks in advance,
Canh

add option to return Htmlresponse generated from content of 'html' key in splash response

When processing splash response we could just replace response from splash with new HTMLResponse generated from html in splash response. This way user will not have to worry about generating Html response in spider callback, she can just forget about js rendering - when splash will be enabled she will just get normal target response with JS rendered (with proper url, html etc). User can just use normal response.xpath on this response without doing anything extra. I think rendering html would probably be most common use case for splash middleware so it's worth enabling this by default.

If there will be some other keys in splash response (other then html) we could pass them in meta perhaps? .har content could be converted to response headers

Format of request.cookies

Say we want to add some cookies we got elsewhere to request: we set request.cookies. But the format is HAR, which is not a native python format. I think it would be convenient to allow setting cookies in the format of a list of http.cookiejar.Cookie.__dict__ or http.cookiejar.Cookie to avoid workarounds like this TeamHG-Memex/undercrawler@27d87f2. Do you think it's worth adding, and if yes, what format should scrapy-splash support for request.cookies?

How can I click <a>

url:http(s)://bbs.ngacn.cc/nuke.php?__lib=login&__act=login_ui
<a id="showcaptchad" href="javascript:void(0)" onclick="showCaptchad(this)">xxxxxxxx</a>

script = ''' 
function main(splash)
     splash.images_enabled = false
     splash.resource_timeout = 100.0
     splash:autoload('http://code.jquery.com/jquery-2.1.3.min.js')
     assert(splash:go(splash.args.url))
     splash:wait(1)
     splash:runjs("$('#showcaptchad').click()")
     splash:wait(0.5)
     return splash:html()  
end
'''

it never works.

but
document.getElementById("showcaptchad").click()
worked well.

1.Whether jQuery does not support click .
2.how can I click the link like
<a tabindex="-1" href="#J_Reviews" rel="nofollow" hidefocus="true" data-index="1">xxxxxxx<em class="J_ReviewsCount" style="display: inline;">62791</em></a>
or
<a href="?spm=a220o.1000855.0.0.JZj6pP&page=2" data-spm-anchor-id="a220o.1000855.0.0">2</a>
they don't contain id='xxx' or name='xxx' .
I can't use 'getElementById()' or 'getElementsByName()'.
what should I do?

Portiacrawl with JavaScript?

edit: apologies, I created this on the wrong project. Have closed.

i got some problem on the project

I am a newbie on scrapy,I put this project scrapyjs into practice.It generate problem like"raise error.ReactorAlreadyInstalledError("reactor already installed")
ReactorAlreadyInstalledError: reactor already installed" when I add "gtk2reactor.install()" on init.py.How to make sure scrapy use gtk2reactor on twisted?

scrapy-plugins / scrapy-splash Goto Github PK

scrapy-splash's People

Contributors

Stargazers

Watchers

Forkers

scrapy-splash's Issues

justWondering :)

DOWNLOAD_TIMEOUT=60

Recommend Projects

Recommend Topics

Recommend Org

Jobs