scrapy / scrapyd Goto Github PK

View Code? Open in Web Editor NEW

2.9K 2.9K 565.0 544 KB

A service daemon to run Scrapy spiders

Home Page: https://scrapyd.readthedocs.io/en/stable/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scrapyd's Introduction

Scrapyd

Scrapyd is a service for running Scrapy spiders.

It allows you to deploy your Scrapy projects and control their spiders using an HTTP JSON API.

The documentation (including installation and usage) can be found at: http://scrapyd.readthedocs.org/

scrapyd's People

Contributors

Stargazers

Watchers

Forkers

tonal gcmalloc llonchj xaqq noscripter netconstructor scraping-xx opentemplates listings-xx openhosts big-data imclab urbanizo skade asquera andypp drankinn rustemt genba pyvkd kmike yuxi0203 orangain azizur77 manuchandraprasad medialab wang-changhong hacder bryant1410 monday0rsunday chu888chu888 elrull maximium mbacho ksze arezki1990 anumpatel nyov hlaposhka nimblemachine nikhgupta caiooliveiraeti wxpjimmy rafaelliu umrashrf zimndev atassumer hugolm84 rallysf ferdi jayzeng theringer aniversarioperu chekunkov ms5 mrorii timfeirg qwang2505 yiliaofan bbotella misolax jurov huokedu vp-sabbad sujaymansingh getwingm easonyi dabonneville 1060460048 sleigner anapitupulu liwei123o0 yavalvas editd dfockler e98cuenc backtrack-5 danhamilt1 shirk3y persomi ewokcillo rdowinton dharmeshpandav bkam-com torymur cendari albert2lyu utek jengyic sigma-random particledecay joleye tools-alexuser01 eliasdorneles cloudxtreme encorehu xitij2000 adammfrank biddyweb gischen

scrapyd's Issues

missing scrapyd-deploy on ubuntu

I followed the official insatllation guide (including adding the scrapyd APT repos) to find that there's no scrapyd-deploy program.

I have scrapyd which seems to work, I can run the daemon and then in my scrapy project folder run:

scrapy deploy -p test_crawler

and get {"status": "ok", "spiders": ["spider1"]} in the scrapyd output, but there's nothing on localhost:6800, and the spider isn't actually running (nothing happens on the database side)

I'm sure there isn't anything wrong with my spider, I can run with scrapy crawl spider1 and have everything working.

Add scrapyd-client package

We should add scrapyd-client package for users so they won't need to install the full scrapyd package to deploy their code.

scrapyd overwriting FEED_URI

im using the FEED_URI for storing my data on a ftp server. The only problem is that scrapyd is overwriting the FEED_URI to a local file

Problem with --rundir when using relative directory paths

I run scrapyd with the --rundir option. (version 1.0.1)

I have the following issue.

I deploy an egg to scrapyd
egg and project are present (I can see them in listprojects.json)
I then restart scrapyd
egg and project are not present in listprojects.json
BUT: egg file is present in $rundir/eggs

I suspect the issue is to do with when the directory is changed.

It seems that SpiderScheduler will load the eggs/projects when initialised
https://github.com/scrapy/scrapyd/blob/1.0.1/scrapyd/scheduler.py#L12

But I think this is done before changing the working directory to --rundir.

To investigate, I added a couple of hacky print statements to SpiderScheduler

class SpiderScheduler(object):

    implements(ISpiderScheduler)

    def __init__(self, config):
        self.config = config
        import os; print "SpiderScheduler::__init__ current dir=" + os.getcwd()
        self.update_projects()

    def schedule(self, project, spider_name, **spider_args):
        q = self.queues[project]
        q.add(spider_name, **spider_args)

    def list_projects(self):
        import os; print "SpiderScheduler::list_projects current dir=" + os.getcwd()
        return self.queues.keys()

    def update_projects(self):
        self.queues = get_spider_queues(self.config)

And grepping logs for "SpiderScheduler" (after I restart and then make a call in my browser to listprojects.json)

SpiderScheduler::__init__ current dir=/opt/skuscraper
2014-12-11 15:22:40+0000 [HTTPChannel,0,10.10.9.220] SpiderScheduler::list_projects current dir=/var/scrapyd

So /opt/skuscraper is my project directory (with the scrapyd.conf).
But I want the working directory to be separate (so it doesn't put any extra files in the app directory), that is why I use /var/scrapyd as the run dir.

We can see that when the SpiderScheduler object is init'd, the current dir is /opt/skuscraper, so it can't find any eggs.
But after the app starts up, it uses /var/scrapyd.

So any deploys of eggs after the app starts up are saved to /var/scrapd/eggs, but then scrapyd is restarted, it loads its initial list of eggs from /opt/skuscraper (where they won't exist).

spider exits immediately after a successful deployment

My spider runs perfectly using scrapy crawler my_crawler, I tried to deploy using the following command:

First attempt

scrapyd
scrapyd-deploy my_crawler -p scrapy_crawler

It complains some sort of permission error:
Packing version s3 Traceback (most recent call last): File "/usr/bin/scrapyd-deploy", line 269, in <module> main() File "/usr/bin/scrapyd-deploy", line 95, in main egg, tmpdir = _build_egg() File "/usr/bin/scrapyd-deploy", line 236, in _build_egg retry_on_eintr(check_call, [sys.executable, 'setup.py', 'clean', '-a', 'bdist_egg', '-d', d], stdout=o, stderr=e) File "/usr/lib/pymodules/python2.7/scrapy/utils/python.py", line 281, in retry_on_eintr return function(*args, **kw) File "/usr/lib/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', 'setup.py', 'clean', '-a', 'bdist_egg', '-d', '/tmp/scrapydeploy-T4vslK']' returned non-zero exit status 1

Second

scrapyd
sudo scrapyd-deploy my_crawler -p scrapy_crawler

and the error:
Packing version s3 Deploying to project "scrapy_crawler" in http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "[Errno 13] Permission denied: 'eggs/scrapy_crawler/s3.egg'"}

Last

sudo scrapyd
sudo scrapyd-deploy dianping_crawler -p scrapy_crawler

I got a successful deployment but the spider doesn't run at all, I checked the joblist only to find that the spider stopped as soon as it's opened, but actually I can run without error in my project folder using scrapy crawl my_crawler

The test bash script

Under sudo scrapyd I manage to run this test script and here's the output:

scrapyd dir: /tmp/test-scrapyd.IaQftNd scrapy dir : /tmp/test-scrapy.w8SUzSy testscarpy.sh: line 23: bin/scrapyd: No such file or directory New Scrapy project 'testproj' created in: /tmp/test-scrapy.w8SUzSy/testproj You can start your first spider with: cd testproj scrapy genspider example example.com Packing version 1407924053 Deploying to project "testproj" in http://localhost:6800/addversion.json Server response (200): {"status": "ok", "project": "testproj", "version": "1407924053", "spiders": 1} {"status": "ok", "jobid": "b659134022d011e49ad3a0d3c11f87db"} waiting 20 seconds for spider to run and finish...

And nothing else.

Cancel does not trigger shutdown handlers (on windows)

When using the cancel REST API method, the crawler process is terminated without calling the registered shutdown handler (spider_closed), at least on Windows. This is my code:

class SpiderCtlExtension(object):

   @classmethod 
   def from_crawler(cls, crawler):
       ext = SpiderCtlExtension()

       ext.project_name = crawler.settings.get('BOT_NAME')
       crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
       crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

       return ext

   def spider_opened(self, spider):
       sql = """UPDATE ctl_crawler
                SET status = 'RUNNING'
                WHERE jobid = '{}'  """.format(os.getenv("SCRAPY_JOB"))
       engine.execute(sql)

   def spider_closed(self,spider,reason):
       print "CLOSE SPIDER"
       sql = """UPDATE ctl_crawler
                SET status = '{}'
                WHERE jobid = '{}'  """.format(reason.upper(),os.getenv("SCRAPY_JOB"))
       engine.execute(sql)

The spider_opened method gets called, the spider_closed method gets called when the crawl is actually finished. However on a cancel, the method is not called.

Another symptom is that the spider's log ends abruptly, without a log entry for the closing event. After going through the sources, I suspect the culprit is actually the way Twisted handles signals on windows:

http://twistedmatrix.com/trac/browser/tags/releases/twisted-8.0.0/twisted/internet/_dumbwin32proc.py#L245

    def signalProcess(self, signalID):
        if self.closed:
            raise error.ProcessExitedAlready()
        if signalID in ("INT", "TERM", "KILL"):
            win32process.TerminateProcess(self.hProcess, 1)

If I understand correctly what is happening, you have the following setup:

Twisted container (created in scrapyd/launcher.py)
- Crawler Process (created by scrapy/crawler.py)
  - Twisted container
    - Crawler

The issue here is that the outer Twisted container exits immediately, as also indirectly said here:
scrapy/scrapy#1001 (comment)

To fix this, it is necessary to somehow trigger a graceful shutdown of the Crawler Process, without terminating the outer container right away.

Arguments are urlencoded when passed to the spider

I was naively getting an argument from the command line and it was ok, but when I moved the project to scrapyd I noticed the arguments were being urlencoded when sent by python-requests. I needed to use urlparse.unquote to get the normalized argument.

Specific log file when launching a spider with scrapyd

Hi there,

I'm currently launching my spiders with scrapyd, and I have some spiders that I can launch several times with distinct arguments (I call these ones generic spiders). The thing is, I want to have distinct log files according to those arguments, and I can't find anywhere any configuration to do that. So my first question is: is it possible to configure scrapyd to do so ?

Then I took a look at your code, and find out interesting lines where we could do that without being too intrusive, right here. It might be achieved by simply passing an optional log_file argument in messages. Do you think it could be possible ?

Thanks !

unable to locate txapp.py within virtualenv

Steps to reproduce:

git clone [email protected]:scrapy/scrapyd.git
virualenv env
source env/bin/active
python setup.py install
./build/scripts-2.7/scrapyd

throws the following error:

(env)➜  scrapyd git:(master) ✗ ./build/scripts-2.7/scrapyd
Unhandled Error
Traceback (most recent call last):
  File "/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-macosx-10.9-x86_64.egg/twisted/application/app.py", line 642, in run
    runApp(config)
  File "/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-macosx-10.9-x86_64.egg/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-macosx-10.9-x86_64.egg/twisted/application/app.py", line 376, in run
    self.application = self.createOrGetApplication()
  File "/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-macosx-10.9-x86_64.egg/twisted/application/app.py", line 441, in createOrGetApplication
    application = getApplication(self.config, passphrase)
--- <exception caught here> ---
  File "/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-macosx-10.9-x86_64.egg/twisted/application/app.py", line 452, in getApplication
    application = service.loadApplication(filename, style, passphrase)
  File "/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-macosx-10.9-x86_64.egg/twisted/application/service.py", line 405, in loadApplication
    application = sob.loadValueFromFile(filename, 'application', passphrase)
  File "/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-macosx-10.9-x86_64.egg/twisted/persisted/sob.py", line 203, in loadValueFromFile
    fileObj = open(filename, mode)
exceptions.IOError: [Errno 20] Not a directory: '/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/scrapyd-1.0.1-py2.7.egg/scrapyd/txapp.py'

Failed to load application: [Errno 20] Not a directory: '/Users/jayzeng/Projects/scrapyd/env/lib/python2.7/site-packages/scrapyd-1.0.1-py2.7.egg/scrapyd/txapp.py'

Installing scrapyd as global dependency has no problem.

Install on debian ?

Hello,

There is a debian directory which is supposed to help build a package for debian but since the new version, the package requires installation of upstart which would break most debian installs. So I tried instead to install v0.17, but although it seems to install everything properly, when it's getting to the configuration part, it crashes while trying to start the daemon, which makes sense because it's trying to call /etc/init.d/scrapyd which is not a file anymore but a link to /lib/init/upstart-job which obviously does not exist.

Does anyone know how to properly install the scrapyd service on Debian ? (I know how to install via pip, running scrapyd manually, I need it as a daemon service as the former releases provided)

can't start scrapyd with latest version of scrapy

Hi,

I have installed scrapy 0.25-1 and scrapyd not working

Unhandled Error
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/application/app.py", line 642, in run
runApp(config)
File "/usr/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/scripts/twistd.py", line 23, in runApp
_SomeApplicationRunner(config).run()
File "/usr/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/application/app.py", line 376, in run
self.application = self.createOrGetApplication()
File "/usr/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/application/app.py", line 441, in createOrGetApplication
application = getApplication(self.config, passphrase)
--- ---
File "/usr/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/application/app.py", line 452, in getApplication
application = service.loadApplication(filename, style, passphrase)
File "/usr/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/application/service.py", line 405, in loadApplication
application = sob.loadValueFromFile(filename, 'application', passphrase)
File "/usr/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/persisted/sob.py", line 210, in loadValueFromFile
exec fileObj in d, d
File "/usr/lib/python2.7/site-packages/scrapyd-1.0.1-py2.7.egg/scrapyd/txapp.py", line 3, in
application = get_application()
File "/usr/lib/python2.7/site-packages/scrapyd-1.0.1-py2.7.egg/scrapyd/init.py", line 14, in get_application
return appfunc(config)
File "/usr/lib/python2.7/site-packages/scrapyd-1.0.1-py2.7.egg/scrapyd/app.py", line 37, in application
webservice = TCPServer(http_port, server.Site(Root(config, app)), interface=bind_address)
File "/usr/lib/python2.7/site-packages/scrapyd-1.0.1-py2.7.egg/scrapyd/website.py", line 33, in init
servCls = load_object(servClsName)
File "/usr/lib/python2.7/site-packages/Scrapy-0.25.1-py2.7.egg/scrapy/utils/misc.py", line 47, in load_object
raise ImportError("Error loading object '%s': %s" % (path, e))
exceptions.ImportError: Error loading object 'scrapyd.webservice.Schedule': No module named txweb

Failed to load application: Error loading object 'scrapyd.webservice.Schedule': No module named txweb

Thanks.

Can`t start scrapyd servise

I have scrapy-0.25 & scrapyd installed from repository http://archive.scrapy.org/ubuntu

scrapyd don't start from command line start scrapyd
In error log traceback:

$ cat /var/log/scrapyd/scrapyd.err 
Removing stale pidfile /var/run/scrapyd.pid
Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/application/app.py", line 642, in run
    runApp(config)
  File "/usr/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/usr/lib/python2.7/dist-packages/twisted/application/app.py", line 376, in run
    self.application = self.createOrGetApplication()
  File "/usr/lib/python2.7/dist-packages/twisted/application/app.py", line 441, in createOrGetApplication
    application = getApplication(self.config, passphrase)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/application/app.py", line 452, in getApplication
    application = service.loadApplication(filename, style, passphrase)
  File "/usr/lib/python2.7/dist-packages/twisted/application/service.py", line 405, in loadApplication
    application = sob.loadValueFromFile(filename, 'application', passphrase)
  File "/usr/lib/python2.7/dist-packages/twisted/persisted/sob.py", line 210, in loadValueFromFile
    exec fileObj in d, d
  File "/usr/lib/pymodules/python2.7/scrapyd/txapp.py", line 3, in <module>
    application = get_application()
  File "/usr/lib/pymodules/python2.7/scrapyd/__init__.py", line 14, in get_application
    return appfunc(config)
  File "/usr/lib/python2.7/dist-packages/promsoft/scrapyd/app.py", line 38, in application
    webservice = TCPServer(http_port, server.Site(Root(config, app)), interface=bind_address)
  File "/usr/lib/python2.7/dist-packages/promsoft/scrapyd/website.py", line 11, in __init__
    baseRoot.__init__(self, config, app)
  File "/usr/lib/pymodules/python2.7/scrapyd/website.py", line 27, in __init__
    servCls = load_object(servClsName)
  File "/usr/lib/pymodules/python2.7/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/lib/pymodules/python2.7/scrapyd/webservice.py", line 7, in <module>
    from scrapy.utils.txweb import JsonResource
exceptions.ImportError: No module named txweb

Failed to load application: No module named txweb

I see issue #60 closed.
May by update scrapyd paskage?

Persist job data in Scrapyd

Job data in scrapyd is only kept in memory and thus removed when Scrapyd is restarted. We should store job data (in a sqlite or something) so that it persists after restarts. This is particularly important for completed jobs.

What is the official word on upcomming developments on deploying/managing executions on a cluster of node servers?

I have been following this topic for a while now but it seems we don't have a clear understanding of the envisioned development timeline related to managing and monitoring the execution of one or more scrapy jobs a network of distributed node/edge execution servers.

I am dying to to know if you could provide some insight into the status of the Scrapyd branch (or for that matter any efforts related to the deployments of Scrapy spiders/workers over a network of distributed servers) besides doing so on the closed source "scrapy cloud" service.

We have been following these efforts weekly in hopes that these additions will be addressed in the near future. More specifically, I am referring to a solid direction nailing down a native ability to package up Scrapy spiders and deploy them (on a distributed network of clustered execution node/edge servers).

The reason we are so keen on these developments is because the logic down this path does not seem to be clear within the Scrapy framework and seems to be changing. We were looking to dedicated resources towards extending the core of this system require the desired further direction/approach/logic to be nailed down first.

Basically, how does the dev team envision things moving forward down this path? EVERY time this topic comes up it seems to end with "Scrapy Cloud" which can't be customized, extended or installed locally.

Any insights which could be shared would be of great interest.

Specific areas of interests within this topic should address environmental elements related to registering new servers/nodes to a job execution management system, the registering of settings/options for individual server nodes and/or groups of nodes, some type of permissions (ACL type) on executing, monitoring and deploying new jobs.... method/approach for real-time monitoring of the execution pipelines for node groups, individual servers, workers, and jobs, scheduling system in a multi-server environment..... and of course some default or an optimized method (redis/rabbitmq??) to establish a real-time database/data interface which is accessible, can be updated by/between individual node groups, servers, workers, or even individual jobs.

Could you share how far away we might be to answering some of these topics?

scrapyd on windows failes to accept API requests

I use scrapyd on windows, I successfully installed it, but when I try to use the API I get an error that is related to the environment variables:
it happens in addversion call:

2014-11-12 17:52:54+0200 [HTTPChannel,4,127.0.0.1] 127.0.0.1 - - [12/Nov/2014:15:52:52 +0000] "POST /addversion.json HTT
P/1.1" 200 115 "-" "Python-urllib/2.7"
2014-11-12 17:54:41+0200 [HTTPChannel,5,10.10.1.203] Unhandled Error
        Traceback (most recent call last):
          File "C:\Python27\lib\site-packages\twisted\web\http.py", line 1618, in allContentReceived
            req.requestReceived(command, path, version)
          File "C:\Python27\lib\site-packages\twisted\web\http.py", line 773, in requestReceived
            self.process()
          File "C:\Python27\lib\site-packages\twisted\web\server.py", line 132, in process
            self.render(resrc)
          File "C:\Python27\lib\site-packages\twisted\web\server.py", line 167, in render
            body = resrc.render(self)
        --- <exception caught here> ---
          File "C:\Python27\lib\site-packages\scrapyd-1.0.1-py2.7.egg\scrapyd\webservice.py", line 17, in render
            return JsonResource.render(self, txrequest)
          File "C:\Python27\lib\site-packages\scrapyd-1.0.1-py2.7.egg\scrapyd\utils.py", line 19, in render
            r = resource.Resource.render(self, txrequest)
          File "C:\Python27\lib\site-packages\twisted\web\resource.py", line 216, in render
            return m(request)
          File "C:\Python27\lib\site-packages\scrapyd-1.0.1-py2.7.egg\scrapyd\webservice.py", line 33, in render_POST
            spiders = get_spider_list(project)
          File "C:\Python27\lib\site-packages\scrapyd-1.0.1-py2.7.egg\scrapyd\utils.py", line 114, in get_spider_list
            raise RuntimeError(msg.splitlines()[-1])
        exceptions.RuntimeError: Use "scrapy" to see available commands

it happens also in schedule call(same error).

looking at the file https://github.com/scrapy/scrapyd/blob/master/scrapyd/utils.py
I see this is related to the following lines of code:

env = os.environ.copy()
    env['SCRAPY_PROJECT'] = project
    if pythonpath:
        env['PYTHONPATH'] = pythonpath
    pargs = [sys.executable, '-m', runner, 'list']
    proc = Popen(pargs, stdout=PIPE, stderr=PIPE, env=env)
    out, err = proc.communicate()
    if proc.returncode:
        msg = err or out or 'unknown error'
        raise RuntimeError(msg.splitlines()[-1])

which means something in the connection with windows environment throws error.
I really appreciate any help.

Getting the state of job on the basis of job id.

IMHO in scrapyd API, there should be some option to know the state of job like (pending, running and finished) on the basis of job id. Something like

result=scrapyd.state("job_id")

Thanks

Scrapyd is not following links as per pattern given in "follow_patterns"

I am deploying my Portia spider in scrapyd. I have given a pattern to be followed in Crawling section in Portia.
While deploying the spider, scrapyd is not following the link pattern which I have given.
How to fix this issue?

Remove scrapyd-deploy tool (moved to scrapyd-client)

Add doc for development/contributing

It may be my Python inexperience, but I'm not sure how to get my modified scrapyd code to run from the project root, or how to run tests or anything.
Any help would be appreciated, Thanks.

Provision scrapyd.readthedocs.org or similar

Please kindly provide scrapyd.readthedocs.org to host the documentation.
Thanks

The web page shows the root directory!

Hi all,

I just set items_dir to empty in /etc/scrapyd.conf!
Although scrapyd generates no items, the web monitor page shows my root directory @ http://localhost:6800/items. How can I disable this vulnerability?! @@

JsonLinesItemExporter

Hi,

I'm trying to export the scrapyd's output to a json file.

marco@pc:~/crawlscrape/urls_listing$ curl http://localhost:6800/listversions.json?project="urls_listing"
{"status": "ok", "versions": []}
marco@pc:~/crawlscrape/urls_listing$ scrapyd-deploy urls_listing -p urls_listing
Packing version 1422294714
Deploying to project "urls_listing" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "urls_listing", "version": "1422294714", "spiders": 1}

marco@pc:~/crawlscrape/urls_listing$ curl http://localhost:6800/schedule.json -d project=urls_listing -d spider=urls_grasping
{"status": "ok", "jobid": "0b4518bea58411e482bcc04a00090e80"}

And this is the log file:

2015-01-26 18:52:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: urls_listing)
2015-01-26 18:52:08+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-01-26 18:52:08+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'urls_listing.spiders', 'SPIDER_MODULES': ['urls_listing.spiders'], 'FEED_URI': '/var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04\
a00090e80.jl', 'LOG_FILE': '/var/log/scrapyd/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.log', 'BOT_NAME': 'urls_listing'}
2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Red\
irectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled item pipelines:
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Spider opened
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-26 18:52:08+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-26 18:52:08+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Redirecting (301) to <GET http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/>
2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Crawled (200) <GET http://www.ilsole24ore.com/> (referer: None)
2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Scraped from <200 http://www.ilsole24ore.com/>
        {'url': [u'http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/ravvedimento/index.shtml',
                 u'http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/ravvedimento/index.shtml',
                 u'http://www.ilsole24ore.com/cultura.shtml',
                 u'http://www.casa24.ilsole24ore.com/',
                 u'http://www.moda24.ilsole24ore.com/',
                 u'http://food24.ilsole24ore.com/',
                 u'http://www.motori24.ilsole24ore.com/',
                 u'http://job24.ilsole24ore.com/',
                 u'http://stream24.ilsole24ore.com/',
                 u'http://www.viaggi24.ilsole24ore.com/',
                 u'http://www.salute24.ilsole24ore.com/',
                 u'http://www.shopping24.ilsole24ore.com/',
                 u'http://www.radio24.ilsole24ore.com/',
                 u'http://america24.com/',
                 u'http://meteo24.ilsole24ore.com/',
                 u'https://24orecloud.ilsole24ore.com/',
                 u'http://www.ilsole24ore.com/feed/agora/agora.shtml',
                 u'http://www.formazione.ilsole24ore.com/',
                 u'http://nova.ilsole24ore.com/',
                 ......(omitted)
                 u'http://websystem.ilsole24ore.com/',
                 u'http://www.omniture.com']}
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Closing spider (finished)
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Stored jsonlines feed (1 items) in: /var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.jl
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 434,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 51709,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 1,
         'downloader/response_status_count/301': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 1, 26, 17, 52, 8, 820513),
         'item_scraped_count': 1,
         'log_count/DEBUG': 5,
         'log_count/INFO': 8,
         'response_received_count': 1,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2015, 1, 26, 17, 52, 8, 612923)}
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Spider closed (finished)

But there is no output.json:
marco@pc:~/crawlscrape/urls_listing$ ls -a
. .. build project.egg-info scrapy.cfg setup.py urls_listing

in ~/crawlscrape/urls_listing/urls/listing:
in items.py:

class UrlsListingItem(scrapy.Item):
    # define the fields for your item here like:
    #url = scrapy.Field()
    #url  = scrapy.Field(serializer=UrlsListingJsonExporter)
    url = scrapy.Field(serializer=serialize_url)
    pass

in pipelines.py I put:

class JsonExportPipeline(object):ì
    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}
    def spider_opened(self, spider):
        file = open('%s_items.json' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = JsonLinesItemExporter(file)
        self.exporter.start_exporting()ì
    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

in settings.py I put:

BOT_NAME = 'urls_listing'

SPIDER_MODULES = ['urls_listing.spiders']
NEWSPIDER_MODULE = 'urls_listing.spiders'

FEED_URI = 'file://home/marco/crawlscrape/urls_listing/output.json'
#FEED_URI = 'output.json'
FEED_FORMAT = 'jsonlines'

FEED_EXPORTERS = {
    'jsonlines': 'scrapy.contrib.exporter.JsonLinesItemExporter',
}

What am I doing wrongly?
Looking forward to your kind help.
Marco

Scrapyd - webservice.py - IListJobs - running jobs "start_time" appears in the original code but not in the release package

Inside webservice.py -> lListJobs function:
In the newest code in the repository:

running = [{"id": s.job, "spider": s.spider,
            "start_time": s.start_time.isoformat(' ')} for s in spiders if s.project == project]
Inside the latest release package (1.0.1 when checked):
running = [{"id": s.job, "spider": s.spider} for s in spiders if s.project == project]

scrapyd items_dir: when empty returns access to root level folder

Moved from: scrapy/scrapy#324
Originally by: @ivandir

Hi,

The "items_dir" setting when left empty and used in the default scrapyd server you get from OS repo exposes the whole server filesystem starting with /

I am storing my parsed data in a database and don't need and items feed. When I set "items_dir" to an empty value and start the scrapyd server clicking on the /items/ url in the webpage leads me to my servers root path. The scrapyd server I am using is the default one installed with Ubuntu precise32

(dev2)vagrant@precise32:/vagrant$ scrapy version -v
Scrapy : 0.16.3
lxml : 3.2.1.0
libxml2 : 2.7.8
Twisted : 12.2.0
Python : 2.7.3 (default, Aug 1 2012, 05:16:07) - [GCC 4.6.3]
Platform: Linux-3.2.0-23-generic-pae-i686-with-Ubuntu-12.04-precise

(dev2)vagrant@precise32:/vagrant$ apt-cache show scrapyd-0.16
Package: scrapyd-0.16
Source: scrapy-0.16
Version: 0.16.5+1369956345
Architecture: all
Maintainer: Scrapinghub Team [email protected]
Installed-Size: 93
Depends: scrapy-0.16 (>= 0.16.5+1369956345), python-setuptools
Conflicts: scrapyd, scrapyd-0.11
Provides: scrapyd
Homepage: http://scrapy.org/
Priority: optional
Section: python
Filename: pool/main/s/scrapy-0.16/scrapyd-0.16_0.16.5+1369956345_all.deb
Size: 4016
SHA256: 9a361297122a7149a3e91d8262303d8a1d27878c5e189b8cb9e4b15c2703a20a
SHA1: 6e8f00c084d6c94957c3b546c710c2cd91719498
MD5sum: fd51653e8d4524d9d89a07931e5748e2
Description: Scrapy Service
The Scrapy service allows you to deploy your Scrapy projects by building
Python eggs of them and uploading them to the Scrapy service using a JSON API
that you can also use for scheduling spider runs. It supports multiple
projects also.

The only modification I have done to the actual startup script is the location of twisted so that it uses then one I have in my virtual environment.

exec ~/.myvirtualenv/bin/twistd -ny /usr/share/scrapyd/scrapyd.tac ..........

Here is the config file:

[scrapyd]
http_port = 6800
debug = off
max_proc = 0
max_proc_per_cpu = 4
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir =
logs_dir = /var/log/scrapyd
logs_to_keep = 5
runner = scrapyd.runner
application = scrapyd.app.application

[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs

This is what my virtual environment contains (removing unecessary modules not related to discussion)
(dev2)vagrant@precise32:/vagrant$ pip freeze
Scrapy==0.16.3
Twisted==12.2.0

Thanks for your help.

scrapyd poll error

2013-02-26 16:02:56+0100 [-] Unhandled Error
        Traceback (most recent call last):
          File "/home/innuendo/.virtualenvs/nazya/local/lib/python2.7/site-packages/twisted/internet/base.py", line 805, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/home/innuendo/.virtualenvs/nazya/local/lib/python2.7/site-packages/twisted/internet/task.py", line 218, in __call__
            d = defer.maybeDeferred(self.f, *self.a, **self.kw)
          File "/home/innuendo/.virtualenvs/nazya/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 138, in maybeDeferred
            result = f(*args, **kw)
          File "/home/innuendo/.virtualenvs/nazya/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1214, in unwindGenerator
            return _inlineCallbacks(None, gen, Deferred())
        --- <exception caught here> ---
          File "/home/innuendo/.virtualenvs/nazya/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1071, in _inlineCallbacks
            result = g.send(result)
          File "/home/innuendo/.virtualenvs/nazya/local/lib/python2.7/site-packages/scrapyd/poller.py", line 24, in poll
            returnValue(self.dq.put(self._message(msg, p)))
          File "/home/innuendo/.virtualenvs/nazya/local/lib/python2.7/site-packages/scrapyd/poller.py", line 33, in _message
            d = queue_msg.copy()
        exceptions.AttributeError: 'NoneType' object has no attribute 'copy'

can't pass start_urls argument via scrapyd API

I tried something like:

payload = {"project": settings['BOT_NAME'],
             "spider": crawler_name,
             "start_urls": ["http://www.foo.com"]}
response = requests.post("http://192.168.1.41:6800/schedule.json",
                           data=payload)

And when I check the logs, I got this error code:

File "/usr/lib/pymodules/python2.7/scrapy/spider.py", line 53, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 26, in __init__
    self._set_url(url)
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 61, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h

Looks like only the first letter of "http://www.foo.com" is used as request.url, and I really have no idea why.

Update

Maybe start_urls should be a string instead of a list containing 1 element, so I also tried:

"start_urls": "http://www.foo.com"

and

"start_urls": [["http://www.foo.com"]]

only to get the same error.

One scrapy job is running forever, can't be stopped

One scrapy job is running forever, can't be stopped.

I have set downloader timeout to 30
In my code, I checked the running periodlly, if longer than some time, use 'http://localhost:6800/cancel.json' to cancel the job.

However, the job is still running. How can I debug this situation?(Bug of my src, scrapy, or scrapyd?) Any ideas? Thanks.

Finally, I run sudo kill -SIGKILL 13902 to kill the process. sudo kill -SIGTERM 13902 doesn't work.

$ curl http://localhost:6800/cancel.json -d project=myproject -d job=5139c1ac8f0a11e4b0ed247703282fcc
{"status": "ok", "prevstate": "running"} <--- It returns status ok, however, the job can't be stopped.

$ ps aux|grep 13902
scrapy 13902 0.0 0.8 172860 64636 ? S 11:25 0:01 /usr/bin/python -m scrapyd.runner crawl mysider -a _job=5139c1ac8f0a11e4b0ed247703282fcc

scrapy.version
u'0.24.4'
scrapyd.version
'1.0.1'
log:
2014-12-29 11:25:26+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: myproject)
2014-12-29 11:25:26+0800 [scrapy] INFO: Optional features available: ssl, http11
2014-12-29 11:25:26+0800 [scrapy] INFO: Overridden settings:
{'COOKIES_DEBUG': True, 'NEWSPIDER_MODULE': 'myproject.spiders',
'FEED_URI':
'/var/lib/scrapyd/items/myproject/myspider/5139c1ac8f0a11e4b0ed247703282fcc.jl',
'SPIDER_MODULES': ['myproject.spiders'], 'RETRY_HTTP_CODES': [500,
502, 503, 504, 400, 408, 403, 404], 'BOT_NAME': 'myproject',
'DOWNLOAD_TIMEOUT': 30, 'COOKIES_ENABLED': False, 'LOG_FILE': <--------------------timeout
'/var/log/scrapyd/myproject/myspider/5139c1ac8f0a11e4b0ed247703282fcc.log',
'DOWNLOAD_DELAY': 2}
2014-12-29 11:25:26+0800 [scrapy] INFO: Enabled extensions:
FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService,
CoreStats, SpiderState
2014-12-29 11:25:26+0800 [scrapy] INFO: Enabled downloader
middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware,
RandomUserAgentMiddleware, RandomProxyMiddleware, RetryMiddleware,
DefaultHeadersMiddleware, MetaRefreshMiddleware,
HttpCompressionMiddleware, RedirectMiddleware,
ChunkedTransferMiddleware, DownloaderStats
2014-12-29 11:25:26+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-29 11:25:26+0800 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, WordpressPipeline, MySQLStorePipeline
2014-12-29 11:25:26+0800 [myspider] INFO: Spider opened
2014-12-29 11:25:26+0800 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-29 11:25:26+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6025
2014-12-29 11:25:26+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6082
2014-12-29 11:45:52+0800 [scrapy] INFO: Received SIGTERM, shutting down gracefully. Send again to force
2014-12-29 11:46:22+0800 [scrapy] INFO: Received SIGTERM twice, forcing unclean shutdown <------------------------

$ sudo kill -SIGTERM 13902
$ ps aux|grep 13902
scrapy 13902 0.0 0.8 172860 64636 ? S 11:25 0:01 /usr/bin/python -m scrapyd.runner crawl myspider -a _job=5139c1ac8f0a11e4b0ed247703282fcc
touch 18543 0.0 0.0 18248 2204 pts/5 S+ 14:24 0:00 grep 13902
$ sudo kill -SIGKILL 13902
$ ps aux|grep 13902
touch 18551 0.0 0.0 18244 2204 pts/5 S+ 14:25 0:00 grep 13902

Authorization?

Are there any plans concerning authorization of API requests?

EggStorage doesn't pick the latest version of the project with GIT

When I deploy a project with version = GIT and later schedule the job, scrapyd sometimes picks some older version of the project.

If I am correct, this is due to the fact, that FilesystemEggStorage sorts versions by eggs' filenames and then picks the last one. This is fine for Mercurial and timestamp versioning, since both are sequential. But Git hashes are not ordered, so sometimes the older project takes precedence.

Since scrapyd-deploy picks Git version with git describe --always, the revisions are apparently supposed to be sequentially tagged (like v1.0 and so), eggs will be then ordered correctly. I think this fact is worth mentioning in the documentation, don't you think?

Spider stats webservice JSON format

How would one go about implementing a live feedback on what the spider is doing ?
I have the logs that tell me each second pages per second and items, but i would like to access that via a webservice.
Maybe I'm wrong, but I found in the scrapy docs that there is something that should do this http://localhost:6080/stats/spider_stats but it returns empty.
Any ideas are welcome.

Thanks!

edit: I would like to be able to do something like curl http://localhost:6800/stats.json -d project=default -d spider=somespider
and it would return {"pages_crawled": "650","pages_per_min":"342","items_scraped":"286","items_per_min":"156"}

Inconsistent spiders among different servers

We have 7 servers with running scrapyd.
Repo clones to the first server and the deployed to another 6 servers.

Already on 'master'
From github.com:....
e51238d...82cb8 master -> origin/master
Updating e51238d..82cb8

Packing version 27-01-2014-3012-ged82cb8
Deploying to project "1" in http://server1:6800/addversion.json
Server response (200):
{"status": "ok", "project": "1", "version": "27-01-2014-3012-ged82cb8", "spiders": 1893}

Packing version 27-01-2014-3012-ged82cb8
Deploying to project "1" in http://server2:6800/addversion.json
Server response (200):
{"status": "ok", "project": "1", "version": 27-01-2014-3012-ged82cb8"", "spiders": 1893}

Packing version 27-01-2014-3012-ged82cb8
Deploying to project "1" in http://server3:6800/addversion.json
Server response (200):
{"status": "ok", "project": "1", "version": "27-01-2014-3012-ged82cb8", "spiders": 1893}

But when I start the spiders with scrapyd at different servers I receive the different results. Some old code still remains at scrapyd.
Despite the versions at spiders are the same:
user@server1:~/de-scrapy$ curl http://localhost:6800/listversions.json?project
...

{"status": "ok", "versions": ["27-01-2014-3012-ged82cb8-master",...

user@server6:~$ curl http://localhost:6800/listversions.json?project=1
{"status": "ok", "versions": ["27-01-2014-3012-ged82cb8-master",...

I tried to restart scrapyd service but this doesn't help.
Only one working solution was to delete project at each server and create it from scratch. But this doesn't guarantee that problem can't occur again.

Please help with fixing this.
Thanks!

scrapyd is only kept in memory

Today, job data in scrapyd is only kept in memory?

scrapyd not yet available on pypi

$ pip install scrapyd
Downloading/unpacking scrapyd
  Could not find any downloads that satisfy the requirement scrapyd
Cleaning up...
No distributions at all found for scrapyd

https://pypi.python.org/pypi/Scrapyd not found.

scrapy is not found in pypi.

If using GIT as version, scrapyd will not execute your latest version

Due to the fact that scrapyd uses distutils.version.LooseVersion it will not execute the latest version of a project if you use GIT SHA1 hashes

Using item exporters to export two items, the items don't seem to be created properly.

I have a spider that extracts two types of items. My item pipeline saves these in two separate files by using two item exporters (currently standard csv exporter). This approach creates the two files fine when just running scrapy crawl but either doesn't create them or I can't find them when using scrapyd schedule. I know that scrapyd is running my spider correctly, it returns correct status codes and also creates one file containing one type of item in json format.

The psedo code of my pipeline is:

open_spider:
    file1 = open('filename1.csv','w')
    file2 = open('filename2.csv','w')
    exporter1 = CsvExporter(file1)
    exporter2 = CsvExporter(file2)

process_item(item):
    if isinstance(Item1,item):
        exporter1.export_item(item)
    else:
        exporter2.export_item(item)

I haven't changed any settings for scrapyd except the item_dir a few times.

So, will scrapyd create files and what not in the same way as scrapy? Where will these be saved? Do I need to tell it to use my pipelines (its using them in scrapy)? Basically, I don't understand why it isn't working, please help. Thank you.

support EventSource/long polling in scrapyd

Moved from: scrapy/scrapy#335
Originally by: @graingert

When I run a long running crawl task, I'd like to be notified when it's done through the API. I know this is possible with the callbacks, but I'd rather it was built in.

The solution to this in REST/HTTP is the EventSource API.

https://developer.mozilla.org/en-US/docs/Server-sent_events/Using_server-sent_events

improve "scheduled jobs" listing page

localhost:6800/jobs page should have

start_time for each scheduled jobs
pagination (25 jobs per page)
display jobs by date range

Status of Scrapyd

The last "real" pull request on Scrapyd is almost 5 months old. There are 20 open issues and at least 4 pull requests passing tests that are ready to merge. Is there any reason for the lack of feedback from the project maintainers?

Error with Spider arguments sent using Scrapyd

I figured out the solution to this, but I thought I would put it somewhere for comment and documentation.
I was getting this error when scheduling spider runs through scrapyd with spider arguments.

    TypeError: __init__() got an unexpected keyword argument '_job'

It's not stated in the documentation explicitly, but when using scrapyd your spiders should allow for any number of arguments which I did not realize. scrapyd was sending a _job keyword to the spider, which my spider couldn't handle. So in short your spider __init__ def should look like

    def __init__(self, *args, **kwargs):

and look for the keyword of the argument you want instead of

    def __init__(self, my_keyword="default"):

Hopefully this helps some other python newbies in the future.

scrapyd didn't inspect the config file in the users home directory

I haved installed scrapyd via pip.

And I created my costomized config file as ~/.scrapyd.conf. But it didn't work.

I inspected scrapyd/config.py. It seems that the function didn't check ~/.scrapyd.conf

    def _getsources(self):
        sources = ['/etc/scrapyd/scrapyd.conf', r'c:\scrapyd\scrapyd.conf']
        sources += sorted(glob.glob('/etc/scrapyd/conf.d/*'))
        sources += ['scrapyd.conf']
        scrapy_cfg = closest_scrapy_cfg()
        if scrapy_cfg:
            sources.append(scrapy_cfg)
        return sources

The verison of my scrapyd is 1.0.1

More details in the Jobs table

The Jobs table in the web interface is really bare. The Scrapy stats collector contains a lot of valuable data, which should be included in this table.

I see a few ways of accessing this data:

Parsing logs. This seems like unnecessary work and will only give access to crawl statistics after a crawl has finished.
Subclassing CrawlerProcess, overriding methods that start/stop the reactor, thus removing the need to launch scrapyd.runner as a separate process. This gives us direct access to crawler.stats.get_stats() and gives the added benefit of using only one reactor to run multiple crawls.
Using scrapy.contrib.webservice.stats.StatsResource. This doesn't rely on an unstable API (unlike 2), but will force us to parse log files to determine the webservice port.

Scrapyd needs some useful upgrades aside from a prettier UI. Scheduling periodic crawls, queues, retrying, etc. They don't seem difficult to implement, but I don't have the time to do this myself and don't know if the community even has interest in Scrapyd.

Thoughts?

There is no method to view the status of a particular job in scrapyd webservice

scrapy server command is not available

Docs mention using "scrapy server" as a mean of running scrapyd inside Scrapy project, but this command was removed from Scrapy when Scrapyd was extracted.

Finished jobs remain in the list and remain as running processes.

i use scrapyd to manage the job but when the job is finished ,it also exist in the joblist and the process of python is persistent there ,no i have a question, when i renew a ne w job it will existed forever if i won't kill the process, i need help

Bug with --list-projects option (scrapyd-deploy script)

$ scrapyd-deploy -L Target

Traceback (most recent call last):
  File ".../.virtualenvs/project/bin/scrapyd-deploy", line 273, in <module>
    main()
  File "../.virtualenvs/project/bin/scrapyd-deploy", line 76, in main
    print os.linesep.join(projects)
TypeError: sequence item 0: expected string, int found

scrapyd deploy with slash in branch name causes problems

If you try to deploy a project with scrapyd-deploy and your current branch has a slash in the name, it will say the deploy was successful, but doesn't update the project.

Example Branch Name: feature/some-feature

Changing the branch name solve the problem.

custom downloader middleware not being deployed

My custom middleware isn't being deployed into the generated .egg, so the spiders fail to run when the class isn't found. More info here: http://stackoverflow.com/questions/25898851/importerror-error-loading-object-scrap-middlewares-randomuseragentmiddleware

Performance problems with Scrapyd and single spider

Context

I am running scrapyd 1.1 + scrapy 0.24.6 with a single spider that crawls over many domains according to parameters. The development machine that host scrapyd's instance is an OSX Yosemite with 4 cores and this is my current configuration:

[scrapyd]
max_proc_per_cpu = 75
debug = on

Output when scrapyd starts:

2015-06-05 13:38:10-0500 [-] Log opened. 2015-06-05 13:38:10-0500 [-] twistd 15.0.0 (/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python 2.7.9) starting up.
2015-06-05 13:38:10-0500 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2015-06-05 13:38:10-0500 [-] Site starting on 6800
2015-06-05 13:38:10-0500 [-] Starting factory twisted.web.server.Site instance at 0x104b91f38
2015-06-05 13:38:10-0500 [Launcher] Scrapyd 1.0.1 started: max_proc=300, runner='scrapyd.runner'

Problem

I would like a setup to process around N jobs simultaneously for a single spider but scrapyd is "lazily" processing 1 to 4 at a time regardless of how many of jobs are pending, still don't know why:

Any ideas?

More info:
http://stackoverflow.com/questions/30672910/parallelism-problems-with-scrapyd-and-single-spider-on-osx

SqlitePriorityQueue.pop() return None may crash Poller.poll()

Continued discussion of scrapy/scrapy#28 here since it is directly related to scrapyd.

I can think of a few different ways to handle this, but until I get more familiar with scrapy(d)'s internals, I'm not sure which is preferable:

Impement for fine-grained control over sqlite3's transactions
Increase the timeout limit
Test for an ongoing transaction before hand

I suspect 1 would be most robust but 3 would be least intrusive.

Is it enough to change scrapyd/sqlite.py line 14 to:

self.conn = sqlite3.connect(self.database, isolation_level=None, check_same_thread=False)

? If so, we could also remove explicit calls to commit() since they would be committed automatically after execute()