GithubHelp home page GithubHelp logo

my8100 / scrapydweb Goto Github PK

View Code? Open in Web Editor NEW
3.0K 75.0 545.0 3.13 MB

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:

Home Page: https://github.com/my8100/files

License: GNU General Public License v3.0

Python 61.11% CSS 3.54% JavaScript 3.52% HTML 31.82%
scrapy scrapyd scrapyd-ui scrapyd-api scrapyd-admin scrapyd-manage log-parsing log-analysis scrapyd-monitor scrapyd-keeper

scrapydweb's People

Contributors

cclauss avatar my8100 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapydweb's Issues

'jinja2.runtime.LoopContext object' has no attribute 'changed'

the web interface encounter a 500 error, the log show the following:

{% if SCRAPYD_SERVERS_GROUPS[loop.index-1] and loop.changed(SCRAPYD_SERVERS_GROUPS[loop.index-1]) %}
jinja2.exceptions.UndefinedError: 'jinja2.runtime.LoopContext object' has no attribute 'changed'

Passing arguments to scraper

Is it possible to pass arguments to the scraper?

When i start my scraper through the command line i add -a symbol=APPL to the scraper.
so for example: scrapy crawl my_scaper -a symbol=APPL

I tried adding this to the additional text box, however it does not seem to be added to the command.

[BUG] 'pip install logparser' on the current Scrapyd host

Describe the bug
Running (0) 'pip install logparser' on the current Scrapyd host and get it started via command 'logparser' to show crawled_pages and scraped_items.

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'http://ip:5000/1/dashboard/'
  2. See error

Expected behavior
Running (0) by LogParser: xxx seconds ago

Screenshots

Environment (please complete the following information):

  • OS: centOS 7.6
  • Python: 3.6.6
  • Browser Chrome 71

Additional context
I am sure I have pip install logparser this machine(A). In another machine, I can use it normally, and use this requirements.txt to pip in machine(A).

What does Forcestop mean in this?

What does Forcestop mean in this, does it mean to keep the crawler running in the background? Iundefinedm a Chinese student. I like computer programming, but English is so bad. I canundefinedt translate this, I donundefinedt understand this Forcestop,. The translation here is forced not to clean off the machine?

Email send failed, suggest split FROM_ADDR and EMAIL_USER

Dear friend,
I am very like you scrapydweb project, and use it in many place. But today I found email sender not support our school's mailbox,becourse our school email login user name is different 'from_addr' , and it not cotains suffix string like '@xxx.edu.com'. I suggest split FROM_ADDR and EMAIL_USER, and change 'smtp.login(from_addr, ...)' to 'smtp.login(email_user)'.
Thanks.

"max_instances" setting is not work

When i set a schedule on a spider, i set "max_instances" to 1 and "coalesce" as "True", but it seems not work. After a moment, the spider has more than one instance are running.

dashboard page not found

I cannot found the web page after I run the scrapydweb. In the terminal, some 404 message were showed.

webpage:
404

terminal log:
terminal404

Disable scrapydweb's first startup build configuration files

As scrapydweb commands are first launched with input,

>>> ScrapydWeb version: 1.1.0
>>> Use 'scrapydweb -h' to get help
>>> Main pid: 23712
>>> Loading default settings from d:\virtualenvs\foo--pw-wxg0\lib\site-packages\scrapydweb\default_settings.py

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The config file 'scrapydweb_settings_v7.py' has been copied to current working directory.
Please add your SCRAPYD_SERVERS in the config file and restart scrapydweb.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

We've seen that scrapydweb loads the default configuration file, and that it can be launched completely. Why generate one in the current directory? I feel this is redundant and unfriendly, so if I switch directories (I'm in a virtual environment) and execute it again, it will still generate a configuration file.

I think now that you have the default configuration, you should use the default configuration. Custom configurations are specified using command-line parameters. Configuration file, of course, you can also use the parameter specifies --setting settings.conf.

If you command line configuration file specified every time very troublesome, can draw lessons from scrapyd configuration file loading way.

Always be killed

[2018-12-18 18:10:43,553] INFO in werkzeug: 127.0.0.1 - - [18/Dec/2018 18:10:43] "POST /2/log/stats/Qcc/qcc_tax/2018-12-18T16_01_06/?job_finished= HTTP/1.1" 200 -
[2018-12-18 18:16:31,385] INFO in werkzeug: 127.0.0.1 - - [18/Dec/2018 18:16:31] "POST /2/log/stats/Qcc/qcm/a2161188e7f911e8882b35d4ab7f5d7b/?job_finished= HTTP/1.1" 200 -
Killed

Impossible to start scrapydweb in Docker

I'l testing scrapydweb on a docker, but it doesn't work, I must miss something I guess

I get indeed an error 500: 'NoneType' object has no attribute 'group'

Basically, here is my Dockerfile:

FROM python:3.6-jessie

ENV TZ="Europe/Paris"

WORKDIR /app

RUN pip install scrapydweb

RUN cp /usr/local/lib/python3.6/site-packages/scrapydweb/default_settings.py /app/scrapydweb_settings_v7.py

EXPOSE 5000

CMD ["scrapydweb", "--disable_auth", "--disable_logparser", "--scrapyd_server=scrapyd:6800"]

And here is the full logs of scrapydweb: when I go on localhost:5000

scrapydweb_1     | [2019-01-20 21:14:53,789] ERROR in flask.app: Exception on /1/dashboard/ [GET]
scrapydweb_1     | Traceback (most recent call last):
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
scrapydweb_1     |     response = self.full_dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
scrapydweb_1     |     rv = self.handle_user_exception(e)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
scrapydweb_1     |     reraise(exc_type, exc_value, tb)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
scrapydweb_1     |     raise value
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
scrapydweb_1     |     rv = self.dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
scrapydweb_1     |     return self.view_functions[rule.endpoint](**req.view_args)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view
scrapydweb_1     |     return self.dispatch_request(*args, **kwargs)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 57, in dispatch_request
scrapydweb_1     |     return self.generate_response()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 98, in generate_response
scrapydweb_1     |     _url_items = re.search(r"href='(.*?)'>", row['items']).group(1)
scrapydweb_1     | AttributeError: 'NoneType' object has no attribute 'group'
scrapydweb_1     | [2019-01-20 21:14:53,816] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:53] "GET /1/dashboard/ HTTP/1.1" 500 -
scrapydweb_1     | [2019-01-20 21:14:55,497] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:55] "GET / HTTP/1.1" 302 -
scrapydweb_1     | [2019-01-20 21:14:55,521] ERROR in flask.app: Exception on /1/dashboard/ [GET]
scrapydweb_1     | Traceback (most recent call last):
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
scrapydweb_1     |     response = self.full_dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
scrapydweb_1     |     rv = self.handle_user_exception(e)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
scrapydweb_1     |     reraise(exc_type, exc_value, tb)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
scrapydweb_1     |     raise value
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
scrapydweb_1     |     rv = self.dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
scrapydweb_1     |     return self.view_functions[rule.endpoint](**req.view_args)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view
scrapydweb_1     |     return self.dispatch_request(*args, **kwargs)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 57, in dispatch_request
scrapydweb_1     |     return self.generate_response()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 98, in generate_response
scrapydweb_1     |     _url_items = re.search(r"href='(.*?)'>", row['items']).group(1)
scrapydweb_1     | AttributeError: 'NoneType' object has no attribute 'group'
scrapydweb_1     | [2019-01-20 21:14:55,527] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:55] "GET /1/dashboard/ HTTP/1.1" 500 -
scrapydweb_1     | [2019-01-20 21:14:55,759] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:55] "GET / HTTP/1.1" 302 -
scrapydweb_1     | [2019-01-20 21:14:55,776] ERROR in flask.app: Exception on /1/dashboard/ [GET]
scrapydweb_1     | Traceback (most recent call last):
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
scrapydweb_1     |     response = self.full_dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
scrapydweb_1     |     rv = self.handle_user_exception(e)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
scrapydweb_1     |     reraise(exc_type, exc_value, tb)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
scrapydweb_1     |     raise value
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
scrapydweb_1     |     rv = self.dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
scrapydweb_1     |     return self.view_functions[rule.endpoint](**req.view_args)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view
scrapydweb_1     |     return self.dispatch_request(*args, **kwargs)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 57, in dispatch_request
scrapydweb_1     |     return self.generate_response()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 98, in generate_response
scrapydweb_1     |     _url_items = re.search(r"href='(.*?)'>", row['items']).group(1)
scrapydweb_1     | AttributeError: 'NoneType' object has no attribute 'group'
scrapydweb_1     | [2019-01-20 21:14:55,781] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:55] "GET /1/dashboard/ HTTP/1.1" 500 -

Any idea how to fix this ?

Thanks

为什么我的定时任务没有执行

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Logs
Add logs of ScrapydWeb and Scrapyd (optional) when reproducing the bug. (Run ScrapydWeb with argument '--verbose' if its version >= 1.0.0)

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. Win 10, macOS 10.14, Ubuntu 18, centOS 7.6, Debian 9.6 or Fedora 29]
  • Python: [e.g. 2.7, 3.6 or 3.7]
  • ScrapydWeb: [e.g. 0.9.9 or 1.0.0]
  • Scrapyd amount [e.g. 1 or 5]
  • Related settings [e.g. 'ENABLE_CACHE = True']
  • Browser [e.g. Chrome 71, Firefox 64 or Safari 12]

Additional context
Add any other context about the problem here.

Scrapydweb constantly using 100% CPU

I have a machine with two scrapyd instances and one scrapydweb running, scrapydweb is connected to both scrapyd instances. However, CPU usage of scrapydweb is very high all the time. Investigating a bit, I've seen that scrapydweb is constantly making requests (more than 2 per second) to both scrapyd instances, requesting logs (the request is also done twice, once asking for the uncompressed log and once for the compressed one).

Now my question is, why does scrapydweb need to be constantly fetching scrapyd logs? I mean, once it has got them once, they aren't going to change.

How does ScrapydWeb handle route?

Hello, you're doing great. I'm a beginner of flask. I've been playing with crawlers. I rarely touch web frameworks. I want to know that I want to add a new function, such as creating crawler files on the web side and deploying them directly into scrapydweb. How can I add a route to guide links? Which is your route add. py, I don't know where to add my xx. HTML route! Maybe I asked this is silly, but I am a beginner, and I don't know the framework yet. I hope to give you an answer! Thank you very much.

TypeError: init() got an unexpected keyword argument 'jitter'

  • i cant add timer task and get a error:init() got an unexpected keyword argument 'jitter'

Detail:
Traceback (most recent call last):
File "c:\programdata\anaconda3\envs\py36\lib\site-packages\scrapydweb\operations\schedule.py", line 479, in add_update_task
replace_existing=True, **self.task_data)
File "c:\programdata\anaconda3\envs\py36\lib\site-packages\apscheduler\schedulers\base.py", line 411, in add_job
'trigger': self._create_trigger(trigger, trigger_args),
File "c:\programdata\anaconda3\envs\py36\lib\site-packages\apscheduler\schedulers\base.py", line 905, in _create_trigger
return self._create_plugin_instance('trigger', trigger, trigger_args)
File "c:\programdata\anaconda3\envs\py36\lib\site-packages\apscheduler\schedulers\base.py", line 890, in _create_plugin_instance
return plugin_cls(**constructor_kwargs)
TypeError: init() got an unexpected keyword argument 'jitter'

kwargs for execute_task():
{
    "task_id": 1
}

task_data for scheduler.add_job():
{
    "coalesce": true,
    "day": "*",
    "day_of_week": "*",
    "end_date": null,
    "hour": "*",
    "id": "1",
    "jitter": 0,
    "max_instances": 1,
    "minute": "*/10",
    "misfire_grace_time": 600,
    "month": "*",
    "name": "test timeer tasks - edit",
    "second": "0",
    "start_date": null,
    "timezone": "Asia/Shanghai",
    "trigger": "cron",
    "week": "*",
    "year": "*"
}

scrapyd always stoped

Hello sir.
I have two server. scrapydweb and scrapyd is running on 166, 110 only have scrapyd.
but. at 110, the scrapyd always stop after run some times. why ? thinks sir
.
image

[PY2] UnicodeDecodeError raised when there are some files with illegal filenames in `SCRAPY_PROJECTS_DIR`

Request URL |
-- | --
'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 2311, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1834, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1737, in handle_user_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1832, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1818, in dispatch_request return self.view_functionsrule.endpoint File "/usr/local/lib/python2.7/dist-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/scrapydweb/operations/deploy.py", line 56, in dispatch_request self.get_modification_times() File "/usr/local/lib/python2.7/dist-packages/scrapydweb/operations/deploy.py", line 75, in get_modification_times timestamps = [self.get_modification_time(i) for i in self.project_paths] File "/usr/local/lib/python2.7/dist-packages/scrapydweb/operations/deploy.py", line 90, in get_modification_time for dirpath, dirnames, filenames in os.walk(path): File "/usr/lib/python2.7/os.py", line 296, in walk for x in walk(new_path, topdown, onerror, followlinks): File "/usr/lib/python2.7/os.py", line 296, in walk for x in walk(new_path, topdown, onerror, followlinks): File "/usr/lib/python2.7/os.py", line 286, in walk if isdir(join(top, name)): File "/usr/lib/python2.7/posixpath.py", line 73, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
Linux-4.4.0-117-generic-x86_64-with-Ubuntu-16.04-xenial
2.7.12
1.2.0
0.8.1
2
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
GET
ImmutableMultiDict([])
ImmutableMultiDict([])
ImmutableMultiDict([])

can change !!! to [ERROR]?

USE !!! TO DISCRIBE ERROR INFO IS NOT VERY CLEAR...

JUST A SUGGEST...

THANK YOU FOR YOUR EFFORT FOR THIS AWESOME PROJECT

由于目标计算机积极拒绝

win10 ,anaconda python3.6 { "auth": null, "message": "HTTPConnectionPool(host='127.0.0.1', port=6800): Max retries exceeded with url: /daemonstatus.json (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝,无法连接。',))", "status": "error", "status_code": -1, "url": "http://127.0.0.1:6800/daemonstatus.json" }

Run timer tasks using the latest version

I have many spiders in one project, when I modify the common functions, I need to update timer task one by one to use the latest version. It will be very helpful if timer task can use the latest version of spiders every time. 😉

can change !!! to [ERROR]?

USE !!! TO DISCRIBE ERROR INFO IS NOT VERY CLEAR...

JUST A SUGGEST...

THANK YOU FOR YOUR EFFORT FOR THIS AWESOME PROJECT

SQLite database is locked occasionally when executing time tasks concurrently

use 'timer tasks' to schedule some spiders run periodically,sometimes when a spider scheduled to execute,but it not run and Throw a "database is locked" error.

Logs:
[2019-04-30 00:00:27,840] ERROR in apscheduler: Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapydweb/operations/execute_task.py", line 170, in execute_task
task_executer.main()
File "/usr/local/lib/python3.6/site-packages/scrapydweb/operations/execute_task.py", line 43, in main
self.get_task_result_id()
File "/usr/local/lib/python3.6/site-packages/scrapydweb/operations/execute_task.py", line 70, in get_task_result_id
db.session.commit()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
return getattr(self.registry(), name)(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1026, in commit
self.transaction.commit()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 493, in commit
self._prepare_impl()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 472, in _prepare_impl
self.session.flush()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2451, in flush
self._flush(objects)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2589, in _flush
transaction.rollback(_capture_exception=True)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in exit
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2549, in _flush
flush_context.execute()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
rec.execute(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
uow,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
insert,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 1120, in _emit_insert_statements
statement, params
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 988, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
distilled_params,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
[SQL: INSERT INTO task_result (task_id, execute_time, fail_count, pass_count) VALUES (?, ?, ?, ?)]
[parameters: (15, '2019-04-30 00:00:22.834299', 0, 0)]
(Background on this error at: http://sqlalche.me/e/e3q8)

Refresh crawl status after long time leads to memory error.

I have a crawl running, where after 87000 seconds since last refresh, the following error occures when trying to refresh:

Traceback (most recent call last): File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app response = self.full_dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise raise value File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, **kwargs) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 94, in dispatch_request self.request_scrapy_log() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 142, in request_scrapy_log self.status_code, self.text = self.make_request(self.url, api=False, auth=self.AUTH) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/myview.py", line 191, in make_request front = r.text[:min(100, len(r.text))].replace('\n', '') File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/requests/models.py", line 861, in text content = str(self.content, encoding, errors='replace') MemoryError

The crawl seems to be running fine thow.

User Guide | Q&A | 用户指南 | 问答

linux:HTTPConnectionPool(host='192.168.0.24', port=6801): Max retries exceeded with url: /listprojects.json (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f0a78b2d828>: Failed to establish a new connection: [Errno 111] Connection refused',))
windows:HTTPConnectionPool(host='localhost', port=6801): Max retries exceeded with url: /jobs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000000004589CC0>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝,无法连接。',))
这个我应该怎么解决呢?

the scrapyd server port issue

in this https://github.com/my8100/scrapyd-cluster-on-heroku-scrapyd-app repo, thd init.py file set the scrapyd port 6800 by this:
PORT = int(os.environ.get('PORT', 6800))
with io.open("scrapyd.conf", 'r+', encoding='utf-8') as f:
f.read()
f.write(u'\nhttp_port = %s\n' % PORT)

when i deployed Scrapyd server on heroku.com :
telnet pjhscrapyd1.heroku.com 6800 Failed
telnet pjhscrapyd1.heroku.com 6801 Failed
telnet pjhscrapyd1.heroku.com 80 Succeed

what's wrong with it?

Cannot save state when restarting scrapydweb

I'm running scrapydweb in docker
I can start job and I can then see some statistics, I also see the finished jobs, it's perfect

However, when I restart my container, I loose this state. I no more see the finished jobs for example
=> what is the data to persist in my docker container, so that I can see everything when I restart the container ?

I've tried to persist /usr/local/lib/python3.6/site-packages/scrapydweb/data, but it doesn't seem to do the trick

spiderkeeper keeps all its state in a SpiderKeeper.db which is perfect to keep state on container restart

Any idea how to have the same stuff with scrapydweb ?

Thanks again for your work !

To run ScrapydWeb in HTTPS mode

How do i enable https? I have a certificate file from let's encrypt and would like to start the server using https.

Is this possible at this time? It should be possible since using basic auth without HTTPS is dangerous.

Use Timer Tasks to start crawling jobs concurrently

我创建了两个timer task, 都是5秒执行一次, 两个定时任务在相同时间只有一个会被执行, 怎样才能实现多个job并发执行, 是bug还是我的设置有问题? 望解惑 多谢

Auto eggifying sets the folder name as the project name

Congratulations on your work. I started using it and I hope to contribute with the project.

One problem I noticed is that by using the 'auto eggifying' feature, the project name is set to the folder name where there is a scrapy.cfg file.
This can cause problems. It should be the name that is in the scrapy.cfg file (e.g. 'project = xxx').

How to add a Timer task with multiple execution times

比如: 工作日期间(周一到周五), 每天上午8:30 启动一次, 每天晚上5:00 启动一次.
For example: During the working day (Monday to Friday), start every day at 8:30 am, start every day at 5:00 pm.

至少我去过的几家爬虫公司的确有这样的需求.
At least some of the WebCrawl companies which I have been to ,it's truly have such a demand.

功能需求建议

  1. 直接在web管理界面添加scrapyd主机,并且支持有https连接的scrapyd主机
  2. 添加中文语言支持
  3. 能够配置各种请求的超时时间与重试次数

about timer task

图片
图片

I want the task to run at 18:00:00, but the task is running when i click [add task&fire right now] ! How can i do ? thanks sir

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.