my8100 / scrapydweb Goto Github PK

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:

Home Page: https://github.com/my8100/files

License: GNU General Public License v3.0

Python 61.11% CSS 3.54% JavaScript 3.52% HTML 31.82%

scrapy scrapyd scrapyd-ui scrapyd-api scrapyd-admin scrapyd-manage log-parsing log-analysis scrapyd-monitor scrapyd-keeper

scrapydweb's Introduction

🔤 English | 🀄 简体中文

ScrapydWeb: Web app for Scrapyd cluster management, with support for Scrapy log analysis & visualization.

Scrapyd ❌ ScrapydWeb ❌ LogParser

📖 Recommended Reading

🔗 How to efficiently manage your distributed web scraping projects

🔗 How to set up Scrapyd cluster on Heroku

👀 Demo

🔗 scrapydweb.herokuapp.com

⭐ Features

View contents

💠 Scrapyd Cluster Management
- 💯 All Scrapyd JSON API Supported
- ☑️ Group, filter and select any number of nodes
- 🖱️ Execute command on multinodes with just a few clicks
🔍 Scrapy Log Analysis
- 📊 Stats collection
- 📈 Progress visualization
- 📑 Logs categorization
🔋 Enhancements
- 📦 Auto packaging
- 🕵️‍♂️ Integrated with 🔗 LogParser
- ⏰ Timer tasks
- 📧 Monitor & Alert
- 📱 Mobile UI
- 🔐 Basic auth for web UI

💻 Getting Started

View contents

⚠️ Prerequisites

❗ Make sure that 🔗 Scrapyd has been installed and started on all of your hosts.

‼️ Note that for remote access, you have to manually set 'bind_address = 0.0.0.0' in 🔗 the configuration file of Scrapyd and restart Scrapyd to make it visible externally.

⬇️ Install

Use pip:

pip install scrapydweb

❗ Note that you may need to execute python -m pip install --upgrade pip first in order to get the latest version of scrapydweb, or download the tar.gz file from https://pypi.org/project/scrapydweb/#files and get it installed via pip install scrapydweb-x.x.x.tar.gz

Use git:

pip install --upgrade git+https://github.com/my8100/scrapydweb.git

Or:

git clone https://github.com/my8100/scrapydweb.git
cd scrapydweb
python setup.py install

▶️ Start

Start ScrapydWeb via command scrapydweb. (a config file would be generated for customizing settings at the first startup.)
Visit http://127.0.0.1:5000 (It's recommended to use Google Chrome for a better experience.)

🌐 Browser Support

The latest version of Google Chrome, Firefox, and Safari.

✔️ Running the tests

View contents

$ git clone https://github.com/my8100/scrapydweb.git
$ cd scrapydweb

# To create isolated Python environments
$ pip install virtualenv
$ virtualenv venv/scrapydweb
# Or specify your Python interpreter: $ virtualenv -p /usr/local/bin/python3.7 venv/scrapydweb
$ source venv/scrapydweb/bin/activate

# Install dependent libraries
(scrapydweb) $ python setup.py install
(scrapydweb) $ pip install pytest
(scrapydweb) $ pip install coverage

# Make sure Scrapyd has been installed and started, then update the custom_settings item in tests/conftest.py
(scrapydweb) $ vi tests/conftest.py
(scrapydweb) $ curl http://127.0.0.1:6800

# '-x': stop on first failure
(scrapydweb) $ coverage run --source=scrapydweb -m pytest tests/test_a_factory.py -s -vv -x
(scrapydweb) $ coverage run --source=scrapydweb -m pytest tests -s -vv --disable-warnings
(scrapydweb) $ coverage report
# To create an HTML report, check out htmlcov/index.html
(scrapydweb) $ coverage html

🏗️ Built With

View contents

Front End
- 🔗 Element
- 🔗 ECharts
Back End
- 🔗 Flask

📋 Changelog

Detailed changes for each release are documented in the 🔗 HISTORY.md.

👨‍💻 Author

_my8100

👥 Contributors

_Kaisla

©️ License

This project is licensed under the GNU General Public License v3.0 - see the 🔗 LICENSE file for details.

scrapydweb's People

Contributors

Stargazers

Watchers

Forkers

booksir yst726 evido3s reigster trendingtechnology podolskyi gaara2016 tomzhang leo23 kizzlepc mdzz9527 tangyu giserh zhoujun skyhive knowsbuy arkilis chrlis-zhang kamontan sourcepirate wymen2018 cced3000 lidadreamer liuyangfa lstnull wniels pycn allensmile freedom99 mtouny dingjunyong maxwellu mgbin088 jacsonbai yuanfeng0905 benjamesbabala wung livingmagic rhinoxi xb520 sangecoder waitingfy mrcomer mrying letitgrow magic-coder lifetruth-liu vnisor samson886 zhangshengchun danielprogramic dark-flower ssl834 hhy5277 ivivisoft lubancafe wymen2020 maksimuc24 zuoandroid sawdog99 mushroomlb lysfighting raindrop4steven sulthonzh kongshuaifu alenwon danceiny githubformatt anoshop npc7 xucn simpledatasci arsmarvin max-kviatkouski databill86 zuiwengf lidonghe charmer21 vikky-lin yinspark simlan allen-oneill munkher zhangbc spider-study lotapp haipop leo-xxx 1060460048 kindlejiang zzzz123321 jakejie163 yiailake brickc7 angerhui wuzhan0912 te-chih d3vr zhanyy88 fightingpjh

scrapydweb's Issues

Always be killed

[2018-12-18 18:10:43,553] INFO in werkzeug: 127.0.0.1 - - [18/Dec/2018 18:10:43] "POST /2/log/stats/Qcc/qcc_tax/2018-12-18T16_01_06/?job_finished= HTTP/1.1" 200 -
[2018-12-18 18:16:31,385] INFO in werkzeug: 127.0.0.1 - - [18/Dec/2018 18:16:31] "POST /2/log/stats/Qcc/qcm/a2161188e7f911e8882b35d4ab7f5d7b/?job_finished= HTTP/1.1" 200 -
Killed

Cannot save state when restarting scrapydweb

I'm running scrapydweb in docker
I can start job and I can then see some statistics, I also see the finished jobs, it's perfect

However, when I restart my container, I loose this state. I no more see the finished jobs for example
=> what is the data to persist in my docker container, so that I can see everything when I restart the container ?

I've tried to persist /usr/local/lib/python3.6/site-packages/scrapydweb/data, but it doesn't seem to do the trick

spiderkeeper keeps all its state in a SpiderKeeper.db which is perfect to keep state on container restart

Any idea how to have the same stuff with scrapydweb ?

Thanks again for your work !

WARNING: Do not use the development server in a production environment. Use a production WSGI server instead.

First, thanks for your awesome project.

every time when I run scrapydweb command , the warning message would show this in the terminal.

I think whether you can add some guide links for this warning. such as official deploy link. Just add information for someone who may confuse this message. It is not a bug, just an optional improvement, so it is low priority thing.

Thanks.

功能需求建议

直接在web管理界面添加scrapyd主机，并且支持有https连接的scrapyd主机
添加中文语言支持
能够配置各种请求的超时时间与重试次数

how to deploy scrapydweb by github resp

on heroku.com, I set myscrapydweb's Deployment method to "github"（I forkd https://github.com/my8100/scrapydweb to :https://github.com/fightingpjh/scrapyWeb）, connect this app to GitHub (https://github.com/fightingpjh/scrapyWeb), then the server breakdown, request the addr myscrapydweb.heroku.com by chrome failed!
so, how can i deploy the scrapyweb by github resp?

How to add a Timer task with multiple execution times

比如: 工作日期间(周一到周五), 每天上午8:30 启动一次, 每天晚上5:00 启动一次.
For example: During the working day (Monday to Friday), start every day at 8:30 am, start every day at 5:00 pm.

至少我去过的几家爬虫公司的确有这样的需求.
At least some of the WebCrawl companies which I have been to ,it's truly have such a demand.

What does Forcestop mean in this？

What does Forcestop mean in this, does it mean to keep the crawler running in the background? Iundefinedm a Chinese student. I like computer programming, but English is so bad. I canundefinedt translate this, I donundefinedt understand this Forcestop,. The translation here is forced not to clean off the machine?

Got status error when stopping a job integrated with headless browser

can change !!! to [ERROR]?

USE !!! TO DISCRIBE ERROR INFO IS NOT VERY CLEAR...

JUST A SUGGEST...

THANK YOU FOR YOUR EFFORT FOR THIS AWESOME PROJECT

为什么UI不用中文呢？或者提供中文切换

我看作者都是**人吧，怎么连个汉化版的都没有

How to run scrapydweb and logparser in the same docker container?

To run ScrapydWeb in HTTPS mode

How do i enable https? I have a certificate file from let's encrypt and would like to start the server using https.

Is this possible at this time? It should be possible since using basic auth without HTTPS is dangerous.

Disable scrapydweb's first startup build configuration files

As scrapydweb commands are first launched with input,

>>> ScrapydWeb version: 1.1.0
>>> Use 'scrapydweb -h' to get help
>>> Main pid: 23712
>>> Loading default settings from d:\virtualenvs\foo--pw-wxg0\lib\site-packages\scrapydweb\default_settings.py

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The config file 'scrapydweb_settings_v7.py' has been copied to current working directory.
Please add your SCRAPYD_SERVERS in the config file and restart scrapydweb.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

We've seen that scrapydweb loads the default configuration file, and that it can be launched completely. Why generate one in the current directory? I feel this is redundant and unfriendly, so if I switch directories (I'm in a virtual environment) and execute it again, it will still generate a configuration file.

I think now that you have the default configuration, you should use the default configuration. Custom configurations are specified using command-line parameters. Configuration file, of course, you can also use the parameter specifies --setting settings.conf.

If you command line configuration file specified every time very troublesome, can draw lessons from scrapyd configuration file loading way.

can change !!! to [ERROR]?

USE !!! TO DISCRIBE ERROR INFO IS NOT VERY CLEAR...

JUST A SUGGEST...

THANK YOU FOR YOUR EFFORT FOR THIS AWESOME PROJECT

Does the current version have a scheduled task function?

"max_instances" setting is not work

When i set a schedule on a spider, i set "max_instances" to 1 and "coalesce" as "True", but it seems not work. After a moment, the spider has more than one instance are running.

Email send failed, suggest split FROM_ADDR and EMAIL_USER

Dear friend,
I am very like you scrapydweb project, and use it in many place. But today I found email sender not support our school's mailbox，becourse our school email login user name is different 'from_addr' , and it not cotains suffix string like '@xxx.edu.com'. I suggest split FROM_ADDR and EMAIL_USER, and change 'smtp.login(from_addr, ...)' to 'smtp.login(email_user)'.
Thanks.

由于目标计算机积极拒绝

win10 ,anaconda python3.6 { "auth": null, "message": "HTTPConnectionPool(host='127.0.0.1', port=6800): Max retries exceeded with url: /daemonstatus.json (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝，无法连接。',))", "status": "error", "status_code": -1, "url": "http://127.0.0.1:6800/daemonstatus.json" }

can not use pip to install project

env: python3.6 (anaconda3)

User Guide | Q&A | 用户指南 | 问答

linux：HTTPConnectionPool(host='192.168.0.24', port=6801): Max retries exceeded with url: /listprojects.json (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f0a78b2d828>: Failed to establish a new connection: [Errno 111] Connection refused',))
windows：HTTPConnectionPool(host='localhost', port=6801): Max retries exceeded with url: /jobs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000000004589CC0>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝，无法连接。',))
这个我应该怎么解决呢？

Auto eggifying sets the folder name as the project name

Congratulations on your work. I started using it and I hope to contribute with the project.

One problem I noticed is that by using the 'auto eggifying' feature, the project name is set to the folder name where there is a scrapy.cfg file.
This can cause problems. It should be the name that is in the scrapy.cfg file (e.g. 'project = xxx').

Could not find a version that satisfies the requirement scrapydweb==1.2.0

Could not find a version that satisfies the requirement scrapydweb==1.2.0 (from versions: 0.9.1, 0.9.2)
why

Use Timer Tasks to start crawling jobs concurrently

我创建了两个timer task, 都是5秒执行一次，两个定时任务在相同时间只有一个会被执行，怎样才能实现多个job并发执行，是bug还是我的设置有问题？望解惑多谢

Support managing Scrapyd servers with https enabled

直接在web管理界面添加scrapyd主机，并且支持有https连接的scrapyd主机
添加中文语言支持
能够配置各种请求的超时时间与重试次数

TypeError: init() got an unexpected keyword argument 'jitter'

i cant add timer task and get a error:init() got an unexpected keyword argument 'jitter'

Detail:
Traceback (most recent call last):
File "c:\programdata\anaconda3\envs\py36\lib\site-packages\scrapydweb\operations\schedule.py", line 479, in add_update_task
replace_existing=True, **self.task_data)
File "c:\programdata\anaconda3\envs\py36\lib\site-packages\apscheduler\schedulers\base.py", line 411, in add_job
'trigger': self._create_trigger(trigger, trigger_args),
File "c:\programdata\anaconda3\envs\py36\lib\site-packages\apscheduler\schedulers\base.py", line 905, in _create_trigger
return self._create_plugin_instance('trigger', trigger, trigger_args)
File "c:\programdata\anaconda3\envs\py36\lib\site-packages\apscheduler\schedulers\base.py", line 890, in _create_plugin_instance
return plugin_cls(**constructor_kwargs)
TypeError: init() got an unexpected keyword argument 'jitter'

kwargs for execute_task():
{
    "task_id": 1
}

task_data for scheduler.add_job():
{
    "coalesce": true,
    "day": "*",
    "day_of_week": "*",
    "end_date": null,
    "hour": "*",
    "id": "1",
    "jitter": 0,
    "max_instances": 1,
    "minute": "*/10",
    "misfire_grace_time": 600,
    "month": "*",
    "name": "test timeer tasks - edit",
    "second": "0",
    "start_date": null,
    "timezone": "Asia/Shanghai",
    "trigger": "cron",
    "week": "*",
    "year": "*"
}

Support compressed logs and items

The barebones web interface of scrapyd can display compressed (eg, .gz) log and item files, while scrapydweb seems not to.

the scrapyd server port issue

in this https://github.com/my8100/scrapyd-cluster-on-heroku-scrapyd-app repo, thd init.py file set the scrapyd port 6800 by this:
PORT = int(os.environ.get('PORT', 6800))
with io.open("scrapyd.conf", 'r+', encoding='utf-8') as f:
f.read()
f.write(u'\nhttp_port = %s\n' % PORT)

when i deployed Scrapyd server on heroku.com :
telnet pjhscrapyd1.heroku.com 6800 Failed
telnet pjhscrapyd1.heroku.com 6801 Failed
telnet pjhscrapyd1.heroku.com 80 Succeed

what's wrong with it?

scrapyd always stoped

Hello sir.
I have two server. scrapydweb and scrapyd is running on 166, 110 only have scrapyd.
but. at 110, the scrapyd always stop after run some times. why ? thinks sir
.

SQLite database is locked occasionally when executing time tasks concurrently

use 'timer tasks' to schedule some spiders run periodically，sometimes when a spider scheduled to execute，but it not run and Throw a "database is locked" error.

Logs:
[2019-04-30 00:00:27,840] ERROR in apscheduler: Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapydweb/operations/execute_task.py", line 170, in execute_task
task_executer.main()
File "/usr/local/lib/python3.6/site-packages/scrapydweb/operations/execute_task.py", line 43, in main
self.get_task_result_id()
File "/usr/local/lib/python3.6/site-packages/scrapydweb/operations/execute_task.py", line 70, in get_task_result_id
db.session.commit()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
return getattr(self.registry(), name)(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1026, in commit
self.transaction.commit()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 493, in commit
self._prepare_impl()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 472, in _prepare_impl
self.session.flush()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2451, in flush
self._flush(objects)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2589, in _flush
transaction.rollback(_capture_exception=True)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in exit
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2549, in _flush
flush_context.execute()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
rec.execute(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
uow,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
insert,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 1120, in _emit_insert_statements
statement, params
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 988, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
distilled_params,
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
[SQL: INSERT INTO task_result (task_id, execute_time, fail_count, pass_count) VALUES (?, ?, ?, ?)]
[parameters: (15, '2019-04-30 00:00:22.834299', 0, 0)]
(Background on this error at: http://sqlalche.me/e/e3q8)

lost crawl and scrape bug?

why i lost crawl and scrape，lost items count

dashboard page not found

I cannot found the web page after I run the scrapydweb. In the terminal, some 404 message were showed.

webpage:

terminal log:

Passing arguments to scraper

Is it possible to pass arguments to the scraper?

When i start my scraper through the command line i add -a symbol=APPL to the scraper.
so for example: scrapy crawl my_scaper -a symbol=APPL

I tried adding this to the additional text box, however it does not seem to be added to the command.

[BUG] 'pip install logparser' on the current Scrapyd host

Describe the bug
Running (0) 'pip install logparser' on the current Scrapyd host and get it started via command 'logparser' to show crawled_pages and scraped_items.

To Reproduce
Steps to reproduce the behavior:

Go to 'http://ip:5000/1/dashboard/'
See error

Expected behavior
Running (0) by LogParser: xxx seconds ago

Screenshots

Environment (please complete the following information):

OS: centOS 7.6
Python: 3.6.6
Browser Chrome 71

Additional context
I am sure I have pip install logparser this machine(A). In another machine, I can use it normally, and use this requirements.txt to pip in machine(A).

How to restart this application more elegantly?

and when this app can intergrade a function which can be used to schedule spider/spiders? think this kind of fun will be very helpful.

原谅我有点蠢,那个怎么添加其他主机的scrapyd服务

支持定时任务？

Impossible to start scrapydweb in Docker

I'l testing scrapydweb on a docker, but it doesn't work, I must miss something I guess

I get indeed an error 500: 'NoneType' object has no attribute 'group'

Basically, here is my Dockerfile:

FROM python:3.6-jessie

ENV TZ="Europe/Paris"

WORKDIR /app

RUN pip install scrapydweb

RUN cp /usr/local/lib/python3.6/site-packages/scrapydweb/default_settings.py /app/scrapydweb_settings_v7.py

EXPOSE 5000

CMD ["scrapydweb", "--disable_auth", "--disable_logparser", "--scrapyd_server=scrapyd:6800"]

And here is the full logs of scrapydweb: when I go on localhost:5000

scrapydweb_1     | [2019-01-20 21:14:53,789] ERROR in flask.app: Exception on /1/dashboard/ [GET]
scrapydweb_1     | Traceback (most recent call last):
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
scrapydweb_1     |     response = self.full_dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
scrapydweb_1     |     rv = self.handle_user_exception(e)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
scrapydweb_1     |     reraise(exc_type, exc_value, tb)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
scrapydweb_1     |     raise value
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
scrapydweb_1     |     rv = self.dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
scrapydweb_1     |     return self.view_functions[rule.endpoint](**req.view_args)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view
scrapydweb_1     |     return self.dispatch_request(*args, **kwargs)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 57, in dispatch_request
scrapydweb_1     |     return self.generate_response()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 98, in generate_response
scrapydweb_1     |     _url_items = re.search(r"href='(.*?)'>", row['items']).group(1)
scrapydweb_1     | AttributeError: 'NoneType' object has no attribute 'group'
scrapydweb_1     | [2019-01-20 21:14:53,816] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:53] "GET /1/dashboard/ HTTP/1.1" 500 -
scrapydweb_1     | [2019-01-20 21:14:55,497] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:55] "GET / HTTP/1.1" 302 -
scrapydweb_1     | [2019-01-20 21:14:55,521] ERROR in flask.app: Exception on /1/dashboard/ [GET]
scrapydweb_1     | Traceback (most recent call last):
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
scrapydweb_1     |     response = self.full_dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
scrapydweb_1     |     rv = self.handle_user_exception(e)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
scrapydweb_1     |     reraise(exc_type, exc_value, tb)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
scrapydweb_1     |     raise value
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
scrapydweb_1     |     rv = self.dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
scrapydweb_1     |     return self.view_functions[rule.endpoint](**req.view_args)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view
scrapydweb_1     |     return self.dispatch_request(*args, **kwargs)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 57, in dispatch_request
scrapydweb_1     |     return self.generate_response()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 98, in generate_response
scrapydweb_1     |     _url_items = re.search(r"href='(.*?)'>", row['items']).group(1)
scrapydweb_1     | AttributeError: 'NoneType' object has no attribute 'group'
scrapydweb_1     | [2019-01-20 21:14:55,527] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:55] "GET /1/dashboard/ HTTP/1.1" 500 -
scrapydweb_1     | [2019-01-20 21:14:55,759] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:55] "GET / HTTP/1.1" 302 -
scrapydweb_1     | [2019-01-20 21:14:55,776] ERROR in flask.app: Exception on /1/dashboard/ [GET]
scrapydweb_1     | Traceback (most recent call last):
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
scrapydweb_1     |     response = self.full_dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
scrapydweb_1     |     rv = self.handle_user_exception(e)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
scrapydweb_1     |     reraise(exc_type, exc_value, tb)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
scrapydweb_1     |     raise value
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
scrapydweb_1     |     rv = self.dispatch_request()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
scrapydweb_1     |     return self.view_functions[rule.endpoint](**req.view_args)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view
scrapydweb_1     |     return self.dispatch_request(*args, **kwargs)
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 57, in dispatch_request
scrapydweb_1     |     return self.generate_response()
scrapydweb_1     |   File "/usr/local/lib/python3.6/site-packages/scrapydweb/jobs/dashboard.py", line 98, in generate_response
scrapydweb_1     |     _url_items = re.search(r"href='(.*?)'>", row['items']).group(1)
scrapydweb_1     | AttributeError: 'NoneType' object has no attribute 'group'
scrapydweb_1     | [2019-01-20 21:14:55,781] INFO in werkzeug: 192.168.48.1 - - [20/Jan/2019 21:14:55] "GET /1/dashboard/ HTTP/1.1" 500 -

Any idea how to fix this ?

Thanks

Support aggregating results of all timer tasks

How does ScrapydWeb handle route?

Hello, you're doing great. I'm a beginner of flask. I've been playing with crawlers. I rarely touch web frameworks. I want to know that I want to add a new function, such as creating crawler files on the web side and deploying them directly into scrapydweb. How can I add a route to guide links? Which is your route add. py, I don't know where to add my xx. HTML route! Maybe I asked this is silly, but I am a beginner, and I don't know the framework yet. I hope to give you an answer! Thank you very much.

Support for custom data path

scrapydweb/scrapydweb/vars.py

Line 14 in c2fb2d0

DATA_PATH = os.path.join(CWD, 'data')

@my8100

'jinja2.runtime.LoopContext object' has no attribute 'changed'

the web interface encounter a 500 error, the log show the following:

{% if SCRAPYD_SERVERS_GROUPS[loop.index-1] and loop.changed(SCRAPYD_SERVERS_GROUPS[loop.index-1]) %}
jinja2.exceptions.UndefinedError: 'jinja2.runtime.LoopContext object' has no attribute 'changed'

Refresh crawl status after long time leads to memory error.

I have a crawl running, where after 87000 seconds since last refresh, the following error occures when trying to refresh:

Traceback (most recent call last): File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app response = self.full_dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise raise value File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, **kwargs) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 94, in dispatch_request self.request_scrapy_log() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 142, in request_scrapy_log self.status_code, self.text = self.make_request(self.url, api=False, auth=self.AUTH) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/myview.py", line 191, in make_request front = r.text[:min(100, len(r.text))].replace('\n', '') File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/requests/models.py", line 861, in text content = str(self.content, encoding, errors='replace') MemoryError

The crawl seems to be running fine thow.

Feature Request | 功能需求

1、Timed task
2、Online packaging and deployment
3、Log monitoring alarm

how delete the log file on Scrapyd server(slaver server)？

FYI

Scrapydweb constantly using 100% CPU

I have a machine with two scrapyd instances and one scrapydweb running, scrapydweb is connected to both scrapyd instances. However, CPU usage of scrapydweb is very high all the time. Investigating a bit, I've seen that scrapydweb is constantly making requests (more than 2 per second) to both scrapyd instances, requesting logs (the request is also done twice, once asking for the uncompressed log and once for the compressed one).

Now my question is, why does scrapydweb need to be constantly fetching scrapyd logs? I mean, once it has got them once, they aren't going to change.

Run timer tasks using the latest version

I have many spiders in one project, when I modify the common functions, I need to update timer task one by one to use the latest version. It will be very helpful if timer task can use the latest version of spiders every time. 😉

[PY2] UnicodeDecodeError raised when there are some files with illegal filenames in `SCRAPY_PROJECTS_DIR`

Request URL |
-- | --
'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 2311, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1834, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1737, in handle_user_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1832, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1818, in dispatch_request return self.view_functionsrule.endpoint File "/usr/local/lib/python2.7/dist-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/scrapydweb/operations/deploy.py", line 56, in dispatch_request self.get_modification_times() File "/usr/local/lib/python2.7/dist-packages/scrapydweb/operations/deploy.py", line 75, in get_modification_times timestamps = [self.get_modification_time(i) for i in self.project_paths] File "/usr/local/lib/python2.7/dist-packages/scrapydweb/operations/deploy.py", line 90, in get_modification_time for dirpath, dirnames, filenames in os.walk(path): File "/usr/lib/python2.7/os.py", line 296, in walk for x in walk(new_path, topdown, onerror, followlinks): File "/usr/lib/python2.7/os.py", line 296, in walk for x in walk(new_path, topdown, onerror, followlinks): File "/usr/lib/python2.7/os.py", line 286, in walk if isdir(join(top, name)): File "/usr/lib/python2.7/posixpath.py", line 73, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
Linux-4.4.0-117-generic-x86_64-with-Ubuntu-16.04-xenial
2.7.12
1.2.0
0.8.1
2
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
GET
ImmutableMultiDict([])
ImmutableMultiDict([])
ImmutableMultiDict([])

为什么我的定时任务没有执行

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Logs
Add logs of ScrapydWeb and Scrapyd (optional) when reproducing the bug. (Run ScrapydWeb with argument '--verbose' if its version >= 1.0.0)

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: [e.g. Win 10, macOS 10.14, Ubuntu 18, centOS 7.6, Debian 9.6 or Fedora 29]
Python: [e.g. 2.7, 3.6 or 3.7]
ScrapydWeb: [e.g. 0.9.9 or 1.0.0]
Scrapyd amount [e.g. 1 or 5]
Related settings [e.g. 'ENABLE_CACHE = True']
Browser [e.g. Chrome 71, Firefox 64 or Safari 12]

Additional context
Add any other context about the problem here.

when I click [Timer Tasks], I will wait 5 minutes ！

I have 6 tasks in timer tasks list, when i click [Timer Tasks], I will take 5 minutes to get the result, so slow ！ why ? only [Timer Tasks] !

about timer task

I want the task to run at 18:00:00, but the task is running when i click [add task&fire right now] ! How can i do ? thanks sir

my8100 / scrapydweb Goto Github PK

scrapydweb's Introduction

ScrapydWeb: Web app for Scrapyd cluster management, with support for Scrapy log analysis & visualization.

Scrapyd ❌ ScrapydWeb ❌ LogParser

📖 Recommended Reading

👀 Demo

⭐ Features

💻 Getting Started

⚠️ Prerequisites

⬇️ Install

▶️ Start

🌐 Browser Support

✔️ Running the tests

🏗️ Built With

📋 Changelog

👨‍💻 Author

👥 Contributors

©️ License

scrapydweb's People

Contributors

Stargazers

Watchers

Forkers

scrapydweb's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs