GithubHelp home page GithubHelp logo

tonywangcn / scaleable-crawler-with-docker-cluster Goto Github PK

View Code? Open in Web Editor NEW
96.0 7.0 27.0 9 KB

a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine

Python 100.00%
crawler python celery rabbitmq scaleable cluster distributed docker

scaleable-crawler-with-docker-cluster's Introduction

scaleable-crawler-with-docker-cluster

a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine Check out here for full document

scaleable-crawler-with-docker-cluster's People

Contributors

dependabot[bot] avatar tonywangcn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scaleable-crawler-with-docker-cluster's Issues

use celery==4.0.2 can raise asyc error, celery==4.2.x won't. suggest to change celery version.

error as below:

worker_1    | /usr/local/lib/python2.7/site-packages/celery/platforms.py:793: RuntimeWarning: You're running the worker with superuser privileges: this is
worker_1    | absolutely not recommended!
worker_1    | 
worker_1    | Please specify a different user using the -u option.
worker_1    | 
worker_1    | User information: uid=0 euid=0 gid=0 egid=0
worker_1    | 
worker_1    |   uid=uid, euid=euid, gid=gid, egid=egid,
scaleable-crawler-with-docker-cluster_worker_2 exited with code 1
worker_1    | Traceback (most recent call last):
worker_1    |   File "/usr/local/bin/celery", line 10, in <module>
worker_1    |     sys.exit(main())
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/__main__.py", line 14, in main
worker_1    |     _main()
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 326, in main
worker_1    |     cmd.execute_from_commandline(argv)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
worker_1    |     super(CeleryCommand, self).execute_from_commandline(argv)))
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
worker_1    |     return self.handle_argv(self.prog_name, argv[1:])
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 480, in handle_argv
worker_1    |     return self.execute(command, argv)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 412, in execute
worker_1    |     ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/worker.py", line 221, in run_from_argv
worker_1    |     return self(*args, **options)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/base.py", line 244, in __call__
worker_1    |     ret = self.run(*args, **kwargs)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/worker.py", line 255, in run
worker_1    |     **kwargs)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 99, in __init__
worker_1    |     self.setup_instance(**self.prepare_args(**kwargs))
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 122, in setup_instance
worker_1    |     self.should_use_eventloop() if use_eventloop is None
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 241, in should_use_eventloop
worker_1    |     self._conninfo.transport.implements.async and
worker_1    |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/base.py", line 127, in __getattr__
worker_1    |     raise AttributeError(key)
worker_1    | AttributeError: async

MongoClient opened before fork

Last night, when I ran python -m test_celery.run_tasks, anyone knows how to fix it?

worker_1    | [2018-05-19 05:12:06,257: WARNING/ForkPoolWorker-3] /usr/local/lib/python2.7/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
worker_1    |   "MongoClient opened before fork. Create MongoClient "

and

worker_1    | [2018-05-19 05:12:36,612: ERROR/ForkPoolWorker-3] Task test_celery.tasks.longtime_add[287b7121-cb07-42b5-868b-785e6aab74cc] raised unexpected: ServerSelectionTimeoutError('127.0.0.1:27017: [Errno 111] Connection refused',)
worker_1    | Traceback (most recent call last):
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
worker_1    |     R = retval = fun(*args, **kwargs)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
worker_1    |     return self.run(*args, **kwargs)
worker_1    |   File "/app/test_celery/tasks.py", line 17, in longtime_add
worker_1    |     raise self.retry(exc=exc)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 668, in retry
worker_1    |     raise_with_context(exc)
worker_1    |   File "/app/test_celery/tasks.py", line 14, in longtime_add
worker_1    |     post.insert({'status':r.status_code,"creat_time":time.time()}) # store status code and current time to mongodb
worker_1    |   File "/usr/local/lib/python2.7/site-packages/pymongo/collection.py", line 2467, in insert
worker_1    |     with self._socket_for_writes() as sock_info:
worker_1    |   File "/usr/local/lib/python2.7/contextlib.py", line 17, in __enter__
worker_1    |     return self.gen.next()
worker_1    |   File "/usr/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 823, in _get_socket
worker_1    |     server = self._get_topology().select_server(selector)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/pymongo/topology.py", line 214, in select_server
worker_1    |     address))
worker_1    |   File "/usr/local/lib/python2.7/site-packages/pymongo/topology.py", line 189, in select_servers
worker_1    |     self._error_message(selector))
worker_1    | ServerSelectionTimeoutError: 127.0.0.1:27017: [Errno 111] Connection refused

Currently, I have

pymongo==3.6.1
celery==4.1.0
python==3.5.2
Ubuntu==16.04

Few Corrections and Tips

Thank you so much for the open sourcing the code and detailed article.

I found few tips to be followed to make it work:

RabbitMQ Host

In your test_celery/celery.py you have to set broker url as broker='amqp://admin:mypass@rabbit:5672'

Running a tasks

You have to execute a run_tasks inside a worker container instead of host machine.

  sudo docker exec -i -t scaleablecrawlerwithdockercluster_worker_1 /bin/bash
  python -m test_celery.run_tasks

Where scaleablecrawlerwithdockercluster_worker_1 is a container name. Make sure you replace it with your worker container name or id.

Access MongoDB from the host machine to see the results.

mongo --host 172.18.0.1 --port 27018

Question about docker-engine:

In the medium article, you said about installation of docker-engine. I think. we only need installation docker and docker-compose. I installed both docker and docker-compose using following steps

1. Install Docker

Refer: https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/


    sudo apt-get remove docker docker-engine docker.io

    sudo apt-get update
    sudo apt-get install \
        apt-transport-https \
        ca-certificates \
        curl \
        software-properties-common


    sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo apt-key fingerprint 0EBFCD88

    sudo add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) \
       stable"

    sudo apt-get update
    sudo apt-get install docker-ce
    sudo docker --version

2. Install Docker-compose

    sudo curl -L https://github.com/docker/compose/releases/download/1.18.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
    sudo chmod +x /usr/local/bin/docker-compose
    sudo docker-compose --version

Could you please update the same on the article?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.