GithubHelp home page GithubHelp logo

tonywangcn / scaleable-crawler-with-docker-cluster Goto Github PK

View Code? Open in Web Editor NEW
96.0 7.0 27.0 9 KB

a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine

Python 100.00%
crawler python celery rabbitmq scaleable cluster distributed docker

scaleable-crawler-with-docker-cluster's Issues

MongoClient opened before fork

Last night, when I ran python -m test_celery.run_tasks, anyone knows how to fix it?

worker_1    | [2018-05-19 05:12:06,257: WARNING/ForkPoolWorker-3] /usr/local/lib/python2.7/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
worker_1    |   "MongoClient opened before fork. Create MongoClient "

and

worker_1    | [2018-05-19 05:12:36,612: ERROR/ForkPoolWorker-3] Task test_celery.tasks.longtime_add[287b7121-cb07-42b5-868b-785e6aab74cc] raised unexpected: ServerSelectionTimeoutError('127.0.0.1:27017: [Errno 111] Connection refused',)
worker_1    | Traceback (most recent call last):
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
worker_1    |     R = retval = fun(*args, **kwargs)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
worker_1    |     return self.run(*args, **kwargs)
worker_1    |   File "/app/test_celery/tasks.py", line 17, in longtime_add
worker_1    |     raise self.retry(exc=exc)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 668, in retry
worker_1    |     raise_with_context(exc)
worker_1    |   File "/app/test_celery/tasks.py", line 14, in longtime_add
worker_1    |     post.insert({'status':r.status_code,"creat_time":time.time()}) # store status code and current time to mongodb
worker_1    |   File "/usr/local/lib/python2.7/site-packages/pymongo/collection.py", line 2467, in insert
worker_1    |     with self._socket_for_writes() as sock_info:
worker_1    |   File "/usr/local/lib/python2.7/contextlib.py", line 17, in __enter__
worker_1    |     return self.gen.next()
worker_1    |   File "/usr/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 823, in _get_socket
worker_1    |     server = self._get_topology().select_server(selector)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/pymongo/topology.py", line 214, in select_server
worker_1    |     address))
worker_1    |   File "/usr/local/lib/python2.7/site-packages/pymongo/topology.py", line 189, in select_servers
worker_1    |     self._error_message(selector))
worker_1    | ServerSelectionTimeoutError: 127.0.0.1:27017: [Errno 111] Connection refused

Currently, I have

pymongo==3.6.1
celery==4.1.0
python==3.5.2
Ubuntu==16.04

use celery==4.0.2 can raise asyc error, celery==4.2.x won't. suggest to change celery version.

error as below:

worker_1    | /usr/local/lib/python2.7/site-packages/celery/platforms.py:793: RuntimeWarning: You're running the worker with superuser privileges: this is
worker_1    | absolutely not recommended!
worker_1    | 
worker_1    | Please specify a different user using the -u option.
worker_1    | 
worker_1    | User information: uid=0 euid=0 gid=0 egid=0
worker_1    | 
worker_1    |   uid=uid, euid=euid, gid=gid, egid=egid,
scaleable-crawler-with-docker-cluster_worker_2 exited with code 1
worker_1    | Traceback (most recent call last):
worker_1    |   File "/usr/local/bin/celery", line 10, in <module>
worker_1    |     sys.exit(main())
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/__main__.py", line 14, in main
worker_1    |     _main()
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 326, in main
worker_1    |     cmd.execute_from_commandline(argv)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
worker_1    |     super(CeleryCommand, self).execute_from_commandline(argv)))
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
worker_1    |     return self.handle_argv(self.prog_name, argv[1:])
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 480, in handle_argv
worker_1    |     return self.execute(command, argv)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 412, in execute
worker_1    |     ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/worker.py", line 221, in run_from_argv
worker_1    |     return self(*args, **options)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/base.py", line 244, in __call__
worker_1    |     ret = self.run(*args, **kwargs)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/bin/worker.py", line 255, in run
worker_1    |     **kwargs)
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 99, in __init__
worker_1    |     self.setup_instance(**self.prepare_args(**kwargs))
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 122, in setup_instance
worker_1    |     self.should_use_eventloop() if use_eventloop is None
worker_1    |   File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 241, in should_use_eventloop
worker_1    |     self._conninfo.transport.implements.async and
worker_1    |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/base.py", line 127, in __getattr__
worker_1    |     raise AttributeError(key)
worker_1    | AttributeError: async

Few Corrections and Tips

Thank you so much for the open sourcing the code and detailed article.

I found few tips to be followed to make it work:

RabbitMQ Host

In your test_celery/celery.py you have to set broker url as broker='amqp://admin:mypass@rabbit:5672'

Running a tasks

You have to execute a run_tasks inside a worker container instead of host machine.

  sudo docker exec -i -t scaleablecrawlerwithdockercluster_worker_1 /bin/bash
  python -m test_celery.run_tasks

Where scaleablecrawlerwithdockercluster_worker_1 is a container name. Make sure you replace it with your worker container name or id.

Access MongoDB from the host machine to see the results.

mongo --host 172.18.0.1 --port 27018

Question about docker-engine:

In the medium article, you said about installation of docker-engine. I think. we only need installation docker and docker-compose. I installed both docker and docker-compose using following steps

1. Install Docker

Refer: https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/


    sudo apt-get remove docker docker-engine docker.io

    sudo apt-get update
    sudo apt-get install \
        apt-transport-https \
        ca-certificates \
        curl \
        software-properties-common


    sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo apt-key fingerprint 0EBFCD88

    sudo add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) \
       stable"

    sudo apt-get update
    sudo apt-get install docker-ce
    sudo docker --version

2. Install Docker-compose

    sudo curl -L https://github.com/docker/compose/releases/download/1.18.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
    sudo chmod +x /usr/local/bin/docker-compose
    sudo docker-compose --version

Could you please update the same on the article?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.