GithubHelp home page GithubHelp logo

openzim / zimfarm Goto Github PK

View Code? Open in Web Editor NEW
76.0 12.0 24.0 6.05 MB

Farm operated by bots to grow and harvest new zim files

Home Page: https://farm.openzim.org

License: GNU General Public License v3.0

Python 74.17% JavaScript 4.30% HTML 0.55% CSS 0.29% Shell 2.25% Dockerfile 1.00% Vue 16.59% Mako 0.06% Jupyter Notebook 0.78%
flask python3 zim-files distributed-systems docker-images

zimfarm's Introduction

ZIM Farm

Build Status CodeFactor License: GPL v3 codecov

The ZIM farm (zimfarm) is a semi-decentralised software solution to build ZIM files efficiently. This means scraping Web contents, packaging them into a ZIM file and uploading the result to an online ZIM files repository.

How does it work?

The Zimfarm platform is a combination of different tools:

dispatcher

The dispatcher is a central database and API that records recipes (metadata of ZIM to produce) and tasks. It includes a scheduler that decides when a ZIM file should be recreated (based on the recipe) and a dispatcher that creates and assigns tasks to workers.

frontend

The frontend, available at farm.openzim.org is a simple consumer of the API.

It is used to create, clone and edit recipes, but also to monitor the evolution of tasks and workers.

Anybody can use it in read-only mode.

workers

Workers are always-running computers which gets assigned ZIM creation tasks by the dispatcher. If you are interested in providing us worker resources, please read these instructions.

A worker is made of two software components:

worker-manager

The manager is responsible for declaring its available resources and configuration and receives tasks assigned to it by the dispatcher. It's a very-low resources container whose job is to spawn task-worker ones.

task-worker

The task-worker is responsible for running a specific task. It's also a very-low resources container but contrary to the manager, one is spawned for each task assigned to the worker (the manager defines the concurrency based on resources).

The task-worker's role is to start and monitor the scraper's container for the task and to spawn uploader containers for both created ZIM files and logs.

uploader

The uploader is instantiated by the task-worker to upload, individually, each created ZIM files, as well as the scraper's container log.

The uploader supports both SCP and SFTP. We are currently using SFTP for all uploads due to a slight speed gain.

Uploader is very fast and convenient (can watch and resumes files) but works only off files at the moment.

receiver

The receiver is a jailed OpenSSH-server that receives scraper logs and ZIM files and pass the latter through a quarantine via the zimcheck tool which eventually either put them aside (invalid ZIM) or move those to the public download server.

scrapers

Scrapers are the tools used to actually convert a scraping request (recorded in a Zimfarm recipe) into one or several ZIM files.

The most important one is the Mediawiki scraper, called mwoffliner but there are many of them for Stack-Exchange, Project Gutenberg, PhET and others.

Scrapers are not part of the Zimfarm. Those are completely independent projects for which the requirements to integrate into the Zimfarm are minimal:

  • Works completely off a docker image
  • Arguments should be set on the command line
  • ZIM output folder should be settable via an argument

How do I request a ZIM file?

ZIM file requests are handled on zim-requests repository.

If there's already a scraper for the website you want to convert to ZIM, someone with editor access to the Zimfarm will create the recipe and in a few days, a ZIM file should be available.

zimfarm's People

Contributors

aniruddhachattopadhyay avatar automactic avatar benoit74 avatar dependabot[bot] avatar haksoat avatar jenskorte avatar kelson42 avatar mahakporwal02 avatar nemobis avatar popolechien avatar rgaudin avatar satyamtg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zimfarm's Issues

Add server side event monitor

It might be a good idea to add a server side event listener, instead of having worker constantly sending status updates to the server.

All backend APIs should be protected using JWT

Currently all resource APIs are not protected.
I have added JWT token auth in worker-mwoffliner branch, we need to verify tokens in all resource APIs, throw exceptions if token does not exist or is invalid.

Remove "docker" directory

This is useless as all the necessary Dockerfiles are already in the subdirectories (depending of their respective duties).

How to sync file from worker back to dispatcher

I think eventually there is gonna be two type of workers: local and remote.

  • local workers run on the same server / data center dispatcher is on, handle large zim file generation.
  • remote workers run on other places geographically, mostly generate small zim files.

The way file is synced could be

Enhanced File Transfer

Currently we use plain FTP to transfer files from worker to warehouse. This brings a security risk, as username and user token are transferred in plain text.

We also had a discussion regarding using rsync. The problem is it is not possible to enter the password programatically.


Here I propose another approach, inspired by GitHub. (thanks @blajzer)

  • User upload their public key to dispatcher, who then store the key in database. (like the GitHub key management system)
  • Worker map ~/.ssh on host to the same path inside of the container on startup
  • Warehouse is a ssh server implemented using Paramiko

When worker need to upload a file:

  1. worker use rsync over ssh or SFTP, which at some point a ssh connection is required
  2. warehouse receive the ssh auth challenge
  3. warehouse send a request to dispatcher try to verify the username and key
  4. dispatcher query its database, verify or reject the username and key
  5. warehouse accept or reject the auth challenge

Deal with storage usage limits

We can not assume that each worker is going to be able to do all possible ZIM files. One of the reason is that a worker might not have enough disk storage allowed to do an extra big ZIM file (like Wikipedia in English with videos or StackOverflow).

For this reason we need:

  • To be able to specific the max storage available at the Worker level
  • To specify the amount of storage necessary at the task level.

The dispatcher should then dispatch the jobs properly.

Test task recovery

Task needs to be recovered / rerun in the following cases:

  • a worker running the task went offline
  • a task failed (within the retry count)

Wrong ZIM file temporary name

Newer version use foobar.zim as temporary file to write the ZIM file. This is a regression in comparison to the previous version and does not allow anymore to know if a ZIM file is ready or not. foobar.zim should be named foobar.zim.tmp as long as not everything is over with its creation. Then it should be renamed.

Better ssl certificate handling

we currently manually manage let's encrypt certs. It would be great to create a containerized solution.

See how download.kiwix.org currently do it: here and here

master does not start properly

I get a "502 Bad Gateway" for "http://localhost:8080/"

Here is the build log:

Successfully built 585bda2759cb
Creating zimfarm_redis_1
Creating zimfarm_dispatcher_frontend_1
Creating zimfarm_rabbit_1
Creating zimfarm_dispatcher_backend_1
Creating zimfarm_proxy_1
Creating zimfarm_worker_1
Attaching to zimfarm_dispatcher_frontend_1, zimfarm_redis_1, zimfarm_rabbit_1, zimfarm_dispatcher_backend_1, zimfarm_proxy_1, zimfarm_worker_1
redis_1                | WARNING: no logs are available with the 'none' log driver
rabbit_1               | WARNING: no logs are available with the 'none' log driver
dispatcher_frontend_1  | npm info it worked if it ends with ok
dispatcher_frontend_1  | npm info using [email protected]
dispatcher_frontend_1  | npm info using [email protected]
dispatcher_frontend_1  | npm info lifecycle [email protected]~prestart: [email protected]
dispatcher_frontend_1  | 
dispatcher_frontend_1  | > [email protected] prestart /app
dispatcher_frontend_1  | > npm run build
dispatcher_frontend_1  | 
dispatcher_frontend_1  | npm info it worked if it ends with ok
proxy_1                | WARNING: no logs are available with the 'none' log driver
dispatcher_backend_1   |  * Running on http://0.0.0.0:80/ (Press CTRL+C to quit)
dispatcher_backend_1   |  * Restarting with stat
dispatcher_frontend_1  | npm info using [email protected]
dispatcher_frontend_1  | npm info using [email protected]
dispatcher_frontend_1  | npm info lifecycle [email protected]~prebuild: [email protected]
dispatcher_frontend_1  | npm info lifecycle [email protected]~build: [email protected]
dispatcher_frontend_1  | 
dispatcher_frontend_1  | > [email protected] build /app
dispatcher_frontend_1  | > tsc -p src/
dispatcher_frontend_1  | 
worker_1               | Stopping redis-server: redis-server.
dispatcher_backend_1   |  * Debugger is active!
dispatcher_backend_1   |  * Debugger pin code: 958-657-090
worker_1               | Starting redis-server: redis-server.
worker_1               | /usr/local/lib/python3.5/dist-packages/celery/platforms.py:793: RuntimeWarning: You're running the worker with superuser privileges: this is
worker_1               | absolutely not recommended!
worker_1               | 
worker_1               | Please specify a different user using the -u option.
worker_1               | 
worker_1               | User information: uid=0 euid=0 gid=0 egid=0
worker_1               | 
worker_1               |   uid=uid, euid=euid, gid=gid, egid=egid,
worker_1               | [2017-06-18 09:41:50,741: ERROR/MainProcess] consumer: Cannot connect to amqp://admin:**@rabbit:5672//: [Errno 111] Connection refused.
worker_1               | Trying again in 2.00 seconds...
worker_1               | 
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(43,40): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(2364,40): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(2366,46): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(2477,23): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(2478,17): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(2479,17): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3290,29): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3299,37): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3586,30): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3692,23): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3693,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3698,41): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3706,43): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3824,42): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3824,57): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3891,23): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3892,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3893,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3972,23): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3973,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(3974,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4003,41): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4003,56): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4012,23): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4024,23): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4025,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4026,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4038,23): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4039,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4040,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4151,25): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4151,46): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4151,48): error TS1139: Type parameter declaration expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4158,31): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4165,20): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4172,32): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4179,25): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4186,26): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4193,22): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4200,22): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4207,38): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4214,20): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4221,24): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4228,26): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4235,21): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4242,22): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4258,9): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4266,9): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4274,9): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4282,9): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4290,9): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4293,29): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4306,44): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4306,46): error TS1139: Type parameter declaration expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4307,9): error TS1136: Property assignment expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4307,14): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4307,37): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4307,68): error TS1109: Expression expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4307,77): error TS1005: ',' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4307,85): error TS1005: ';' expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4307,92): error TS1109: Expression expected.
dispatcher_frontend_1  | node_modules/@types/jquery/index.d.ts(4451,1): error TS1128: Declaration or statement expected.
dispatcher_frontend_1  | 
dispatcher_frontend_1  | npm info lifecycle [email protected]~build: Failed to exec build script
dispatcher_frontend_1  | npm ERR! Linux 4.4.0-78-generic
dispatcher_frontend_1  | npm ERR! argv "/usr/local/bin/node" "/usr/local/bin/npm" "run" "build"
dispatcher_frontend_1  | npm ERR! node v7.10.0
dispatcher_frontend_1  | npm ERR! npm  v4.2.0
dispatcher_frontend_1  | npm ERR! code ELIFECYCLE
dispatcher_frontend_1  | npm ERR! errno 2
dispatcher_frontend_1  | npm ERR! [email protected] build: `tsc -p src/`
dispatcher_frontend_1  | npm ERR! Exit status 2
dispatcher_frontend_1  | npm ERR! 
dispatcher_frontend_1  | npm ERR! Failed at the [email protected] build script 'tsc -p src/'.
dispatcher_frontend_1  | npm ERR! Make sure you have the latest version of node.js and npm installed.
dispatcher_frontend_1  | npm ERR! If you do, this is most likely a problem with the zimfarm package,
dispatcher_frontend_1  | npm ERR! not with npm itself.
dispatcher_frontend_1  | npm ERR! Tell the author that this fails on your system:
dispatcher_frontend_1  | npm ERR!     tsc -p src/
dispatcher_frontend_1  | npm ERR! You can get information on how to open an issue for this project with:
dispatcher_frontend_1  | npm ERR!     npm bugs zimfarm
dispatcher_frontend_1  | npm ERR! Or if that isn't available, you can get their info via:
dispatcher_frontend_1  | npm ERR!     npm owner ls zimfarm
dispatcher_frontend_1  | npm ERR! There is likely additional logging output above.
dispatcher_frontend_1  | 
dispatcher_frontend_1  | npm ERR! Please include the following file with any support request:
dispatcher_frontend_1  | npm ERR!     /root/.npm/_logs/2017-06-18T09_41_51_757Z-debug.log
dispatcher_frontend_1  | 
dispatcher_frontend_1  | npm info lifecycle [email protected]~prestart: Failed to exec prestart script
dispatcher_frontend_1  | npm ERR! Linux 4.4.0-78-generic
dispatcher_frontend_1  | npm ERR! argv "/usr/local/bin/node" "/usr/local/bin/npm" "start"
dispatcher_frontend_1  | npm ERR! node v7.10.0
dispatcher_frontend_1  | npm ERR! npm  v4.2.0
dispatcher_frontend_1  | npm ERR! code ELIFECYCLE
dispatcher_frontend_1  | npm ERR! errno 2
dispatcher_frontend_1  | npm ERR! [email protected] prestart: `npm run build`
dispatcher_frontend_1  | npm ERR! Exit status 2
dispatcher_frontend_1  | npm ERR! 
dispatcher_frontend_1  | npm ERR! Failed at the [email protected] prestart script 'npm run build'.
dispatcher_frontend_1  | npm ERR! Make sure you have the latest version of node.js and npm installed.
dispatcher_frontend_1  | npm ERR! If you do, this is most likely a problem with the zimfarm package,
dispatcher_frontend_1  | npm ERR! not with npm itself.
dispatcher_frontend_1  | npm ERR! Tell the author that this fails on your system:
dispatcher_frontend_1  | npm ERR!     npm run build
dispatcher_frontend_1  | npm ERR! You can get information on how to open an issue for this project with:
dispatcher_frontend_1  | npm ERR!     npm bugs zimfarm
dispatcher_frontend_1  | npm ERR! Or if that isn't available, you can get their info via:
dispatcher_frontend_1  | npm ERR!     npm owner ls zimfarm
dispatcher_frontend_1  | npm ERR! There is likely additional logging output above.
dispatcher_frontend_1  | 
dispatcher_frontend_1  | npm ERR! Please include the following file with any support request:
dispatcher_frontend_1  | npm ERR!     /root/.npm/_logs/2017-06-18T09_41_51_781Z-debug.log
zimfarm_dispatcher_frontend_1 exited with code 2
worker_1               | [2017-06-18 09:41:52,757: ERROR/MainProcess] consumer: Cannot connect to amqp://admin:**@rabbit:5672//: [Errno 111] Connection refused.
worker_1               | Trying again in 4.00 seconds...
worker_1               | 
worker_1               | [2017-06-18 09:41:56,973: INFO/MainProcess] Connected to amqp://admin:**@rabbit:5672//
worker_1               | [2017-06-18 09:41:56,988: INFO/MainProcess] mingle: searching for neighbors
worker_1               | [2017-06-18 09:41:58,059: INFO/MainProcess] mingle: all alone
worker_1               | [2017-06-18 09:41:58,099: INFO/MainProcess] celery@f174068cccc8 ready.

Status API not always able to get the correct status

The task query API does not always return the correct result. I am not totally sure why, but I think this is due to the rpc result backend has the restriction of one queue per client. But in our case, we have 4 queues (4 uWSGI processes). If the uWSGI process handled the status query request is not the same that started the task, celery will not be able to determine the task's status, hence return PENDING.

Create sanity-check Zimfarm image

From @Popolechien on November 4, 2018 9:20

I'm looking at http://library.kiwix.org/granbluefantasy_en_all_all_nopic_2018-10/ (the last release of Granblue fantasy wiki) and it is obvious that a bunch of things are broken, rendering the file unusable and a waste of data/download time for users.
It seems to be a rather recent addition to the library, so can we think of some simple confirm/vetting process (a.k.a. Quality control) before adding new zims?

Copied from original issue: openzim/mwoffliner#422

Python error for create zim request

$ curl -X POST -H "token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJkaXNwYXRjaGVyLWJhY2tlbmQiLCJleHAiOjE1MTEzNzg5MzEsImlhdCI6MTUxMTM3NzEzMSwianRpIjoiMzNmMWVhYmUtOGNjMC00NDUwLWE1MjYtYmEwMzkwY2M2N2YxIiwidXNlcm5hbWUiOiJhZG1pbiIsInNjb3BlIjp7ImFkbWluIjp0cnVlfX0.rS5r_U0wo7ro4N0_c_rKm6IUoPzzsxwiPX1mlq-Z6wc" --data "[{ "mwUrl": "https://bm.wikipedia.org/", "adminEmail": "[email protected]", "verbose": true }]" "https://farm.openzim.org/api/task/mwoffliner"
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>TypeError: 'NoneType' object is not iterable // Werkzeug Debugger</title>
    <link rel="stylesheet" href="?__debugger__=yes&amp;cmd=resource&amp;f=style.css"
        type="text/css">
    <!-- We need to make sure this has a favicon so that the debugger does
         not by accident trigger a request to /favicon.ico which might
         change the application state. -->
    <link rel="shortcut icon"
        href="?__debugger__=yes&amp;cmd=resource&amp;f=console.png">
    <script src="?__debugger__=yes&amp;cmd=resource&amp;f=jquery.js"></script>
    <script src="?__debugger__=yes&amp;cmd=resource&amp;f=debugger.js"></script>
    <script type="text/javascript">
      var TRACEBACK = 139747121042936,
          CONSOLE_MODE = false,
          EVALEX = true,
          EVALEX_TRUSTED = false,
          SECRET = "YqXW2eTSMtesh66WWQq3";
    </script>
  </head>
  <body style="background-color: #fff">
    <div class="debugger">
<h1>builtins.TypeError</h1>
<div class="detail">
  <p class="errormsg">TypeError: 'NoneType' object is not iterable</p>
</div>
<h2 class="traceback">Traceback <em>(most recent call last)</em></h2>
<div class="traceback">
  
  <ul><li><div class="frame" id="frame-139747121041536">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/app.py"</cite>,
      line <em class="line">1997</em>,
      in <code class="function">__call__</code></h4>
  <div class="source"><pre class="line before"><span class="ws">                </span>error = None</pre>
<pre class="line before"><span class="ws">            </span>ctx.auto_pop(error)</pre>
<pre class="line before"><span class="ws"></span> </pre>
<pre class="line before"><span class="ws">    </span>def __call__(self, environ, start_response):</pre>
<pre class="line before"><span class="ws">        </span>&quot;&quot;&quot;Shortcut for :attr:`wsgi_app`.&quot;&quot;&quot;</pre>
<pre class="line current"><span class="ws">        </span>return self.wsgi_app(environ, start_response)</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws">    </span>def __repr__(self):</pre>
<pre class="line after"><span class="ws">        </span>return '&lt;%s %r&gt;' % (</pre>
<pre class="line after"><span class="ws">            </span>self.__class__.__name__,</pre>
<pre class="line after"><span class="ws">            </span>self.name,</pre></div>
</div>

<li><div class="frame" id="frame-139747121043384">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/app.py"</cite>,
      line <em class="line">1985</em>,
      in <code class="function">wsgi_app</code></h4>
  <div class="source"><pre class="line before"><span class="ws">        </span>try:</pre>
<pre class="line before"><span class="ws">            </span>try:</pre>
<pre class="line before"><span class="ws">                </span>response = self.full_dispatch_request()</pre>
<pre class="line before"><span class="ws">            </span>except Exception as e:</pre>
<pre class="line before"><span class="ws">                </span>error = e</pre>
<pre class="line current"><span class="ws">                </span>response = self.handle_exception(e)</pre>
<pre class="line after"><span class="ws">            </span>except:</pre>
<pre class="line after"><span class="ws">                </span>error = sys.exc_info()[1]</pre>
<pre class="line after"><span class="ws">                </span>raise</pre>
<pre class="line after"><span class="ws">            </span>return response(environ, start_response)</pre>
<pre class="line after"><span class="ws">        </span>finally:</pre></div>
</div>

<li><div class="frame" id="frame-139747121041760">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/app.py"</cite>,
      line <em class="line">1540</em>,
      in <code class="function">handle_exception</code></h4>
  <div class="source"><pre class="line before"><span class="ws">            </span># if we want to repropagate the exception, we can attempt to</pre>
<pre class="line before"><span class="ws">            </span># raise it with the whole traceback in case we can do that</pre>
<pre class="line before"><span class="ws">            </span># (the function was actually called from the except part)</pre>
<pre class="line before"><span class="ws">            </span># otherwise, we just raise the error again</pre>
<pre class="line before"><span class="ws">            </span>if exc_value is e:</pre>
<pre class="line current"><span class="ws">                </span>reraise(exc_type, exc_value, tb)</pre>
<pre class="line after"><span class="ws">            </span>else:</pre>
<pre class="line after"><span class="ws">                </span>raise e</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws">        </span>self.log_exception((exc_type, exc_value, tb))</pre>
<pre class="line after"><span class="ws">        </span>if handler is None:</pre></div>
</div>

<li><div class="frame" id="frame-139747121045176">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/_compat.py"</cite>,
      line <em class="line">33</em>,
      in <code class="function">reraise</code></h4>
  <div class="source"><pre class="line before"><span class="ws">    </span>from io import StringIO</pre>
<pre class="line before"><span class="ws"></span> </pre>
<pre class="line before"><span class="ws">    </span>def reraise(tp, value, tb=None):</pre>
<pre class="line before"><span class="ws">        </span>if value.__traceback__ is not tb:</pre>
<pre class="line before"><span class="ws">            </span>raise value.with_traceback(tb)</pre>
<pre class="line current"><span class="ws">        </span>raise value</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws">    </span>implements_to_string = _identity</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws"></span>else:</pre>
<pre class="line after"><span class="ws">    </span>text_type = unicode</pre></div>
</div>

<li><div class="frame" id="frame-139747121042712">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/app.py"</cite>,
      line <em class="line">1982</em>,
      in <code class="function">wsgi_app</code></h4>
  <div class="source"><pre class="line before"><span class="ws">        </span>ctx = self.request_context(environ)</pre>
<pre class="line before"><span class="ws">        </span>ctx.push()</pre>
<pre class="line before"><span class="ws">        </span>error = None</pre>
<pre class="line before"><span class="ws">        </span>try:</pre>
<pre class="line before"><span class="ws">            </span>try:</pre>
<pre class="line current"><span class="ws">                </span>response = self.full_dispatch_request()</pre>
<pre class="line after"><span class="ws">            </span>except Exception as e:</pre>
<pre class="line after"><span class="ws">                </span>error = e</pre>
<pre class="line after"><span class="ws">                </span>response = self.handle_exception(e)</pre>
<pre class="line after"><span class="ws">            </span>except:</pre>
<pre class="line after"><span class="ws">                </span>error = sys.exc_info()[1]</pre></div>
</div>

<li><div class="frame" id="frame-139747121045456">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/app.py"</cite>,
      line <em class="line">1614</em>,
      in <code class="function">full_dispatch_request</code></h4>
  <div class="source"><pre class="line before"><span class="ws">            </span>request_started.send(self)</pre>
<pre class="line before"><span class="ws">            </span>rv = self.preprocess_request()</pre>
<pre class="line before"><span class="ws">            </span>if rv is None:</pre>
<pre class="line before"><span class="ws">                </span>rv = self.dispatch_request()</pre>
<pre class="line before"><span class="ws">        </span>except Exception as e:</pre>
<pre class="line current"><span class="ws">            </span>rv = self.handle_user_exception(e)</pre>
<pre class="line after"><span class="ws">        </span>return self.finalize_request(rv)</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws">    </span>def finalize_request(self, rv, from_error_handler=False):</pre>
<pre class="line after"><span class="ws">        </span>&quot;&quot;&quot;Given the return value from a view function this finalizes</pre>
<pre class="line after"><span class="ws">        </span>the request by converting it into a response and invoking the</pre></div>
</div>

<li><div class="frame" id="frame-139747121043104">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/app.py"</cite>,
      line <em class="line">1517</em>,
      in <code class="function">handle_user_exception</code></h4>
  <div class="source"><pre class="line before"><span class="ws">            </span>return self.handle_http_exception(e)</pre>
<pre class="line before"><span class="ws"></span> </pre>
<pre class="line before"><span class="ws">        </span>handler = self._find_error_handler(e)</pre>
<pre class="line before"><span class="ws"></span> </pre>
<pre class="line before"><span class="ws">        </span>if handler is None:</pre>
<pre class="line current"><span class="ws">            </span>reraise(exc_type, exc_value, tb)</pre>
<pre class="line after"><span class="ws">        </span>return handler(e)</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws">    </span>def handle_exception(self, e):</pre>
<pre class="line after"><span class="ws">        </span>&quot;&quot;&quot;Default exception handling that kicks in when an exception</pre>
<pre class="line after"><span class="ws">        </span>occurs that is not caught.  In debug mode the exception will</pre></div>
</div>

<li><div class="frame" id="frame-139747121044112">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/_compat.py"</cite>,
      line <em class="line">33</em>,
      in <code class="function">reraise</code></h4>
  <div class="source"><pre class="line before"><span class="ws">    </span>from io import StringIO</pre>
<pre class="line before"><span class="ws"></span> </pre>
<pre class="line before"><span class="ws">    </span>def reraise(tp, value, tb=None):</pre>
<pre class="line before"><span class="ws">        </span>if value.__traceback__ is not tb:</pre>
<pre class="line before"><span class="ws">            </span>raise value.with_traceback(tb)</pre>
<pre class="line current"><span class="ws">        </span>raise value</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws">    </span>implements_to_string = _identity</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws"></span>else:</pre>
<pre class="line after"><span class="ws">    </span>text_type = unicode</pre></div>
</div>

<li><div class="frame" id="frame-139747121042824">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/app.py"</cite>,
      line <em class="line">1612</em>,
      in <code class="function">full_dispatch_request</code></h4>
  <div class="source"><pre class="line before"><span class="ws">        </span>self.try_trigger_before_first_request_functions()</pre>
<pre class="line before"><span class="ws">        </span>try:</pre>
<pre class="line before"><span class="ws">            </span>request_started.send(self)</pre>
<pre class="line before"><span class="ws">            </span>rv = self.preprocess_request()</pre>
<pre class="line before"><span class="ws">            </span>if rv is None:</pre>
<pre class="line current"><span class="ws">                </span>rv = self.dispatch_request()</pre>
<pre class="line after"><span class="ws">        </span>except Exception as e:</pre>
<pre class="line after"><span class="ws">            </span>rv = self.handle_user_exception(e)</pre>
<pre class="line after"><span class="ws">        </span>return self.finalize_request(rv)</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws">    </span>def finalize_request(self, rv, from_error_handler=False):</pre></div>
</div>

<li><div class="frame" id="frame-139747121045120">
  <h4>File <cite class="filename">"/usr/local/lib/python3.6/site-packages/flask/app.py"</cite>,
      line <em class="line">1598</em>,
      in <code class="function">dispatch_request</code></h4>
  <div class="source"><pre class="line before"><span class="ws">        </span># request came with the OPTIONS method, reply automatically</pre>
<pre class="line before"><span class="ws">        </span>if getattr(rule, 'provide_automatic_options', False) \</pre>
<pre class="line before"><span class="ws">           </span>and req.method == 'OPTIONS':</pre>
<pre class="line before"><span class="ws">            </span>return self.make_default_options_response()</pre>
<pre class="line before"><span class="ws">        </span># otherwise dispatch to the handler for that endpoint</pre>
<pre class="line current"><span class="ws">        </span>return self.view_functions[rule.endpoint](**req.view_args)</pre>
<pre class="line after"><span class="ws"></span> </pre>
<pre class="line after"><span class="ws">    </span>def full_dispatch_request(self):</pre>
<pre class="line after"><span class="ws">        </span>&quot;&quot;&quot;Dispatches the request and on top of that performs request</pre>
<pre class="line after"><span class="ws">        </span>pre and postprocessing as well as HTTP exception catching and</pre>
<pre class="line after"><span class="ws">        </span>error handling.</pre></div>
</div>

<li><div class="frame" id="frame-139747121043496">
  <h4>File <cite class="filename">"/app/routes/task.py"</cite>,
      line <em class="line">51</em>,
      in <code class="function">enqueue_mwoffliner</code></h4>
  <div class="source"><pre class="line before"><span class="ws">            </span>'options': config,</pre>
<pre class="line before"><span class="ws">            </span>'steps': []</pre>
<pre class="line before"><span class="ws">        </span>})</pre>
<pre class="line before"><span class="ws"></span> </pre>
<pre class="line before"><span class="ws">    </span>task_configs = request.get_json()</pre>
<pre class="line current"><span class="ws">    </span>for task_config in task_configs:</pre>
<pre class="line after"><span class="ws">        </span>check_task(task_config)</pre>
<pre class="line after"><span class="ws">    </span>for task_config in task_configs:</pre>
<pre class="line after"><span class="ws">        </span>enqueue_task(task_config)</pre>
<pre class="line after"><span class="ws">    </span>return Response(status=202)</pre>
<pre class="line after"><span class="ws"></span> </pre></div>
</div>
</ul>
  <blockquote>TypeError: 'NoneType' object is not iterable</blockquote>
</div>

<div class="plain">
  <form action="/?__debugger__=yes&amp;cmd=paste" method="post">
    <p>
      <input type="hidden" name="language" value="pytb">
      This is the Copy/Paste friendly version of the traceback.  <span
      class="pastemessage">You can also paste this traceback into
      a <a href="https://gist.github.com/">gist</a>:
      <input type="submit" value="create paste"></span>
    </p>
    <textarea cols="50" rows="10" name="code" readonly>Traceback (most recent call last):
  File &quot;/usr/local/lib/python3.6/site-packages/flask/app.py&quot;, line 1997, in __call__
    return self.wsgi_app(environ, start_response)
  File &quot;/usr/local/lib/python3.6/site-packages/flask/app.py&quot;, line 1985, in wsgi_app
    response = self.handle_exception(e)
  File &quot;/usr/local/lib/python3.6/site-packages/flask/app.py&quot;, line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
  File &quot;/usr/local/lib/python3.6/site-packages/flask/_compat.py&quot;, line 33, in reraise
    raise value
  File &quot;/usr/local/lib/python3.6/site-packages/flask/app.py&quot;, line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File &quot;/usr/local/lib/python3.6/site-packages/flask/app.py&quot;, line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File &quot;/usr/local/lib/python3.6/site-packages/flask/app.py&quot;, line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File &quot;/usr/local/lib/python3.6/site-packages/flask/_compat.py&quot;, line 33, in reraise
    raise value
  File &quot;/usr/local/lib/python3.6/site-packages/flask/app.py&quot;, line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File &quot;/usr/local/lib/python3.6/site-packages/flask/app.py&quot;, line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File &quot;/app/routes/task.py&quot;, line 51, in enqueue_mwoffliner
    for task_config in task_configs:
TypeError: 'NoneType' object is not iterable</textarea>
  </form>
</div>
<div class="explanation">
  The debugger caught an exception in your WSGI application.  You can now
  look at the traceback which led to the error.  <span class="nojavascript">
  If you enable JavaScript you can also use additional features such as code
  execution (if the evalex feature is enabled), automatic pasting of the
  exceptions and much more.</span>
</div>
      <div class="footer">
        Brought to you by <strong class="arthur">DON'T PANIC</strong>, your
        friendly Werkzeug powered traceback interpreter.
      </div>
    </div>

    <div class="pin-prompt">
      <div class="inner">
        <h3>Console Locked</h3>
        <p>
          The console is locked and needs to be unlocked by entering the PIN.
          You can find the PIN printed out on the standard output of your
          shell that runs the server.
        <form>
          <p>PIN:
            <input type=text name=pin size=14>
            <input type=submit name=btn value="Confirm Pin">
        </form>
      </div>
    </div>
  </body>
</html>

<!--

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1997, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/routes/task.py", line 51, in enqueue_mwoffliner
    for task_config in task_configs:
TypeError: 'NoneType' object is not iterable

-->

Discusion: separate database from dispatcher_backend

Currently, we use a SQLite db connection in dispatcher_backend , but I think sooner or later it needs to be 1) separated and 2) be replaced with a NoSQL solution.

Why it needs to be separated?

Everytime the dispatcher_backend container is rebooted (for example, because of a update), the sqlite file is removed, which means all task history data are lost. If we separate database to another container, we don't have to stop the database container.

Implement upload solution

As the workers are thought to not run on the same machine as the dispatcher, all generated ZIM files need to be gathered somewhere. That is why we need to be able to upload the files from a worker machine to a dedicated place.

This upload process should match the following requirements:

  • Able to deal with extremely large files (>50GB)
  • user/pass protection (time limited, ideally one pair per upload)
  • ZIM file gathering server does not have to be on the same box like the dispatcher
  • The server side part of the solution should run as a Docker container
  • Log which user has uploaded what
  • Forbids any other action except uploading (listing/delete/... of remote files forbidden)

delete old (incomplete) .tmp file in the warehouse

If for some reason, an upload can not be completed then the file .tmp will stay forever in the warehouse. This should be avoided. I would recommend to just implement a deletion in the cron of all .tmp files older than 2 weeks.

Crashes by running kelson_worker

$ docker logs -f zimfarm_worker
[2019-02-02 14:27:24,214: INFO] Starting Zimfarm Worker...
[2019-02-02 14:27:24,621: INFO] ENV USERNAME -- kelson
[2019-02-02 14:27:24,623: INFO] ENV DISPATCHER_HOST -- farm.openzim.org
[2019-02-02 14:27:24,624: INFO] ENV RABBIT_PORT -- 5671
[2019-02-02 14:27:24,626: INFO] ENV WAREHOUSE_HOST -- warehouse.farm.openzim.org
[2019-02-02 14:27:24,627: INFO] ENV WAREHOUSE_PORT -- 1522
[2019-02-02 14:27:24,629: INFO] ENV WORKING_DIR -- /data/zimfarm/tmp
[2019-02-02 14:27:24,631: INFO] ENV NODE_NAME -- kelson_worker
[2019-02-02 14:27:24,632: INFO] ENV QUEUES -- small
[2019-02-02 14:27:26,111: ERROR] SFTP auth check failed -- please double check your username and private key.
[2019-02-02 14:46:04,366: INFO] Starting Zimfarm Worker...
[2019-02-02 14:46:04,774: INFO] ENV USERNAME -- kelson
[2019-02-02 14:46:04,775: INFO] ENV DISPATCHER_HOST -- farm.openzim.org
[2019-02-02 14:46:04,776: INFO] ENV RABBIT_PORT -- 5671
[2019-02-02 14:46:04,777: INFO] ENV WAREHOUSE_HOST -- warehouse.farm.openzim.org
[2019-02-02 14:46:04,777: INFO] ENV WAREHOUSE_PORT -- 1522
[2019-02-02 14:46:04,778: INFO] ENV WORKING_DIR -- /data/zimfarm/tmp
[2019-02-02 14:46:04,779: INFO] ENV NODE_NAME -- kelson_worker
[2019-02-02 14:46:04,780: INFO] ENV QUEUES -- small
[2019-02-02 14:46:05,334: INFO] SFTP auth check success.
/usr/local/lib/python3.6/site-packages/celery/platforms.py:796: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  uid=uid, euid=euid, gid=gid, egid=egid,
[2019-02-02 14:46:06,434: INFO/MainProcess] Connected to amqps://kelson:**@farm.openzim.org:5671/zimfarm
[2019-02-02 14:46:06,743: INFO/MainProcess] mingle: searching for neighbors
[2019-02-02 14:46:08,360: INFO/MainProcess] mingle: all alone
[2019-02-02 14:46:08,994: INFO/MainProcess] kelson@kelson_worker ready.
[2019-02-02 14:46:09,000: INFO/MainProcess] Received task: offliner.mwoffliner[5c5129548c127b00217cb4af]
[2019-02-02 14:46:09,010: INFO/MainProcess] Received task: offliner.mwoffliner[5c5129548c127b00217cb4b2]
[2019-02-02 14:46:09,158: INFO/MainProcess] Received task: offliner.mwoffliner[5c5129558c127b00217cb4b5]
[2019-02-02 14:46:09,169: INFO/MainProcess] Received task: offliner.mwoffliner[5c5129558c127b00217cb4b8]
/usr/local/lib/python3.6/site-packages/celery/platforms.py:796: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  uid=uid, euid=euid, gid=gid, egid=egid,
[2019-02-02 14:46:25,858: ERROR/ForkPoolWorker-2] offliner.mwoffliner[5c5129548c127b00217cb4af]: task failed
[2019-02-02 14:46:25,861: ERROR/ForkPoolWorker-2] Task offliner.mwoffliner[5c5129548c127b00217cb4af] raised unexpected: APIError(HTTPError('409 Client Error: Conflict for url: http+docker://localhost/v1.35/containers/create?name=zimfarm_redis',),)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.35/containers/create?name=zimfarm_redis

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/src/app/tasks/mwoffliner.py", line 41, in run
    run_redis.execute()
  File "/usr/src/app/operations/run_redis.py", line 31, in execute
    self.docker.containers.run('redis', detach=True, name=self.container_name)
  File "/usr/local/lib/python3.6/site-packages/docker/models/containers.py", line 785, in run
    detach=detach, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/docker/models/containers.py", line 843, in create
    resp = self.client.api.create_container(**create_kwargs)
  File "/usr/local/lib/python3.6/site-packages/docker/api/container.py", line 427, in create_container
    return self.create_container_from_config(config, name)
  File "/usr/local/lib/python3.6/site-packages/docker/api/container.py", line 438, in create_container_from_config
    return self._result(res, True)
  File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 262, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 409 Client Error: Conflict ("Conflict. The container name "/zimfarm_redis" is already in use by container "e85ce7f7fe3d000f4824879405268e63c846af7d47a325e4655f17659d22e926". You have to remove (or rename) that container to be able to reuse that name.")
[2019-02-02 14:46:25,935: INFO/MainProcess] Received task: offliner.mwoffliner[5c5129558c127b00217cb4bb]
[2019-02-02 14:46:27,667: ERROR/ForkPoolWorker-2] offliner.mwoffliner[5c5129558c127b00217cb4b5]: task failed
[2019-02-02 14:46:27,675: ERROR/ForkPoolWorker-2] Task offliner.mwoffliner[5c5129558c127b00217cb4b5] raised unexpected: APIError(HTTPError('409 Client Error: Conflict for url: http+docker://localhost/v1.35/containers/create?name=zimfarm_redis',),)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.35/containers/create?name=zimfarm_redis

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/src/app/tasks/mwoffliner.py", line 41, in run
    run_redis.execute()
  File "/usr/src/app/operations/run_redis.py", line 31, in execute
    self.docker.containers.run('redis', detach=True, name=self.container_name)
  File "/usr/local/lib/python3.6/site-packages/docker/models/containers.py", line 785, in run
    detach=detach, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/docker/models/containers.py", line 843, in create
    resp = self.client.api.create_container(**create_kwargs)
  File "/usr/local/lib/python3.6/site-packages/docker/api/container.py", line 427, in create_container
    return self.create_container_from_config(config, name)
  File "/usr/local/lib/python3.6/site-packages/docker/api/container.py", line 438, in create_container_from_config
    return self._result(res, True)
  File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 262, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 409 Client Error: Conflict ("Conflict. The container name "/zimfarm_redis" is already in use by container "e85ce7f7fe3d000f4824879405268e63c846af7d47a325e4655f17659d22e926". You have to remove (or rename) that container to be able to reuse that name.")
[2019-02-02 14:46:27,748: INFO/MainProcess] Received task: offliner.mwoffliner[5c5129558c127b00217cb4be]
/usr/local/lib/python3.6/site-packages/celery/platforms.py:796: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  uid=uid, euid=euid, gid=gid, egid=egid,

worker approach: generic or specialized

Regarding how workers should be designed, there are two approaches, generic and specialized. And we need to make a decision to go forward.

Note: the work task used in this issue is referring to things like mwoffliner and maintenance

Generic

Dispatcher send name of script to worker, worker download the script and execute it. Or dispatcher directly send content of the script to worker. In both situation, worker trust the script it receives is legit and executes it.

Pros:

  • convenience: no need to configure new worker for new task type

Cons:

  • security risk: the script / command could be modified during transmission
  • security risk: docker socket of host is exposed to worker

Specialized

Every worker is specialized, i.e., one type of worker can and can only handle zim file generation, another type of worker can and can only handle maintenance task in dispatcher. Only parameters / settings are transferred from dispatcher to worker. Worker then do the necessary work. In the case of mwoffliner it will be to make sure redis server running, run mwoffliner command with parameters, upload the file.

Pros:

  • much less security risk:
    • truly containerized, no socket sharing
    • worker can only run permitted commands
    • commands run in exec mode, not in shell
  • more detailed progress reporting: upload progress, etc

Cons:

  • inconvenience: have to create new worker type for new task type

worker auth

  • rabbitmq auth
  • dispatcher task status update API needs auth

add user workflow:

  • admin add user in management panel (with generated initial password)
    • dispatcher add user in database
    • dispatcher add user in rabbitmq with correct permission

worker start up workflow:

  • worker start up with username and password (as docker cli env)

Add queue support

Different kind of zim file generation requires different processing power. For large zim files, it might be better to generate them using a more powerful machine. It would be a good idea to add support for queues.

When scheduling a task, user can choose which queue the task go to. Based on the file size, we might have the following queues: huge, large, medium, small, tiny, corresponding to 1000000+, 100000+, 10000+, 1000+, 100+ articles in the zimfile.

When a worker starts, user can choose to join one or more queue.

Suggest adding a contributions file

A minor suggestion – not urgent and not a complaint – but future contributors may benefit from the addition of a CONTRIBUTING.md file at the top level of the repo. The file could describe the process(es) for how people can contribute to the project. (Examples of contributors' guidelines are plentiful on GitHub; I'm familiar with EDGI's but there are other examples.)

Proposal: every celery task should execute mwoffliner directly, instead of running a shell script

Our current worker implementation is to run a shell script (examples), which seems to execute mwoffliner multiple times and generate multiple zim files. In this way, every celery task could generate multiple zim files.

I think, instead of doing this, it would be better to runmwoffliner once, generate one zim file per celery task. Reasons:

  1. security: if we allow user to enqueue celery task that execute any command, we expose workers to shell injection attack, accidental file deletion, etc. It's better to have worker set parameters used in mwoffliner in dispatcher and assemble the command programmatically on the worker.
  2. distributed system performance: by breaking big tasks into smaller units, more worker potentially could participate at the same time, thereby speeding up the overall process.
  3. management:
  • the stdout & stderr contains messages regarding one zim file generation
  • every celery task need to upload one zim file, easier to figure out the uploading progress and ETA
  1. error recovery: In case of error, if every celery task produce multiple zim files, it's
  • hard to figure out which generated zim file has error, which doesn't
  • impossible for another worker to pick up the process without unnecessarily re-generate zim files that does not have error

GET /task should be more versatile

The GET /task API just return a list of tasks in the order they are added to database. But this API should be more sophisticated. We should add features including:

  • limit and offset: fetch the first x tasks, fetch x to x+n tasks
  • sort by: by time finished, time created, etc

Add more contextual and introductory descriptions

I'm learning about ZIM and related efforts, and came across this repo, but am having trouble wrapping my head around what ZIM farm is about. I realize that my difficulty is entirely because I don't have sufficient background knowledge in this particular effort (or ZIM in general, for that matter); my point in bringing this up here is that I think I may not be alone, and adding some additional explanations may help other people in the future.

The front README has the following brief explanation:

A farm operated by bots to grow and harvest new zim files. User can submit a new zim file generate task through the website and a registered worker will run the task and upload the file back to the dispatcher.

But to an outsider such as me, there is not enough context to understand this summary. For example, when it says "harvest new zim files", where are those files being harvested from? When it says "upload the file back to the dispatcher", what is the dispatcher? What is its role? (It's not even clear whether it's part of ZIM farm.) And finally, why would one want to harvest ZIM files in the first place?

A section on background or a longer introduction might clear up these questions for me and others.

In any case, thank you for your efforts on ZIM and the associated infrastructure.

All backend APIs should try catch

For the purpose of fast development, I did not use try catch in response APIs. So they are not protected against invalid input or exceptions like KeyError. To make the APIs resilient, all exceptions should be handled.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.