Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
Docker images for dask
Home Page: https://hub.docker.com/u/daskdev
License: BSD 3-Clause "New" or "Revised" License
Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
latest docker hub image is 2.13
instead of 2.15
released in 24-04.
Noticed this while using dask-kubernetes
and client.get_versions(check=True)
.
I guess this is tied to: https://github.com/dask/dask-docker/pull/94/files
If the pull request gets merged does it solve the issue? Thanks
Hi,
can you add something like:
if [ "$EXTRA_APT_PACKAGES" ]; then
echo "EXTRA_APT_PACKAGES environment variable found. Installing."
apt install -y $EXTRA_APT_PACKAGES
fi
Thanks
Hi,
I'm trying to use hdfs3 with a distributed notebook to save/read parquets files.
However, after added hdfs3 fastparquet
to EXTRA_CONDA_PACKAGES
, when I try to run the notebook, it failed with:
ImportError: Can not find the shared library: libhdfs3.so
But the file exists: /opt/conda/lib/libhdfs3.so
Maybe we should add this directory via a file in /etc/ld.so.conf/
?
Thanks
I am trying to compose up the three docker images but I get the following errors:
C:\cygwin64\home\fcgr\code\dask-docker>docker-compose up
Starting dask-docker_worker_1 ... done
Starting dask-docker_scheduler_1 ... done
Recreating dask-docker_notebook_1 ... done
Attaching to dask-docker_worker_1, dask-docker_notebook_1, dask-docker_scheduler_1
worker_1 | [dumb-init] /usr/bin/prepare.sh: No such file or directory
notebook_1 | [FATAL tini (6)] exec /usr/bin/prepare.sh failed: No such file or directory
scheduler_1 | [dumb-init] /usr/bin/prepare.sh: No such file or directory
dask-docker_worker_1 exited with code 2
dask-docker_notebook_1 exited with code 127
dask-docker_scheduler_1 exited with code 2
I am using Docker Version 18.06.1-ce-win73
What happened:
After installing Java and dask-sql using pip, whenever I try to run a SQL query from my python code I get the following error:
...
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 378, in sql
rel, select_names, _ = self._get_ral(sql)
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 515, in _get_ral
nonOptimizedRelNode = generator.getRelationalAlgebra(validatedSqlNode)
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Unable to instantiate java compiler
...
...
File "JaninoRelMetadataProvider.java", line 426, in org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile
File "CompilerFactoryFactory.java", line 61, in org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory
java.lang.java.lang.NullPointerException: java.lang.NullPointerException
What you expected to happen:
I should get a dataframe as a result.
Minimal Complete Verifiable Example:
# The cluster/client setup is done first, in another module not the one executing the SQL query
# Also tried other cluster/scheduler types with the same error
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers=4,
threads_per_worker=1,
processes=False,
dashboard_address=':8787',
asynchronous=False,
memory_limit='1GB'
)
client = Client(cluster)
# The SQL code is executed in its own module
import dask.dataframe as dd
from dask_sql import Context
c = Context()
df = dd.read_parquet('/vQuery/files/results/US_Accidents_June20.parquet')
c.register_dask_table(df, 'df')
df = c.sql("""select ID, Source from df""") # This line fails with the error reported
Anything else we need to know?:
As mentioned in the code snippet above, due to the way my application is designed, the Dask client/cluster setup is done before dask-sql context is created.
Environment:
Install steps
$ sudo apt install default-jre
$ sudo apt install default-jdk
$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
$ javac -version
javac 11.0.10
$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64
$ pip install dask-sql
$ pip list | grep dask-sql
dask-sql 0.3.1
Platform: ubuntu 16.04, Docker version 18.03.0-ce, docker-compose version 1.21.0
Exception when docker-compose up
scheduler_1 | Future exception was never retrieved
scheduler_1 | future: <Future finished exception=KeyError('op',)>
scheduler_1 | Traceback (most recent call last):
scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
scheduler_1 | yielded = self.gen.send(value)
scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 276, in handle_comm
scheduler_1 | op = msg.pop('op')
scheduler_1 | KeyError: 'op'
scheduler_1 | Future exception was never retrieved
scheduler_1 | future: <Future finished exception=KeyError('op',)>
scheduler_1 | Traceback (most recent call last):
scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
scheduler_1 | yielded = self.gen.send(value)
scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 276, in handle_comm
scheduler_1 | op = msg.pop('op')
scheduler_1 | KeyError: 'op'
scheduler_1 | Future exception was never retrieved
scheduler_1 | future: <Future finished exception=KeyError('op',)>
scheduler_1 | Traceback (most recent call last):
scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
scheduler_1 | yielded = self.gen.send(value)
scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 276, in handle_comm
scheduler_1 | op = msg.pop('op')
scheduler_1 | KeyError: 'op'
It would be really useful if the default builds supported arm, docker buildx makes this a lot easier than it used to be.
Currently the ENTRYPOINT and CMD entries in the base image look like the following:
ENTRYPOINT ["/usr/local/bin/dumb-init", "--"]
CMD ["bash", "-c", "/usr/bin/prepare.sh && exec dask-scheduler"]
Often we use this image also for the dask-worker process, and replace the command with dask-worker
. My understanding is that this will stop the prepare script from running properly. Is my understanding correct? If so then is there a clean way to move some of the arguments in the CMD line up to ENTRYPOINT?
Due to a version mismatch, a client cannot be made in the notebook which connects to a scheduler from the base. This is related to #36 and likely a symptom of the fact that the base images are different.
Using a different base image or pinning dependencies is a possible remedy.
notebook:
base:
Hello,
I'm wondering: is this software released under a license, like Apache 2.0, MIT, etc.? If so, could you add a LICENSE, like so: https://help.github.com/en/github/creating-cloning-and-archiving-repositories/licensing-a-repository
Dask itself is licensed under New BSD: https://github.com/dask/dask#license
Thank you!
Our new auto-tag GitHub action job is failing with the following error message (see this CI build)
Warning: Unexpected input(s) 'GITHUB_TOKEN', valid inputs are ['source_file', 'extraction_regex', 'tag_format', 'tag_message']
Run jaliborc/[email protected]
with:
GITHUB_TOKEN: ***
source_file: .github/workflows/build.yml
extraction_regex: \s*"release"\s*:\s*"([\d\.]+)"\s*
tag_format: {version}
(node:1574) Warning: require() of ES modules is not supported.
require() of /home/runner/work/_actions/jaliborc/action-general-autotag/1.0.0/main.js is an ES module file as it is a .js file whose nearest parent package.json contains "type": "module" which defines all .js files in that package scope as ES modules.
Instead rename main.js to end in .cjs, change the requiring code to use import(), or remove "type": "module" from /home/runner/work/_actions/jaliborc/action-general-autotag/1.0.0/package.json.
Error: no match was found for the regex '/\s*"release"\s*:\s*"([\d\.]+)"\s*/'.
Note there is both a warning about GITHUB_TOKEN
(which stems from the action being used Jaliborc/action-general-autotag#3) and an error about no matching regex being found.
I have an environment.yml
file. It installs 13 packages from conda and 23 packages from PyPI. Installing takes time. It'd be nice if this could be accelerated on repeated builds. I think this would be possible by caching the downloads/installs.
Since both dask
and distributed
now support Python 3.8, should we bump our docker images here to use Python 3.8 too?
docker-compose scale worker=10
ERROR: for daskdocker_worker_6 driver failed programming external connectivity on endpoint daskdocker_worker_6 (2e9c2498d0638292dadc59aa550b310103d7b9c82c258f929f568db0c994269a): Bind for 0.0.0.0:8789 failed: port is already allocated
For example, in the base image python comes from main
$ docker run -it --rm daskdev/dask conda list python
...
+ conda list python
# packages in environment at /opt/conda:
#
# Name Version Build Channel
msgpack-python 0.5.6 py36h6bb024c_1
python 3.6.5 hc3d631a_2
python-blosc 1.5.1 py36h14c3975_2
python-dateutil 2.7.3 py36_0
But in the notebook image, it come from conda-forge
.
docker run -it --rm daskdev/dask-notebook conda list python
...
+ conda list python
# packages in environment at /opt/conda:
#
# Name Version Build Channel
ipython 6.5.0 py36_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
msgpack-python 0.5.6 py36h2d50403_3 conda-forge
python 3.6.5 1 conda-forge
python-blosc 1.4.4 py36_0 conda-forge
python-dateutil 2.7.3 py_0 conda-forge
python-editor 1.0.3 py36_0 conda-forge
python-oauth2 1.0.1 py36_0 conda-forge
I know this feature is considered experimental, but is there any chance of making cachey a dependency?
Trying to install build-essential
package for some of pip requirements, can't pass-through because of that error.
EXTRA_APT_PACKAGES environment variable found. Installing.
+ echo 'EXTRA_APT_PACKAGES environment variable found. Installing.'
+ apt update -y
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)
+ apt install -y build-essential
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?```
docker-compose build base-notebook
Building base-notebook
#1 [internal] load git source github.com/jupyter/docker-stacks.git#master:base-notebook
#1 ERROR: subdir not supported yet
------
> [internal] load git source github.com/jupyter/docker-stacks.git#master:base-notebook:
------
failed to solve with frontend dockerfile.v0: failed to read dockerfile: failed to load cache key: subdir not supported yet
Service 'base-notebook' failed to build : Build failed
We should probably update the numpy and versions that we pin to at some point. We should verify that examples.dask.org continues to build and run nicely with these updates.
The command specified for worker in docker-compose.yml fails with "command not found" error and worker service crashes. Setting it to 2-item list (iow ["dask-worker", "scheduler:8786"]
solves this problem.
Host system: OS X 10.13.6
Docker: 18.06.1-ce-mac73 (26764)
Compose: 1.22.0
What do you propose to do with this, versus the Dockerfile at https://github.com/martindurant/dask-kubernetes/blob/master/Dockerfile ?
Is there a particolar reason for having
FROM continuumio/miniconda3:4.7.12
Instead of
FROM continuumio/miniconda3:4.8.2
Testing a docker-compose setup with one scheduler, one worker, and one client
docker-compose.yml:
version: "3.1"
services:
scheduler:
image: daskdev/dask
hostname: dask-scheduler
ports:
- "8786:8786"
- "8787:8787"
command: ["dask-scheduler"]
worker:
image: daskdev/dask
hostname: dask-worker
command: ["dask-worker", "tcp://scheduler:8786"]
client:
build: client
environment:
- DASK_SCHEDULER_ADDRESS=scheduler:8786
command: ["python", "script.py"]
client Dockerfile
FROM python:3.8-slim
ENV VIRTUAL_ENV=/opt/env
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
WORKDIR /app
COPY ./requirements.txt /app/requirements.txt
RUN apt-get update \
&& apt-get install gcc -y \
&& apt-get clean
RUN pip install -r /app/requirements.txt \
&& rm -rf /root/.cache/pip
COPY . /app/
client script
import os
from dask.distributed import Client
dask_scheduler = os.getenv("DASK_SCHEDULER_ADDRESS")
cl = Client(dask_scheduler)
print(cl)
repo available here: https://github.com/hcorrada/test-dask
What happened:
Client could not connect to scheduler:
Traceback (most recent call last):
File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 313, in connect
_raise(error)
File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 266, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: connect() didn't finish in time
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "script.py", line 22, in <module>
cl = Client(dask_scheduler)
File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 744, in __init__
self.start(timeout=timeout)
File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 949, in start
sync(self.loop, self._start, **kwargs)
File "/opt/env/lib/python3.8/site-packages/distributed/utils.py", line 339, in sync
raise exc.with_traceback(tb)
File "/opt/env/lib/python3.8/site-packages/distributed/utils.py", line 323, in f
result[0] = yield future
File "/opt/env/lib/python3.8/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 1046, in _start
await self._ensure_connected(timeout=timeout)
File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 1103, in _ensure_connected
comm = await connect(
File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 325, in connect
_raise(error)
File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 266, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: connect() didn't finish in time
What you expected to happen:
Client to connect
Minimal Complete Verifiable Example:
above
Environment:
It would be nice to support jupyterlab extention in dask-notebook.
Something like this in the prepare.sh:
if [ "$EXTRA_JL_EXTENTIONS" ]; then
echo "EXTRA_JL_EXTENTIONS environment variable found. Installing."
jupyter labextension install $EXTRA_JL_EXTENTIONS
fi
In order for a Docker image to be used with the Daskhub helm chart it needs dask-gateway
and jupyterhub-singleuser
to be installed.
Neither of our official images have those packages so I propose we either add them to the notebook image or create a new image specifically for Daskhub.
I'm using your fantastic dask notebook image, thanks for creating it!
I would like to pass some extra parameters to the notebook as command line arguments. This doesn't seem to be possible using the standard prepare.sh
script, so I need to customize the image with an additional layer to enable this.
JUPYTERLAB_ARGS
that contains any additional command line args to be passed through or an empty string otherwise.The line of code in question: link
@jrbourbeau and I are in the process of moving the default branch for this repo from master to main.
Once the name on github is changed (the first box above is Xed, or this issue closed), when you try to git pull
you'll get
Your configuration specifies to merge with the ref 'refs/heads/master'
from the remote, but no such ref was fetched.
First: head to your fork and rename the default branch there
Then:
git branch -m master main
git fetch origin
git branch -u origin/main main
As noted in dask/distributed#3209, the helm/kubernetes documentation for dask leads to an issue if the client computer has lz4 installed.
It may be better to include the lz4 conda package in the daskdev/dask image to avoid this.
Running through the example notebooks in daskdev/dask-notebook:1.1.1
, and encountered this error on the last cell of examples/03-dataframes-timeseries.ipynb
:
df_small.rolling(100).mean().visualize(rankdir='LR')
/opt/conda/lib/python3.7/site-packages/dask/dataframe/utils.py:390: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated. Use `pandas.date_range` instead.
tz=idx.tz, name=idx.name)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
210 try:
--> 211 values = ensure_float64(values)
212 except (ValueError, TypeError):
pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()
ValueError: could not convert string to float: 'foo'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-13-0432fe458397> in <module>
----> 1 df_small.rolling(100).mean().visualize(rankdir='LR')
/opt/conda/lib/python3.7/site-packages/dask/dataframe/rolling.py in mean(self)
261 @derived_from(pd_Rolling)
262 def mean(self):
--> 263 return self._call_method('mean')
264
265 @derived_from(pd_Rolling)
/opt/conda/lib/python3.7/site-packages/dask/dataframe/rolling.py in _call_method(self, method_name, *args, **kwargs)
229 rolling_kwargs = self._rolling_kwargs()
230 meta = pandas_rolling_method(self.obj._meta_nonempty, rolling_kwargs,
--> 231 method_name, *args, **kwargs)
232
233 if self._has_single_partition:
/opt/conda/lib/python3.7/site-packages/dask/dataframe/rolling.py in pandas_rolling_method(df, rolling_kwargs, name, *args, **kwargs)
180 def pandas_rolling_method(df, rolling_kwargs, name, *args, **kwargs):
181 rolling = df.rolling(**rolling_kwargs)
--> 182 return getattr(rolling, name)(*args, **kwargs)
183
184
/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in mean(self, *args, **kwargs)
1726 def mean(self, *args, **kwargs):
1727 nv.validate_rolling_func('mean', args, kwargs)
-> 1728 return super(Rolling, self).mean(*args, **kwargs)
1729
1730 @Substitution(name='rolling')
/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in mean(self, *args, **kwargs)
1070 def mean(self, *args, **kwargs):
1071 nv.validate_window_func('mean', args, kwargs)
-> 1072 return self._apply('roll_mean', 'mean', **kwargs)
1073
1074 _shared_docs['median'] = dedent("""
/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in _apply(self, func, name, window, center, check_minp, **kwargs)
839 results = []
840 for b in blocks:
--> 841 values = self._prep_values(b.values)
842
843 if values.size == 0:
/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
212 except (ValueError, TypeError):
213 raise TypeError("cannot handle this type -> {0}"
--> 214 "".format(values.dtype))
215
216 if kill_inf:
TypeError: cannot handle this type -> object
It would be nice for this to be kept in line with dask releases automatically.
Perhaps some CI task which raises a PR to update versions?
What happened:
I am not fully sure this is a bug, or it is due to an incorrect setup/installation.
However, I am using the provided docker-compose to test a local dockerized instance of dask, but I can't execute any job on it.
Currently, I simply tried a few of the provided example notebooks(e.g. number 4), and they did not run correctly. The following error is returned: AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'
He is the stack trace:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-a7bc8667f5ea> in <module>
----> 1 x = x.persist()
2 progress(x)
/opt/conda/lib/python3.8/site-packages/dask/base.py in persist(self, **kwargs)
253 dask.base.persist
254 """
--> 255 (result,) = persist(self, traverse=False, **kwargs)
256 return result
257
/opt/conda/lib/python3.8/site-packages/dask/base.py in persist(*args, **kwargs)
754 else:
755 if client.get == schedule:
--> 756 results = client.persist(
757 collections, optimize_graph=optimize_graph, **kwargs
758 )
/opt/conda/lib/python3.8/site-packages/distributed/client.py in persist(self, collections, optimize_graph, workers, allow_other_workers, resources, retries, priority, fifo_timeout, actors, **kwargs)
2942 names = {k for c in collections for k in flatten(c.__dask_keys__())}
2943
-> 2944 futures = self._graph_to_futures(
2945 dsk,
2946 names,
/opt/conda/lib/python3.8/site-packages/distributed/client.py in _graph_to_futures(self, dsk, keys, workers, allow_other_workers, priority, user_priority, resources, retries, fifo_timeout, actors)
2541 dsk = HighLevelGraph.from_collections(id(dsk), dsk, dependencies=())
2542
-> 2543 dsk = highlevelgraph_pack(dsk, self, keyset)
2544
2545 annotations = {}
/opt/conda/lib/python3.8/site-packages/distributed/protocol/highlevelgraph.py in highlevelgraph_pack(hlg, client, client_keys)
113 "__module__": None,
114 "__name__": None,
--> 115 "state": _materialized_layer_pack(
116 layer,
117 hlg.get_all_external_keys(),
/opt/conda/lib/python3.8/site-packages/distributed/protocol/highlevelgraph.py in _materialized_layer_pack(layer, all_keys, known_key_dependencies, client, client_keys)
63 }
64
---> 65 annotations = layer.pack_annotations()
66 all_keys = all_keys.union(dsk)
67 dsk = {stringify(k): stringify(v, exclusive=all_keys) for k, v in dsk.items()}
AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'
What you expected to happen:
Computation should start on the dask cluster
Minimal Complete Verifiable Example:
Run docker-compose up
, connect to Jupyter Notebook and exec e.g. notebook 04, or paste this:
from dask.distributed import Client, progress
c = Client()
import dask.array as da
x = da.random.random(size=(10000, 10000), chunks=(1000, 1000))
x = x.persist()
progress(x)
Anything else we need to know?:
Environment:
Printing the distributed client object returns the following:
/opt/conda/lib/python3.8/site-packages/distributed/client.py:1135: VersionMismatchWarning: Mismatched versions found
+---------+---------------+---------------+---------------+
| Package | client | scheduler | workers |
+---------+---------------+---------------+---------------+
| blosc | 1.10.2 | 1.9.2 | 1.9.2 |
| lz4 | 3.1.3 | 3.1.1 | 3.1.1 |
| msgpack | 1.0.2 | 1.0.0 | 1.0.0 |
| python | 3.8.6.final.0 | 3.8.0.final.0 | 3.8.0.final.0 |
+---------+---------------+---------------+---------------+
Notes:
- msgpack: Variation is ok, as long as everything is above 0.6
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
Due to changes in the Travis CI billing, the Dask org is migrating CI to GitHub Actions.
This repo contains a .travis.yml
file which needs to be replaced with an equivalent .github/workflows/ci.yml
file.
See dask/community#107 for more details.
Current pulls are failing with the error error pulling image configuration: unknown blob
.
$ docker pull daskdev/dask:latest
latest: Pulling from daskdev/dask
b8f262c62ec6: Already exists
0a43c0154f16: Already exists
906d7b5da8fb: Already exists
3568180997ed: Pulling fs layer
555c313ecf5a: Pulling fs layer
218fd3c9fea3: Pulling fs layer
error pulling image configuration: unknown blob
Reproduced locally and on CI for dask-kubernetes.
@TomAugspurger I'd like to add your dask-ml notebook to our standard helm install, which uses this docker image by default. Is the notebook that you placed in the pangeo docker image the correct one to copy here as well? I can do this, I just wanted to check with you that you haven't made changes since then.
It would appear that sudo access has not been provided for the default 'jovyan' user, or credentials are missing in the documentation, but my docker research tells me that this is intentional for security reasons. The bigger issue is how to run the dask docker image granting sudo access, which does not currently seem to be possible as all iterations of docker run
seem to fail completely with a 'no environment.yml' error:
>>> docker run --rm -it -p 8888:8888 -e GRANT_SUDO="yes" --user root daskdev/dask
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
no environment.yml
+ '[' '' ']'
+ '[' '' ']'
and I have not been able to find any documentation explaining how to achieve sudo access in the docker-compose.yml file.
Hello, i try docker-compose up
Get:93 http://archive.ubuntu.com/ubuntu bionic/main amd64 vim-runtime all 2:8.0.1453-1ubuntu1 [5,437 kB]
Get:94 http://archive.ubuntu.com/ubuntu bionic/main amd64 vim amd64 2:8.0.1453-1ubuntu1 [1,152 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 31.8 MB in 22s (1,446 kB/s)
Selecting previously unselected package multiarch-support.
(Reading database ... 5067 files and directories currently installed.)
Preparing to unpack .../multiarch-support_2.27-3ubuntu1_amd64.deb ...
Unpacking multiarch-support (2.27-3ubuntu1) ...
dpkg: error: error creating new backup file '/var/lib/dpkg/status-old':
Invalid cross-device link
E: Sub-process /usr/bin/dpkg returned an error code (2)
ERROR: Service 'notebook' failed to build: The command '/bin/sh -c apt-get update
&& apt-get install -yq --no-install-recommends graphviz git vim
&& apt-get clean && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100
I use Docker version 18.09.0-ce, build 4d60db472b
Thanks
Title says it all
$ docker pull daskdev/dask:1.1.5
$ docker run --entrypoint python daskdev/dask:1.1.5 -m pip freeze | grep dask
dask==1.1.5
I am not sure what would be causing the following error as I haven't changed anything, it just seems to have started happening after rebuilding my images
dask-scheduler_1 | distributed.core - ERROR - update_graph() got an unexpected keyword argument 'actors'
dask-scheduler_1 | Traceback (most recent call last):
dask-scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 321, in handle_comm
dask-scheduler_1 | result = yield result
dask-scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
dask-scheduler_1 | value = future.result()
dask-scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
dask-scheduler_1 | yielded = self.gen.send(value)
dask-scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 1923, in add_client
dask-scheduler_1 | yield self.handle_stream(comm=comm, extra={'client': client})
dask-scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
dask-scheduler_1 | value = future.result()
dask-scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
dask-scheduler_1 | yielded = self.gen.send(value)
dask-scheduler_1 | File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 375, in handle_stream
dask-scheduler_1 | handler(**merge(extra, msg))
dask-scheduler_1 | TypeError: update_graph() got an unexpected keyword argument 'actors'
dask-scheduler_1 | distributed.scheduler - INFO - Receive client connection: Client-b183835c-d2e2-11e8-8001-0242ac130007
dask-scheduler_1 | distributed.core - INFO - Starting established connection
here is my docker compose file
version: '3.3'
services:
dask-scheduler:
image: daskdev/dask:0.18.1
command: ["dask-scheduler"]
volumes:
- ./mnt:/mnt
env_file:
- .env
- .env.local
ports:
- 8786:8786
- 8787:8787
worker:
image: daskdev/dask:0.18.1
command: ["dask-worker", "dask-scheduler:8786"]
volumes:
- ./mnt:/mnt
env_file:
- .env
- .env.local
depends_on:
- dask-scheduler
I have tried deleting all images, containers and networks and then rebuilding everything but then when I start up my service I get the error above. I haven't changed my code and get access to the client with the following
def get_client() -> Client:
client = Client('dask-scheduler:8786')
while not client.scheduler_info()['workers']:
print('Workers are asleep')
time.sleep(1)
client.restart()
client.upload_file('/usr/app/src/hash.py')
return client
I have taken the all the code after client
out, it seems to make no difference, it seems to throw this error the moment I try to use the client, all the futures
return as cancelled
here is example code I am running which seems to trigger my error
futures = client.map(run_hash(dir), batches)
for future in as_completed(futures):
if future.status == 'finished':
files += future.result()
else:
print('{} - {}'.format(future.status, future.exception()))
I am using 0.18.1
because the helm charts currently use dask version 0.18.1
any help would be appreciated
this is the script im running
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd
X, y = make_classification(n_samples=1000, random_state=0)
X[:5]
param_grid = {"C": np.logspace(-3, 1, 30),
"gamma": [0.05, 0.5, 2],
"kernel": ['rbf', 'poly', 'sigmoid'],
"shrinking": [True, False]}
grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
param_grid=param_grid,
return_train_score=False,
iid=True,
cv=3,
n_jobs=-1)
from sklearn.externals import joblib
dask_scheduler='172.17.0.1'
with joblib.parallel_backend('dask',scheduler_host=dask_scheduler+":8786", scatter=[X, y]):
grid_search.fit(X, y)
I get the following error
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95Q\x03\x00\x00\x00\x00\x00\x00\x8c\x0cjoblib._dask\x94\x8c\x05Batch\x94\x93\x94]\x94\x8c#sklearn.model_selection._validation\x94\x8c\x0e_fit_and_score\x94\x93\x94]\x94(\x8c\x13sklearn.svm.classes\x94\x8c\x03SVC\x94\x93\x94)\x81\x94}\x94(\x8c\x17decision_function_shape\x94\x8c\x03ovr\x94\x8c\x06kernel\x94\x8c\x03rbf\x94\x8c\x06degree\x94K\x03\x8c\x05gamma\x94\x8c\x04auto\x94\x8c\x05coef0\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\x03tol\x94G?PbM\xd2\xf1\xa9\xfc\x8c\x01C\x94G?\xf0\x00\x00\x00\x00\x00\x00\x8c\x02nu\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\x07epsilon\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\tshrinking\x94\x88\x8c\x0bprobability\x94\x88\x8c\ncache_size\x94K\xc8\x8c\x0cclass_weight\x94N\x8c\x07verbose\x94\x89\x8c\x08max_iter\x94J\xff\xff\xff\xff\x8c\x0crandom_state\x94K\x00\x8c\x10_sklearn_version\x94\x8c\x060.21.2\x94ub\x8c\x11distributed.utils\x94\x8c\nitemgetter\x94\x93\x94K\x00\x85\x94R\x94h$K\x01\x85\x94R\x94e}\x94(\x8c\x05train\x94h$K\x02\x85\x94R\x94\x8c\x04test\x94h$K\x03\x85\x94R\x94\x8c\nparameters\x94}\x94(h\x16\x8c\x15numpy.core.multiarray\x94\x8c\x06scalar\x94\x93\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02f8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94bC\x08\xfc\xa9\xf1\xd2MbP?\x94\x86\x94R\x94h\x12G?\xa9\x99\x99\x99\x99\x99\x9ah\x0fh\x10h\x19\x88u\x8c\x06scorer\x94}\x94\x8c\x05score\x94\x8c\x16sklearn.metrics.scorer\x94\x8c\x13_passthrough_scorer\x94\x93\x94s\x8c\nfit_params\x94}\x94\x8c\x12return_train_score\x94\x89\x8c\x15return_n_test_samples\x94\x88\x8c\x0creturn_times\x94\x88\x8c\x11return_parameters\x94\x89\x8c\x0berror_score\x94\x8c\x11raise-deprecating\x94h\x1dK\x00u\x87\x94a\x85\x94R\x94.'
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 61, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'joblib'
distributed.worker - WARNING - Could not deserialize task
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1272, in add_task
self.tasks[key] = _deserialize(function, args, kwargs, task)
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 3060, in _deserialize
function = pickle.loads(function)
File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 61, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'joblib'
Today I've been debugging something which ended up being a Python 3.7 v 3.8 issue where my cluster was using Docker and 3.8 and my local conda environment was 3.7.
I wonder if it would be useful to build multiple plages with different Python versions?
Hi,
after a deploy over GC of a Dask chart using Helm I've update the environment with some extra conda packages, specifically xarray and rasterio.
If I try to run my code I'm getting back this error from the workers log.
Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback ret = callback() File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result future.result() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 742, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 661, in handle_scheduler self.ensure_computing]) File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 742, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 386, in handle_stream msgs = yield comm.read() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 742, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 206, in read deserializers=deserializers) File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 209, in wrapper yielded = next(result) File "/opt/conda/lib/python3.7/site-packages/distributed/comm/utils.py", line 82, in from_frames res = _from_frames() File "/opt/conda/lib/python3.7/site-packages/distributed/comm/utils.py", line 68, in _from_frames deserializers=deserializers) File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/core.py", line 132, in loads value = _deserialize(head, fs, deserializers=deserializers) File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 184, in deserialize return loads(header, frames) File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 57, in pickle_loads return pickle.loads(b''.join(frames)) File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 59, in loads return pickle.loads(x) File "/opt/conda/lib/python3.7/site-packages/rasterio/__init__.py", line 22, in <module> from rasterio._base import gdal_version ImportError: libzstd.so.1: cannot open shared object file: No such file or directory
For my understanding problem seems to be the missing or corrupted library libzstdl , am I right?
Any idea?
I pulled the latest images of daskdev/dask
and daskdev/dask-notebook
from Docker hub. It sounds like there is a version mismatch between them
Moving forward to investigate my dask dataframe it created error. I'm new to dask and not sure this version mismatch is really the root cause but that's what error message complained!
When using this container as part of the dask-distributed helm chart one faces the issue of inconsistent python versions:
I think we should stop providing 2.7 by default and instead use Python 3.6 by default for this image. Furthermore it would be great to find way to pass an environment.yml
that would be used to initialise the images both for jupyter and the dask workers (and scheduler) maybe using the ENTRYPOINT feature of Dockerfile so that all containers share compatible versions of Python and shared libraries (dask and distributed in particular) so as to avoid weird pickling and protocol issues.
I'd like to have some clarifications about best practices for Dockers. I'm referring to this file
env.yml
file instead of list all packages to install after RUN conda install --yes \
? If yes is prepare.sh
dealing with the environment name or should I add an extra step?FROM continuumio/miniconda3:4.7.12
RUN conda install --yes \
-c conda-forge \
python=3.7.3 \
python-blosc \
cytoolz \
dask==2.14.0 \
msgpack-python=1.0.0 \
nomkl \
numpy=1.17.5 \
pandas==0.25.3 \
numba=0.49.1 \
pyarrow=0.16.0 \
cython=0.29.21 \
boto3==1.14.39 \
s3fs==0.4.2 \
tini==0.18.0 \
rpy2=2.9.1 \
r-base=3.5.0 \
r-devtools=2.3.0\
r-stringr=1.4.0 \
r-stringi=1.4.3 \
r-curl=4.3 \
&& conda clean -tipsy \
&& find /opt/conda/ -type f,l -name '*.a' -delete \
&& find /opt/conda/ -type f,l -name '*.pyc' -delete \
&& find /opt/conda/ -type f,l -name '*.js.map' -delete \
&& find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
&& rm -rf /opt/conda/pkgs
# Install R libraries
COPY packages.R /packages.R
RUN ln -s /bin/tar /bin/gtar
RUN Rscript packages.R
COPY prepare.sh /usr/bin/prepare.sh
RUN mkdir /opt/app
where package.R
is
options(Ncpus = parallel::detectCores())
library(devtools)
devtools::install_github("robjhyndman/anomalous")
I get an error compiling several R libraries but if I comment out (or add after R installation) the lines
&& conda clean -tipsy \
&& find /opt/conda/ -type f,l -name '*.a' -delete \
&& find /opt/conda/ -type f,l -name '*.pyc' -delete \
&& find /opt/conda/ -type f,l -name '*.js.map' -delete \
&& find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
&& rm -rf /opt/conda/pkgs
it works fine. I don't quite understand why this is happening but I think it will be great to better document this or comment the Dockerfile
accordingly. I could help with it.
Hello folks,
I want to install pyarrow as an additional conda dependency. The installation works, but when I try to import it the following error appears: ImportError: libdouble-conversion.so.3: cannot open shared object file: No such file or directory
This does not happen with daskdev/dask.
Below are the commands to reproduce the error and the full console output of the error (I'm sorry for installing s3fs also, but it shouldn't make a difference):
Commands:
sudo docker run -it -e EXTRA_CONDA_PACKAGES="pyarrow s3fs -c conda-forge" daskdev/dask-notebook:2.16.0 bash
python
import pyarrow
Console output:
sudo docker run -it -e EXTRA_CONDA_PACKAGES="pyarrow s3fs -c conda-forge" daskdev/dask-notebook:2.16.0 bash
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
no environment.yml
+ '[' 'pyarrow s3fs -c conda-forge' ']'
+ echo 'EXTRA_CONDA_PACKAGES environment variable found. Installing.'
EXTRA_CONDA_PACKAGES environment variable found. Installing.
+ /opt/conda/bin/conda install -y pyarrow s3fs -c conda-forge
Collecting package metadata (current_repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.7.12
latest version: 4.8.3
Please update conda by running
$ conda update -n base conda
## Package Plan ##
environment location: /opt/conda
added / updated specs:
- pyarrow
- s3fs
The following packages will be downloaded:
package | build
---------------------------|-----------------
arrow-cpp-0.13.0 | py37hdbb9910_4 3.5 MB conda-forge
boost-cpp-1.70.0 | h8e57a91_2 21.1 MB conda-forge
boto3-1.13.16 | pyh9f0ad1d_0 69 KB conda-forge
botocore-1.16.16 | pyh9f0ad1d_0 3.8 MB conda-forge
brotli-1.0.7 | he1b5a44_1001 1.0 MB conda-forge
bzip2-1.0.8 | h516909a_2 396 KB conda-forge
docutils-0.15.2 | py37_0 736 KB conda-forge
gflags-2.2.2 | he1b5a44_1002 175 KB conda-forge
glog-0.4.0 | h49b9bf7_3 104 KB conda-forge
jmespath-0.10.0 | pyh9f0ad1d_0 21 KB conda-forge
libevent-2.1.10 | h72c5cf5_0 1.3 MB conda-forge
libprotobuf-3.7.1 | h8b12597_0 4.6 MB conda-forge
lz4-3.0.2 | py37hb076c26_1 43 KB conda-forge
lz4-c-1.8.3 | he1b5a44_1001 187 KB conda-forge
parquet-cpp-1.5.1 | 2 3 KB conda-forge
pyarrow-0.13.0 | py37h8b68381_2 2.2 MB conda-forge
re2-2019.08.01 | he6710b0_0 456 KB defaults
s3fs-0.4.2 | py_0 21 KB conda-forge
s3transfer-0.3.3 | py37hc8dfbb8_1 90 KB conda-forge
snappy-1.1.8 | he1b5a44_1 39 KB conda-forge
thrift-cpp-0.12.0 | hf3afdfd_1004 2.4 MB conda-forge
------------------------------------------------------------
Total: 42.2 MB
The following NEW packages will be INSTALLED:
arrow-cpp conda-forge/linux-64::arrow-cpp-0.13.0-py37hdbb9910_4
boost-cpp conda-forge/linux-64::boost-cpp-1.70.0-h8e57a91_2
boto3 conda-forge/noarch::boto3-1.13.16-pyh9f0ad1d_0
botocore conda-forge/noarch::botocore-1.16.16-pyh9f0ad1d_0
brotli conda-forge/linux-64::brotli-1.0.7-he1b5a44_1001
bzip2 conda-forge/linux-64::bzip2-1.0.8-h516909a_2
docutils conda-forge/linux-64::docutils-0.15.2-py37_0
gflags conda-forge/linux-64::gflags-2.2.2-he1b5a44_1002
glog conda-forge/linux-64::glog-0.4.0-h49b9bf7_3
jmespath conda-forge/noarch::jmespath-0.10.0-pyh9f0ad1d_0
libevent conda-forge/linux-64::libevent-2.1.10-h72c5cf5_0
libprotobuf conda-forge/linux-64::libprotobuf-3.7.1-h8b12597_0
parquet-cpp conda-forge/noarch::parquet-cpp-1.5.1-2
pyarrow conda-forge/linux-64::pyarrow-0.13.0-py37h8b68381_2
re2 pkgs/main/linux-64::re2-2019.08.01-he6710b0_0
s3fs conda-forge/noarch::s3fs-0.4.2-py_0
s3transfer conda-forge/linux-64::s3transfer-0.3.3-py37hc8dfbb8_1
snappy conda-forge/linux-64::snappy-1.1.8-he1b5a44_1
thrift-cpp conda-forge/linux-64::thrift-cpp-0.12.0-hf3afdfd_1004
The following packages will be DOWNGRADED:
lz4 3.0.2-py37h5a7ed16_2 --> 3.0.2-py37hb076c26_1
lz4-c 1.9.2-he1b5a44_1 --> 1.8.3-he1b5a44_1001
Downloading and Extracting Packages
arrow-cpp-0.13.0 | 3.5 MB | ############################################### | 100%
botocore-1.16.16 | 3.8 MB | ############################################### | 100%
s3fs-0.4.2 | 21 KB | ############################################### | 100%
lz4-c-1.8.3 | 187 KB | ############################################### | 100%
boost-cpp-1.70.0 | 21.1 MB | ############################################### | 100%
pyarrow-0.13.0 | 2.2 MB | ############################################### | 100%
s3transfer-0.3.3 | 90 KB | ############################################### | 100%
libevent-2.1.10 | 1.3 MB | ############################################### | 100%
bzip2-1.0.8 | 396 KB | ############################################### | 100%
jmespath-0.10.0 | 21 KB | ############################################### | 100%
glog-0.4.0 | 104 KB | ############################################### | 100%
gflags-2.2.2 | 175 KB | ############################################### | 100%
docutils-0.15.2 | 736 KB | ############################################### | 100%
boto3-1.13.16 | 69 KB | ############################################### | 100%
re2-2019.08.01 | 456 KB | ############################################### | 100%
thrift-cpp-0.12.0 | 2.4 MB | ############################################### | 100%
lz4-3.0.2 | 43 KB | ############################################### | 100%
brotli-1.0.7 | 1.0 MB | ############################################### | 100%
snappy-1.1.8 | 39 KB | ############################################### | 100%
libprotobuf-3.7.1 | 4.6 MB | ############################################### | 100%
parquet-cpp-1.5.1 | 3 KB | ############################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
+ '[' '' ']'
+ exec bash
jovyan@2df202a75cc8:~$ python
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/pyarrow/__init__.py", line 47, in <module>
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: libdouble-conversion.so.3: cannot open shared object file: No such file or directory
>>> exit()
it seems it only tags latest/dev
, add a proper version tag like 2.6.0
will be great
We should probably update both dask, and also numpy=1.17.2 and pandas=0.25.1 .
I tried doing this locally but ended up running into some jupyterlab conflicts, so I thought I'd raise an issue.
@jacobtomlinson any interest in handling this? Also, any interest in automating this process?
Hello,
could you please add a 2.12.0 tag?
Thank you!
Is there any way to automate this, since I saw other issues with missing issues?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.