googleclouddataproc / custom-images Goto Github PK
View Code? Open in Web Editor NEWTools for creating Dataproc custom images
Home Page: https://cloud.google.com/dataproc/docs/guides/dataproc-images
License: Apache License 2.0
Tools for creating Dataproc custom images
Home Page: https://cloud.google.com/dataproc/docs/guides/dataproc-images
License: Apache License 2.0
When creating custom image:
python generate_custom_image.py \
--image-name conda-test-1 \
--dataproc-version 1.4.10-debian9 \
--customization-script combined.sh \
--zone us-central1-f \
--gcs-bucket gs://test/dataproc-custom-image \
--no-smoke-test
Here is customization-script:
#!/bin/bash
pip install numpy==1.15.4
pip install pandas==0.23.4
pip install ujson==1.35
pip install geopandas==0.4.0
pip install lxml==4.3.0
The following error was thrown:
INFO:shell_image_creator:############################################################
INFO:shell_image_creator:Successfully generated Shell script...
INFO:shell_image_creator:Creating custom image...
Traceback (most recent call last):
File "generate_custom_image.py", line 429, in <module>
run()
File "generate_custom_image.py", line 422, in run
shell_image_creator.create(args)
File "/home/yanzhongsu/cloud-dataproc/custom-images/shell_image_creator.py", line 44, in create
shell_script_executor.run(script)
File "/home/yanzhongsu/cloud-dataproc/custom-images/shell_script_executor.py", line 39, in run
stderr=sys.stderr
File "/opt/conda/default/lib/python3.7/subprocess.py", line 775, in __init__
restore_signals, start_new_session)
File "/opt/conda/default/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/bash': '/usr/bin/bash'
Seem that #46 broke compatibility with Python3.
$ python3 generate_custom_image.py --image-name dagang-test-1-4 --dataproc-version 1.4.25-debian9 --customization-script /tmp/swap-artifacts.sh --disk-size=15 --gcs-bucket gs://cloud-dataproc-dev/dagang/custom-images --shutdown-instance-timer-sec 10 --zone us-west1-a --no-smoke-test
INFO:__main__:Parsed args: Namespace(accelerator=None, base_image_uri=None, customization_script='/tmp/swap-artifacts.sh', dataproc_version='1.4.25-debian9', disk_size=15, dry_run=False, extra_sources={}, family='dataproc-custom-image', gcs_bucket='gs://cloud-dataproc-dev/dagang/custom-images', image_name='dagang-test-1-4', machine_type='n1-standard-1', metadata=None, network='', no_external_ip=False, no_smoke_test=True, oauth=None, project_id=None, service_account='default', shutdown_instance_timer_sec=10, storage_location=None, subnetwork='', zone='us-west1-a')
INFO:custom_image_utils.args_inferer:Getting Dataproc base image name...
Reauthentication required.
Please insert and touch your security key
Traceback (most recent call last):
File "generate_custom_image.py", line 95, in <module>
main()
File "generate_custom_image.py", line 86, in main
args = parse_args(sys.argv[1:])
File "generate_custom_image.py", line 57, in parse_args
args_inferer.infer_args(args)
File "/google/src/cloud/dagang/custom-images/google3/experimental/users/dagang/git/dataproc-custom-images/custom_image_utils/args_inferer.py", line 167, in infer_args
_infer_base_image(args)
File "/google/src/cloud/dagang/custom-images/google3/experimental/users/dagang/git/dataproc-custom-images/custom_image_utils/args_inferer.py", line 137, in _infer_base_image
args.dataproc_base_image = _get_dataproc_image_path_by_version(args.dataproc_version)
File "/google/src/cloud/dagang/custom-images/google3/experimental/users/dagang/git/dataproc-custom-images/custom_image_utils/args_inferer.py", line 117, in _get_dataproc_image_path_by_version
and not parsed_image[0].encode('ascii','ignore').endswith("-eap"):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str
In Python3, parsed_image[0].encode('ascii','ignore')
is of type bytes
instead of str
.
Here is the command I used to create custom image:
python generate_custom_image.py \
--image-name core-spark-conda-test \
--dataproc-version 1.4.10-debian9 \
--disk-size 50 \
--customization-script ../start_actions.sh \
--zone us-central1-f \
--gcs-bucket gs://test-misc/dataproc-custom-image \
--no-smoke-test
The start_actions.sh is here.
It worked in July 2019 after this commit 49f3375. Now I run it again on the latest commit, it gives me the following error:
+ echo 'Uploading files to GCS bucket.'
Uploading files to GCS bucket.
+ sources=([run.sh]='startup_script/run.sh' [init_actions.sh]='../init_actions.sh')
/var/folders/fr/hc45cptn3cs_hkf6y1zr7l7w0000gn/T/tmpQy24Bb: line 37: run.sh: syntax error: invalid arithmetic operator (error token is ".sh")
But when I roll back to this commit (49f3375) It works.
Does anyone know what causes this error and how to fix it?
The generate_custom_image.py script seems to require specifying an exact version with --dataproc-version
. It would be nice if you could specify, for example, 1.5-debian10
to get the latest version of the 1.5-debian10 image, rather than having to specify 1.5.29-debian10.
I'm trying to create a custom image because installation of Oozie and Hue via Dataproc initialization scripts take a long time. So to reduce the cluster provisioning time, I'm trying to create an image with Oozie and Hue installed using the Dataproc initialization scripts:
Since customization script takes only one script I created a single script with Oozie and Hue installation scripts combined. But the problem is that Hue is getting stuck during the installation and eventually fails (or it gets stuck forever).
This happens after the node is created and script is getting executed. I noticed that it happens during the execution of the '/usr/lib/hue/build/env/bin/hue collectstatic --noinput' with None
command.
If I add the same script as initialization script for Dataproc, it works fine.
Mar 26 13:15:23 packer-dataproc-1-4-5-install startup-script: INFO startup-script: make: Leaving directory '/usr/lib/hue/apps/rdbms'
Mar 26 13:15:23 packer-dataproc-1-4-5-install startup-script: INFO startup-script: === Saved registry at /var/lib/hue/app.reg
Mar 26 13:15:23 packer-dataproc-1-4-5-install startup-script: INFO startup-script: === Saved /usr/lib/hue/build/env/lib/python2.7/site-packages/hue.pth
Mar 26 13:15:23 packer-dataproc-1-4-5-install startup-script: INFO startup-script: Running '/usr/lib/hue/build/env/bin/hue makemigrations --noinput' with None
Mar 26 13:15:26 packer-dataproc-1-4-5-install startup-script: INFO startup-script: No changes detected
Mar 26 13:15:26 packer-dataproc-1-4-5-install startup-script: INFO startup-script: Running '/usr/lib/hue/build/env/bin/hue migrate --fake-initial' with None
Mar 26 13:15:28 packer-dataproc-1-4-5-install startup-script: INFO startup-script: #033[36;1mOperations to perform:#033[0m
Mar 26 13:15:28 packer-dataproc-1-4-5-install startup-script: INFO startup-script: #033[1m Apply all migrations: #033[0madmin, auth, axes, beeswax, contenttypes, desktop, django_openid_auth, jobsub, oozie, pig, search, sessions, sites, useradmin
Mar 26 13:15:28 packer-dataproc-1-4-5-install startup-script: INFO startup-script: #033[36;1mRunning migrations:#033[0m
Mar 26 13:15:28 packer-dataproc-1-4-5-install startup-script: INFO startup-script: No migrations to apply.
Mar 26 13:15:29 packer-dataproc-1-4-5-install startup-script: INFO startup-script: Running '/usr/lib/hue/build/env/bin/hue collectstatic --noinput' with None
Mar 26 13:15:30 packer-dataproc-1-4-5-install startup-script: INFO startup-script: Found another file with the destination path 'indexer/css/admin.css'. It will be ignored since only the first encountered file is collected. If this is not what you want, make sure every static file has a unique path.
Mar 26 13:15:30 packer-dataproc-1-4-5-install startup-script: INFO startup-script: Found another file with the destination path 'indexer/css/importer.css'. It will be ignored since only the first encountered file is collected. If this is not what you want, make sure every static file has a unique path.
Any idea what would be the cause for this?
Closed GoogleCloudDataproc/cloud-dataproc#77 and track the feature here.
The shutdown that happens after the timer expires is too abrupt, with no clear message that it is because the shutdown timer limit was reached. Let's please print a message about why it is being shutdown, and how the user can modify the setting with --shutdown-instance-timer-sec.
I tried running the following command:
python generate_custom_image.py --image-name some-name --dataproc-version 2.0.36-ubuntu18 --zone us-east1-c --disk-size 100 --metadata somekey=somevalue --customization-script /tmp/script.sh --gcs-bucket some-bucket
And got the following error:
INFO:custom_image_utils.args_inferer:Getting Dataproc base image name...
Traceback (most recent call last):
File "generate_custom_image.py", line 95, in <module>
main()
File "generate_custom_image.py", line 86, in main
args = parse_args(sys.argv[1:])
File "generate_custom_image.py", line 57, in parse_args
args_inferer.infer_args(args)
File "/home/jakepromisel/Github/custom-images/custom_image_utils/args_inferer.py", line 225, in infer_args
_infer_base_image(args)
File "/home/jakepromisel/Github/custom-images/custom_image_utils/args_inferer.py", line 191, in _infer_base_image
args.dataproc_version)
File "/home/jakepromisel/Github/custom-images/custom_image_utils/args_inferer.py", line 175, in _get_dataproc_image_path_by_version
"Cannot find dataproc base image with dataproc-version=%s." % version)
RuntimeError: Cannot find dataproc base image with dataproc-version=2.0.36-ubuntu18.
But I believe this version does exist. Looking into this further, I tried recreating the gcloud command run to verify the dataproc version:
gcloud compute images list --project cloud-dataproc --filter "labels.goog-dataproc-version = 2-0-36 AND NOT name ~ -eap$ AND status = READY" --format "csv[no-heading=true](name,labels.goog-dataproc-version)" --sort-by=~creationTimestamp
And I get the following warning:
WARNING: --filter : operator evaluation is changing for consistency across Google APIs. labels.goog-dataproc-version=2-0-36 currently does not match but will match in the near future. Run `gcloud topic filters` for details.
If this repository is no longer maintained and shouldn't be used, can someone point me to how I should go about creating a custom image? From what I can tell, the docs still say to use this script. I even tried on a vm where I installed the recommended GCloud version (181.0.0) which is from 2017 and it still didn't work.
Hello,
I am facing an issue in installing Singularity on a Google GCP custom images where there is no user dir exists under /home
?
Go language is properly installed in /usr/local/go
but I have changed GOPATH
env variables like this and extracted GO source folder to these dirs:
${HOME}
-> /usr/local
or /home
or /opt
echo 'export GOPATH=${HOME}/go' >> ~/.bashrc && \
echo 'export PATH=/usr/local/go/bin:${PATH}:${GOPATH}/bin' >> ~/.bashrc && \
source ~/.bashrc
Whenever I try to run config with sudo or without sudo I get:
./mconfig
Configuring for project `singularity' with languages: C, Golang
running pre-basechecks project specific checks ...
running base system checks ...
checking: host C compiler... cc
checking: host C++ compiler... c++
checking: host Go compiler (at least version 1.13)... not found!
mconfig: could not complete configuration
So any suggestions where should I copy GO and Singularity when User is not present under system /home
dir. Thanks.
Encountered the following error: Invalid value for field 'resource.sizeGb': '20'. Requested disk size cannot be smaller than the image size (30 GB)
This is produced using the following command python xxxxx/custom-images/generate_custom_image.py --image-name xxxx-dataproc-custom-image-v0 --dataproc-version 2.0-debian10 --customization-script custom_image_xxx.sh --zone europe-west4-b --gcs-bucket dataproc-temp-europe-west4-xxxx
. (Some sensitive info are marked up.)
Wonder why the image is so large. (The .sh file only installs like 10 python dependencies, and certainly no large ones.)
Hello,
I am experiencing a odd behavior when attempting to build a custom image where less or more frequently the Daisy controller will fail to recognize that an image has successfully reached the end of the custom image script installation and just halts indefinitely (until the 2h timeout is reached and the build fails). To me this looks like a failure in matching the successful string on the serial port of the VM.
Notice this does not happen every time but a considerable amount of times to make this very annoying and not allowing to build images reliably (as I'm rerunning the same build command with the same custom script and sometimes the outcome is a failure and sometimes it is not).
My invocation looks like the following:
python2.7 ${BUILDER_ROOT}/cloud-dataproc/custom-images/generate_custom_image.py \
--service-account ${SERVICE_ACCOUNT} \
--image-name my-custom-image \
--dataproc-version 1.3.5 \
--customization-script ${CUSTOM_INIT_ACTION} \
--daisy-path ${DAISY_PATH} \
--zone europe-west1-b \
--gcs-bucket ${GCS_BUCKET_IN_MY_PROJECT} \
--project-id ${MY_PROJECT_ID} \
--subnetwork projects/${MY_NET_HOST_PROJECT}/regions/europe-west1/subnetworks/${MY_SHARED_VPC_NAME} \
--disk-size 20 \
--no-smoke-test
A few notes on the invocation above:
I'm happy to share more details on my setup as required to debug further.
Thanks!
Not able to create a Google Cloud custom dataproc image using MacOS 10.14.5, Google Cloud SDK 266.0.0 and gsutil 4.44.
The generate_custom_image.py script generates an error when it tries to run my custom script.
**
Building synchronization state...
Starting synchronization...
Copying file:///tmp/custom-image-custom_python_1_4_14_debian9-20191009-140050/logs/workflow.log [Content-Type=text/plain]...
- [1/1 files][ 294.0 B/ 294.0 B] 100% Done
Operation completed over 1 objects/294.0 B.
- [[ -f /tmp/custom-image-custom_python_1_4_14_debian9-20191009-140050/image_created ]]
- echo -e 'Workflow failed, check logs at /tmp/custom-image-custom_python_1_4_14_debian9-20191009-140050/logs/ or gs://bucket-mcberma/custom-image-custom_python_1_4_14_debian9-20191009-140050/logs/'
Workflow failed, check logs at /tmp/custom-image-custom_python_1_4_14_debian9-20191009-140050/logs/ or gs://bucket-mcberma/custom-image-custom_python_1_4_14_debian9-20191009-140050/logs/- exit 1
Traceback (most recent call last):
File "generate_custom_image.py", line 95, in
main()
File "generate_custom_image.py", line 88, in main
shell_image_creator.create(args)
File "/Users/mcberma/opensource/dataproc-custom-images-master/custom_image_utils/shell_image_creator.py", line 43, in create
shell_script_executor.run(script)
File "/Users/mcberma/opensource/dataproc-custom-images-master/custom_image_utils/shell_script_executor.py", line 47, in run
raise RuntimeError("Error building custom image.")
RuntimeError: Error building custom image
**
I have validated that the custom script does not have any carriage returns or extraneous characters.
Here is how I am executing the python script
python generate_custom_image.py --image-name custom_python_1_4_14_debian9 --dataproc-version 1.4.14-debian9 --customization-script custom_image.sh --zone us-central1-a --gcs-bucket gs://zzzzz-zzzzz
I have attached the custom script file and the workflow log.
$ python generate_custom_image.py
Traceback (most recent call last):
File "generate_custom_image.py", line 95, in <module>
main()
File "generate_custom_image.py", line 86, in main
args = parse_args(sys.argv[1:])
File "generate_custom_image.py", line 55, in parse_args
args = args_parser.parse_args(raw_args)
File "/usr/local/google/home/dagang/repo/github/functicons/dataproc-custom-images/custom_image_utils/args_parser.py", line 58, in parse_args
help=constants.version_help_text)
AttributeError: module 'constants' has no attribute 'version_help_text'
Seems there is a conflict between the constants.py
under the custom_image_utils
folder and the constants
module from my Python environment.
$ pip freeze | grep constants
constants==0.6.0
08:49:40 DOCKER: INFO:__main__:Parsed args: Namespace(accelerator=None, base_image_uri=None, customization_script='dataproc_image_setup.sh', dataproc_version='2.0.10-debian10', disk_size=32, dry_run=False, extra_sources={}, family='dataproc-custom-image', gcs_bucket='gs://REDACTED', image_name='redacted-05112021-2-0-10-debian10', machine_type='n1-standard-1', maria_version=None, network='', no_external_ip=True, no_smoke_test=True, oauth=None, project_id=None, service_account='default', shutdown_instance_timer_sec=120, subnetwork='redacted', zone='redacted')
08:49:41 DOCKER: INFO:custom_image_utils.args_inferer:Getting Dataproc base image name...
08:49:45 DOCKER: Traceback (most recent call last):
08:49:45 DOCKER: File "generate_custom_image.py", line 95, in <module>
08:49:45 DOCKER: main()
08:49:45 DOCKER: File "generate_custom_image.py", line 86, in main
08:49:45 DOCKER: args = parse_args(sys.argv[1:])
08:49:45 DOCKER: File "generate_custom_image.py", line 57, in parse_args
08:49:45 DOCKER: args_inferer.infer_args(args)
08:49:45 DOCKER: File "/workspace/samscdp-custom-images/custom_image_utils/args_inferer.py", line 170, in infer_args
08:49:45 DOCKER: _infer_base_image(args)
08:49:45 DOCKER: File "/workspace/samscdp-custom-images/custom_image_utils/args_inferer.py", line 137, in _infer_base_image
08:49:45 DOCKER: args.dataproc_base_image = _get_dataproc_image_path_by_version(args.dataproc_version)
08:49:45 DOCKER: File "/workspace/samscdp-custom-images/custom_image_utils/args_inferer.py", line 117, in _get_dataproc_image_path_by_version
08:49:45 DOCKER: raise RuntimeError(
08:49:45 DOCKER: RuntimeError: ('Cannot find dataproc base image with dataproc-version=%s.', '2.0.10-debian10')
Appears to be issue with last few images - possibly not having correct labels:
this one works:
labels:
goog-dataproc-base-image: debian-10-buster-v20210217
goog-dataproc-version: 2-0-6-debian10
but newer ones do not:
labels:
goog-dataproc-base-image: debian-10-buster-v20210316
google-dataproc-version: 2-0-9-debian10
Notice different label name google-dataproc-version
vs. goog-dataproc-version
.
We can see in the release notes page that 2.0.29-debian10
was released 9 days ago (Jan 17th). However, the latest version of generate_custom_image.py
fails as follow:
Command:
python generate_custom_image.py \
--image-name custom-image-2-0-29-debian10 \
--dataproc-version 2.0.29-debian10 \
--customization-script ~/custom-script.sh \
--zone us-central1-f \
--gcs-bucket gs://my-test-bucket
Traceback:
Traceback (most recent call last):
File "generate_custom_image.py", line 95, in <module>
main()
File "generate_custom_image.py", line 86, in main
args = parse_args(sys.argv[1:])
File "generate_custom_image.py", line 57, in parse_args
args_inferer.infer_args(args)
File "/dataproc-custom-images/custom_image_utils/args_inferer.py", line 225, in infer_args
_infer_base_image(args)
File "/dataproc-custom-images/custom_image_utils/args_inferer.py", line 191, in _infer_base_image
args.dataproc_version)
File "/dataproc-custom-images/custom_image_utils/args_inferer.py", line 175, in _get_dataproc_image_path_by_version
"Cannot find dataproc base image with dataproc-version=%s." % version)
RuntimeError: Cannot find dataproc base image with dataproc-version=2.0.29-debian10.
As a workaround a Google Cloud Employee suggested dropping the minor version (e.g 2.0-debian10
) but that would make environments and workloads non-deterministic unfortunately.
Could this issue be investigate a bit further as seems to be related to GCP's release & tagging process?
We were looking into using --dataproc-version
but I believe the filtering isn't specific enough.
python3 generate_custom_image.py --image-name image-version-check --dataproc-version 2.0-debian10 --customization-script 'script.sh' --zone 'us-central1-a' --gcs-bucket 'gs://bucketname'
INFO:__main__:Parsed args: Namespace(image_name='image-version-check', dataproc_version='2.0-debian10', base_image_uri=None, base_image_family=None, customization_script='script.sh', metadata='bucket=bucketname', zone='us-central1-a', gcs_bucket='gs://bucketname', family='dataproc-custom-image', project_id='project-id', oauth=None, machine_type='n1-standard-1', no_smoke_test=False, network='', subnetwork='', no_external_ip=False, service_account='default', extra_sources={}, disk_size=20, accelerator=None, storage_location=None, shutdown_instance_timer_sec=300, licenses='', dry_run=False)
INFO:custom_image_utils.args_inferer:Getting Dataproc base image name...
projects/cloud-dataproc/global/images/dataproc-2-0-idv-20211011-145037-debian11-tmp 2-0-8-debian10
From what I can understand --dataproc-version 2.0-debian10
should return dataproc-2-0-deb10-20210917-180200-rc01
as of right now.
Should this be updated or am I miss understanding something?
Creating a new custom image with 2.0.0-debian10 require at least 30GB
Please advise
I need to build custom image based on rocky linux because our customizations depend on RPM packages.
CLI option of generate_custom_image.py
does not accept --dataproc-version 2.0-rocky8
though it might be correct image name.
$ python generate_custom_image.py --image-name test-image --dataproc-version 2.0-rocky8 --disk-size 30 --zone asia-northeast1-a --gcs-bucket gs://test-misc
usage: generate_custom_image.py [-h] --image-name IMAGE_NAME
[--dataproc-version DATAPROC_VERSION | --base-image-uri BASE_IMAGE_URI | --base-image-family BASE_IMAGE_FAMILY]
--customization-script CUSTOMIZATION_SCRIPT
[--metadata METADATA] --zone ZONE --gcs-bucket
GCS_BUCKET [--family FAMILY]
[--project-id PROJECT_ID] [--oauth OAUTH]
[--machine-type MACHINE_TYPE]
[--no-smoke-test] [--network NETWORK]
[--subnetwork SUBNETWORK] [--no-external-ip]
[--service-account SERVICE_ACCOUNT]
[--extra-sources EXTRA_SOURCES]
[--disk-size DISK_SIZE]
[--accelerator ACCELERATOR]
[--storage-location STORAGE_LOCATION]
[--shutdown-instance-timer-sec SHUTDOWN_INSTANCE_TIMER_SEC]
[--dry-run]
generate_custom_image.py: error: argument --dataproc-version: Invalid version: 2.0-rocky8.
diff --git a/custom_image_utils/args_parser.py b/custom_image_utils/args_parser.py
index 05408e4..6d825b5 100644
--- a/custom_image_utils/args_parser.py
+++ b/custom_image_utils/args_parser.py
@@ -29,7 +29,7 @@ from custom_image_utils import constants
_VERSION_REGEX = re.compile(r"^\d+\.\d+\.\d+(-RC\d+)?(-[a-z]+\d+)?$")
_FULL_IMAGE_URI = re.compile(r"^(https://www\.googleapis\.com/compute/([^/]+)/)?projects/([^/]+)/global/images/([^/]+)$")
_FULL_IMAGE_FAMILY_URI = re.compile(r"^(https://www\.googleapis\.com/compute/([^/]+)/)?projects/([^/]+)/global/images/family/([^/]+)$")
-_LATEST_FROM_MINOR_VERSION = re.compile(r"^(\d+)\.(\d+)-((?:debian|ubuntu|centos)\d+)$")
+_LATEST_FROM_MINOR_VERSION = re.compile(r"^(\d+)\.(\d+)-((?:debian|ubuntu|centos|rocky)\d+)$")
def _version_regex_type(s):
"""Check if version string matches regex."""
shell_script_executor.py
assumes /usr/bin/bash
for the bash location, but this breaks on the default install in osx and fails with deceptive error message that may lead the user to believe the tempfile is the issue. Easy to soft link locally, but probably worth considering.
I want to see the log of customization-script while creating custom image using generate_custom_image.py . Does anyone know how to do that? Right now when I tried to create custom image, it throws me the following error. I do not know what went wrong. The customization-script seems to be working fine when executed in a separate VM.
[Daisy] Errors in one or more workflows:
conda-spark-pip-version: step "wait-for-inst-install" run error: WaitForInstancesSignal FailureMatch found for "conda-spark-pip-version-install-conda-spark-pip-version-vgz0v": "BuildFailed: Dataproc Initialization Actions Failed. Please check your initialization script."
Traceback (most recent call last):
File "dataproc/cloud-dataproc/custom-images/generate_custom_image.py", line 429, in <module>
run()
File "dataproc/cloud-dataproc/custom-images/generate_custom_image.py", line 420, in run
daisy_image_creator.create(args)
File "/Users/Documents/copofi/gitlab/core-spark/dataproc/cloud-dataproc/custom-images/daisy_image_creator.py", line 43, in create
os.path.abspath(args.daisy_path), workflow_script)
File "/Users/Documents/copofi/gitlab/core-spark/dataproc/cloud-dataproc/custom-images/daisy_workflow_executor.py", line 42, in run
raise RuntimeError("Error building custom image.")
RuntimeError: Error building custom image.
The customization-script is as follows:
#!/bin/bash
conda install --yes python=3.6.7 numpy=1.15.4
conda install --yes python=3.6.7 pandas=0.23.4
conda install --yes python=3.6.7 ujson=1.35
conda install --yes python=3.6.7 geopandas=0.4.0
conda install --yes python=3.6.7 lxml=4.3.0
When I use generate_custom_image.py to create custom dataproc image, it throws this error:
File "dataproc/cloud-dataproc/custom-images/generate_custom_image.py", line 429, in <module>
run()
File "dataproc/cloud-dataproc/custom-images/generate_custom_image.py", line 420, in run
daisy_image_creator.create(args)
File "/Users/temp/Documents/guess/dataproc/cloud-dataproc/custom-images/daisy_image_creator.py", line 43, in create
os.path.abspath(args.daisy_path), workflow_script)
File "/Users/temp/Documents/guess/dataproc/cloud-dataproc/custom-images/daisy_workflow_executor.py", line 42, in run
raise RuntimeError("Error building custom image.")
RuntimeError: Error building custom image.```
The customization-script is as follows:
#!/bin/bash
pip install numpy==1.15.4
pip install pandas==0.23.4
pip install ujson==1.35
pip install geopandas==0.4.0
pip install lxml==4.3.0
**daisy** is downloaded from here: https://storage.googleapis.com/compute-image-tools/release/darwin/daisy
I created a custom image and am surprised to see the subminor version 3 subminor versions behind ...
$ gcloud compute images describe my-ubuntu18-custom | grep -i version
goog-dataproc-version: 1-5-45-ubuntu18
Per https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.5 I expected to see this ...
goog-dataproc-version: 1-5-48-ubuntu18
We are trying to create a new dataproc image using the following paramaters:
python generate_custom_image.py \
--image-name image_name \
--dataproc-version "1.5.57-debian10" \
--customization-script ../customization_script.sh \
--zone us-central1-b \
--gcs-bucket gs://image-logs \
--disk-size 50 \
--machine-type n1-standard-4 \
--metadata "{ \"spark-jar\": \"$SPARK_JAR\", \"postgres-jar\": \"$POSTGRES_JAR\", \"project\": \"$G_PROJECT\", \"instance-id\": \"$G_BIG_TABLE_INSTANCE\", \"postgres-user\": \"$POSTGRES_USER\", \"postgres-password\": \"$POSTGRES_PASSWORD\", \"postgres-database\": \"$POSTGRES_DATABASE\", \"postgres-host\": \"$POSTGRES_HOST\", \"postgres-port\": \"$POSTGRES_PORT\"}"
We are getting an error saying that the dataproc-version could not be found:
Traceback (most recent call last):
File "generate_custom_image.py", line 95, in <module>
main()
File "generate_custom_image.py", line 86, in main
args = parse_args(sys.argv[1:])
File "generate_custom_image.py", line 57, in parse_args
args_inferer.infer_args(args)
File "/dataproc-custom-images/custom_image_utils/args_inferer.py", line 225, in infer_args
_infer_base_image(args)
File "/dataproc-custom-images/custom_image_utils/args_inferer.py", line 191, in _infer_base_image
args.dataproc_version)
File "/dataproc-custom-images/custom_image_utils/args_inferer.py", line 175, in _get_dataproc_image_path_by_version
"Cannot find dataproc base image with dataproc-version=%s." % version)
RuntimeError: Cannot find dataproc base image with dataproc-version=1.5.57-debian10.
We have tried several different versions but each one gives the same error. We have looked at the docs and we think we are defining the version correctly. Any guidance would be appreciated.
Hello,
I tried to override cluster properties with the guidance of the documentation in read me file.
Properties added to 'dataproc.custom.properties' file and this file uploaded to Cloud Storage then command given below added to 'customization_script.sh'.
gsutil cp gs://<bucket-name>/custom-image/dataproc.custom.properties /etc/google-dataproc
echo cat /etc/google-dataproc/dataproc.custom.properties
I saw the below info in the console log.
+ echo hive.metastore.warehouse.dir=gs://<bucket-name>/hive-warehouse
Unfortunately, when dataproc cluster created with this custom image, given properties is not added to cluster properties.
Additionally, I try property with below key but result is same.
hive.hive.metastore.warehouse.dir=gs://<bucket-name>/hive-warehouse
Could you please share the detailed documentation about overriding cluster properties ?
I have found an issue when trying to create a Dataproc custom image when following the Google Cloud Platform documentation[1].
Error Logs
[Daisy] Errors in one or more workflows:
debian-error: FileIOError: stat /home/$USER/dataproc-custom-image/dataproc-custom-images/custom_image_utils/startup_script/run.sh: no such file or directory
Traceback (most recent call last):
File "generate_custom_image.py", line 110, in <module>
main()
File "generate_custom_image.py", line 101, in main
daisy_image_creator.create(args)
File "/home/borjah/dataproc-custom-image/dataproc-custom-images/custom_image_utils/daisy_image_creator.py", line 43, in create
os.path.abspath(args.daisy_path), workflow_script)
File "/home/$USER/dataproc-custom-image/dataproc-custom-images/custom_image_utils/daisy_workflow_executor.py", line 42, in run
raise RuntimeError("Error building custom image.")
RuntimeError: Error building custom image.
Steps to reproduce error
Executing the commands from the Google Clound Console ,these are the steps I followed in order to reproduce the error:
git clone https://github.com/GoogleCloudPlatform/dataproc-custom-images.git
cd dataproc-custom-images
wget https://storage.googleapis.com/compute-image-tools/release/linux/daisy
chmod +x daisy
#! /usr/bin/bash
apt-get -y update
apt-get install python-dev -y
apt-get install python-pip -y
pip install numpy
python generate_custom_image.py --image-name custom-image-error --dataproc-version 1.4.7-debian9 --customization-script initialisation.sh --zone europe-west1-c --gcs-bucket gs://mybucket --daisy-path daisy
Workaround
As we can see in the error logs when executing the command, the create image script is looking for a script named “run.sh” but it does in the path custom_image_utils/startup_script/run.sh.
There is indeed a script named “run.sh” but not in this path, so the workaround is to make a folder called “startup_script” inside the folder “custom_image_utils” and copy the script run.sh that is located inside dataproc-custom-images/startup_script/run.sh
In order to solve this error follow next steps from the dataproc-custom-images base folder:
Create startup_script folder inside custom_image_utils:
mkdir custom_image_utils/startup_script
Copy run.sh from the startup_script folder into the newly created folder custom_image_utils_startup_script
cp startup_script/run.sh custom_image_utils/startup_script/
[1] https://cloud.google.com/dataproc/docs/guides/dataproc-images
Currently smoke tests can not be selected by users and only an example app from Spark distribution is used.
It would be very interesting to allow users to decide flexibly which tests to run.
For example: In my custom image I install some additional Python package(s) and I want to check that pySpark can successfully use them to process sample data from a table.
This could be implemented with different degrees:
a. Allow users to specify their own jar with tests and class(es) to run from such jar
b. Allow users to provide a workflow or a subset of it with all the steps they want to include in their test suite (e.g, I first want to run a Spark job, then a Hive query to check both components are successfully configured)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.