GithubHelp home page GithubHelp logo

googlecloudplatform / llm-pipeline-examples Goto Github PK

View Code? Open in Web Editor NEW
110.0 18.0 26.0 917 KB

License: Apache License 2.0

Dockerfile 8.57% Python 50.86% Shell 34.73% JavaScript 1.98% HTML 2.53% HCL 1.33%

llm-pipeline-examples's Introduction

Training Large Language Models on Google Cloud

The challenges of training large language models are multiple. To start with, it needs a large infrastructure of compute resources. Multiple machines with multiple hardware accelerators such as GPUs and TPUs are needed to train a single model. Getting the infrastructure ready for running is just the start of the challenge. When training starts, it could take multiple days for training to converge. This, besides the fact that we are training on a large number of hardware, increases the probability of experiencing a failure during training. If training fails, we need to restart and get the infrastructure ready again and resume where we left off.

In addition to these problems, we face the common production ready machine learning problems. Such as, reliable retraining, data storage, checkpoint storage, model versioning, tracking model quality and deployment to production.

Google Cloud Platform is one of the largest cloud providers which provides compute infrastructure suitable for training large language models. GCP is offering A3 VMs, which are powered by Nvidia's latest H100 GPU.

There are multiple model pre-training frameworks. In these examples, we show how to pretrain a GPT model using Megatron-Deepspeed. We also show how to finetune a T5 model using the Hugging Face transformer library.

Quick Start Guide

Prerequisites

  1. Make sure you have gcloud installed and that you are authenticated including application default authentication

    gcloud auth login
    gcloud auth application-default login
  2. Enable Services In your project, enable services needed to run the pipeline. You can do this by issuing the following command:

        export PROJECT_ID=<your project ID>
        gcloud services enable cloudfunctions compute.googleapis.com iam.googleapis.com cloudresourcemanager.googleapis.com --project=${PROJECT_ID}
  3. Create Workspace Create our workspace VM which we will use to process the data, create cluster and launch training:

    gcloud compute instances create llm-processor     --project=${PROJECT_ID}     --zone=us-east4-c     --machine-type=e2-standard-4     --metadata=enable-oslogin=true     --scopes=https://www.googleapis.com/auth/cloud-platform     --create-disk=auto-delete=yes,boot=yes,device-name=llm-processor,image-project=cos-cloud,image-family=cos-stable,mode=rw,size=250,type=projects/${PROJECT_ID}/zones/us-central1-a/diskTypes/pd-balanced 
  4. Connect Connect to the VM using SSH:

    gcloud compute ssh llm-processor --zone=us-east4-c --project=${PROJECT_ID}
  5. Create Storage Bucket Create a regional bucket in the same project. Make sure you choose to make it a regional bucket and choose the same region as where your pipeline will run. us-central1 recommended.

        export PROJECT_ID=<your project ID>
        export BUCKET_NAME=<your choice of a globally unique bucket ID>
        gcloud storage buckets create gs://$BUCKET_NAME --project=$PROJECT_ID --location=us-central1 --uniform-bucket-level-access

Megatron-Deepspeed GPT Pretraining Instructions

The following instructions show how to train a GPT model using Megatron-Deepspeed on a 96 H100 GPU cluster. The intruction are simplified to use Google Cloud Compute Engine APIs only. In addition they use Google Cloud Strorage for storing the data:

  • Task: LLM Pretraining
  • Implementation: Megatron-Deepspped
  • Distributed Framework: Deepspeed (ZeRO stage 3)
  • Infrastructure: Google Cloud
  • Cluster Management: AI Infra cluster provisioning tool
  • Storage: Google Cloud Storage
  • Dataset: Wikipedia
  • Model Architecture: GPT
  • Model Size: 176B
  1. Download and Preprocess Wikipedia dataset :

        docker run gcr.io/llm-containers/gpt_preprocess:release ./preprocess.sh gs://$BUCKET_NAME

    Warning: This could take hours to finish running.

  2. Pretrain GPT 176B on an A3 VM cluster

        sudo docker run -it gcr.io/llm-containers/batch:release $PROJECT_ID gcr.io/llm-containers/gpt_train:release gs://$BUCKET_NAME 0 0 0 ' {"data_file_name":"wiki_data_text_document", "tensor_parallel":4, "pipeline_parallel":12, "nlayers":70, "hidden":14336, "heads":112, "seq_len":2048, "train_steps":100, "eval_steps":10, "micro_batch":1, "gradient_acc_steps":128 }' '{ "name_prefix": "megatron-gpt", "zone": "us-central1-a", "node_count": 12, "machine_type": "a3-highgpu-8g", "gpu_type": "nvidia-h100-80gb", "gpu_count": 8 }'

Huggingface T5 Finetuning Instructions

In this effort, we provide a fully functioning example of how can we use GCP tools as well as the HuggingFace transformer library in conjunction with deepseed to finetune a large language model (T5 XXL) for a text summarization task. We encapsulate the code in container images that you can easily run on GCP. Here is a summary of task and tooling we are going to use:

  • Task: Text Summarization
  • Implementation: Hugging Face Transformers library
  • Distributed Framework: Deepspeed (ZeRO stage 3)
  • Infrastructure: Google Cloud
  • Cluster Management: AI Infra cluster provisioning tool
  • Storage: Google Cloud Storage
  • Dataset: CNN Dailymail
  • Model Architecture: T5
  • Model Size: 11B
  1. Copy Data Copy data and model checkpoint to GCS bucket.

    1. If you want to finetune the XXL T5 model (11B parameters), this can take up to 30 minutes to copy, you can use the following command:

          sudo docker run -it gcr.io/llm-containers/train:release python download.py --dataset=cnn_dailymail --subset=3.0.0 --dataset_path=gs://$BUCKET_NAME/dataset --model_checkpoint=google/t5-v1_1-xxl --workspace_path=gs://$BUCKET_NAME/workspace
    2. If you want to quickly test finetuning a small T5 model (50M parameters), you can use the following command:

          sudo docker run -it gcr.io/llm-containers/train:release python download.py --dataset=cnn_dailymail --subset=3.0.0 --dataset_path=gs://$BUCKET_NAME/dataset --model_checkpoint=t5-small --workspace_path=gs://$BUCKET_NAME/workspace
  2. Preproces Data Kick off data preprocessing:

    sudo docker run -it gcr.io/llm-containers/train:release python preprocess.py --model_checkpoint google/t5-v1_1-xxl --document_column article --summary_column highlights --dataset_path gs://$BUCKET_NAME/dataset --tokenized_dataset_path gs://$BUCKET_NAME/processed_dataset
  3. Kick off training

    1. To run a T5 XXL on 8 A3 VMs with H100 GPUs

      sudo docker run -it gcr.io/llm-containers/batch:release $PROJECT_ID gcr.io/llm-containers/train:release gs://$BUCKET_NAME/model 0 gs://$BUCKET_NAME/processed_dataset gs://$BUCKET_NAME/workspace '{ "model_checkpoint" : "google/t5-v1_1-xxl",  "batch_size" : 16,  "epochs" : 7}' ' { "name_prefix" : "t5node", "zone" : "us-east4-a",  "node_count" : 8,  "machine_type" : "a3-highgpu-8g", "gpu_type" : "nvidia-h100-80gb", "gpu_count" : 8 }'
    2. To run a T5 small on a single A100 GPU:

      sudo docker run -it gcr.io/llm-containers/batch:release $PROJECT_ID gcr.io/llm-containers/train:release gs://$BUCKET_NAME/model 0 gs://$BUCKET_NAME/processed_dataset gs://$BUCKET_NAME/workspace '{ "model_checkpoint" : "t5-small",  "batch_size" : 128,  "epochs" : 1}' ' { "name_prefix" : "t5node", "zone" : "us-east4-a",  "node_count" : 1,  "machine_type" : "a3-highgpu-8g", "gpu_type" : "nvidia-h100-80gb", "gpu_count" : 8 }'

    Make sure you have enough Quota for the VM and GPU types you select. You can learn more about Google Cloud quota from here

Test your pipeline

  1. Run one of the following command to deploy your model to Vertex AI:

    1. For the T5 small:

      sudo docker run -it gcr.io/llm-containers/deploy:release python deploy.py --project=${PROJECT_ID} --model_display_name=t5 --serving_container_image_uri=gcr.io/llm-containers/predict:release --model_path=gs://${BUCKET_NAME}/model --machine_type=n1-standard-32 --gpu_type=NVIDIA_TESLA_V100 --gpu_count=1 --region=us-central1
    2. For the T5 XXL:

      sudo docker run -it gcr.io/llm-containers/deploy:release python deploy.py --project=${PROJECT_ID} --model_display_name=t5 --serving_container_image_uri=gcr.io/llm-containers/predict:release --model_path=gs://${BUCKET_NAME}/model --machine_type=n1-standard-32 --gpu_type=NVIDIA_TESLA_V100 --gpu_count=1 --region=us-central1

    When it finishes deployment, it will output the following line:

    Endpoint model deployed. Resource name: projects/<project numbre>/locations/us-central1/endpoints/<endpoint id>
    
  2. Create a json file with the following content or any article of your choice:

    {
    "instances": [
        "Sandwiched between a second-hand bookstore and record shop in Cape Town's charmingly grungy suburb of Observatory is a blackboard reading 'Tapi Tapi -- Handcrafted, authentic African ice cream.' The parlor has become one of Cape Town's most talked about food establishments since opening in October 2020. And in its tiny kitchen, Jeff is creating ice cream flavors like no one else. Handwritten in black marker on the shiny kitchen counter are today's options: Salty kapenta dried fish (blitzed), toffee and scotch bonnet chile Sun-dried blackjack greens and caramel, Malted millet ,Hibiscus, cloves and anise. Using only flavors indigenous to the African continent, Guzha's ice cream has become the tool through which he is reframing the narrative around African food. 'This (is) ice cream for my identity, for other people's sake,' Jeff tells CNN. 'I think the (global) food story doesn't have much space for Africa ... unless we're looking at the generic idea of African food,' he adds. 'I'm not trying to appeal to the global universe -- I'm trying to help Black identities enjoy their culture on a more regular basis.'"
    ]
    }

    Save the file and give it a name. For this example, prediction.json

  3. Do a prediction using the following command:

    curl \
        -X POST \
        -H "Authorization: Bearer $(gcloud auth print-access-token)" \
        -H "Content-Type: application/json" \
        https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/endpoints/432544312:predict \
        -d "@prediction.json"

Expected Output

  1. If you used a configurtion with the T5 small model (60M parameters), the output would be like:
{
  "predictions": [
    "'Tapi Tapi -- Handcrafted, authentic African ice cream' is a",
  ],
  "deployedModelId": "8744807401842016256",
  "model": "projects/649215667094/locations/us-central1/models/6720808805245911040",
  "modelDisplayName": "t5",
  "modelVersionId": "12"
}
  1. If you use a configurtion with the T5 XXL (11B parameters), the output would be like:
{
  "predictions": [
    "Tapi Tapi is an ice cream parlor in Cape Town, South Africa.",
  ],
  "deployedModelId": "8744807401842016256",
  "model": "projects/649215667094/locations/us-central1/models/6720808805245911040",
  "modelDisplayName": "t5",
  "modelVersionId": "12"
}

How it works

Download

This is the ingestion step for the data. Currently, it uses huggingface.co datasets library. To learn more about loading datasets using this library, check out the library reference. Eventually, the data is downloaded to Google Cloud Storage (GCS). The next steps in the pipeline are expecting the data to be in GCS. This works well for datasets in the multi GB order of magnitude. In our future work, we will present how to process larger datasets using DataFlow. The full training scripts can be found here.

We package this as a pipeline component that produces the dataset on GCP. The component takes the dataset and subset as input. These correspond to the ‘path’ and ‘name’ parameters passed directly to datasets.load_dataset. You can learn more about loading datasets here:

Download component Download Parameters

In the example, we use the CNN Dailymail dataset. The code is packaged in a container available here.

Preprocessing

As part of the summarization task, we tokenize our dataset during the preprocessing stage. The full script for tokenization can be found [here]. When we are finished processing, we upload the tokenized dataset to the output GCS path.

This component allows the use of dataset formats supported by datasets.load_dataset. All the user needs to do is specify which column in the dataset contains the document body and which column contains the document summary.

Preprocess component Preprocess Parameters

Fine Tuning

In this step of the pipeline we take the tokenized dataset and train a base model to be finetuned on the dataset. For large language models, this is typically the step that consumes most resources and takes a significant amount of time. Depending on how much GPUs are used for training and how many epochs you run the training for, this could vary from hours to days.

Fine tuning component Fine tuning Parameters

The general workflow for finetuning is that we spawn a cluster of GPU VMs on GCP. In the example shown we use 8 A2 plus matches with 8 A100 GPUs. We preprovision them with DLVM images that include the necessary GPU drivers including NVidia Common Communication Library (NCCL). We download our training code which is packaged in a docker image. And use deepspeed to launch and coordinate our training across all the VMS. We use fluentd to collect logs from the VMs and upload to Google Cloud Logging. We save model checkpoints to GCS including the final trained model.

Training architecture

Let’s look at the implementation of some of these parts:

Cluster Provisioning

To provision the cluster of VMs, we use a pre created container we call ‘batch container’ . The container is available here. It is based on the cluster provisioning tool container available here. Creating VMs using the tool is as simple as filling a few environment variables and calling an entry point. It will automatically create the cluster with an image of choice and run a starting command on all VMs.

When the job completes successfully, we call the entry point again requesting the destruction of the cluster.

Training Container

All the VMs will be provisioned with the DLVM image with NVidia drivers preinstalled. The VMs will download the training docker image and invoke the training start command. The training container is available here . And you can find the fine tuning scripts here.

The container image has some pre-installed packages and configuration. This includes:

  • Transformer libraries and deepspeed with compiled ops
  • ssh server that allows deepspeed launcher to launch training inside the container
  • Google Cloud logging agent to collect logs from the container and publish to the cloud.
  • Training scripts

The head node will invoke the head script which:

  • Configures ssh to talk to all servers in the cluster
  • Configures fluentd to collect logs
  • Invokes deepspeed with correct config to run the training script using ZeRO stage 3

In turn, deepspeed would use its default ssh launcher to launch the training scripts on all cluster containers.

After the training container provisions the cluster, it will start downloading the model weights and dataset and will kick off training after. It will show the following message:

Waiting for training to start...

This could take up to 20 minutes depending on the size of your model and data.

Saving to GCS

The training scripts use the Huggingface transformer library to finetune a summarization task on the given dataset. To save the progress of training as it goes, we create a training callback that uploads each checkpoint to GCS. The call back looks like this:

class GCSSaveCallback(TrainerCallback):
 """A [`TrainerCallback`] that handles checkpoints.
 """

 def on_save(self, args: TrainingArguments, state: TrainerState,
             control: TrainerControl, **kwargs):
   checkpoint_folder = f'checkpoint-{state.global_step}'
   local_output_dir = os.path.join(args.output_dir, checkpoint_folder)
   if not os.path.exists(local_output_dir):
     logging.error(
          'Check point called for a non existing checkpoint %s',
          local_output_dir,
     )
     return
   gcs_root, gcs = utils.gcs_path(FLAGS.gcs_output)
   gcs_output_dir = os.path.join(gcs_root, local_output_dir)
   logging.info('Uploading %s....', gcs_output_dir)
   gcs.put(local_output_dir, gcs_output_dir, recursive=True)
   return None

When the model is completely trained, we save the final copy to GCS.

Evaluation and metrics

When the model finishes training, we run our evaluation to collect the model accuracy number. In this case, since it is a summarization task, we collect ROUGE scores for the model. You can learn more about ROUGE scores here. Once we have the final scores, we save them along the model to GCS. By saving the model metrics along with the model, we are able to figure out the model performance just by looking at the saved model without having to run evaluation again as in the example below:

Metrics

Model deployment

Now that our model is ready, we want to provide it as a service for authorized users to call. This can serve as a backend to a frontend web application. We go with a simple and quick solution to demo our model. For this purpose, we use Vertex AI Prediction which provides us with the quickest path to serve our model.

Deployment component Deployment Parameters

We package the model serving code in a simple prediction container which is available here. The container packs a Flask server with a simple python script that uses deepspeed to perform prediction from the model. It implements the necessary REST APIs required by Vertex AI Prediction.

For larger models or production class deployment, we can deploy directly to GKE and use Triton Inference Server. See an example with T5.

Current Pipeline Supported configurations

Data

The pipeline only supports data that is loadable using load_dataset. It works well for data less than 100GB. For larger dataset, users will need to write custom processing scripts on Dataflow.

Compute

The pipeline VM types supported by GCP Compute Engine. For a full list, check GCE GPU Platforms. You can create a cluster of any size as long as you have the quota for it.

Models

For the summarization task, we support any sequence to sequence model in the huggingface.co model repository. This includes T5, mT5, BART and Marian models. For more information about sequence 2 sequence models, check here. The model can be of any size as long as corresponding parameters (batch size, number of nodes, number of GPUs, etc ..) can make the model fit into the compute cluster.

Deployment Scale

Currently we support SKUs that are available for Vertex AI Prediction which can be found here.

llm-pipeline-examples's People

Contributors

abdallag avatar chris113113 avatar dumb-programmer avatar kuo-lin avatar sdlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llm-pipeline-examples's Issues

NotImplementedError due to NoneType when running pipeline.py

Hi there!

I'm running the code with the following pyproject.toml

[tool.poetry]
name = "llm-pipeline-examples"
version = "0.1.0"
description = ""
authors = []
license = "Apache 2.0"
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.9"
absl-py = "^1.4.0"
google = "^3.0.0"
kfp = "^1.8.19"
google-cloud-aiplatform = "^1.23.0"

[tool.poetry.group.dev.dependencies]
pyright = "^1.1.301"
pylint = "^2.17.1"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Running pipeline.py results in errors.

$ poetry run python pipeline.py --project=$PROJECT_ID --pipeline_root=gs://$BUCKET_NAME/pipeline_runs/ --config=configs/small1vm1gpu.json
/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/kfp/v2/compiler/compiler.py:1290: FutureWarning: APIs imported from the v1 namespace (e.g. kfp.dsl, kfp.components, etc) will not be supported by the v2 compiler since v2.0.0
  warnings.warn(
Traceback (most recent call last):
  File "/home/anirudh/projects/llm-pipeline-examples/pipeline.py", line 402, in <module>
    app.run(main)
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/anirudh/projects/llm-pipeline-examples/pipeline.py", line 345, in main
    compiler.Compiler().compile(
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/kfp/v2/compiler/compiler.py", line 1301, in compile
    pipeline_job = self._create_pipeline_v2(
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/kfp/v2/compiler/compiler.py", line 1223, in _create_pipeline_v2
    pipeline_func(*args_list)
  File "/home/anirudh/projects/llm-pipeline-examples/pipeline.py", line 303, in my_pipeline
    deploy_op = deploy(
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/kfp/components/_dynamic.py", line 53, in Deploy
    return dict_func(locals())  # noqa: F821 TODO
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/kfp/components/_components.py", line 389, in create_task_object_from_component_and_pythonic_arguments
    return _create_task_object_from_component_and_arguments(
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/kfp/components/_components.py", line 326, in _create_task_object_from_component_and_arguments
    task = _container_task_constructor(
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/kfp/dsl/_component_bridge.py", line 319, in _create_container_op_from_component_and_arguments
    _attach_v2_specs(task, component_spec, original_arguments)
  File "/home/anirudh/.cache/pypoetry/virtualenvs/llm-pipeline-examples-uTYd1Pp6-py3.9/lib/python3.9/site-packages/kfp/dsl/_component_bridge.py", line 600, in _attach_v2_specs
    raise NotImplementedError(
NotImplementedError: Input argument supports only the following types: PipelineParam, str, int, float, bool, dict, and list. Got: "None".

I've tried this with various python 3.8, 3.9, and 3.10 without success. It always fails on the same error. Do you know if my dependencies might be incorrect or could something else be causing this issue?

Stuck at creating H100 instances

I'm trying to run the example with 2 H100x8 nodes to test the DirectGPU-TCPX speed.
There's some credential issues when calling "gsutil cp", so I've created a local docker image(h100launcher in the command below) with credential json.
This is my command to start the provisioning process:

sudo docker run -it docker.io/library/h100launcher:latest $PROJECT_ID gcr.io/llm-containers/gpt_train:release gs://$BUCKET_NAME 0 0 0 ' {"data_file_name":"wiki_data_text_document", "tensor_parallel":4, "pi
peline_parallel":12, "nlayers":70, "hidden":14336, "heads":112, "seq_len":2048, "train_steps":100, "eval_steps":10, "micro_batch":1, "gradient_acc_steps":128 }' '{ "name_prefix": "megatron-gpt", "zone": "us-c
entral1-a", "node_count": 2, "machine_type": "a3-highgpu-8g", "gpu_type": "nvidia-h100-80gb", "gpu_count": 8 }'

I can see the network interfaces are created successfully in the firewall page, but it stuck at creating the instances:

...
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m0s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m10s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m20s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m30s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m40s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m50s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m0s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m10s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m20s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m30s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m40s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m50s elapsed]

I should have enough H100 quotas.
Do you have any suggestions on how to debug this? Thanks!

Ran into an error running the container

Deploying container to Cloud Run service [my-service] in project [test-gke-t5] region [us-central1]
X Deploying new service...
. Creating Revision...
. Routing traffic...
. Setting IAM Policy...
Deployment failed
ERROR: gcloud crashed (ValidationError): Expected type <class 'str'> for field value, found True (type <class 'bool'>)

If you would like to report this issue, please run the following command:
gcloud feedback

To check gcloud for common problems, please run the following command:
gcloud info --run-diagnostics

Security Policy violation Binary Artifacts

This issue was automatically created by Allstar.

Security Policy Violation
Project is out of compliance with Binary Artifacts policy: binaries present in source code

Rule Description
Binary Artifacts are an increased security risk in your repository. Binary artifacts cannot be reviewed, allowing the introduction of possibly obsolete or maliciously subverted executables. For more information see the Security Scorecards Documentation for Binary Artifacts.

Remediation Steps
To remediate, remove the generated executable artifacts from the repository.

Artifacts Found

  • src/pycache/utils.cpython-310.pyc

Additional Information
This policy is drawn from Security Scorecards, which is a tool that scores a project's adherence to security best practices. You may wish to run a Scorecards scan directly on this repository for more details.


Allstar has been installed on all Google managed GitHub orgs. Policies are gradually being rolled out and enforced by the GOSST and OSPO teams. Learn more at http://go/allstar

This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Support for AutoModelForCausalLM

Can there be an image that can support CausalLM training ?
This is important for our use-case and I was wondering if you could suggest some steps to:

  • Create my own container image with modified train and dataset read code (something similar to what you did to create this pipeline)

Some permissions not granted by default when following readme

When running the pipeline from a fresh VM and new GCP Project I see auth issues. Required IAM roles should be called out in the doc.

Running list of needed permissions for the Service Account not present by default:
compute.instanceGroupManagers.get
aiplatform.metadataStores.get
iam.serviceAccountUser

Unable to reuse previously provisioned cluster

Scenario: first run training cluster was provisioned, deepspeed executed but training failed. Cluster remained.

Trying to reuse the cluster for a rerun, encountered following problems:

  1. pipeline.py don't seem to be able to find the existing cluster.
  2. error in run_batch.sh script for "gcloud compute instances list" - Required 'compute.instances.list' permission for 'projects/e64a316e7aac5aacfp-tp'

I was able to get the script to recognize my previous cluster after my mods (investigation below), but now the ssh fails unable issue the docker commands to restart the training. It would seem if failed at this line

gcloud compute ssh $(head -n 1 machines.txt | sed 's/\(\S\+\) .*/\1/') --zone=$ZONE --internal-ip --ssh-key-expire-after=1d --strict-host-key-checking=no --command="echo 'ssh is available'"

What am I missing? advice appreciated!

Cluster found! Exporting machine list...
Copying file://machines.txt [Content-Type=text/plain]...
/ [0 files][    0.0 B/   27.0 B]
/ [1 files][   27.0 B/   27.0 B]
Operation completed over 1 objects/27.0 B.                                       
Restarting training on VMs...
CommandException: No URLs matched: gs://MY_BUCKET/pipeline_runs/PROJECT_ID/llm-pipeline-20230924223600/train_-5175810662584025088/model/progress.txt
Copying gs://MY_BUCKET/pipeline_runs/PROJECT_ID/llm-pipeline-20230924223600/train_-5175810662584025088/model/machines.txt...
/ [0 files][    0.0 B/   27.0 B]
/ [1 files][   27.0 B/   27.0 B]
Operation completed over 1 objects/27.0 B.                                       
WARNING: The private SSH key file for gcloud does not exist.
WARNING: The public SSH key file for gcloud does not exist.
WARNING: You do not have an SSH key for gcloud.
WARNING: SSH keygen will be executed to generate a key.
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/google_compute_engine
Your public key has been saved in /root/.ssh/google_compute_engine.pub
The key fingerprint is:
SHA256:xxxxxxxxxxx
The key's randomart image is:
+---[RSA 3072]----+
...
+----[SHA256]-----+
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

Investigations

1) Cluster recognition:

id=str(int(time.time())),

It is the time suffix that causes the mismatch when the code looks for the job. Setting id="" will allow matching.
If there are more than one clusters and only one is intended to be found, we can use the "name_prefix" field to provide a prefix that matches just one cluster. Perhaps we should add this to documentation or add a flag to pipeline.py for reuse.

"name_prefix" : "t5node",

2) run_batch.sh, I found that there are multiple issues:

adding --project=${PROJECT} to the following lines solved the unknown project issue:

In fact, I use the follow with explicit filter and headers to make sure the regexp will work:

gcloud compute instances list --project=${PROJECT} --filter='STATUS=RUNNING' --format='csv[no-heading,separator=" "](NAME,ZONE,MACHINE_TYPE,PREEMPTIBLE,INTERNAL_IP,EXTERNAL_IP,STATUS)' | grep ${JOB_ID} | sed 's/\(\S\+\) .* \([0-9\.]\+\)[0-9\.,]* \([0-9\.]\+\)\? RUNNING/\1 \2/' | sort | head -n ${NODE_COUNT} > machines.txt

and also changed from nvidia-persistenced because it failed in my VM:

export PRE_DOCKER_RUN="nvidia-smi -pm 1;" 

NOTE for devs reading at this point: run_batch.sh is inside the batch container, changing it means you would need to rebuild and use your own rather than use the supplied gcr.io one

docker build . -t ${YOUR IMAGE TAG} -f docker/batch.Dockerfile
docker push ${YOUR IMAGE TAG} 

# components/trainer.yaml - make sure you reference it before running pipeline.py

implementation:
  container:
    image: {YOUR IMAGE TAG}

Trainer parameters seem wrong.

There seem to be more parameters passed to run_batch.sh than expected.

Screenshot 2023-07-26 at 19 21 21



This as consequence makes the trainer_config.json incorrect.

Screenshot 2023-07-26 at 19 16 45

GKE Inference instructions improvements

GKE cluster creation fails

Errors:

rovisioning cluster...
2023-06-14 15:32:09.488 PDT
Fetching cluster endpoint and auth data.
2023-06-14 15:32:09.733 PDT
kubeconfig entry generated for gke-inference-cluster-gke.
2023-06-14 15:32:09.952 PDT
Deploying predict image to cluster
2023-06-14 15:32:13.579 PDT
E0614 22:32:13.577621 789 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2023-06-14 15:32:13.637 PDT
E0614 22:32:13.635635 789 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2023-06-14 15:32:13.874 PDT
deployment.apps/flan-t5-base-deployment configured
2023-06-14 15:32:13.906 PDT
service/flan-t5-base unchanged
2023-06-14 15:32:14.034 PDT
E0614 22:32:14.032452 817 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2023-06-14 15:32:14.043 PDT
E0614 22:32:14.041722 817 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2023-06-14 15:32:14.048 PDT
E0614 22:32:14.046914 817 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2023-06-14 15:32:14.053 PDT
E0614 22:32:14.051580 817 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2023-06-14 15:32:14.097 PDT
error: no matching resources found
2023-06-14 15:32:14.112 PDT
Container called exit(1).

Instructions doesn't mention gcloud run deploy interaction

When following the instruction, running gcloud run deploy result in an interactive shell that ask for service name, region and whether the endpoint is going to be authenticated. Can we pass these as parameter so that we are not asked for ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.