Hi, I've been trying to run the example inference using the 7B model weights, but I ge

you can lower the max batch size. See here: <a class="issue-link js-issue-link" data-e

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Comments (14)

Qubitium commented on August 25, 2024 4

3090 24GB has same error on 7B. There should be GPU mem requirements in README . Please add this.

from llama.

maartenjv commented on August 25, 2024 1

you can lower the max batch size. See here: #42 (comment)

model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=32, **params)

from llama.

fabawi commented on August 25, 2024 1

I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:

https://github.com/modular-ml/wrapyfi-examples_llama

and have a readme with the instructions on how to do it:

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon!
UPDATE: Tested on Two 3080 Tis as well!!!

How to?

Replace all instances of <YOUR_IP> and before running the scripts
Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .

Install Wrapyfi with the same environment:

git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]

Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:

cd wrapyfi/standalone 
python zeromq_proxy_broker.py --comm_type pubsubpoll

Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):

CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

Now start the second instance (within this repo and env) :

CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

You will now see the output on both terminals
EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 4 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

from llama.

zeelsheladiya commented on August 25, 2024 1

To address the "CUDA out of memory" error, you can implement the following changes in your code:

Reduce Batch Size: Decrease the batch size used during training or inference.

batch_size = 4  # Reduce the batch size

Memory Management: Set the max_split_size_mb option for memory management in PyTorch. Place these lines before creating the model.

import torch

torch.cuda.set_per_process_memory_fraction(0.5, device=0)
torch.cuda.set_per_process_memory_growth(True, device=0)

Model Initialization: Initialize your model within a try-except block to catch any OutOfMemoryError and handle it gracefully.

def load_model(ckpt_dir, tokenizer_path, local_rank, world_size):
    try:
        model = Transformer(model_args)
    except torch.cuda.CudaError as e:
        print(f"Error initializing the model: {e}")
        # Handle the error, such as reducing model size or batch size
        return None
    return model

generator = load_model(ckpt_dir, tokenizer_path, local_rank, world_size)
if generator is None:
    sys.exit(1)  # Exit the script if model initialization failed

Gradient Accumulation: Implement gradient accumulation to simulate larger batch sizes and reduce memory consumption.

accumulation_steps = 2  # Accumulate gradients over 2 small batches
for step in range(total_steps):
    for _ in range(accumulation_steps):
        # Load data and perform forward and backward passes
        loss.backward()
    
    # Update weights after gradient accumulation
    optimizer.step()
    optimizer.zero_grad()

Free GPU Memory: Explicitly delete tensors that are no longer needed to free up GPU memory.

del cache_k, cache_v  # After these tensors are no longer needed

Remember to experiment and fine-tune these changes to find the optimal settings that fit within your available GPU memory while maintaining training stability and performance.

from llama.

jspisak commented on August 25, 2024 1

@zeelsheladiya - feels like a blog post? :) Let me know if you want to author something..

from llama.

jspisak commented on August 25, 2024 1

cc @subramen - closing this one out but we should also consider any knowledge transfer to llama-recipes @HamidShojanazeri

from llama.

andrewssobral commented on August 25, 2024

Same for me, but in my case I have 2x RTX 2070 (8Gb each) 16Gb in total.
How could we use multiple gpus?

# |  Model | MP |
# |--------|----|
# | 7B     | 1  |
# | 13B    | 2  |
# | 30B    | 4  |
# | 65B    | 8  |
export TARGET_FOLDER="models"
export model_size="7B"
export MP="1"
(llama_env) andrews@gpuserver:~/llms/llama$ torchrun --nproc_per_node $MP example.py --ckpt_dir $TARGET_FOLDER/$model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "example.py", line 72, in <module>
    fire.Fire(main)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 62, in main
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
  File "example.py", line 48, in load
    model = Transformer(model_args)
  File "/home/andrews/llms/llama/llama/model.py", line 211, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/andrews/llms/llama/llama/model.py", line 184, in __init__
    self.attention = Attention(args)
  File "/home/andrews/llms/llama/llama/model.py", line 104, in __init__
    self.wo = RowParallelLinear(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__
    self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.48 GiB already allocated; 27.69 MiB free; 6.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2975677) of binary: /home/andrews/llms/llama/llama_env/bin/python3
Traceback (most recent call last):
  File "/home/andrews/llms/llama/llama_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_11:49:19
  host      : activeeon-gpuserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2975677)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

from llama.

iodine-pku commented on August 25, 2024

Same for me, but in my case I have 2x RTX 2070 (8Gb each) 16Gb in total. How could we use multiple gpus?

# |  Model | MP |
# |--------|----|
# | 7B     | 1  |
# | 13B    | 2  |
# | 30B    | 4  |
# | 65B    | 8  |
export TARGET_FOLDER="models"
export model_size="7B"
export MP="1"
(llama_env) andrews@gpuserver:~/llms/llama$ torchrun --nproc_per_node $MP example.py --ckpt_dir $TARGET_FOLDER/$model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "example.py", line 72, in <module>
    fire.Fire(main)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 62, in main
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
  File "example.py", line 48, in load
    model = Transformer(model_args)
  File "/home/andrews/llms/llama/llama/model.py", line 211, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/andrews/llms/llama/llama/model.py", line 184, in __init__
    self.attention = Attention(args)
  File "/home/andrews/llms/llama/llama/model.py", line 104, in __init__
    self.wo = RowParallelLinear(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__
    self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.48 GiB already allocated; 27.69 MiB free; 6.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2975677) of binary: /home/andrews/llms/llama/llama_env/bin/python3
Traceback (most recent call last):
  File "/home/andrews/llms/llama/llama_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_11:49:19
  host      : activeeon-gpuserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2975677)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I have tried in free tier google colab, which has a Tesla T4 GPU with 15.36GB VRAM and the error message is like yours. Maybe just need more VRAM (7B model has a ckpt file of 13GB size)

from llama.

ahmedlila commented on August 25, 2024

Same for me, but in my case I have 2x RTX 2070 (8Gb each) 16Gb in total. How could we use multiple gpus?

# |  Model | MP |
# |--------|----|
# | 7B     | 1  |
# | 13B    | 2  |
# | 30B    | 4  |
# | 65B    | 8  |
export TARGET_FOLDER="models"
export model_size="7B"
export MP="1"
(llama_env) andrews@gpuserver:~/llms/llama$ torchrun --nproc_per_node $MP example.py --ckpt_dir $TARGET_FOLDER/$model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "example.py", line 72, in <module>
    fire.Fire(main)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 62, in main
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
  File "example.py", line 48, in load
    model = Transformer(model_args)
  File "/home/andrews/llms/llama/llama/model.py", line 211, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/andrews/llms/llama/llama/model.py", line 184, in __init__
    self.attention = Attention(args)
  File "/home/andrews/llms/llama/llama/model.py", line 104, in __init__
    self.wo = RowParallelLinear(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__
    self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.48 GiB already allocated; 27.69 MiB free; 6.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2975677) of binary: /home/andrews/llms/llama/llama_env/bin/python3
Traceback (most recent call last):
  File "/home/andrews/llms/llama/llama_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_11:49:19
  host      : activeeon-gpuserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2975677)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I have tried in free tier google colab, which has a Tesla T4 GPU with 15.36GB VRAM and the error message is like yours. Maybe just need more VRAM (7B model has a ckpt file of 13GB size)

Same here

from llama.

Qubitium commented on August 25, 2024

@vincenzoml Your log show 40GB A100 model, not the 32GB model. Can you confirm?

(GPU 0; 39.59 GiB total capacity;

from llama.

vincenzoml commented on August 25, 2024

@vincenzoml Your log show 40GB A100 model, not the 32GB model. Can you confirm?
(GPU 0; 39.59 GiB total capacity; 

Yes I confirm, sorry for the mistake.

from llama.

vincenzoml commented on August 25, 2024

I confirm that setting max_batch_size=2 in model_args in example.py let my A100 40GB run the example. Setting it to 1 causes an assertion error. I will later investigate if the number can be raised and if it affects runtime.

from llama.

jacklxc commented on August 25, 2024

@vincenzoml if the batch size is 1, then the number of prompts per forward pass should also be 1. https://github.com/facebookresearch/llama/blob/76066b1b5cf467ce750f51af15cd34de442185e7/example.py#L63

from llama.

fabawi commented on August 25, 2024

Same for me, but in my case I have 2x RTX 2070 (8Gb each) 16Gb in total. How could we use multiple gpus?

# |  Model | MP |
# |--------|----|
# | 7B     | 1  |
# | 13B    | 2  |
# | 30B    | 4  |
# | 65B    | 8  |
export TARGET_FOLDER="models"
export model_size="7B"
export MP="1"
(llama_env) andrews@gpuserver:~/llms/llama$ torchrun --nproc_per_node $MP example.py --ckpt_dir $TARGET_FOLDER/$model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "example.py", line 72, in <module>
    fire.Fire(main)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 62, in main
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
  File "example.py", line 48, in load
    model = Transformer(model_args)
  File "/home/andrews/llms/llama/llama/model.py", line 211, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/andrews/llms/llama/llama/model.py", line 184, in __init__
    self.attention = Attention(args)
  File "/home/andrews/llms/llama/llama/model.py", line 104, in __init__
    self.wo = RowParallelLinear(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__
    self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.48 GiB already allocated; 27.69 MiB free; 6.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2975677) of binary: /home/andrews/llms/llama/llama_env/bin/python3
Traceback (most recent call last):
  File "/home/andrews/llms/llama/llama_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/andrews/llms/llama/llama_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_11:49:19
  host      : activeeon-gpuserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2975677)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Checkout the readme in https://github.com/modular-ml/wrapyfi-examples_llama , I have instructions on how to do it

from llama.

Failure on A100 32GB about llama HOT 14 CLOSED

Comments (14)

LLaMA with Wrapyfi

How to?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs