<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Train with single gpu about ultralytics HOT 3 CLOSED

TonightGo commented on July 24, 2024

Train with single gpu

from ultralytics.

Comments (3)

glenn-jocher commented on July 24, 2024

@TonightGo hi there,

Thank you for reaching out and providing a detailed description of the issue you're encountering. Let's work through this together.

First, to ensure we can effectively investigate and address the problem, please verify the following:

Minimum Reproducible Example: Could you provide a minimal code snippet that reproduces the issue? This helps us isolate the problem more efficiently. You can refer to our Minimum Reproducible Example Guide for more details on how to create one.
Package Versions: Ensure you are using the latest versions of torch and ultralytics. You can upgrade them using the following commands:
```
pip install --upgrade torch ultralytics
```

Regarding the error you're encountering, it seems related to multiprocessing and data loading. Here are a few steps you can take to troubleshoot and potentially resolve the issue:

Multiprocessing Setup

The error suggests an issue with the multiprocessing setup. Ensure that the multiprocessing code is protected by if __name__ == "__main__": to avoid issues with spawning new processes:

if __name__ == "__main__":
    import sys
    sys.path.insert(0, 'ultralytics')
    import torch
    import cv2
    from ultralytics import YOLO

    torch.set_num_threads(1)
    cv2.setNumThreads(1)
    NUM_THREADS = 2

    import torch.multiprocessing as mp
    mp.set_start_method('spawn', force=True)

    model = YOLO("/content/drive/MyDrive/ultralytics/ultralytics/cfg/models/v8/test.yaml", verbose=True)
    model.train(data="/content/drive/MyDrive/ultralytics/ultralytics/cfg/datasets/test.yaml", epochs=300, project='test', name='debug', device='0', imgsz=640, batch=2, exist_ok=True, workers=1, resume=False, optimizer='SGD')

DataLoader Workers

The error message suggests trying to rerun with num_workers=0 for better error tracing. This can help identify if the issue is related to multiprocessing:

model.train(data="/content/drive/MyDrive/ultralytics/ultralytics/cfg/datasets/test.yaml", epochs=300, project='test', name='debug', device='0', imgsz=640, batch=2, exist_ok=True, workers=0, resume=False, optimizer='SGD')

Dataset and Cache

The assertion error in get_labels indicates a potential issue with the dataset cache. Try clearing the cache or ensuring that the dataset paths and annotations are correct:

# Clear cache
import os
cache_path = "/content/drive/MyDrive/ultralytics/ultralytics/cfg/datasets/test.cache"
if os.path.exists(cache_path):
    os.remove(cache_path)

Debugging

If the issue persists, consider running the training with a smaller dataset or fewer epochs to isolate the problem. Additionally, you can enable more verbose logging to get detailed insights into the training process.

Please try these steps and let us know if the issue persists. We're here to help!

from ultralytics.

TonightGo commented on July 24, 2024

I used the above code and it works. Thank you for your timely and effective reply.

from ultralytics.

glenn-jocher commented on July 24, 2024

Hi @TonightGo,

I'm glad to hear that the provided solution worked for you! 🎉 If you encounter any further issues or have additional questions, feel free to reach out. We're here to help!

Happy training! 🚀

from ultralytics.

Train with single gpu about ultralytics HOT 3 CLOSED

Comments (3)

Multiprocessing Setup

DataLoader Workers

Dataset and Cache

Debugging

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs