GithubHelp home page GithubHelp logo

dataloaders_dali's Introduction

PyTorch DataLoaders with DALI

PyTorch DataLoaders implemented with nvidia-dali, we've implemented CIFAR-10 and ImageNet dataloaders, more dataloaders will be added in the future.

With 2 processors of Intel(R) Xeon(R) Gold 6154 CPU, 1 Tesla V100 GPU and all dataset in memory disk, we can extremely accelerate image preprocessing with DALI.

Iter Training Data Cost(bs=256) CIFAR-10 ImageNet
DALI 1.4s(2 processors) 625s(8 processors)
torchvision 280.1s(2 processors) 13400s(8 processors)

In CIFAR-10 training, we can reduce tranining time from 1 day to 1 hour with our hardware setting.

Requirements

You only need to install nvidia-dali package and version should be >= 0.12, we've tested version 0.11 and it didn't work

#for cuda9.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/9.0 nvidia-dali
#for cuda10.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/10.0 nvidia-dali

More details and documents can be found here

Usage

You can use these dataloaders easily as the following example

from base import DALIDataloader
from cifar10 import HybridTrainPipe_CIFAR
pip_train = HybridTrainPipe_CIFAR(batch_size=TRAIN_BS,
                                  num_threads=NUM_WORKERS,
                                  device_id=0, 
                                  data_dir=IMG_DIR, 
                                  crop=CROP_SIZE, 
                                  world_size=1, 
                                  local_rank=0, 
                                  cutout=0)
train_loader = DALIDataloader(pipeline=pip_train,
                              size=CIFAR_IMAGES_NUM_TRAIN, 
                              batch_size=TRAIN_BS, 
                              onehot_label=True)
for i, data in enumerate(train_loader): # Using it just like PyTorch dataloader
    images = data[0].cuda(non_blocking=True)
    labels = data[1].cuda(non_blocking=True)

If you have large enough memory for storing dataset, we strongly recommend you to mount a memory disk and put the whole dataset in it to accelerate I/O, like this

mount  -t tmpfs -o size=20g  tmpfs /userhome/memory_data

It's noteworthy that 20g above is a ceiling but not occupying 20g memory at the moment you mount the tmpfs, memories are occupied as you putting dataset in it. Compressed files should not be extracted before you've copied them into memory, otherwise it could be much slower.

dataloaders_dali's People

Contributors

tanglang96 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataloaders_dali's Issues

How does the imagenet dir organized?

I want to run 'imagenet.py' and it needs specify a 'path' to 'image_dir'. Can you show me how does the files organized in 'image_dir'? Thank you very much,

output error of the normal about pytorch and DALI

pytorch: transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
output:tensor([[[[ 0.3309, 0.3481, 0.3652, ..., 0.9817, 0.9817, 0.9817],
[ 0.3481, 0.3652, 0.3823, ..., 0.9988, 0.9988, 0.9988],
[ 0.3652, 0.3823, 0.3823, ..., 1.0159, 1.0159, 1.0159],
...,
[-1.5014, -1.5185, -1.5528, ..., 0.3994, 0.3481, 0.3481],
[-1.5014, -1.5185, -1.5528, ..., 0.3994, 0.3481, 0.3823],
[-1.5014, -1.5185, -1.5528, ..., 0.3994, 0.3994, 0.3823]],

     [[ 0.6429,  0.6954,  0.7129,  ...,  1.4132,  1.4132,  1.4132],
      [ 0.6604,  0.7129,  0.7304,  ...,  1.4307,  1.4307,  1.4307],
      [ 0.6779,  0.7304,  0.7304,  ...,  1.4482,  1.4482,  1.4482],
      ...,
      [-1.0903, -1.0903, -1.0553,  ...,  0.8529,  0.8004,  0.8004],
      [-1.0903, -1.0903, -1.0553,  ...,  0.8529,  0.8004,  0.8354],
      [-1.0903, -1.0903, -1.0553,  ...,  0.8529,  0.8529,  0.8354]],

     [[ 0.8797,  0.9145,  0.9319,  ...,  1.6291,  1.6291,  1.6291],
      [ 0.8971,  0.9319,  0.9494,  ...,  1.6465,  1.6465,  1.6465],
      [ 0.9145,  0.9494,  0.9494,  ...,  1.6640,  1.6640,  1.6640],
      ...,
      [-0.6193, -0.6193, -0.6193,  ...,  1.2457,  1.1934,  1.2282],
      [-0.6193, -0.6193, -0.6193,  ...,  1.2457,  1.1934,  1.2631],
      [-0.6193, -0.6193, -0.6193,  ...,  1.2457,  1.2457,  1.2631]]]])

DALI:ops.CropMirrorNormalize(device = "gpu",
output_dtype = types.FLOAT,
image_type = types.RGB,
output_layout=types.NCHW,
mean = [0.485255, 0.456255, 0.406255],
std = [0.229
255, 0.224255, 0.225255]
)
output:tensor([[[[ 0.3138, 0.3309, 0.3481, ..., 0.9646, 0.9646, 0.9646],
[ 0.3309, 0.3481, 0.3652, ..., 0.9817, 0.9817, 0.9817],
[ 0.3481, 0.3652, 0.3652, ..., 0.9988, 0.9988, 0.9988],
...,
[-1.5185, -1.5357, -1.5699, ..., 0.3823, 0.3309, 0.3309],
[-1.5185, -1.5357, -1.5699, ..., 0.3823, 0.3481, 0.3652],
[-1.5185, -1.5357, -1.5699, ..., 0.3823, 0.3823, 0.3652]],

     [[ 0.6254,  0.6779,  0.6954,  ...,  1.3957,  1.3957,  1.3957],
      [ 0.6429,  0.6954,  0.7129,  ...,  1.4132,  1.4132,  1.4132],
      [ 0.6604,  0.7129,  0.7129,  ...,  1.4307,  1.4307,  1.4307],
      ...,
      [-1.1078, -1.0903, -1.0553,  ...,  0.8704,  0.8004,  0.8004],
      [-1.1078, -1.0903, -1.0553,  ...,  0.8704,  0.8179,  0.8354],
      [-1.1078, -1.0903, -1.0553,  ...,  0.8704,  0.8529,  0.8354]],

     [[ 0.8622,  0.8971,  0.9145,  ...,  1.6291,  1.6291,  1.6291],
      [ 0.8797,  0.9145,  0.9319,  ...,  1.6465,  1.6465,  1.6465],
      [ 0.8971,  0.9319,  0.9319,  ...,  1.6640,  1.6640,  1.6640],
      ...,
      [-0.6367, -0.6367, -0.6193,  ...,  1.2631,  1.1934,  1.2108],
      [-0.6367, -0.6367, -0.6193,  ...,  1.2631,  1.2108,  1.2457],
      [-0.6367, -0.6367, -0.6193,  ...,  1.2631,  1.2457,  1.2457]]]],
   device='cuda:3')

RuntimeError

您好,
在Imagenet.py 读取数据时直接出现了 an illegal memory 的错误,请问是什么原因呢?我的显卡是2 * V100,应该不会出现显存不足的错误呀,源码除了数据集位置没有做任何改变,
以下是错误日志

root@test-6gwz28fvc:/data1/test# python imagenet.py
DALI "gpu" variant
read 1281167 files from 1000 directories
140020509374208 Exception in thread: CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Traceback (most recent call last):
File "imagenet.py", line 105, in
num_threads=4, crop=224, device_id=0, num_gpus=1)
File "imagenet.py", line 67, in get_imagenet_iter_dali
dali_iter_train = DALIClassificationIterator(pip_train, size=pip_train.epoch_size("Reader") // world_size)
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 338, in init
last_batch_padded = last_batch_padded)
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 148, in init
self._first_batch = self.next()
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 245, in next
return self.next()
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 163, in next
outputs.append(p.share_outputs())
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 409, in share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: Error in thread 0: CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Current pipeline object is no longer valid.
terminate called after throwing an instance of 'dali::CUDAError'
what(): CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
已放弃 (核心已转储)

能帮忙看一下吗?谢谢

compared the two ways but the time almost the same

read 1281167 files from 1000 directories
read 50000 files from 1000 directories
[DALI] test dataloader length: 196
[DALI] start iterate test dataloader
[DALI] end test dataloader iteration
[DALI] iteration time: 12.413692s [test]
[PyTorch] test dataloader length: 196
[PyTorch] start iterate test dataloader
[PyTorch] end test dataloader iteration
[PyTorch] iteration time: 8.225223s [test]

AttributeError: can't set attribute

运行时,在函数DALIDataloader(pipeline=pip_train, size=IMAGENET_IMAGES_NUM_TRAIN, batch_size=TRAIN_BS, onehot_label=True)中报出错误:AttributeError: can't set attribute。是dali的DALIGenericIterator中加了什么限制吗?求指导 @tanglang96 ,谢谢

no module name nvidia ...?

Thanks for your great work. When I run this code ,
import nvidia.dali
raise error: " no module named 'nvidia' "
how can I fix this.

This data loader can not be reused?

Here's my situation.
I need to train many different models in one single run of python command. If I use the original pytorch loader, every time I delete old model, GPU memory can be released as initial state. However, if I use the script in this repo, after I delete my model and trying to build a new one, GPU will out of memory.
I have noticed that there is a GPU mem release, but it is so few that seems like not releasing at all.
Do you have any idea to solve this problem?

Gains reducing as number of threads increase

When I ran the cifar10 example with num_workers = 16, torch seems to outperform the DALI

[DALI] train dataloader length: 196
[DALI] start iterate train dataloader
[DALI] end train dataloader iteration
[DALI] test dataloader length: 50
[DALI] start iterate test dataloader
[DALI] end test dataloader iteration
[DALI] iteration time: 2.117897s [train], 0.321967s [test]
Files already downloaded and verified
Files already downloaded and verified
[PyTorch] train dataloader length: 196
[PyTorch] start iterate train dataloader
[PyTorch] end train dataloader iteration
[PyTorch] test dataloader length: 50
[PyTorch] start iterate test dataloader
[PyTorch] end test dataloader iteration
[PyTorch] iteration time: 1.788503s [train], 0.328691s [test]

设置multi-gpu实际只挂载一个

你好,我想加速imagenet的读写,并修改了你的代码
train_loader = get_imagenet_iter_dali(type='train', image_dir='/userhome/memory_data/imagenet', batch_size=256, num_threads=4, crop=224, device_id=0, num_gpus=1)
修改至 train_loader = get_imagenet_iter_dali(type='train', image_dir='/data1/share', batch_size=128, num_threads=4, crop=224,device_id=(0,1,2,3), num_gpus=4)

但是CPU占用率相当高85%。且实际只有 gpu = 0被挂载,1,2,3没有。0的util使用率只有5%
image
image

Use dali to train tinyimagenet does not reduce iterate time

Hi,

I modified the file imagenet.py to train tiny-imagenet. I have organized the directories as specified in issues#4. But the iterate time does not decrease too much, even slower( see the image below). Where might be the problem?
time

FYI, I just reorganized the files by adding soft links.

Thank you

compared the two ways(dali and pytorch dataloader), the training time almost the same???

@tanglang96
thanks for your summary, and I compared the two ways(dali and pytorch dataloader), the training time almost the same??? the code are following:

  1. pytorch dataloader format:
CROP_SIZE= 32
CIFAR_MEAN = [0.49139968, 0.48215827, 0.44653124]
CIFAR_STD = [0.24703233, 0.24348505, 0.26158768]
CIFAR_IMAGES_NUM_TRAIN = 50000
CIFAR_IMAGES_NUM_TEST = 10000
IMG_DIR = './data'
TRAIN_BS = 128
TEST_BS = 100
NUM_WORKERS = 2
transform_train = transforms.Compose([
    transforms.RandomCrop(CROP_SIZE, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(CIFAR_MEAN, CIFAR_STD),])
train_dst = CIFAR10(root=IMG_DIR, train=True, download=False, transform=transform_train)
trainloader = torch.utils.data.DataLoader(train_dst, batch_size=TRAIN_BS, shuffle=True, pin_memory=True, num_workers=NUM_WORKERS)

for epoch in range(start_epoch, start_epoch+200):
    print('\nEpoch: %d' % epoch)
    net.train()
    train_loss = 0
    correct = 0
    total = 0
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
        progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
                     % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))

and the corresponding training process
image

2)dali format:

parser = argparse.ArgumentParser(description='Train cifar10 use DALI data process based on the resnet18')
parser.add_argument('--lr', default=0.1, type=float, help='learning rate')
parser.add_argument('--TRAIN_BS', default=128, type=int, help='batch size of data')
parser.add_argument('--TEST_BS', default=100, type=int, help='batch size of data')
parser.add_argument('--NUM_WORKERS', default=2, type=int)
parser.add_argument('--IMG_DIR', default='./data', type=str, help='data path')
parser.add_argument('--CROP_SIZE', default=32, type=int)
parser.add_argument('--CIFAR_IMAGES_NUM_TRAIN', default=50000, type=int)
parser.add_argument('--CIFAR_IMAGES_NUM_TEST', default=10000, type=int)
parser.add_argument('--resume', '-r', action='store_true',
                    help='resume from checkpoint')
args = parser.parse_args()

pip_train = HybridTrainPipe_CIFAR(batch_size=args.TRAIN_BS,
                                  num_threads=args.NUM_WORKERS,
                                  device_id=0,
                                  data_dir=args.IMG_DIR,
                                  crop=args.CROP_SIZE,
                                  world_size=1,
                                  local_rank=0,
                                  cutout=0)
trainloader = DALIDataloader(pipeline=pip_train,
                              size=args.CIFAR_IMAGES_NUM_TRAIN,
                              batch_size=args.TRAIN_BS,
                              onehot_label=True)

for epoch in range(start_epoch, start_epoch+200):
    print('\nEpoch: %d' % epoch)
    net.train()
    train_loss = 0
    correct = 0
    total = 0
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs = inputs.cuda(non_blocking=True)
        targets = targets.cuda(non_blocking=True)
        # inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

        progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
                     % (train_loss / (batch_idx + 1), 100. * correct / total, correct, total))
    trainloader.reset()

and the corresponding training process
image

from the two images, we can see that both of them total time almost 18s for each epoch. and for dali process also exists
WARNING:root:DALI iterator does not support resetting while epoch is not finished. Ignoring...

训练数据的读取

您好,我看目前的例子都是给定分类的文件夹,可以有类似pytorch的自定义Datasets的数据读取方式吗。传入一个txt文件对应imagePath和label 然后在getitem中具体处理。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.