tanglang96 / dataloaders_dali Goto Github PK

PyTorch DataLoaders implemented with DALI for accelerating image preprocessing

Python 100.00%

dataloaders_dali's Introduction

PyTorch DataLoaders with DALI

PyTorch DataLoaders implemented with nvidia-dali, we've implemented CIFAR-10 and ImageNet dataloaders, more dataloaders will be added in the future.

With 2 processors of Intel(R) Xeon(R) Gold 6154 CPU, 1 Tesla V100 GPU and all dataset in memory disk, we can extremely accelerate image preprocessing with DALI.

Iter Training Data Cost(bs=256)	CIFAR-10	ImageNet
DALI	1.4s(2 processors)	625s(8 processors)
torchvision	280.1s(2 processors)	13400s(8 processors)

In CIFAR-10 training, we can reduce tranining time from 1 day to 1 hour with our hardware setting.

Requirements

You only need to install nvidia-dali package and version should be >= 0.12, we've tested version 0.11 and it didn't work

#for cuda9.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/9.0 nvidia-dali
#for cuda10.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/10.0 nvidia-dali

More details and documents can be found here

Usage

You can use these dataloaders easily as the following example

from base import DALIDataloader
from cifar10 import HybridTrainPipe_CIFAR
pip_train = HybridTrainPipe_CIFAR(batch_size=TRAIN_BS,
                                  num_threads=NUM_WORKERS,
                                  device_id=0, 
                                  data_dir=IMG_DIR, 
                                  crop=CROP_SIZE, 
                                  world_size=1, 
                                  local_rank=0, 
                                  cutout=0)
train_loader = DALIDataloader(pipeline=pip_train,
                              size=CIFAR_IMAGES_NUM_TRAIN, 
                              batch_size=TRAIN_BS, 
                              onehot_label=True)
for i, data in enumerate(train_loader): # Using it just like PyTorch dataloader
    images = data[0].cuda(non_blocking=True)
    labels = data[1].cuda(non_blocking=True)

If you have large enough memory for storing dataset, we strongly recommend you to mount a memory disk and put the whole dataset in it to accelerate I/O, like this

mount  -t tmpfs -o size=20g  tmpfs /userhome/memory_data

It's noteworthy that 20g above is a ceiling but not occupying 20g memory at the moment you mount the tmpfs, memories are occupied as you putting dataset in it. Compressed files should not be extracted before you've copied them into memory, otherwise it could be much slower.

dataloaders_dali's People

Contributors

Stargazers

Watchers

Forkers

zhengxiawu njuhaozhang wewan zhyj3038 hiwyl logichen bitqinyong belye prozyy skyuuka cscn89 spytensor ailearnerli wh-forker jinfei3459 weiluo1992 wsnow99 hajungong007 amose-yao dawncc 10183308 zwcheng yin-shane-xia yueyedeai dltensor chunfeima jiazewang wuyxiquanquan arctanxy boozyguo jfc43 instantwindy castellanliu baifanysu hzhang57 wosecz linwaydong patilanup246 bigtaotao geekfeiw chenliangyu-sc leeyongchao daydreamer2023 virtable yanzhaowu tx512185408 gbolin linktopast1990 misslibra wangqiang1588 chunhuizng wxf12345 ethanzhangyc wyzhe hyqyoung jellify wfccross hsj307 highland2019 yuanxing-syy trantorrepository carol007 jinshengye-git guker testwhygh wenbinlee collector-m felixzhang7 zhouleidcc koilgg xinwangg fanyangmeng cwzcwz we0091234 alexadlu beyond1235 lmy370125 lighttoyang miaochenguo t-mac-curry simenglv chang0424 tiaoziliao chenghuige archernero pandorasan joepfortunato shunlu91 cvding dev233 wubukeneng samuellees lmfethan tom666tom666 daojishigailvlun htpauleta ema1997 jxncyym congvmit happyxuwork

dataloaders_dali's Issues

How does the imagenet dir organized?

I want to run 'imagenet.py' and it needs specify a 'path' to 'image_dir'. Can you show me how does the files organized in 'image_dir'? Thank you very much,

output error of the normal about pytorch and DALI

pytorch: transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
output:tensor([[[[ 0.3309, 0.3481, 0.3652, ..., 0.9817, 0.9817, 0.9817],
[ 0.3481, 0.3652, 0.3823, ..., 0.9988, 0.9988, 0.9988],
[ 0.3652, 0.3823, 0.3823, ..., 1.0159, 1.0159, 1.0159],
...,
[-1.5014, -1.5185, -1.5528, ..., 0.3994, 0.3481, 0.3481],
[-1.5014, -1.5185, -1.5528, ..., 0.3994, 0.3481, 0.3823],
[-1.5014, -1.5185, -1.5528, ..., 0.3994, 0.3994, 0.3823]],

     [[ 0.6429,  0.6954,  0.7129,  ...,  1.4132,  1.4132,  1.4132],
      [ 0.6604,  0.7129,  0.7304,  ...,  1.4307,  1.4307,  1.4307],
      [ 0.6779,  0.7304,  0.7304,  ...,  1.4482,  1.4482,  1.4482],
      ...,
      [-1.0903, -1.0903, -1.0553,  ...,  0.8529,  0.8004,  0.8004],
      [-1.0903, -1.0903, -1.0553,  ...,  0.8529,  0.8004,  0.8354],
      [-1.0903, -1.0903, -1.0553,  ...,  0.8529,  0.8529,  0.8354]],

     [[ 0.8797,  0.9145,  0.9319,  ...,  1.6291,  1.6291,  1.6291],
      [ 0.8971,  0.9319,  0.9494,  ...,  1.6465,  1.6465,  1.6465],
      [ 0.9145,  0.9494,  0.9494,  ...,  1.6640,  1.6640,  1.6640],
      ...,
      [-0.6193, -0.6193, -0.6193,  ...,  1.2457,  1.1934,  1.2282],
      [-0.6193, -0.6193, -0.6193,  ...,  1.2457,  1.1934,  1.2631],
      [-0.6193, -0.6193, -0.6193,  ...,  1.2457,  1.2457,  1.2631]]]])

DALI:ops.CropMirrorNormalize(device = "gpu",
output_dtype = types.FLOAT,
image_type = types.RGB,
output_layout=types.NCHW,
mean = [0.485255, 0.456255, 0.406255],
std = [0.229255, 0.224255, 0.225255]
)
output:tensor([[[[ 0.3138, 0.3309, 0.3481, ..., 0.9646, 0.9646, 0.9646],
[ 0.3309, 0.3481, 0.3652, ..., 0.9817, 0.9817, 0.9817],
[ 0.3481, 0.3652, 0.3652, ..., 0.9988, 0.9988, 0.9988],
...,
[-1.5185, -1.5357, -1.5699, ..., 0.3823, 0.3309, 0.3309],
[-1.5185, -1.5357, -1.5699, ..., 0.3823, 0.3481, 0.3652],
[-1.5185, -1.5357, -1.5699, ..., 0.3823, 0.3823, 0.3652]],

     [[ 0.6254,  0.6779,  0.6954,  ...,  1.3957,  1.3957,  1.3957],
      [ 0.6429,  0.6954,  0.7129,  ...,  1.4132,  1.4132,  1.4132],
      [ 0.6604,  0.7129,  0.7129,  ...,  1.4307,  1.4307,  1.4307],
      ...,
      [-1.1078, -1.0903, -1.0553,  ...,  0.8704,  0.8004,  0.8004],
      [-1.1078, -1.0903, -1.0553,  ...,  0.8704,  0.8179,  0.8354],
      [-1.1078, -1.0903, -1.0553,  ...,  0.8704,  0.8529,  0.8354]],

     [[ 0.8622,  0.8971,  0.9145,  ...,  1.6291,  1.6291,  1.6291],
      [ 0.8797,  0.9145,  0.9319,  ...,  1.6465,  1.6465,  1.6465],
      [ 0.8971,  0.9319,  0.9319,  ...,  1.6640,  1.6640,  1.6640],
      ...,
      [-0.6367, -0.6367, -0.6193,  ...,  1.2631,  1.1934,  1.2108],
      [-0.6367, -0.6367, -0.6193,  ...,  1.2631,  1.2108,  1.2457],
      [-0.6367, -0.6367, -0.6193,  ...,  1.2631,  1.2457,  1.2457]]]],
   device='cuda:3')

gpu和cpu使用率折线图

你好，gpu和cpu的历史使用率怎么查看呢，大佬自己写的工具吗

RuntimeError

您好，
在Imagenet.py 读取数据时直接出现了 an illegal memory 的错误，请问是什么原因呢？我的显卡是2 * V100，应该不会出现显存不足的错误呀，源码除了数据集位置没有做任何改变，
以下是错误日志

root@test-6gwz28fvc:/data1/test# python imagenet.py
DALI "gpu" variant
read 1281167 files from 1000 directories
140020509374208 Exception in thread: CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Traceback (most recent call last):
File "imagenet.py", line 105, in
num_threads=4, crop=224, device_id=0, num_gpus=1)
File "imagenet.py", line 67, in get_imagenet_iter_dali
dali_iter_train = DALIClassificationIterator(pip_train, size=pip_train.epoch_size("Reader") // world_size)
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 338, in init
last_batch_padded = last_batch_padded)
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 148, in init
self._first_batch = self.next()
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 245, in next
return self.next()
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 163, in next
outputs.append(p.share_outputs())
File "/usr/local/miniconda3/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 409, in share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: Error in thread 0: CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Current pipeline object is no longer valid.
terminate called after throwing an instance of 'dali::CUDAError'
what(): CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
已放弃 (核心已转储)

能帮忙看一下吗？谢谢

compared the two ways but the time almost the same

read 1281167 files from 1000 directories
read 50000 files from 1000 directories
[DALI] test dataloader length: 196
[DALI] start iterate test dataloader
[DALI] end test dataloader iteration
[DALI] iteration time: 12.413692s [test]
[PyTorch] test dataloader length: 196
[PyTorch] start iterate test dataloader
[PyTorch] end test dataloader iteration
[PyTorch] iteration time: 8.225223s [test]

可否将dali依赖库上传到百度网盘上呢？国外资源下载速度太慢了

谢谢了

请问一下，什么时候开源检测的数据读取呢？

Shift from 'async' to 'non_blocking' in cuda

async has become a python keyword in python 3.7. The cuda method should accept non_blocking as its argument.

refs.
https://github.com/pytorch/pytorch/blob/4d28b65fb8978f1c34765d56f18c07042e70a43a/torch/_utils.py#L20-L21
https://www.python.org/dev/peps/pep-0492/#deprecation-plans

imagenet大约快多少啊

你好，读取imagenet速度有多大的提升，谢谢

请问使用DALI读取速度的确变快了，但实际训练时间未改变，可能是什么原因呢？

ImageNet训练

你好，这个代码在imagenet上训练过吗？为什么每个epoch gpu内存都会增长，最后OOM

Does this also download the data for you?

Or is the dataset expected to be already downloaded?

AttributeError: can't set attribute

运行时，在函数DALIDataloader(pipeline=pip_train, size=IMAGENET_IMAGES_NUM_TRAIN, batch_size=TRAIN_BS, onehot_label=True)中报出错误：AttributeError: can't set attribute。是dali的DALIGenericIterator中加了什么限制吗？求指导 @tanglang96 ，谢谢

Warning: Please set `reader_name` and don't set last_batch_padded and size manually whenever possible. This may lead, in some situations, to miss some samples or return duplicated ones. Check the Sharding section of the documentation for more details.

Does anybody solve this warning?

no module name nvidia ...?

Thanks for your great work. When I run this code ,
import nvidia.dali
raise error: " no module named 'nvidia' "
how can I fix this.

请问在使用 FileReader 读 image 的时候，可以获取读取的 image 的路径吗

Why the parameter "layout" in "feed_input" can only "HWC"?

请问我将数据集挂载在tmpfs上后，对比未挂载时，发现数据的读取速度并未得到任何提升，请问可能是哪方面的原因？

This data loader can not be reused?

Here's my situation.
I need to train many different models in one single run of python command. If I use the original pytorch loader, every time I delete old model, GPU memory can be released as initial state. However, if I use the script in this repo, after I delete my model and trying to build a new one, GPU will out of memory.
I have noticed that there is a GPU mem release, but it is so few that seems like not releasing at all.
Do you have any idea to solve this problem?

I don't think this is a practicable api for GPU usage grows with batchsize ...

If I use large batchsize = 256 for example , this will take up the entire GPU memory and no left for model running .... as stated here

https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/advanced_topics.html#memory-consumption

Multiple GPU

how to use multiple gpu read data?

dataloader里的验证集是不是也做了随机翻转和随机resize操作？

Hello! Could you please tell me how to do the cutout operation?

Dear:
Thank you very much for your code. I noticed that there is a cutout parameter. But you do not implement it. Could you please tell me how to or is it possible to implement it? Thank you !

CIFAR_INPUT_ITER类_init_函数中np.save/np.load的作用

我删除了CIFAR_INPUT_ITER类_init_函数中的np.save/np.load代码，代码执行会报下面的错误

所以想请教一下这段代码在这里起到的作用是？

DataLoaders_DALI/cifar10.py

Lines 121 to 122 in 28e2308

 np.save("cifar.npy", self.data) 

 self.data = np.load('cifar.npy') # to serialize, increase locality

Gains reducing as number of threads increase

When I ran the cifar10 example with num_workers = 16, torch seems to outperform the DALI

[DALI] train dataloader length: 196
[DALI] start iterate train dataloader
[DALI] end train dataloader iteration
[DALI] test dataloader length: 50
[DALI] start iterate test dataloader
[DALI] end test dataloader iteration
[DALI] iteration time: 2.117897s [train], 0.321967s [test]
Files already downloaded and verified
Files already downloaded and verified
[PyTorch] train dataloader length: 196
[PyTorch] start iterate train dataloader
[PyTorch] end train dataloader iteration
[PyTorch] test dataloader length: 50
[PyTorch] start iterate test dataloader
[PyTorch] end test dataloader iteration
[PyTorch] iteration time: 1.788503s [train], 0.328691s [test]

设置multi-gpu实际只挂载一个

你好，我想加速imagenet的读写，并修改了你的代码
从 train_loader = get_imagenet_iter_dali(type='train', image_dir='/userhome/memory_data/imagenet', batch_size=256, num_threads=4, crop=224, device_id=0, num_gpus=1)
修改至 train_loader = get_imagenet_iter_dali(type='train', image_dir='/data1/share', batch_size=128, num_threads=4, crop=224,device_id=(0,1,2,3), num_gpus=4)

但是CPU占用率相当高85%。且实际只有 gpu = 0被挂载，1，2，3没有。0的util使用率只有5%

Use dali to train tinyimagenet does not reduce iterate time

Hi,

I modified the file imagenet.py to train tiny-imagenet. I have organized the directories as specified in issues#4. But the iterate time does not decrease too much, even slower( see the image below). Where might be the problem?

FYI, I just reorganized the files by adding soft links.

Thank you

compared the two ways(dali and pytorch dataloader), the training time almost the same???

@tanglang96
thanks for your summary, and I compared the two ways(dali and pytorch dataloader), the training time almost the same??? the code are following:

pytorch dataloader format:

CROP_SIZE= 32
CIFAR_MEAN = [0.49139968, 0.48215827, 0.44653124]
CIFAR_STD = [0.24703233, 0.24348505, 0.26158768]
CIFAR_IMAGES_NUM_TRAIN = 50000
CIFAR_IMAGES_NUM_TEST = 10000
IMG_DIR = './data'
TRAIN_BS = 128
TEST_BS = 100
NUM_WORKERS = 2
transform_train = transforms.Compose([
    transforms.RandomCrop(CROP_SIZE, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(CIFAR_MEAN, CIFAR_STD),])
train_dst = CIFAR10(root=IMG_DIR, train=True, download=False, transform=transform_train)
trainloader = torch.utils.data.DataLoader(train_dst, batch_size=TRAIN_BS, shuffle=True, pin_memory=True, num_workers=NUM_WORKERS)

for epoch in range(start_epoch, start_epoch+200):
    print('\nEpoch: %d' % epoch)
    net.train()
    train_loss = 0
    correct = 0
    total = 0
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
        progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
                     % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))

and the corresponding training process

2)dali format:

parser = argparse.ArgumentParser(description='Train cifar10 use DALI data process based on the resnet18')
parser.add_argument('--lr', default=0.1, type=float, help='learning rate')
parser.add_argument('--TRAIN_BS', default=128, type=int, help='batch size of data')
parser.add_argument('--TEST_BS', default=100, type=int, help='batch size of data')
parser.add_argument('--NUM_WORKERS', default=2, type=int)
parser.add_argument('--IMG_DIR', default='./data', type=str, help='data path')
parser.add_argument('--CROP_SIZE', default=32, type=int)
parser.add_argument('--CIFAR_IMAGES_NUM_TRAIN', default=50000, type=int)
parser.add_argument('--CIFAR_IMAGES_NUM_TEST', default=10000, type=int)
parser.add_argument('--resume', '-r', action='store_true',
                    help='resume from checkpoint')
args = parser.parse_args()

pip_train = HybridTrainPipe_CIFAR(batch_size=args.TRAIN_BS,
                                  num_threads=args.NUM_WORKERS,
                                  device_id=0,
                                  data_dir=args.IMG_DIR,
                                  crop=args.CROP_SIZE,
                                  world_size=1,
                                  local_rank=0,
                                  cutout=0)
trainloader = DALIDataloader(pipeline=pip_train,
                              size=args.CIFAR_IMAGES_NUM_TRAIN,
                              batch_size=args.TRAIN_BS,
                              onehot_label=True)

for epoch in range(start_epoch, start_epoch+200):
    print('\nEpoch: %d' % epoch)
    net.train()
    train_loss = 0
    correct = 0
    total = 0
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs = inputs.cuda(non_blocking=True)
        targets = targets.cuda(non_blocking=True)
        # inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

        progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
                     % (train_loss / (batch_idx + 1), 100. * correct / total, correct, total))
    trainloader.reset()

and the corresponding training process

from the two images, we can see that both of them total time almost 18s for each epoch. and for dali process also exists
WARNING:root:DALI iterator does not support resetting while epoch is not finished. Ignoring...

训练数据的读取

您好，我看目前的例子都是给定分类的文件夹，可以有类似pytorch的自定义Datasets的数据读取方式吗。传入一个txt文件对应imagePath和label 然后在getitem中具体处理。

AttributeError: 'dict' object has no attribute 'cuda'

Traceback (most recent call last):
File "cifar10.py", line 166, in
images = data[0].cuda(non_blocking=True)
AttributeError: 'dict' object has no attribute 'cuda'

	np.save("cifar.npy", self.data)
	self.data = np.load('cifar.npy') # to serialize, increase locality