GithubHelp home page GithubHelp logo

nvidia-merlin / hugectr Goto Github PK

View Code? Open in Web Editor NEW
926.0 40.0 198.0 56.04 MB

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

License: Apache License 2.0

CMake 1.34% C++ 39.83% Cuda 26.36% Shell 0.48% Python 15.76% Jupyter Notebook 16.19% Makefile 0.01% HTML 0.01% Batchfile 0.01% C 0.01%
cpp deep-learning gpu-acceleration recommendation-system recommender-system

hugectr's Introduction

Version LICENSE Documentation SOK Documentation

HugeCTR is a GPU-accelerated recommender framework designed for training and inference of large deep learning models.

Design Goals:

  • Fast: HugeCTR performs outstandingly in recommendation benchmarks including MLPerf.
  • Easy: Regardless of whether you are a data scientist or machine learning practitioner, we've made it easy for anybody to use HugeCTR with plenty of documents, notebooks and samples.
  • Domain Specific: HugeCTR provides the essentials, so that you can efficiently deploy your recommender models with very large embedding.

NOTE: If you have any questions in using HugeCTR, please file an issue or join our Slack channel to have more interactive discussions.

Table of Contents

Core Features

HugeCTR supports a variety of features, including the following:

To learn about our latest enhancements, refer to our release notes.

Getting Started

If you'd like to quickly train a model using the Python interface, do the following:

  1. Start a NGC container with your local host directory (/your/host/dir mounted) by running the following command:

    docker run --gpus=all --rm -it --cap-add SYS_NICE -v /your/host/dir:/your/container/dir -w /your/container/dir -it -u $(id -u):$(id -g) nvcr.io/nvidia/merlin/merlin-hugectr:24.06
    

    NOTE: The /your/host/dir directory is just as visible as the /your/container/dir directory. The /your/host/dir directory is also your starting directory.

    NOTE: HugeCTR uses NCCL to share data between ranks, and NCCL may requires shared memory for IPC and pinned (page-locked) system memory resources. It is recommended that you increase these resources by issuing the following options in the docker run command.

    -shm-size=1g -ulimit memlock=-1
    
  2. Write a simple Python script to generate a synthetic dataset:

    # dcn_parquet_generate.py
    import hugectr
    from hugectr.tools import DataGeneratorParams, DataGenerator
    data_generator_params = DataGeneratorParams(
      format = hugectr.DataReaderType_t.Parquet,
      label_dim = 1,
      dense_dim = 13,
      num_slot = 26,
      i64_input_key = False,
      source = "./dcn_parquet/file_list.txt",
      eval_source = "./dcn_parquet/file_list_test.txt",
      slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, 
                         20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120,
                         1543 ],
      dist_type = hugectr.Distribution_t.PowerLaw,
      power_law_type = hugectr.PowerLaw_t.Short)
    data_generator = DataGenerator(data_generator_params)
    data_generator.generate()
    
  3. Generate the Parquet dataset for your DCN model by running the following command:

    python dcn_parquet_generate.py
    

    NOTE: The generated dataset will reside in the folder ./dcn_parquet, which contains training and evaluation data.

  4. Write a simple Python script for training:

    # dcn_parquet_train.py
    import hugectr
    from mpi4py import MPI
    solver = hugectr.CreateSolver(max_eval_batches = 1280,
                                  batchsize_eval = 1024,
                                  batchsize = 1024,
                                  lr = 0.001,
                                  vvgpu = [[0]],
                                  repeat_dataset = True)
    reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                     source = ["./dcn_parquet/file_list.txt"],
                                     eval_source = "./dcn_parquet/file_list_test.txt",
                                     slot_size_array = [39884, 39043, 17289, 7420, 20263, 3, 7120, 1543, 39884, 39043, 17289, 7420, 
                                                       20263, 3, 7120, 1543, 63, 63, 39884, 39043, 17289, 7420, 20263, 3, 7120, 1543 ])
    optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam,
                                        update_type = hugectr.Update_t.Global)
    model = hugectr.Model(solver, reader, optimizer)
    model.add(hugectr.Input(label_dim = 1, label_name = "label",
                            dense_dim = 13, dense_name = "dense",
                            data_reader_sparse_param_array =
                            [hugectr.DataReaderSparseParam("data1", 1, True, 26)]))
    model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
                               workspace_size_per_gpu_in_mb = 75,
                               embedding_vec_size = 16,
                               combiner = "sum",
                               sparse_embedding_name = "sparse_embedding1",
                               bottom_name = "data1",
                               optimizer = optimizer))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                               bottom_names = ["sparse_embedding1"],
                               top_names = ["reshape1"],
                               leading_dim=416))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                               bottom_names = ["reshape1", "dense"], top_names = ["concat1"]))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCross,
                               bottom_names = ["concat1"],
                               top_names = ["multicross1"],
                               num_layers=6))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                               bottom_names = ["concat1"],
                               top_names = ["fc1"],
                               num_output=1024))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                               bottom_names = ["fc1"],
                               top_names = ["relu1"]))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                               bottom_names = ["relu1"],
                               top_names = ["dropout1"],
                               dropout_rate=0.5))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                               bottom_names = ["dropout1", "multicross1"],
                               top_names = ["concat2"]))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                               bottom_names = ["concat2"],
                               top_names = ["fc2"],
                               num_output=1))
    model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                               bottom_names = ["fc2", "label"],
                               top_names = ["loss"]))
    model.compile()
    model.summary()
    model.graph_to_json(graph_config_file = "dcn.json")
    model.fit(max_iter = 5120, display = 200, eval_interval = 1000, snapshot = 5000, snapshot_prefix = "dcn")
    

    NOTE: Ensure that the paths to the synthetic datasets are correct with respect to this Python script. data_reader_type, check_type, label_dim, dense_dim, and data_reader_sparse_param_array should be consistent with the generated dataset.

  5. Train the model by running the following command:

    python dcn_parquet_train.py
    

    NOTE: It is presumed that the evaluation AUC value is incorrect since randomly generated datasets are being used. When the training is done, files that contain the dumped graph JSON, saved model weights, and optimizer states will be generated.

For more information, refer to the HugeCTR User Guide.

HugeCTR SDK

We're able to support external developers who can't use HugeCTR directly by exporting important HugeCTR components using:

  • Sparse Operation Kit directory | documentation: a python package wrapped with GPU accelerated operations dedicated for sparse training/inference cases.
  • GPU Embedding Cache: embedding cache available on the GPU memory designed for CTR inference workload.

Support and Feedback

If you encounter any issues or have questions, go to https://github.com/NVIDIA/HugeCTR/issues and submit an issue so that we can provide you with the necessary resolutions and answers. To further advance the HugeCTR Roadmap, we encourage you to share all the details regarding your recommender system pipeline using this survey.

Contributing to HugeCTR

With HugeCTR being an open source project, we welcome contributions from the general public. With your contributions, we can continue to improve HugeCTR's quality and performance. To learn how to contribute, refer to our HugeCTR Contributor Guide.

Additional Resources

Webpages
NVIDIA Merlin
NVIDIA HugeCTR

Publications

Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Jie Liu, Ji Shi and Zehuan Wang, "A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models," Proceedings of the 16th ACM Conference on Recommender Systems, pp. 408-419, 2022.

Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi and Kunlun Li, "Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference," Proceedings of the 16th ACM Conference on Recommender Systems, pp. 534-537, 2022.

Talks

Conference / Website Title Date Speaker Language
ACM RecSys 2022 A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models September 2022 Matthias Langer English
Short Videos Episode 1 Merlin HugeCTR:GPU 加速的推荐系统框架 May 2022 Joey Wang 中文
Short Videos Episode 2 HugeCTR 分级参数服务器如何加速推理 May 2022 Joey Wang 中文
Short Videos Episode 3 使用 HugeCTR SOK 加速 TensorFlow 训练 May 2022 Gems Guo 中文
GTC Sping 2022 Merlin HugeCTR: Distributed Hierarchical Inference Parameter Server Using GPU Embedding Cache March 2022 Matthias Langer, Yingcan Wei, Yu Fan English
APSARA 2021 GPU 推荐系统 Merlin Oct 2021 Joey Wang 中文
GTC Spring 2021 Learn how Tencent Deployed an Advertising System on the Merlin GPU Recommender Framework April 2021 Xiangting Kong, Joey Wang English
GTC Spring 2021 Merlin HugeCTR: Deep Dive Into Performance Optimization April 2021 Minseok Lee English
GTC Spring 2021 Integrate HugeCTR Embedding with TensorFlow April 2021 Jianbing Dong English
GTC China 2020 MERLIN HUGECTR :深入研究性能优化 Oct 2020 Minseok Lee English
GTC China 2020 性能提升 7 倍 + 的高性能 GPU 广告推荐加速系统的落地实现 Oct 2020 Xiangting Kong 中文
GTC China 2020 使用 GPU EMBEDDING CACHE 加速 CTR 推理过程 Oct 2020 Fan Yu 中文
GTC China 2020 将 HUGECTR EMBEDDING 集成于 TENSORFLOW Oct 2020 Jianbing Dong 中文
GTC Spring 2020 HugeCTR: High-Performance Click-Through Rate Estimation Training March 2020 Minseok Lee, Joey Wang English
GTC China 2019 HUGECTR: GPU 加速的推荐系统训练 Oct 2019 Joey Wang 中文

Blogs

Conference / Website Title Date Authors Language
Wechat Blog Merlin HugeCTR 分级参数服务器系列之三:集成到TensorFlow Nov. 2022 Kingsley Liu 中文
NVIDIA Devblog Scaling Recommendation System Inference with Merlin Hierarchical Parameter Server/使用 Merlin 分层参数服务器扩展推荐系统推理 August 2022 Shashank Verma, Wenwen Gao, Yingcan Wei, Matthias Langer, Jerry Shi, Fan Yu, Kingsley Liu, Minseok Lee English/中文
NVIDIA Devblog Merlin HugeCTR Sparse Operation Kit 系列之二 June 2022 Kunlun Li 中文
NVIDIA Devblog Merlin HugeCTR Sparse Operation Kit 系列之一 March 2022 Gems Guo, Jianbing Dong 中文
Wechat Blog Merlin HugeCTR 分级参数服务器系列之二 March 2022 Yingcan Wei, Matthias Langer, Jerry Shi 中文
Wechat Blog Merlin HugeCTR 分级参数服务器系列之一 Jan. 2022 Yingcan Wei, Jerry Shi 中文
NVIDIA Devblog Accelerating Embedding with the HugeCTR TensorFlow Embedding Plugin Sept 2021 Vinh Nguyen, Ann Spencer, Joey Wang and Jianbing Dong English
medium.com Optimizing Meituan’s Machine Learning Platform: An Interview with Jun Huang Sept 2021 Sheng Luo and Benedikt Schifferer English
medium.com Leading Design and Development of the Advertising Recommender System at Tencent: An Interview with Xiangting Kong Sept 2021 Xiangting Kong, Ann Spencer English
NVIDIA Devblog 扩展和加速大型深度学习推荐系统 – HugeCTR 系列第 1 部分 June 2021 Minseok Lee 中文
NVIDIA Devblog 使用 Merlin HugeCTR 的 Python API 训练大型深度学习推荐模型 – HugeCTR 系列第 2 部分 June 2021 Vinh Nguyen 中文
medium.com Training large Deep Learning Recommender Models with Merlin HugeCTR’s Python APIs — HugeCTR Series Part 2 May 2021 Minseok Lee, Joey Wang, Vinh Nguyen and Ashish Sardana English
medium.com Scaling and Accelerating large Deep Learning Recommender Systems — HugeCTR Series Part 1 May 2021 Minseok Lee English
IRS 2020 Merlin: A GPU Accelerated Recommendation Framework Aug 2020 Even Oldridge etc. English
NVIDIA Devblog Introducing NVIDIA Merlin HugeCTR: A Training Framework Dedicated to Recommender Systems July 2020 Minseok Lee and Joey Wang English

hugectr's People

Contributors

aleckohlhoff avatar alexeedm avatar amukkara avatar ashishsardana avatar bashimao avatar benfred avatar bkarsin avatar chirayug-nvidia avatar emmaqiaoch avatar georgeliu95 avatar janekl avatar jershi425 avatar jianbing-d avatar kingsleyliu-nv avatar kunlunl avatar lgardenhire avatar mengran-nvidia avatar miguelusque avatar mikemckiernan avatar minseokl avatar nyrio avatar oyilmaz-nvidia avatar raywang96 avatar reoptnvidia avatar shijieliu avatar vinhngx avatar wl1136 avatar xiaoleishi-nv avatar yingcanw avatar zehuanw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hugectr's Issues

setting seed can't reproduce the results

What have I done?

  • set seed in config file: "solver": {"seed": 100}

  • set maxiter=2 , eval_interval=1

  • set file_list.txt with 1 file set file_list_with.txt with 1 file

  • set train and eval reader chunk_size with 1 : data_reader.reset(new DataReader(source_data, batch_size, label_dim, dense_dim,
    check_type, data_reader_sparse_param_array,
    gpu_resource_group, 1, use_mixed_precision));

  • run the train process ./huge_ctr --train model.json twice

What happened?

AverageLoss(1.200125 and 1.18949 ) are too far

the first train log:

 [05d17h49m17s][HUGECTR][INFO]: Iter: 1 Time(1 iters): 0.101207s Loss: 1.211278 lr:0.000100

[05d17h49m18s][HUGECTR][INFO]: Evaluation, AUC: 0.501446
[05d17h49m18s][HUGECTR][INFO]: Evaluation, AverageLoss: 1.200125

the second train log:

 
[05d17h51m37s][HUGECTR][INFO]: Iter: 1 Time(1 iters): 0.093456s Loss: 1.200530 lr:0.000100

[05d17h51m37s][HUGECTR][INFO]: Evaluation, AUC: 0.397724

[05d17h51m37s][HUGECTR][INFO]: Evaluation, AverageLoss: 1.18949

Custom models on HugeCTR

Hi HugeCTR experts,

I want to implement a custom model on HugeCTR. So far, I could not find docs that show how to import layers/optimizers to build a custom model. Or is there anything I miss?

I wonder if you guys have or will release documentations that show how to build custom model?

Thanks

when cache_size_ >1 the train loss is zero

We find that setting cache_size_ >1 in DataCollector , the train loss is almost zero . In DataCollector.hpp :

template <typename TypeKey>
void DataCollector<TypeKey>::collect() {
  if (counter_ < cache_size_ || cache_size_ == 0) {
    collect_();
  } else {
    collect_blank_(); 
  }
}

counter_ is increment , and will never less than cache_size_ once it's bigger than cache_size_ . And not collect is running, so the train data is old version , and model is overfit , loss is almost zero .
The correct code is supposed to be

template <typename TypeKey>
void DataCollector<TypeKey>::collect() {
  if (counter_ % internal_buffers_.size() < cache_size_ || cache_size_ == 0) {
    collect_();
  } else {
    collect_blank_(); 
  }
}

[BUG] Runtime error: an illegal memory access

After processing Criteo dataset with NVTabular and generating the output parquet files, I get Runtime error: an illegal memory access when I try to train using HugeCTR and DLRM model.

[06d20h48m42s][HUGECTR][INFO]: Iter: 14000 Time(1000 iters): 51.684892s Loss: 0.131229 lr:24.000000
[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/embeddings/update_params_functor.cu:571 

[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/embeddings/update_params_functor.cu:571 

[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/session.cpp:427 

terminate called after throwing an instance of 'HugeCTR::internal_runtime_error'
  what():  [HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/include/general_buffer2.hpp:37

HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(276): error: identifier "__syncwarp" is undefined

HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(276): error: identifier "__syncwarp" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(287): error: identifier "__any_sync" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(300): error: identifier "__all_sync" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(313): error: identifier "__ballot_sync" is undefined

4 errors detected in the compilation of "/tmp/tmpxft_0000e907_00000000-6_embedding_creator.cpp1.ii".
HugeCTR/src/CMakeFiles/huge_ctr_static.dir/build.make:101: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/embedding_creator.cu.o' failed
make[2]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/embedding_creator.cu.o] Error 1
CMakeFiles/Makefile2:156: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all' failed
make[1]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Error of running './huge_ctr --train ./deepfm_bin.json'

Hi there,
I tried running HugeCTR Docker example of DeepFM with NVTabular preprocessing, but after running the command on the title, it shows errors and stops at the training start. Is there any bug?Thx.

System: Ubuntu 18.04.4 LTS
GPU: GeForce RTX 2080 Ti
Driver Version: 440.44
CUDA Version: 10.2

[0.001, init_start, ]
HugeCTR Version: 2.2.1
Config file: ./deepfm_bin.json
[21d09h02m26s][HUGECTR][INFO]: batchsize_eval is not specified using default: 512
[21d09h02m26s][HUGECTR][INFO]: Default evaluation metric is AUC without threshold value
[21d09h02m26s][HUGECTR][INFO]: algorithm_search is not specified using default: 1
[21d09h02m26s][HUGECTR][INFO]: Algorithm search: ON
[21d09h02m26s][HUGECTR][INFO]: cuda_graph is not specified using default: 1
[21d09h02m26s][HUGECTR][INFO]: CUDA Graph: ON
[21d09h02m26s][HUGECTR][INFO]: Initial seed is 3545387129
[21d09h02m28s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: GeForce RTX 2080 Ti
[21d09h02m30s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[21d09h02m30s][HUGECTR][INFO]: max_nnz is not specified using default: 30
[21d09h02m30s][HUGECTR][INFO]: num_internal_buffers 1
[21d09h02m30s][HUGECTR][INFO]: num_internal_buffers 1
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR]
DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] [HCDEBUG][ERROR] 58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderErrorDataHeaderErrorDataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::DataHeaderErrorDataHeaderError:58
58 [HCDEBUG][ERROR]
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::[HCDEBUG][ERROR] 58[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

DataHeaderError[HCDEBUG][ERROR] DataHeaderError58[HCDEBUG][ERROR]
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] [HCDEBUG][ERROR] :DataHeaderError58 /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
DataHeaderError58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] 58
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError :58
:58DataHeaderError[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
::58:5858[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError58: 58

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :58
58 [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError: /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58:

DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::DataHeaderError58

[HCDEBUG][ERROR] DataHeaderError:[HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] 58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp 58
[HCDEBUG][ERROR] DataHeaderError58[HCDEBUG][ERROR] DataHeaderError:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError :[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::5858[HCDEBUG][ERROR]
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[21d09h02m30s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=1737709

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: 5858
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR]
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58:

DataHeaderError 5858DataHeaderError58[HCDEBUG][ERROR] 58DataHeaderError
58:58

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError:[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError
58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError:[HCDEBUG][ERROR] 58 ::
[HCDEBUG][ERROR] DataHeaderError:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :

58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: 58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58DataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError
58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58
58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR]
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

DataHeaderError 58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError :58
[HCDEBUG][ERROR] DataHeaderError 58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError58
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58[HCDEBUG][ERROR]
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError
[HCDEBUG][ERROR] [HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderErrorDataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
:58[HCDEBUG][ERROR]
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58 DataHeaderError [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError :58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

:58
:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError 58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
[HCDEBUG][ERROR] DataHeaderError:58:
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError DataHeaderError 58
[HCDEBUG][ERROR] DataHeaderError :58[HCDEBUG][ERROR]
DataHeaderError[HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: [HCDEBUG][ERROR]
58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::
[HCDEBUG][ERROR] 5858
[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp [HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::58
:[HCDEBUG][ERROR] 58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58DataHeaderError58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError:[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58
58 /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] 58[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

DataHeaderError[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

: /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[21d09h02m30s][HUGECTR][INFO]: gpu0 start to init embedding
[21d09h02m30s][HUGECTR][INFO]: gpu0 init embedding done
[21d09h02m30s][HUGECTR][INFO]: warmup_steps is not specified using default: 1
[21d09h02m30s][HUGECTR][INFO]: decay_start is not specified using default: 0
[21d09h02m30s][HUGECTR][INFO]: decay_steps is not specified using default: 1
[21d09h02m30s][HUGECTR][INFO]: decay_power is not specified using default: 2.000000
[21d09h02m30s][HUGECTR][INFO]: end_lr is not specified using default: 0.000000
[3538.92, init_end, ]
[3538.94, run_start, ]
HugeCTR training start:
[3538.95, train_epoch_start, 0, ]

Parser doesn't check if a given layer name is already in use

Description:
Currently, our parser doesn't check if a specified layer "name" is already being used by a preceding layer.
As a result, the following erroneous layers can be silently inserted into network.
Without any safety measure, this kind of config bug can result in a disconnected network, where parameters are not appropriately trained.

      {
        "name": "fc6",
        "type": "InnerProduct",
        "bottom": "relu5",
        "top": "fc6",
         "fc_param": {
          "num_output": 512
                }
      },

      {
        "name": "fc6",
        "type": "InnerProduct",
        "bottom": "relu5",
        "top": "fc6",
         "fc_param": {
          "num_output": 512
        }
      },

      {
        "name": "relu6",
        "type": "ReLU",
        "bottom": "fc6",
        "top": "relu6"
      },

Comments:

dump_to_tf throwing memory error

I followed instructions given in /hugectr/tutorials/dump_to_tf/ReadMe.
But when running "python3 main.py
../../samples/dcn/dcn_bin.json
../../samples/dcn/train/0.data
../../samples/dcn/_dense_9999.model
../../samples/dcn/0_sparse_9999.model", I am getting memory exception. Please refer attached screenshot for actual error.

Note: I have used nvtabular with binary format to preprocess and train with hugectr. Hence config file used in above command
dump_to_tf_error
free-memory-in-gb

is dcn_bin.json.

Will Hugectr add more support for tensorflow model transfering

Currently, there are only one tutorial about transfering hugectr model to tensorflow model: https://github.com/NVIDIA/HugeCTR/tree/master/tutorial/dump_to_tf . And the tutorial code is not well architectured , and seems to be a specific example , but not a common reuse modular.
My question is: what's the plan hugectr team to develop a common python moduar, which should have the following behaviors:

  • Input: hugectr model config file path, tensorflow output path
  • Output: tensorflow model under tensorflow output path

Runtime error: out of memory /mnt/HugeCTR/HugeCTR/include/general_buffer.hpp:64

Hi, when I was trying to run DLRM with terabyte dataset with one GPU, I got a runtime error message like this. My guess is I ran out of my GPU memory. I've also tried to decrease the mini batch size or batchsize_eval but still get this error. Does anyone know how to solve this issue?

I was running the following command:
./huge_ctr --train ./dlrm_fp16_64k.json

And the solver in my dlrm_fp16_64k.json looks like this:
"solver": {
"lr_policy": "fixed",
"display": 1000,
"max_iter":64013,
"gpu": [0],
"batchsize": 1024,
"batchsize_eval": 131072,
"snapshot": 10000000,
"snapshot_prefix": "./",
"eval_interval": 3200,
"eval_batches": 681,
"mixed_precision": 1024,
"eval_metrics": ["AUC:0.8025"]
}

parqute datareader illegal memory access when 2xDGXA100 training

Description:
commit:
commit 53a2ff8 (HEAD -> v2.3-integration, origin/v2.3-integration, origin/data-power-law-kingsley)
Merge: e75c290 3c49da0
Author: Joey Wang [email protected]
Date: Sat Oct 31 01:24:10 2020 -0700
Merge branch ‘fea-multinode-auc-dmitry-2.3’ into ‘v2.3-integration’
Multinode AUC
See merge request zehuanw/hugectr!257

dataset: /mnt/dldata/criteo_1TB/albertoa/test_dask/output/ in dlcluster

config: 2xdgxa100.json

error log:hugectr-test-1604632076.log

reproduce step:
currently I am facing this bug when using raplab. I will update how to reproduce when I have access to selene
Comments:

Can hugectr add predict command?

Currently , the hugectr main process supports: [--train] [--help] [--version] . However, it's a common scenario that when training is done we predict the test data and print the result on the screen which can redirect to file.

With the predict result, we can :

  1. Comparing the hugectr predict result with the transferred tensorflow model's , to make sure the transfering process is correct.
  2. Use the result to calculate other metric , like auc , precision, etc.
  3. Do batch prediction.

Can hugectr add predict command?

  • huge_ctr --predict model_config.json test_file_list.txt

v2.2 build error

build v2.2 with command : mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release -DNCCL_A2A=ON -DSM=70 .. && make -j

got error:
[ 3%] Building CUDA object HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o
nvcc fatal : Value 'all-warnings' is not defined for option 'Werror'
HugeCTR/src/CMakeFiles/huge_ctr_static.dir/build.make:134: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o' failed
make[2]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o] Error 1
CMakeFiles/Makefile2:124: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all' failed
make[1]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

and

data/HugeCTR/test/utest/layers/multi_cross_layer_test.cpp:276:25: note: suggested alternative: 'compare_array_approx'
/data/HugeCTR/test/utest/layers/multi_cross_layer_test.cpp:276:7: error: expected primary-expression before '(' token
ASSERT_TRUE(test::compare_array_approx_with_ratio(

Solution:
we fix the error by deleting some conf in CMakeLists.txt,
delete the "--Werror all-warnings" and test modular.

Does the v2.2 testing use the dockerfile?

v2.2 GeneralBuffer is empty error

We run v2.2 in v100 with cuda10.1, and has some error:
[HCDEBUG][ERROR] Runtime error: GeneralBuffer is empty /tmp/HugeCTR/HugeCTR/include/general_buffer.hpp:136

Our config is:

{
    "solver": {
      "lr_policy": "fixed",
      "display":  100,
      "max_iter":  1000,
      "gpu":  [0],
      "input_key_type":"I64",
      "batchsize":  4096,
      "batchsize_eval":4096,
      "snapshot": 10000000,
      "snapshot_prefix": "./",
      "eval_interval": 100,
      "eval_metrics": ["AUC:0.9","AverageLoss"],
      "eval_batches": 500
    },
    
    "optimizer": {
      "type": "Adam",
      "global_update": true,
      "adam_hparam": {
        "learning_rate": 0.001,
        "alpha": 0.001,
        "beta1": 0.9,
        "beta2": 0.999,
        "epsilon": 0.00000001
      }
    },
  
    "layers": [ 
        {
        "name": "data",
        "type": "Data",
        "source": "./file_list.txt",
        "eval_source": "./file_list_test.txt",
        "check": "Sum",
        "label": {
          "top": "label",
          "label_dim": 1
        },
        "dense": {
          "top": "dense",
          "dense_dim": 0
        },
        "sparse": [
          {
            "top": "data1",
            "type": "DistributedSlot",
            "max_feature_num_per_sample": 100,
            "slot_num": 75
          }        
        ]
      },
  
      {
        "name": "sparse_embedding1",
        "type": "DistributedSlotSparseEmbeddingHash",
        "bottom": "data1",
        "top": "sparse_embedding1",
        "sparse_embedding_hparam": {
          "max_vocabulary_size_per_gpu": 20000000,
          "load_factor": 0.75,
          "embedding_vec_size": 16,
          "combiner": 1
        }
      },
  
      {
        "name": "reshape1",
        "type": "Reshape",
        "bottom": "sparse_embedding1",
        "top": "reshape1",
        "leading_dim": 1200
      },
  
  
      {
        "name": "concat1",
        "type": "Concat",
        "bottom": ["reshape1","dense"],
        "top": "concat1"
      },
  
      {
        "name": "slice1",
        "type": "Slice",
        "bottom": "concat1",
        "ranges": [[0,1200], [0,1200]],
        "top": ["slice11", "slice12"]
      },
  
  
      {
        "name": "multicross1",
        "type": "MultiCross",
        "bottom": "slice11",
        "top": "multicross1",
        "mc_param": {
          "num_layers": 3
        }
      },
  
      {
        "name": "fc1",
        "type": "InnerProduct",
        "bottom": "slice12",
        "top": "fc1",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu1",
        "type": "ReLU",
        "bottom": "fc1",
        "top": "relu1" 
      },
        
      {
        "name": "dropout1",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu1",
        "top": "dropout1" 
      },
  
      {
        "name": "fc2",
        "type": "InnerProduct",
        "bottom": "dropout1",
        "top": "fc2",
         "fc_param": {
          "num_output": 128
        }
      },
  
      {
        "name": "relu2",
        "type": "ReLU",
        "bottom": "fc2",
        "top": "relu2"     
      },
  
      {
        "name": "dropout2",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu2",
        "top": "dropout2" 
      },
         {
        "name": "fc3",
        "type": "InnerProduct",
        "bottom": "dropout2",
        "top": "fc3",
         "fc_param": {
          "num_output": 64
        }
      },
  
      {
        "name": "relu3",
        "type": "ReLU",
        "bottom": "fc3",
        "top": "relu3"     
      },
  
      {
        "name": "dropout3",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu3",
        "top": "dropout3" 
      },
      {
        "name": "concat2",
        "type": "Concat",
        "bottom": ["dropout3","multicross1"],
        "top": "concat2"
      },
      
      {
        "name": "fc4",
        "type": "InnerProduct",
        "bottom": "concat2",
        "top": "fc4",
         "fc_param": {
          "num_output": 1
        }
      },
      
      {
        "name": "loss",
        "type": "BinaryCrossEntropyLoss",
        "bottom": ["fc4","label"],
        "top": "loss"
      } 
    ]
  }

Questions on HashTable

Hi, thanks for the nice work. I viewed the code and meet the following quesitons.

  1. Where is the hash map palced?
    The hash map is responsible for mapping the input to a value. For example, given the input 10 and the bucket size 100, so it's hash value is
822 = hash('10')%100

But what I can find is only hasbTable to store the mapping <10, 822>. I want to know the part it generates the 822.

  1. Does the embedding table support dynamic growth?
    I see the embedding table behind the hashtable is a fixed size. ses here. So the HashTable is dynamic but the embedding table is fixed?

running hugectr with multi nodes

Is there any whole tutorial about running hugectr with multi nodes ?

I have try this:

Follow the examples(https://github.com/NVIDIA/HugeCTR/tree/master/samples/dcn2nodes) , what have done is:
Build an mutlinode support images:

  • base on the dockerfile in hugectr
  • install hwloc2.2.0
  • install ucx-1.8.0
  • install openmpi4.0.3 withe ucx support
  • install mpi4py 3.0.3
  • build ctr : cmake -DCMAKE_BUILD_TYPE=Release -DSM=70 -DENABLE_MULTINODES=ON ..

Run hugectr with two NVlink supported 8*V100(32G) phyical machines.

  • Start command is:
    export SSH_PORT="xxx"
    export NP="2"
    export WORK_DIR="/data/dcn_data/"
    export HOSTS="ip1:1,ip2:1"
    export ARGS=" ./bin/huge_ctr --train ./data/dcn-dist.json "
    cd $WORK_DIR
    bash start_dist.sh

start_dist.sh:
set -x

mpirun --bind-to none --allow-run-as-root -np $NP -H ${HOSTS} -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} -x LIBRARY_PATH=${LIBRARY_PATH} -x PATH=${PATH} -wdir ${PWD} --mca plm_rsh_agent "$PWD/ssh_resolver.sh" --mca btl_tcp_if_include ib0 $ARGS > logs.txt 2>&1 &

ssh_resolver.sh:

#!/bin/bash
HOSTNAME=$1
shift
ARGS=$*

ssh -p "$SSH_PORT" "$HOSTNAME" "$ARGS"

My question is:

Is my mpirun command is correct ? Should I specfic ucx in mpirun?How hugectr use the ucx 、hwloc ? And how can I user Inifiniband \ RDMA to accelerate hugectr?

For example ,the ucx command looks like:
mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app

https://github.com/openucx/ucx

Should hugectr add batch normalization offset and scale

Saving hugectr model with the batch normalization layer, we can get gamma, beta but not offset, scale , which should be estimator of the training data:
image
image

And when we transfer hugectr model to tensorflow model, we need to set offset and scale in tf.nn.batch_normalization( x, mean, variance, offset, scale, variance_epsilon, name=None ) .
Can hugectr adds offset and scale parameters to be saved to the binary model?

what's the meaning of local_id = feature_ids[k] + slot_offset_[k]?

Hi,
I read the V2.2 code, when the hash type is LocalizedSlotSparseEmbeddingOneHot,why the local_id = feature_ids[k] + slot_offset_[k],what's the meaning of this?
if (params_.size() == 1 && params_[0].type == DataReaderSparse_t::Localized && !slot_offset_.empty()) { auto& param = params_[0]; for (int k = 0; k < param.slot_num; k++) { int dev_id = k % csr_chunk->get_num_devices(); T local_id = feature_ids[k] + slot_offset_[k]; csr_chunk->get_csr_buffer(param_id, dev_id).push_back_new_row(local_id);** } }

[FEA] Make optional the number of files in Norm Dataset File List

Hi!

I am not sure if starting the norm dataset file with the number of files in the list is the best option.

IMO, that value should not be needed, because it might be easily calculated by the parser. It might also be a source of future errors if the parser doesn't double-check that the number specified corresponds to the number of files detailed in the norm dataset file.

Therefore, I would suggest to make that value optional.

https://github.com/NVIDIA/HugeCTR/blob/master/docs/hugectr_user_guide.md#file-list

Hope it helps!

Can Hugectr supports Sequence Model?

Currently the embedding layer, supports mean or sum pooling for the variable length. In the deep learning word, using LSTM 、 Attention, is normal. For example , DIN model uses attention layer to merge user behavior sequence.

Can Hugectr supports Sequence Model, such as LSTM 、 GRU 、 Attention ,etc?

question on backward computation

Hi Hugectr experts,

I have a question on backward computation. Take the localized slot as example,
I notice that hugectr perform the all-to-all after the forward propagation. And in the backward, it performs the all-to-all again before the backward propagation. Why there is two all-to-all operations between the forward and backward?

[ QUESTION ] training without eval set

Is there a way to run training without having a validation set? Whenever I don't have anything for source_eval I got an file_empty kind of error.

On top of that, HugeCTR really needs to work on error messages, I had to trace the code to see what is happening.

Test failed when I input decimal

Hi HugeCTR experts:
in master/test/utest/layers/fully_connected_layer_test.cpp
107:for (size_t i = 0; i < k * n; ++i) h_weight[i] = (float)(rand() % 100);
108:for (size_t i = 0; i < m * k; ++i) h_in[i] = (float)(rand() % 100);

when I use decimal:
107:for (size_t i = 0; i < k * n; ++i) h_weight[i] = (float)((rand() % 100) * 0.1);
108:for (size_t i = 0; i < m * k; ++i) h_in[i] = (float)((rand() % 100)* 0.1);
the test is failed, the max_diff of CPU and GPU is > 0.1 (for example: 0.3125), why?

[GFN RecSys] RMM memory allocation fail for parquet Datareader

Description:

Overview:

The GFN dataset is pre-processed with NVTabular which resulted in 8 parquet files for training. I'm using just 1 parquet file for testing HugeCTR. I've modified _metadata.json to include only 1 filename (corresponding to the 1 parquet file).
While training DLRM with the least possible embedding_vec_size=1, I'm getting the following error:

terminate called after throwing an instance of 'rmm::bad_alloc'
what(): std::bad_alloc: CNMEM error at: /opt/conda/envs/rapids/include/rmm/mr/device/cnmem_memory_resource.hpp168: CNMEM_STATUS_OUT_OF_MEMORY

The error is discussed in detail here

Minimal reproducing steps:

The dataset is available on NGC Batch (dataset id: 68926) which contains:

  1. parquet file
  2. _metadata.json
  3. _file_list_try.txt

The docker image was built using this script and is available on NGC Batch as nvidian/tme-gfnmerlin/hugectr_rel:1

Attached is the config used for training - dlrm_fp32_256_local.json

A NGC Batch job can be run using -

ngc batch run --name "gfn-hugectr" --preempt RUNONCE --ace nv-us-west-2 --instance dgx1v.32g.8.norm --commandline "bash -c 'source activate rapids && pip install gdown && jupyter notebook --allow-root --ip 0.0.0.0 --no-browser --NotebookApp.token='admin' --NotebookApp.allow_origin='*' --notebook-dir=/'" --result /results --image "nvidian/tme-gfnmerlin/hugectr_rel:1" --org nvidian --team sae --port 8786 --port 8787 --port 8888 --datasetid 68926:/gfn-merlin/data/preprocessed/preprocessed-53-jan-sept-parquet/

Please add your workspace and change your team.

Full error:

Attached is the error trace after running huge_ctr --train dlrm_fp32_256_local.json - error.log
Comments:

Why my AUC is so high in first 1000 iters?

I just want to run a DCN sample training, and use following model JSON

{
  "solver": {
    "lr_policy": "fixed",
    "display": 1000,
    "max_iter": 10000,
    "gpu": [0],
    "batchsize": 512,
    "snapshot": 10000000,
    "snapshot_prefix": "./",
    "eval_interval": 1000,
    "eval_batches": 60,
    "input_key_type": "I64"
  },
  
  "optimizer": {
    "type": "Adam",
    "global_update": true,
    "adam_hparam": {
      "learning_rate": 0.001,
      "beta1": 0.9,
      "beta2": 0.999,
      "epsilon": 0.0000001
    }
  },

  "layers": [ 
      {
      "name": "data",
      "type": "Data",
      "format": "Parquet",
      "slot_size_array": [1461, 558, 335378, 211710, 306, 20, 12136, 634, 4, 51298, 5302, 332600, 3179, 27, 12191, 301211, 11, 4841, 2086, 4, 324273, 17, 16, 79734, 96, 58622],
      "source": "./dcn_data/train/_file_list.txt",
      "eval_source": "./dcn_data/val/_file_list.txt",
      "check": "None",
      "label": {
        "top": "label",
        "label_dim": 1
      },
      "dense": {
        "top": "dense",
        "dense_dim": 13
      },
      "sparse": [
        {
          "top": "data1",
          "type": "DistributedSlot",
          "max_feature_num_per_sample": 30,
          "slot_num": 26
        }        
      ]
    },

    {
      "name": "sparse_embedding1",
      "type": "DistributedSlotSparseEmbeddingHash",
      "bottom": "data1",
      "top": "sparse_embedding1",
      "sparse_embedding_hparam": {
        "max_vocabulary_size_per_gpu": 1737709,
        "embedding_vec_size": 16,
        "combiner": 0
      }
    },

    {
      "name": "reshape1",
      "type": "Reshape",
      "bottom": "sparse_embedding1",
      "top": "reshape1",
      "leading_dim": 416
    },


    {
      "name": "concat1",
      "type": "Concat",
      "bottom": ["reshape1","dense"],
      "top": "concat1"
    },

    {
      "name": "slice1",
      "type": "Slice",
      "bottom": "concat1",
      "ranges": [[0,429], [0,429]],
      "top": ["slice11", "slice12"]
    },


    {
      "name": "multicross1",
      "type": "MultiCross",
      "bottom": "slice11",
      "top": "multicross1",
      "mc_param": {
        "num_layers": 6
      }
    },

    {
      "name": "fc1",
      "type": "InnerProduct",
      "bottom": "slice12",
      "top": "fc1",
       "fc_param": {
        "num_output": 1024
      }
    },

    {
      "name": "relu1",
      "type": "ReLU",
      "bottom": "fc1",
      "top": "relu1" 
    },
      
    {
      "name": "dropout1",
      "type": "Dropout",
      "rate": 0.5,
      "bottom": "relu1",
      "top": "dropout1" 
    },

    {
      "name": "fc2",
      "type": "InnerProduct",
      "bottom": "dropout1",
      "top": "fc2",
       "fc_param": {
        "num_output": 1024
      }
    },

    {
      "name": "relu2",
      "type": "ReLU",
      "bottom": "fc2",
      "top": "relu2"     
    },

    {
      "name": "dropout2",
      "type": "Dropout",
      "rate": 0.5,
      "bottom": "relu2",
      "top": "dropout2" 
    },
    
    {
      "name": "concat2",
      "type": "Concat",
      "bottom": ["dropout2","multicross1"],
      "top": "concat2"
    },
    
    {
      "name": "fc4",
      "type": "InnerProduct",
      "bottom": "concat2",
      "top": "fc4",
       "fc_param": {
        "num_output": 1
      }
    },
    
    {
      "name": "loss",
      "type": "BinaryCrossEntropyLoss",
      "bottom": ["fc4","label"],
      "top": "loss"
    } 
  ]
}

But my first 1000 iters metrics report is very strange:

[04d15h08m10s][HUGECTR][INFO]: Iter: 1000 Time(1000 iters): 6.113479s Loss: 0.527308 lr:0.001000
[8665.98, eval_start, 0.1, ]
[04d15h08m10s][HUGECTR][INFO]: Evaluation, AUC: 0.692035
[8708.16, eval_accuracy, 0.692035, 0.1, 1000, ]
[04d15h08m10s][HUGECTR][INFO]: Eval Time for 60 iters: 0.042175s
[8708.18, eval_stop, 0.1, ]
[04d15h08m16s][HUGECTR][INFO]: Iter: 2000 Time(1000 iters): 6.171510s Loss: 0.426323 lr:0.001000
[14837.7, eval_start, 0.2, ]
....

This is normal that my first 1000 iters AUC is hit 0.692035? And I find next many 1000 iters AUC is reducing.

No reaction appeared after the training start

Hi professionals,
We tried the steps in the HugeCTR tutorial and picked DeepFM for trail and successfully started the training, but nothing happened after the 'HugeCTR training start' text (we had waited for several days).

We tried several network configs, which however focusing on the max_iter meaning that network architecture was not changed, same problem.

System: Ubuntu 18.04.4 LTS
GPU: GeForce RTX 2080 Ti
Driver Version: 440.44
CUDA Version: 10.2

Docker for v2.3 release

Description:
add scikit-learn python module (Dmitry)

cudf 0.16 (Chirayu)

  1. @jianbingd will help to check NGC docker of TF to run Embedding plugin
  2. @xiaoleis will send email and ask if we need to update SWIAPT.

Plan B:
Four docker containers in total:
build.tfplugin.dockerfile + dev.tfplugin.dockerfile
build.dockerfile + dev.dockerfile

  • upload docker to NGC (Depends on QA)
    Comments:

how data collector send it's CSR buffers to remote node?

When I read source code, I found data collector is supposed to
/**************************************

  • Each node will have one DataCollector.
  • Each iteration, one of the data collector will
  • send it's CSR buffers to remote node.
    ************************************/
    as commented. However, I cannot find specific codes to do this thing. Can sb give some explanation? Thanks~

[FEA] Support cudf 0.16

Currently HugeCTR does not support cudf 0.16. It keeps throwing the following error.

/home/rapids/hugectr/HugeCTR/include/data_readers/parquet_data_reader_worker.hpp:29:10: fatal error: cudf/io/functions.hpp: No such file or directory
 #include <cudf/io/functions.hpp>

cudf 0.16 has refactored some code, and the functions.hpp does not exist anymore. Includes in HugeCTR have to be updated.

Fail to build docker image with ENABLE_MULTINODES=ON

Here is the command

docker build --build-arg ENABLE_MULTINODES=ON -t hugectr:devel -f ./tools/dockerfiles/build.Dockerfile .

and got errors below:

In file included from /HugeCTR/HugeCTR/include/gpu_resource.hpp:19:0,
                 from /HugeCTR/HugeCTR/src/gpu_resource.cpp:17:
/HugeCTR/HugeCTR/include/common.hpp:29:10: fatal error: mpi.h: No such file or directory
 #include <mpi.h>
          ^~~~~~~
compilation terminated.

I think the dockerfile does not meet the requirements of multi-nodes.

hugeCTR train performance question

I watched Zehuan Wang's share "HugeCTR - 端到端点击率预估训练解决方案介绍",
in this ppt, at the "PERFORMANCE" slide, use the 8-GPU performance is only 17.8ms per iter.
only 17.8ms per iter is too fast, Is this some error?

Criteo dataset sample processing issue

I was trying to run HugeCTR on the Criteo Kaggle dataset. When I was converting the original Kaggle dataset to HugeCTR format using Criteo2hugeCTR_legacy tool, I was running the command line as following:

$ ./criteo2hugectr_legacy 1 ../../tools/criteo_script_legacy/train.out criteo/sparse_embedding file_list.txt
$ ./criteo2hugectr_legacy 1 ../../tools/criteo_script_legacy/test.out criteo_test/sparse_embedding file_list_test.txt

However, I'm not able to get the file_list.txt abd file_list_test.txt from these scripts. I'm not sure what I did wrong here, since I pretty much followed the readme online from the beginning.

I also did some trails and realized that the problem might be in criteo2hugectr_legacy.cpp, since I wasn't able to read the eof of txt_file (line 95).

I'd really appreciate it if you guys could explain this a bit. Thank you very much!

Build success but failed to run with CUDA 10.1

Want to run hugectr on device with cuda 10.1.

Change docker config in tools/dockerfiles/build.Dockerfile or dev.a100.Dockerfile
FROM nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04 --> ROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04

And all the thing build ok and I got hugeCTR binary files.

But then the driver seems to break down, nothing could be run. Run nvidia-smi and got
Failed to initialize NVML: Driver/library version mismatch

Try to debug I find driver broke down after libarrow-cuda-dev install.
This line : apt update && apt install -y libarrow-dev=0.17.1-1 libarrow-cuda-dev=0.17.1-1
It installs another libnvidia-compute-435. After installed libnvidia-compute-435, the driver could not work correctly.

Any way to solve it?

Documentations for v2.3

Description:

  • ReadMe (PIC Minseok):
  • Notebook: move release note here and link to the features in User Guide. / @KingsleyL will add notebook.
  • User Guide (PIC Minseok): Connections between term and feature introduction. + Known Issue
  • Samples
  • Tutorial @aleliu multi-node training
  • Question and Answers

Finish a draft version by contributors by 9th Nov
Reorganize start from 9th Nov (PIC Lamont)
Comments:

Running DLRM model got Runtime error: cublas_status_not_supported

Follow by dlrm_fp32_64k.json,
we test DLRM in our data: label_dim=1, dense_dim=5 slot_num=75 . And got error in the first fully connected layer.

What's wrong with my data or model config? Or there are some bugs in hugectr?

log

[10d12h27m36s][HUGECTR][INFO]: end_lr is not specified using default: 0.000000
[6421.34, init_end, ]
[6421.35, run_start, ]
HugeCTR training start:
[6421.36, train_epoch_start, 0, ]
[HCDEBUG][ERROR] Runtime error: cublas_status_not_supported /tmp/HugeCTR/HugeCTR/src/layers/fully_connected_layer.cu:143 

[HCDEBUG][ERROR] Runtime error: operation not permitted when stream is capturing /tmp/HugeCTR/HugeCTR/src/session.cpp:451 

[HCDEBUG][ERROR] Runtime error: cublas_status_not_supported /tmp/HugeCTR/HugeCTR/src/layers/fully_connected_layer.cu:143 

Terminated with error

model config

{
  "solver": {
    "lr_policy": "fixed",
    "display": 1,
    "max_iter": 2,
    "gpu": [
        0
    ],
    "batchsize": 32,
    "snapshot": 1,
    "snapshot_prefix": "./tmp/daw",
    "eval_interval": 1,
    "batchsize_eval":32,
    "eval_metrics": [
        "AUC:0.9",
        "AverageLoss"
    ],
    "eval_batches": 1,
    "input_key_type": "I64"
},
"optimizer": {
    "type": "Adam",
    "global_update": false,
    "adam_hparam": {
        "learning_rate": 0.0001,
        "beta1": 0.9,
        "beta2": 0.999,
        "epsilon": 1e-08
    }
},
"layers": [
    {
        "name": "data",
        "type": "Data",
        "source": "./tmp/file_list.txt",
        "eval_source": "./tmp/file_list_test.txt",
        "check": "Sum",
        "label": {
            "top": "label",
            "label_dim": 1
        },
        "dense": {
            "top": "dense",
            "dense_dim": 5
        },
        "sparse": [
            {
                "top": "data1",
                "type": "DistributedSlot",
                "max_feature_num_per_sample": 180,
                "slot_num": 75
            }
        ]
    },
    {
        "name": "sparse_embedding1",
        "type": "DistributedSlotSparseEmbeddingHash",
        "bottom": "data1",
        "top": "sparse_embedding1",
        "sparse_embedding_hparam": {
            "max_vocabulary_size_per_gpu": 24000000,
            "load_factor": 0.75,
            "embedding_vec_size": 16,
            "combiner": 1
        }
    },
  
      {
        "name": "fc1",
        "type": "InnerProduct",
        "bottom": "dense",
        "top": "fc1",
         "fc_param": {
          "num_output": 512
        }
      },
  
   
    {
        "name": "relu1",
        "type": "ReLU",
        "bottom": "fc1",
        "top": "relu1" 
      },
  
      {
        "name": "fc2",
        "type": "InnerProduct",
        "bottom": "relu1",
        "top": "fc2",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu2",
        "type": "ReLU",
        "bottom": "fc2",
        "top": "relu2"     
      },
      
      {
        "name": "fc3",
        "type": "InnerProduct",
        "bottom": "relu2",
        "top": "fc3",
         "fc_param": {
          "num_output": 16
        }
      },
  
      {
        "name": "relu3",
        "type": "ReLU",
        "bottom": "fc3",
        "top": "relu3"     
      },
      
      {
        "name": "interaction1",
        "type": "Interaction",
        "bottom": ["relu3", "sparse_embedding1"],
        "top": "interaction1"
      },
  
      {
        "name": "fc4",
        "type": "InnerProduct",
        "bottom": "interaction1",
        "top": "fc4",
         "fc_param": {
          "num_output": 1024
        }
      },
  
      {
        "name": "relu4",
        "type": "ReLU",
        "bottom": "fc4",
        "top": "relu4" 
      },
        
  
      {
        "name": "fc5",
        "type": "InnerProduct",
        "bottom": "relu4",
        "top": "fc5",
         "fc_param": {
          "num_output": 1024
        }
      },
  
      {
        "name": "relu5",
        "type": "ReLU",
        "bottom": "fc5",
        "top": "relu5"     
      },
      
      {
        "name": "fc6",
        "type": "InnerProduct",
        "bottom": "relu5",
        "top": "fc6",
         "fc_param": {
          "num_output": 512
        }
      },
  
      {
        "name": "relu6",
        "type": "ReLU",
        "bottom": "fc6",
        "top": "relu6"     
      },
  
      {
        "name": "fc7",
        "type": "InnerProduct",
        "bottom": "relu6",
        "top": "fc7",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu7",
        "type": "ReLU",
        "bottom": "fc7",
        "top": "relu7"     
      },
      
      {
        "name": "fc8",
        "type": "InnerProduct",
        "bottom": "relu7",
        "top": "fc8",
         "fc_param": {
          "num_output": 1
        }
      },
      
      {
        "name": "loss",
        "type": "BinaryCrossEntropyLoss",
        "bottom": ["fc8","label"],
        "top": "loss"
      } 
    ]
  }

DataReader Refactoring TODO list

Description:

  • Python friendly APIs: how to make the code and more uniform?
DataReader::set_source(){
   worker_group_.reset(new xxx_data_reader_worker_group);
}
# and no need to explicit call of start()
  • Completely eliminate repeat from DataReader when it is ready, e.g., enable set_source for Raw
  • Remove all the default arguments from DataReader
  • Decouple Dataset source type from DataReader type (#169 #138)
  • Make the number of DataReaders configurable (and perhaps automatic configuration for a given system)
  • Support Eval for one epoch instead of specifying n_batches. It may require the change to Metrics as well. (@minseokl will see how TF and Pytorch tackle this issue)
  • Remove the duplicate code if possible
    Comments:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.