alibaba / hybridbackend Goto Github PK

View Code? Open in Web Editor NEW

149.0 15.0 29.0 2.96 MB

A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster

License: Apache License 2.0

Shell 1.02% Python 46.45% Makefile 0.94% C++ 51.08% C 0.49% Jinja 0.01%

deep-learning recommender-system parquet gpu hybrid-parallelism

hybridbackend's Introduction

HybridBackend

HybridBackend is a high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster.

Features

Memory-efficient loading of categorical data
GPU-efficient orchestration of embedding layers
Communication-efficient training and evaluation at scale
Easy to use with existing AI workflows

Usage

A minimal example:

import tensorflow as tf
import hybridbackend.tensorflow as hb

ds = hb.data.Dataset.from_parquet(filenames)
ds = ds.batch(batch_size)
# ...

with tf.device('/gpu:0'):
  embs = tf.nn.embedding_lookup_sparse(weights, input_ids)
  # ...

Please see documentation for more information.

Install

Method 1: Install from PyPI

pip install {PACKAGE}

`{PACKAGE}`	Dependency	Python	CUDA	GLIBC	Data Opt.	Embedding Opt.	Parallelism Opt.
hybridbackend-tf115-cu121	TensorFlow 1.15	3.8	12.1	>=2.31	✓	✓	✓
hybridbackend-tf115-cu100	TensorFlow 1.15	3.6	10.0	>=2.27	✓	✓	✗
hybridbackend-tf115-cpu	TensorFlow 1.15	3.6	-	>=2.24	✓	✗	✗

Method 2: Build from source

See Building Instructions.

We also provide built docker images for latest DeepRec: registry.cn-shanghai.aliyuncs.com/pai-dlc/hybridbackend:1.0.0-deeprec-py3.6-cu114-ubuntu18.04

License

HybridBackend is licensed under the Apache 2.0 License.

Community

Please see Contributing Guide before your first contribution.
Please register as an adopter if your organization is interested in adoption. We will discuss RoadMap with registered adopters in advance.

Please cite HybridBackend in your publications if it helps:

@inproceedings{zhang2022picasso,
  title={PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems},
  author={Zhang, Yuanxing and Chen, Langshi and Yang, Siran and Yuan, Man and Yi, Huimin and Zhang, Jie and Wang, Jiamang and Dong, Jianbo and Xu, Yunlong and Song, Yue and others},
  booktitle={2022 IEEE 38th International Conference on Data Engineering (ICDE)},
  year={2022},
  organization={IEEE}
}

Contact Us

If you would like to share your experiences with others, you are welcome to contact us in DingTalk:

hybridbackend's People

Contributors

Stargazers

Watchers

hybridbackend's Issues

How to place the embeddings on gpu?

User Story

The DeepRec latest commit supports placing the embeddings on different devices like cpu , gpu and ssd. The new feature of DeepRec can deal with training OOM . It seems that hb can only support cpu . How can HB support this feature ?

Detailed requirements

API Compatibility

Willing to contribute

Yes

rebatch api produce an Check failed: limit <= dim0_size error

Current behavior

After rebatch(), data iterator get_next() produce an error:

F tensorflow/core/framework/tensor.cc:833] Check failed: limit <= dim0_size (8194 vs. 8193)

Expected behavior

no error

System information

OS Platform and Distribution: Ubuntu 18.04.5 LTS
TensorFlow version: 1.15.0
Python version: 3.6
CUDA/cuDNN version: 10.1
RAM: 94G
GPU model and memory: Tesla T4, 16G

Code to reproduce

Step 1: Generate a parquet file by running following code

import numpy as np
import pandas as pd
import random

data_list = []
for i in range(1, 10000):
    int_feature = random.randint(1, 100)
    # float_feature = random.random()
    array_feature = [random.randint(1, 10) for x in range(0, 4)]
    data_list.append([int_feature, array_feature])

df = pd.DataFrame(data_list, columns=["int_feature", "array_feature"])
df.to_parquet("parquet_sample_file.parquet")

Step 2: Load generated parquet in step 1 by HybridBackend

import tensorflow as tf
import hybridbackend.tensorflow as hb


filenames_ds = tf.data.Dataset.from_tensor_slices(['file1.snappy.parquet', 'file2.snappy.parquet', ... 'fileN.snappy.parquet'])


hb_fields = []
hb_fields.append(hb.data.DataFrame.Field("feature1", tf.int64, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field("feature2", tf.float32, ragged_rank=1))
hb_fields.append(hb.data.DataFrame.Field("feature3", tf.int64, ragged_rank=1))

ds = filenames_ds.apply(hb.data.read_parquet(8192, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
iterator = ds.apply(hb.data.rebatch(8192, fields=hb_fields))

it = iterator.make_one_shot_iterator()
item = it.get_next()

batch_size_dict = {}
with tf.Session() as sess:
    print("======  start ======")
    total_batch_size = 0
    while True:
        try:
            batch = sess.run(item)
            batch_size = len(batch['mod_series'])
            batch_size_dict[batch_size] = batch_size_dict.get(batch_size, 0) + 1
        except tf.errors.OutOfRangeError:
            break

Running above code in a pyhon3 shell, an error shall be thrown:

F tensorflow/core/framework/tensor.cc:833] Check failed: limit <= dim0_size (8194 vs. 8193)

Willing to contribute

Yes

DLRM model on A100 8cards training

I don't find a DLRM example in HybridBackend repo.
Can you provid a DLRM example ?

Error when running imported/restored model that uses feedable iterator

I got a situation where I trained a model and saved its checkpoint files, then I need to restore the graph from the meta file and feed a new data iterator to keep training, so i find a issue talking about that, then i write some code to demo my situation.

Current behavior

When i use ParquetDataset to feed, i can't restore the meta file, and got the following error:

Traceback (most recent call last):
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 501, in _import_graph_def_internal
    graph._c_graph, serialized, options)  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot add function '__inference_Dataset_flat_map__create_dataset_10' because a different function with the same name already exists.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test/io/restore_hb.py", line 223, in <module>
    resume_training(another_train_dataset, another_test_dataset)
  File "test/io/restore_hb.py", line 132, in resume_training
    saver = tf.train.import_meta_graph('checkpoints_hb/fufu.meta')
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1697, in import_meta_graph
    **kwargs)[0]
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1721, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 505, in _import_graph_def_internal
    raise ValueError(str(e))
ValueError: Cannot add function '__inference_Dataset_flat_map__create_dataset_10' because a different function with the same name already exists.

I guess that error not belongs to a bug for HybridBackend, because i also try the TFRecordDataset and get a similar error：

Traceback (most recent call last):
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 501, in _import_graph_def_internal
    graph._c_graph, serialized, options)  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot add function '__inference_Dataset_map__parse_function_55' because a different function with the same name already exists.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test/io/restore_pb.py", line 225, in <module>
    restore_feed()
  File "test/io/restore_pb.py", line 220, in restore_feed
    resume_training(another_train_dataset, another_test_dataset)
  File "test/io/restore_pb.py", line 155, in resume_training
    saver = tf.train.import_meta_graph('checkpoints_pb/fufu.meta')
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1697, in import_meta_graph
    **kwargs)[0]
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1721, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 505, in _import_graph_def_internal
    raise ValueError(str(e))
ValueError: Cannot add function '__inference_Dataset_map__parse_function_55' because a different function with the same name already exists.

But that process works for from_tensor_slices and CsvDataset, i'm just curious and want to know how to restore and feed a new dataset iterator.

Expected behavior

When i use ParquetDataset in traing, i can restore the checkpoint and feed a new ParquetDataset iterator

System information

GPU model and memory: 16G for Tesla T4
OS Platform: Ubuntu 18.04.5 LTS (Bionic Beaver)
Docker version: 20.10.14
GCC/CUDA/cuDNN version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04),
Python/conda version: Python 3.6.12
TensorFlow/PyTorch version: tensorflow 1.15.5+deeprec2201

Code to reproduce

training and restore use ParquetDataset to feed that doesn't work

# Tensorflow 1.15
# https://github.com/tensorflow/tensorflow/issues/11679#
#
import tensorflow as tf
import numpy as np
import pandas as pd
import os
import shutil
from hybridbackend.tensorflow.data import DataFrame
from hybridbackend.tensorflow.data import ParquetDataset
from tensorflow.python.data.ops import dataset_ops

new_dtypes = {"test1": np.float32, "test2": np.float32}

train_df = pd.DataFrame(np.random.randint(0, 100, (5, 2)), columns=['test1', 'test2'])
train_df = train_df.astype(new_dtypes)
train_df.to_parquet('train.parquet')

test_df = pd.DataFrame(np.random.randint(0, 100, (2, 2)), columns=['test1', 'test2'])
test_df = test_df.astype(new_dtypes)
test_df.to_parquet('test.parquet')


def make_initializable_iterator(ds):
  if hasattr(dataset_ops, 'make_initializable_iterator'):
    return dataset_ops.make_initializable_iterator(ds)
  return ds.make_initializable_iterator()


def make_one_shot_iterator(ds):
  if hasattr(dataset_ops, 'make_one_shot_iterator'):
    return dataset_ops.make_one_shot_iterator(ds)
  return ds.make_one_shot_iterator()


def train(train_dataset, test_dataset):
  """
    Create graph with an Dataset and Iterator and save the model.

    There is some op that is applied to the data from the iterator.
    """
  iterator_handle = tf.placeholder(tf.string, shape=[])
  tf.add_to_collection('iterator_handle', iterator_handle)

  iterator = tf.data.Iterator.from_string_handle(iterator_handle, dataset_ops.get_legacy_output_types(train_dataset),
                                                 dataset_ops.get_legacy_output_shapes(train_dataset),
                                                 dataset_ops.get_legacy_output_classes(train_dataset))
  train_iter = make_initializable_iterator(train_dataset)
  test_iter = make_initializable_iterator(test_dataset)
  element = iterator.get_next()

  v = tf.get_variable(name='v', initializer=tf.zeros(shape=(1, 2)))

  # to use when saving summaries
  global_step = tf.Variable(0, name='global_step', trainable=False, dtype=tf.int32)
  increament_global_step = tf.assign(global_step, global_step + 1)
  global_step = global_step + 1
  tf.add_to_collection('increament_global_step', increament_global_step)

  some_op = tf.assign(v, v + tf.abs(element['test1']))
  tf.add_to_collection('some_op', tf.reduce_sum(some_op))

  tf.summary.scalar('v_sum', tf.reduce_sum(v))
  tf.summary.scalar('some_op', tf.reduce_mean(some_op))
  merged_summary = tf.summary.merge_all()
  tf.add_to_collection('merged_summary', merged_summary)

  writer = tf.summary.FileWriter('checkpoints_hb', graph=tf.get_default_graph())
  saver = tf.train.Saver()

  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Run data iterator initialisation
    sess.run(train_iter.initializer)
    sess.run(test_iter.initializer)

    # "Training"
    print("Training")
    while True:
      try:
        [op, summary_values, g_step] = sess.run([some_op, merged_summary, increament_global_step],
                                                feed_dict={iterator_handle: train_handle})
        writer.add_summary(summary_values, global_step=g_step)
        print(op)
      except tf.errors.OutOfRangeError:
        break

    # "Test evaluation"
    print("Testing")
    while True:
      try:
        print(sess.run(some_op, feed_dict={iterator_handle: test_handle}))
      except tf.errors.OutOfRangeError:
        break

    saver.save(sess, 'checkpoints_hb/fufu')

def resume_training(train_dataset, test_dataset):
  """Restore the model from file and pass some new data through it
     for further training """
  with tf.Session() as sess:
    saver = tf.train.import_meta_graph('checkpoints_hb/fufu.meta')
    saver.restore(sess, 'checkpoints_hb/fufu')
    iterator_handle = tf.get_collection('iterator_handle')[0]
    some_op = tf.get_collection('some_op')[0]
    increament_global_step = tf.get_collection('increament_global_step')[0]
    merged_summary = tf.get_collection('merged_summary')[0]

    writer = tf.summary.FileWriter('checkpoints_hb', graph=tf.get_default_graph())

    # Make new iterators and handles
    train_iter = make_initializable_iterator(train_dataset)
    test_iter = make_initializable_iterator(test_dataset)

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Further training the model using new datasets (which may be different from original ones)
    print("Resume training ...")

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Run data iterator initialisation
    sess.run(train_iter.initializer)
    sess.run(test_iter.initializer)

    # "Training"
    print("Training")
    while True:
      try:
        [op, summary_values, g_step] = sess.run([some_op, merged_summary, increament_global_step],
                                                feed_dict={iterator_handle: train_handle})
        writer.add_summary(summary_values, global_step=g_step)
        print(op)
      except tf.errors.OutOfRangeError:
        break

    # "Test evaluation"
    print("Testing")
    while True:
      try:
        print(sess.run(some_op, feed_dict={iterator_handle: test_handle}))
      except tf.errors.OutOfRangeError:
        break

    saver.save(sess, 'checkpoints_hb/fufu')


def train_feed():
  # delete existing saved models and summary files
  if os.path.exists('checkpoints_hb'):
    shutil.rmtree('checkpoints_hb')
  # train_dataset = tf.data.Dataset.from_tensor_slices(
  #     tf.constant(np.random.randint(0, 100, (5, 2)), dtype=tf.float32))
  train_dataset = ParquetDataset('train.parquet',
                                 batch_size=1,
                                 fields=[DataFrame.Field('test1', tf.float32),
                                         DataFrame.Field('test2', tf.float32)])
  test_dataset = ParquetDataset('test.parquet',
                                batch_size=1,
                                fields=[DataFrame.Field('test1', tf.float32),
                                        DataFrame.Field('test2', tf.float32)])
  # test_dataset = tf.data.Dataset.from_tensor_slices(
  # tf.constant(np.random.randint(0, 100, (2, 2)), dtype=tf.float32))

  train(train_dataset, test_dataset)


def restore_feed():
  # Load and fine-tune the saved model using new data
  another_train_dataset = ParquetDataset(
      'train.parquet',
      batch_size=1,
      fields=[DataFrame.Field('test1', tf.float32),
              DataFrame.Field('test2', tf.float32)])
  another_test_dataset = ParquetDataset(
      'test.parquet', batch_size=1, fields=[DataFrame.Field('test1', tf.float32),
                                            DataFrame.Field('test2', tf.float32)])

  resume_training(another_train_dataset, another_test_dataset)


if __name__ == '__main__':
  train_feed()
  restore_feed()

It works neither for TFRecordDataset.

import tensorflow as tf
import numpy as np
import pandas as pd
import os
import shutil
from tensorflow.python.data.ops import dataset_ops


def make_one_shot_iterator(ds):
  if hasattr(dataset_ops, 'make_one_shot_iterator'):
    return dataset_ops.make_one_shot_iterator(ds)
  return ds.make_one_shot_iterator()


def make_initializable_iterator(ds):
  if hasattr(dataset_ops, 'make_initializable_iterator'):
    return dataset_ops.make_initializable_iterator(ds)
  return ds.make_initializable_iterator()


# Define features
feature_description = {
    'test1': tf.io.FixedLenFeature([], dtype=tf.float32),
    'test2': tf.io.FixedLenFeature([], dtype=tf.float32)
}


def _parse_function(example_proto):
  return tf.io.parse_example(example_proto, feature_description)


def write_pb(df, file):
  # Write TFrecord file
  with tf.io.TFRecordWriter(file) as writer:
    for index, row in df.iterrows():
      print(row['test1'], row['test2'])
      # Create the Example
      example = tf.train.Example(features=tf.train.Features(
          feature={
              'test1': tf.train.Feature(float_list=tf.train.FloatList(value=[row['test1']])),
              'test2': tf.train.Feature(float_list=tf.train.FloatList(value=[row['test2']]))
          }))
      writer.write(example.SerializeToString())


new_dtypes = {"test1": np.float32, "test2": np.float32}

train_df = pd.DataFrame(np.random.randint(0, 100, (5, 2)), columns=['test1', 'test2'])
train_df = train_df.astype(new_dtypes)
write_pb(train_df, 'train.tfrecord')

test_df = pd.DataFrame(np.random.randint(0, 100, (2, 2)), columns=['test1', 'test2'])
test_df = test_df.astype(new_dtypes)
write_pb(test_df, 'test.tfrecord')


def train(train_dataset, test_dataset):
  """
  Create graph with an Dataset and Iterator and save the model.

  There is some op that is applied to the data from the iterator.
  """
  iterator_handle = tf.placeholder(tf.string, shape=[])
  tf.add_to_collection('iterator_handle', iterator_handle)

  iterator = tf.data.Iterator.from_string_handle(iterator_handle, dataset_ops.get_legacy_output_types(train_dataset),
                                                 dataset_ops.get_legacy_output_shapes(train_dataset),
                                                 dataset_ops.get_legacy_output_classes(train_dataset))
  train_iter = make_initializable_iterator(train_dataset)
  test_iter = make_initializable_iterator(test_dataset)
  element = iterator.get_next()

  v = tf.get_variable(name='v', initializer=tf.zeros(shape=(1, 2)))

  # to use when saving summaries
  global_step = tf.Variable(0, name='global_step', trainable=False, dtype=tf.int32)
  increament_global_step = tf.assign(global_step, global_step + 1)
  global_step = global_step + 1
  tf.add_to_collection('increament_global_step', increament_global_step)

  some_op = tf.assign(v, v + tf.abs(element['test1']))
  tf.add_to_collection('some_op', tf.reduce_sum(some_op))

  tf.summary.scalar('v_sum', tf.reduce_sum(v))
  tf.summary.scalar('some_op', tf.reduce_mean(some_op))
  merged_summary = tf.summary.merge_all()
  tf.add_to_collection('merged_summary', merged_summary)

  writer = tf.summary.FileWriter('checkpoints_pb', graph=tf.get_default_graph())
  saver = tf.train.Saver()

  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Run data iterator initialisation
    sess.run(train_iter.initializer)
    sess.run(test_iter.initializer)

    # "Training"
    print("Training")
    while True:
      try:
        [op, summary_values, g_step] = sess.run([some_op, merged_summary, increament_global_step],
                                                feed_dict={iterator_handle: train_handle})
        writer.add_summary(summary_values, global_step=g_step)
        print(op)
      except tf.errors.OutOfRangeError:
        break

    # "Test evaluation"
    print("Testing")
    while True:
      try:
        print(sess.run(some_op, feed_dict={iterator_handle: test_handle}))
      except tf.errors.OutOfRangeError:
        break

    saver.save(sess, 'checkpoints_pb/fufu')


def resume_training(train_dataset, test_dataset):
  """Restore the model from file and pass some new data through it
     for further training """
  with tf.Session() as sess:
    saver = tf.train.import_meta_graph('checkpoints_pb/fufu.meta')
    saver.restore(sess, 'checkpoints_pb/fufu')
    iterator_handle = tf.get_collection('iterator_handle')[0]
    some_op = tf.get_collection('some_op')[0]
    increament_global_step = tf.get_collection('increament_global_step')[0]
    merged_summary = tf.get_collection('merged_summary')[0]

    writer = tf.summary.FileWriter('checkpoints_pb', graph=tf.get_default_graph())

    # Make new iterators and handles
    train_iter = make_initializable_iterator(train_dataset)
    test_iter = make_initializable_iterator(test_dataset)

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Further training the model using new datasets (which may be different from original ones)
    print("Resume training ...")

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Run data iterator initialisation
    sess.run(train_iter.initializer)
    sess.run(test_iter.initializer)

    # "Training"
    print("Training")
    while True:
      try:
        [op, summary_values, g_step] = sess.run([some_op, merged_summary, increament_global_step],
                                                feed_dict={iterator_handle: train_handle})
        writer.add_summary(summary_values, global_step=g_step)
        print(op)
      except tf.errors.OutOfRangeError:
        break

    # "Test evaluation"
    print("Testing")
    while True:
      try:
        print(sess.run(some_op, feed_dict={iterator_handle: test_handle}))
      except tf.errors.OutOfRangeError:
        break

    saver.save(sess, 'checkpoints_pb/fufu')


def train_feed():
  # delete existing saved models and summary files
  if os.path.exists('checkpoints_pb'):
    shutil.rmtree('checkpoints_pb')
  # train_dataset = tf.data.Dataset.from_tensor_slices(
  #     tf.constant(np.random.randint(0, 100, (5, 2)), dtype=tf.float32))
  train_dataset = tf.data.TFRecordDataset(['train.tfrecord']).batch(1).map(_parse_function)
  test_dataset = tf.data.TFRecordDataset(['test.tfrecord']).batch(1).map(_parse_function)

  train(train_dataset, test_dataset)


def restore_feed():
  # Load and fine-tune the saved model using new data
  another_train_dataset = tf.data.TFRecordDataset(['train.tfrecord']).batch(1).map(_parse_function)
  another_test_dataset = tf.data.TFRecordDataset(['test.tfrecord']).batch(1).map(_parse_function)

  resume_training(another_train_dataset, another_test_dataset)


if __name__ == '__main__':
  train_feed()
  restore_feed()

But works for CsvDataset

import tensorflow as tf
import numpy as np
import pandas as pd
import os
import shutil
from tensorflow.python.data.experimental.ops import readers
from tensorflow.python.data.ops import dataset_ops

new_dtypes = {"test1": np.float32, "test2": np.float32}

train_df = pd.DataFrame(np.random.randint(0, 100, (5, 2)), columns=['test1', 'test2'])
train_df = train_df.astype(new_dtypes)
train_df.to_csv('train.csv', index=False)

test_df = pd.DataFrame(np.random.randint(0, 100, (2, 2)), columns=['test1', 'test2'])
test_df = test_df.astype(new_dtypes)
test_df.to_csv('test.csv', index=False)


def make_initializable_iterator(ds):
  if hasattr(dataset_ops, 'make_initializable_iterator'):
    return dataset_ops.make_initializable_iterator(ds)
  return ds.make_initializable_iterator()


def make_one_shot_iterator(ds):
  if hasattr(dataset_ops, 'make_one_shot_iterator'):
    return dataset_ops.make_one_shot_iterator(ds)
  return ds.make_one_shot_iterator()


def train(train_dataset, test_dataset):
  """
    Create graph with an Dataset and Iterator and save the model.

    There is some op that is applied to the data from the iterator.
    """
  iterator_handle = tf.placeholder(tf.string, shape=[])
  tf.add_to_collection('iterator_handle', iterator_handle)

  iterator = tf.data.Iterator.from_string_handle(iterator_handle, dataset_ops.get_legacy_output_types(train_dataset),
                                                 dataset_ops.get_legacy_output_shapes(train_dataset),
                                                 dataset_ops.get_legacy_output_classes(train_dataset))
  train_iter = make_initializable_iterator(train_dataset)
  test_iter = make_initializable_iterator(test_dataset)
  element = iterator.get_next()

  v = tf.get_variable(name='v', initializer=tf.zeros(shape=(1, 2)))

  # to use when saving summaries
  global_step = tf.Variable(0, name='global_step', trainable=False, dtype=tf.int32)
  increament_global_step = tf.assign(global_step, global_step + 1)
  global_step = global_step + 1
  tf.add_to_collection('increament_global_step', increament_global_step)

  some_op = tf.assign(v, v + tf.abs(element))
  tf.add_to_collection('some_op', tf.reduce_sum(some_op))

  tf.summary.scalar('v_sum', tf.reduce_sum(v))
  tf.summary.scalar('some_op', tf.reduce_mean(some_op))
  merged_summary = tf.summary.merge_all()
  tf.add_to_collection('merged_summary', merged_summary)

  writer = tf.summary.FileWriter('checkpoints_csv', graph=tf.get_default_graph())
  saver = tf.train.Saver()

  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Run data iterator initialisation
    sess.run(train_iter.initializer)
    sess.run(test_iter.initializer)

    # "Training"
    print("Training")
    while True:
      try:
        [op, summary_values, g_step] = sess.run([some_op, merged_summary, increament_global_step],
                                                feed_dict={iterator_handle: train_handle})
        writer.add_summary(summary_values, global_step=g_step)
        print(op)
      except tf.errors.OutOfRangeError:
        break

    # "Test evaluation"
    print("Testing")
    while True:
      try:
        print(sess.run(some_op, feed_dict={iterator_handle: test_handle}))
      except tf.errors.OutOfRangeError:
        break

    saver.save(sess, 'checkpoints_csv/fufu')


def resume_training(train_dataset, test_dataset):
  """Restore the model from file and pass some new data through it
     for further training """
  with tf.Session() as sess:
    saver = tf.train.import_meta_graph('checkpoints_csv/fufu.meta')
    saver.restore(sess, 'checkpoints_csv/fufu')
    iterator_handle = tf.get_collection('iterator_handle')[0]
    some_op = tf.get_collection('some_op')[0]
    increament_global_step = tf.get_collection('increament_global_step')[0]
    merged_summary = tf.get_collection('merged_summary')[0]

    writer = tf.summary.FileWriter('checkpoints_csv', graph=tf.get_default_graph())

    # Make new iterators and handles
    train_iter = make_initializable_iterator(train_dataset)
    test_iter = make_initializable_iterator(test_dataset)

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Further training the model using new datasets (which may be different from original ones)
    print("Resume training ...")

    train_handle = sess.run(train_iter.string_handle())
    test_handle = sess.run(test_iter.string_handle())

    # Run data iterator initialisation
    sess.run(train_iter.initializer)
    sess.run(test_iter.initializer)

    # "Training"
    print("Training")
    while True:
      try:
        [op, summary_values, g_step] = sess.run([some_op, merged_summary, increament_global_step],
                                                feed_dict={iterator_handle: train_handle})
        writer.add_summary(summary_values, global_step=g_step)
        print(op)
      except tf.errors.OutOfRangeError:
        break

    # "Test evaluation"
    print("Testing")
    while True:
      try:
        print(sess.run(some_op, feed_dict={iterator_handle: test_handle}))
      except tf.errors.OutOfRangeError:
        break

    saver.save(sess, 'checkpoints_csv/fufu')


def train_feed():
  # delete existing saved models and summary files
  if os.path.exists('checkpoints_csv'):
    shutil.rmtree('checkpoints_csv')
  # train_dataset = tf.data.Dataset.from_tensor_slices(
  #     tf.constant(np.random.randint(0, 100, (5, 2)), dtype=tf.float32))
  train_dataset = readers.CsvDataset("train.csv", record_defaults=[tf.float32, tf.float32], header=True)
  test_dataset = readers.CsvDataset("test.csv", record_defaults=[tf.float32, tf.float32], header=True)
  # test_dataset = tf.data.Dataset.from_tensor_slices(
  # tf.constant(np.random.randint(0, 100, (2, 2)), dtype=tf.float32))

  train(train_dataset, test_dataset)


def restore_feed():
  # Load and fine-tune the saved model using new data
  another_train_dataset = readers.CsvDataset("train.csv", record_defaults=[tf.float32, tf.float32], header=True)
  another_test_dataset = readers.CsvDataset("test.csv", record_defaults=[tf.float32, tf.float32], header=True)

  resume_training(another_train_dataset, another_test_dataset)


if __name__ == '__main__':
  train_feed()
  restore_feed()

Willing to contribute

Yes

Failed to train with multiple GPUs in single node

User Story

Failed to train with multiple GPUs in single node when we use from tensorflow.contrib.layers.python.layers import feature_column

Detailed requirements

API Compatibility

Willing to contribute

Yes

support keras fit history in estimator's train_and_evaluate

User Story

I want to hold a record of the loss values and metric values during training, like keras History object:
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/History
https://keras.io/guides/training_with_built_in_methods/

Detailed requirements

I have to decide saving or not models depends on it's metrics(maybe lastest one).

API Compatibility

hb.estimator.train_and_evaluate

Willing to contribute

Yes

Question: When to release the code?

Hi, I saw the code is not public according to the architecture doc here https://hybridbackend.readthedocs.io/en/latest/architecture.html

Do you have a plan to open source it? Or is it just focused on data io?

I'd appreciate it if anyone could help me.

Thanks 🥂 🍻

Support keras.Model convert to hb.keras.Model by one function

User Story

Now a lot of models was build by keras.Model and function API.
If I want to tranfer them to hb.keras.Model, I must rewrite them using subclass API ,and change training pipeline code.
Maybe It could be done by hb.

Detailed requirements

Keras.Model's training pipeline like this :

build Model(function API or subclass)
model.compile
model.fit
model.save

It should add a new function like hb.keras.utils.tranfer_keras_model(keras.Model ,**args), between step1 and step2.
Then I get a hb.keras.Model to do distribute training or enjoy it's abilities.

API Compatibility

Willing to contribute

Yes

Feature Request: Support fixed length list

User Story

As a deep learning scientist, I want to support fixed-length list in ParquetDataset, so that I can retrieve dense Tensors directly from list.

API Compatibility

It should be compatible with existing API

Willing to contribute

Yes

support ARROW_NUM_THREADS in ParquetDataset

User Story

hb.data.ParquetDataset can not used all of pod-cpu.

Detailed requirements

hb.data.ParquetDataset

num_parallel_reads to set file reader nums
**[new]**num_arrow_threads to set column reader thread nums

to accelerate model training

API Compatibility

hb.data.ParquetDataset

Willing to contribute

Yes

Support Mac OSX.

User Story

Now only limited linux version support, so we cannot use hybridbackend in our local Macos systems, like maxos 10.x, 11.x and 12.x.

API Compatibility

tensorflow 1.14 & 1.15
maxos 10+ (Intel/M1)

Metrics in HB multi outputs job don't get correctly

Current behavior

When running multi outputs job with HB, Log doesn't get the metrics of second output

As show above, the log gets the output_1_auc metric but leave out the output_2_mean_squared_error metric and output_2_mean_absoluted_error metric

Expected behavior

I want all the metrics can get correctly

System information

GPU model and memory:
OS Platform:
Docker version:
GCC/CUDA/cuDNN version:
Python/conda version:
TensorFlow/PyTorch version:

Code to reproduce

Willing to contribute

Yes

How to realize gradient truncation function in HB Pkg

Current behavior

Recently, when I trained the model, the loss function got a Nan value. I want to perform gradient truncation. However HB can only use tf.train.XxxOptimizer api, which is used with computes_ gradients, tf.clip_ by_ value(), tf.clip_ by_ Norm() . But now HB uses tf.keras compile, fit mode. The community tf.keras.optimizers Adam () supports the clipvalue/clipnorm parameter .

Expected behavior

I want the HB pkg can support the gradient truncation function.

System information

GPU model and memory:
OS Platform:
Docker version:
GCC/CUDA/cuDNN version:
Python/conda version:
TensorFlow/PyTorch version:

Code to reproduce

Willing to contribute

Yes

hb.data.ParquetDataset will discard some data

Current behavior

When in distributed mode, hb.data.ParquetDataset will discard some data. In the case of a large number of parquet files
and a relatively small file size, even one-third of the data set is discarded.
Because the data on each device divides the total data set equally according to the number of row groups, but the
number of row groups of a parquet file is usually not guaranteed to be divisible by the number of devices, and the
redundant group will be discarded.

Expected behavior

Expected no data is discarded in the dataset or a small part of the data is discarded

System information

GPU model and memory: Two GPU devices： Tesla T4 . Memory: 15109MiB
OS Platform: x86_64 x86_64 x86_64 GNU/Linux
Docker version:
GCC/CUDA/cuDNN version: CUDA 11.4 /cuDnn 8
Python/conda version: python3.6
TensorFlow/PyTorch version: DeepRec deeprec2212, HybridBackend a832b4e

Code to reproduce

datasource = hb.data.ParquetDataset(filenames, batch_size=10000,
                            num_parallel_parser_calls=tf.data.experimental.AUTOTUNE,
                            num_parallel_reads = tf.data.experimental.AUTOTUNE,
                        #    drop_remainder=True,
                            partition_count =hb.context.world_size,
                            partition_index =hb.context.rank,
                            fields=fields,
                            )
datasource = datasource.apply(hb.data.parse()).map(map_func = parquet_map, num_parallel_calls=tf.data.experimental.AUTOTUNE)
datasource = datasource.prefetch(2)
            
iterator = tf.data.make_one_shot_iterator(datasource)

Willing to contribute

Yes

Filter_func in parqeut reader

User Story

It is a common process that map, filter and batch, in row-based storage format, like tfrecord. But in parquet format, transforming to row-based dataset performs very badly and fitlering data after batch will bring the size of batch fluctuating drasticly. So we suppose to add a filter_func in read_parquet interface that helps user to get a clean batch directly.

Detailed requirements

add filter_func in hybridbackend.tensorflow.data.read_parquet(batch_size, fields=None, partition_count=1, partition_index=0, drop_remainder=False, num_parallel_reads=None, num_sequential_reads=1, filter_func=None, map_func=None)

API Compatibility

At least tensorflow 1.14 and 1.15

QR code is invalid

Error when drop_reminder=True using rebatch API

Current behavior

Using rebatch API with drop_reminder=True will make program exit with segmentation fault

Expected behavior

No error

System information

GPU model and memory:
OS Platform: ubuntu 18
Docker version:
GCC/CUDA/cuDNN version:
Python/conda version: python 3.6
TensorFlow/PyTorch version: 1.5.0
HybridBackend version: 0.6.0a0

Code to reproduce

(1) First generate a random parquet file.

import pandas as pd
import random

data_list = []
for i in range(1, 100000):
    int_feature = random.randint(1, 1000)
    array_feature = [random.randint(1, 1000) for x in range(0, 50)]
    data_list.append([int_feature, array_feature, 0.8])

df = pd.DataFrame(data_list, columns=["int_feature", "array_feature", "label"])
df['label'] = pd.to_numeric(df["label"], downcast="float")
df.to_parquet("parquet_sample_file.parquet")

(2) Then read data

import tensorflow as tf
import tensorflow.keras as keras
import hybridbackend.tensorflow as hb

BATCH_SIZE = 1000


def get_parquet_ds():
    filenames_ds = tf.data.Dataset.from_tensor_slices([
        'parquet_sample_file.parquet'
    ]*1)
    hb_fields = []

    def _map(elem):
        features = {
            "int_feature": tf.cast(tf.reshape(elem["int_feature"], [-1, 1]), dtype=tf.float32),
            "array_feature": tf.cast(tf.reshape(elem["array_feature"].values, [-1, 50]),
                                              dtype=tf.float32)
        }
        labels = tf.reshape(elem["label"], [-1, 1])
        return features, labels

    hb_fields.append(hb.data.DataFrame.Field("int_feature", tf.int64, ragged_rank=0))
    hb_fields.append(hb.data.DataFrame.Field("array_feature", tf.int64, ragged_rank=1))
    hb_fields.append(hb.data.DataFrame.Field("label", tf.float32, ragged_rank=0))
    iterator = filenames_ds.apply(
        hb.data.read_parquet(BATCH_SIZE, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
    iterator = iterator.apply(hb.data.rebatch(BATCH_SIZE*2, fields=hb_fields, drop_remainder=True)).map(_map)

    return iterator


def train():
    global_init_op = tf.compat.v1.global_variables_initializer()

    ds = get_parquet_ds()
    iterator = ds.make_one_shot_iterator()
    get_data_op = iterator.get_next()

    with tf.compat.v1.Session() as sess:
        a = sess.run([global_init_op])
        i = 1
        while True:
            try:
                sample = sess.run([get_data_op])

                f_category = sample[0][0]["int_feature"]
                f_list = sample[0][0]["array_feature"]
                labels_ = sample[0][1]

                if i % 100 == 0:
                    print(f"step={i}")
                i += 1

            except tf.errors.OutOfRangeError:
                break


if __name__ == '__main__':
    train()

Willing to contribute

Yes

Following the BUILD.md tutorial, something is wrong

Current behavior

When building from source with customized container image for developers, latest quay.io/pypa/manylinux_2_24_x86_64 image is pulled. The cmd line 'RUN /opt/_internal/pipx/venvs/auditwheel/bin/patch.py' needs file '/opt/_internal/pipx/venvs/auditwheel/lib/python3.9/site-packages/auditwheel/policy/manylinux-policy.json' ,but '/opt/_internal/pipx/venvs/auditwheel/lib/python3.10/.../manylinux-policy.json' found. It seems that the error is caused by the recent update of the image quay.io/pypa/manylinux_2_24_x86_64。

Expected behavior

I hope the dockfile can run successfully

System information

GPU model and memory:
OS Platform:
Docker version:
GCC/CUDA/cuDNN version:
Python/conda version:
TensorFlow/PyTorch version:

Code to reproduce

Willing to contribute

Yes

Throughput is lower than TFRecords when there are many strings in Parquets file

System information

OS Platform: Ubuntu 18.04.5 LTS
Docker version: 18.09.5
GCC version: 7.5.0
Python version: 3.6.9
TensorFlow/PyTorch version: tf1.15.5

Willing to contribute

Yes

model.summary didn't show model layers

User Story

When using hb.keras.model, the model.summary() didn't show any layer infos.
like below:

Detailed requirements

It should support summary

API Compatibility

Willing to contribute

Yes

Question: Data Loading Performance with 150G Byte/s

Hi, thanks for open source this project，it's a great job！🥂 🍻

I saw the Data Loading doc here，the ParquetDataset is to solve IO performance issues on cloud.

According to the doc, the speed of reading and decoding ParquetDataset is about 150G Byte/s (3346.10MB/21.67ms) , equals max throughput of 12X 100G bit/s NIC, it's nearly impossible on cloud（hdfs/oss/s3）.

File Format	Size (MB)	Framework	#Threads	Elapsed (ms)
CSV	11062.61	Tensorflow	1	8558.38
Parquet	3346.10	Tensorflow IO	1	103056.71
Parquet	3346.10	HybridBackend	1	397.88
Parquet	3346.10	HybridBackend	20	21.67

Is it convenient to provide details of test environment？
Apart from code of Dataset module, will HybridBackend engine code be released in the future?

Thanks 🥂 🍻

Feature Request: Support reading tabular data on Aliyun OSS.

User Story

As a recommender system engineer, I want to ParquetDataset can be used with files on OSS via S3 protocol, so that I can store click logs on various object storage systems without migration effort.

Detailed requirements

OSS should be accessed via S3 protocol.
Environment variables should be compatible with S3 support on TensorFlow.

API Compatibility

Compatible with existing API.

Willing to contribute

Yes

Row-wise shuffling required

User Story

As a AI scientist, I want to shuffle in row-wise manner during training, so that I can control shuffling in a flexible way.

Detailed requirements

It should be easy to use

API Compatibility

Willing to contribute

Yes

feature_column bucket_size is 6, use 8 gpus, then worker-5 and worker-6 'save/RestoreV2' failed

feature_column bucket_size is 6, use 8 gpus, then worker-5 and worker-6 'save/RestoreV2' failed;
backtrace:
Traceback (most recent call last):
File "neg_feedback_multi.py", line 1252, in
tf.app.run()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "neg_feedback_multi.py", line 1235, in main
model.run()
File "neg_feedback_multi.py", line 1227, in run
classifier.train_and_evaluate(train_spec, eval_spec)
File "/home/pai/lib/python3.6/site-packages/hybridbackend/tensorflow/estimator/estimator.py", line 276, in train_and_evaluate
return executor.run()
File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 640, in run
getattr(self, task_to_run)()
File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 650, in run_worker
return self._start_distributed_training()
File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 796, in _start_distributed_training
saving_listeners=saving_listeners)
File "/home/pai/lib/python3.6/site-packages/hybridbackend/tensorflow/estimator/estimator.py", line 188, in train
saving_listeners=saving_listeners)
File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/home/pai/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/home/pai/lib/python3.6/site-packages/hybridbackend/tensorflow/training/session.py", line 131, in HybridBackendMonitoredTrainingSession
sess = fn(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 678, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/home/pai/lib/python3.6/site-packages/hybridbackend/tensorflow/training/session.py", line 64, in init
session_creator, hooks, should_recover=True, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 827, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in init
_WrappedSession.init(self, self._create_session())
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1314, in _create_session
return self._sess_creator.create_session()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 980, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 733, in create_session
self._scaffold.finalize()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 252, in finalize
self._saver.build()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1059, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/pai/lib/python3.6/site-packages/hybridbackend/tensorflow/training/saver.py", line 258, in _build
super()._build(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1137, in _build
build_restore=build_restore)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 660, in _build_internal
restore_sequentially, reshape)
File "/home/pai/lib/python3.6/site-packages/hybridbackend/tensorflow/training/saver.py", line 200, in _AddShardedRestoreOps
filename_tensor, per_device, restore_sequentially, reshape)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 536, in _AddShardedRestoreOps
name="restore_shard"))
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 476, in _AddRestoreOps
restore_sequentially)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 744, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 2380, in restore_v2
name=name)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3360, in create_op
attrs, op_def, compute_device)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3429, in _create_op_internal
op_def=op_def)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1773, in init
control_input_ops)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1613, in _create_c_op
raise ValueError(str(e))
ValueError: Expected non-negative start and positive length but got start = 6, length = 0: string = 6,0:0,10 for 'save/RestoreV2' (op: 'RestoreV2') with input shapes: [], [382], [382] and with computed input tensors: input[2] = <144150 23 108114,18018:0,23 144150 23 108114,18018:0,23 195

Using shuffle or rebatch may cause OOM problem

1. Current behavior

Using shuffle or rebatch may cause OOM problem.

1.1 小文件测试记录

total parquet file count: 15780

total parquet file size: 126G

total sample count: 600w

No.	Test Scenarios	RAM Usage
1	no shuffle no rebatch	use 50G ram stable
2	shuffle，buffer_size=2048 rebatch，batch_size=8192	start at 53G ram, and increasingly use more ram rapidly, exceed 94G ram limit within 2 minutes
3	shuffle，buffer_size=8 rebatch，batch_size=8192	increasingly use more ram slowly, exceed 94G ram limit after 2 hours
4	no shuffle rebatch，batch_size=8192	increasingly use more ram slowly, exceed 94G ram limit after 2 hours
5	shuffle，buffer_size=8 no rebatch	use 50G ram stable at first, but increasingly use more ram after 1 hour

1.2 大文件测试记录

total parquet file count: 240

total parquet file size: 126G

单个parquet文件大小：500MB

total sample count: 600w

现象：每个epoch训练结束时，有明显内存回收的过程，但是回收不干净，导致每个epoch后使用的内存峰值越来越多，最终OOM。但是如果1、2个epoch内能训练完，内存不会爆。

1.3 不同训练方式测试对比

	新sdk	旧sdk
是否内存溢出	是	否
环境	T4 单机单卡	T4 单机单卡
训练方式	session.run()	tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
调用链	见下图1	见下图2

2. Expected behavior

During the training process, tensorflow should use stable amount of RAM, not using more and more RAM.

3. System information

OS Platform and Distribution: Ubuntu 18.04.5 LTS
TensorFlow version: 1.15.0
Python version: 3.6
CUDA/cuDNN version: 10.1
RAM: 94G
GPU model and memory: Tesla T4, 16G

4. Code to reproduce

BATCH_SIZE = 8192

parquet_file_list = ['some_parquet_file1.snappy.parquet', 'some_parquet_file2.snappy.parquet', ...]
filenames_ds = tf.data.TFRecordDataset.from_tensor_slices(file_list)
hb_fields = []
hb_fields.append(hb.data.DataFrame.Field('int_field', tf.int64, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field('float_field', tf.float32, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field('array_field', tf.float32, ragged_rank=1))   # ... and some anthor fields


# 1. no shuffle, no rebatch
ds = filenames_ds.apply(hb.data.read_parquet(BATCH_SIZE, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))

# 2. big shuffle and rebatch
ds = filenames_ds.apply(hb.data.read_parquet(BATCH_SIZE, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
ds.shuffle(2048)
ds.apply(hb.data.rebatch(BATCH_SIZE, fields=fields))

# 3. small shuffle and rebatch
ds = filenames_ds.apply(hb.data.read_parquet(BATCH_SIZE, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
ds.shuffle(8)
ds.apply(hb.data.rebatch(BATCH_SIZE, fields=fields))

# 4. no shuffle and rebatch
ds = filenames_ds.apply(hb.data.read_parquet(BATCH_SIZE, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
ds.apply(hb.data.rebatch(BATCH_SIZE, fields=fields))

# 5. small shuffle and no rebatch
ds = filenames_ds.apply(hb.data.read_parquet(BATCH_SIZE, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
ds.shuffle(8)

Willing to contribute

Yes

error: Variables not initialized: communicator/1/HbNcclCommHandleOp

Current behavior

2022-10-19 12:39:39.948019: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-10-19 12:39:39.948020: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
INFO:tensorflow:Parsing ../data//train.csv
INFO:tensorflow:Parsing ../data//train.csv
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
WARNING:tensorflow:The default value of combiner will change from "mean" to "sqrtn" after 2016/11/01.
INFO:tensorflow:Aggregate 12 dense gradients (33.35MB) and 0 sparse gradients (0.00MB), skip 26 aggregated gradients
INFO:tensorflow:Aggregate 12 dense gradients (33.35MB) and 0 sparse gradients (0.00MB), skip 26 aggregated gradients
2022-10-19 12:39:43.135528: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3792945000 Hz
2022-10-19 12:39:43.136209: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x53d08e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-10-19 12:39:43.136227: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-10-19 12:39:43.137613: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-10-19 12:39:43.147796: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3792945000 Hz
2022-10-19 12:39:43.148600: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4190950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-10-19 12:39:43.148638: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-10-19 12:39:43.150144: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-10-19 12:39:43.263791: I tensorflow/stream_executor/cuda/cuda_driver.cc:404] Cuda add device primary context 0x6a91880
2022-10-19 12:39:43.263977: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.264217: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6a644c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-10-19 12:39:43.264241: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
2022-10-19 12:39:43.264400: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.264809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1687] Found device 0 with properties: 
name: NVIDIA GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:08:00.0
2022-10-19 12:39:43.264837: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-10-19 12:39:43.267560: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-10-19 12:39:43.267587: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-10-19 12:39:43.272210: I tensorflow/stream_executor/cuda/cuda_driver.cc:404] Cuda add device primary context 0x51f3c70
2022-10-19 12:39:43.272373: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.272643: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x51bf3f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-10-19 12:39:43.272666: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
2022-10-19 12:39:43.272784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.272986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1687] Found device 0 with properties: 
name: NVIDIA GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:07:00.0
2022-10-19 12:39:43.273007: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-10-19 12:39:43.275711: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-10-19 12:39:43.275737: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-10-19 12:39:43.288994: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-10-19 12:39:43.289162: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-10-19 12:39:43.289498: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2022-10-19 12:39:43.290069: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-10-19 12:39:43.290150: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-10-19 12:39:43.290231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.290471: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.290643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1815] Adding visible gpu devices: 0
2022-10-19 12:39:43.292542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1170] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-10-19 12:39:43.292558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1176]      0 
2022-10-19 12:39:43.292564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] 0:   N 
2022-10-19 12:39:43.292640: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.292843: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.293058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1372] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 9793 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:08:00.0, compute capability: 7.5)
2022-10-19 12:39:43.294188: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job chief -> {0 -> 127.0.0.1:20001}
2022-10-19 12:39:43.294199: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> localhost:20002}
2022-10-19 12:39:43.294986: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:374] Started server with target: grpc://localhost:20002
2022-10-19 12:39:43.297217: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-10-19 12:39:43.297453: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-10-19 12:39:43.297814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2022-10-19 12:39:43.298428: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-10-19 12:39:43.298520: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-10-19 12:39:43.298607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.298845: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.299025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1815] Adding visible gpu devices: 0
2022-10-19 12:39:43.301007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1170] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-10-19 12:39:43.301024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1176]      0 
2022-10-19 12:39:43.301031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] 0:   N 
2022-10-19 12:39:43.301114: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.301327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-19 12:39:43.301552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1372] Created TensorFlow device (/job:chief/replica:0/task:0/device:GPU:0 with 9729 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:07:00.0, compute capability: 7.5)
2022-10-19 12:39:43.302687: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job chief -> {0 -> localhost:20001}
2022-10-19 12:39:43.302705: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:20002}
2022-10-19 12:39:43.303434: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:374] Started server with target: grpc://localhost:20001
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run without loading checkpoint
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run without loading checkpoint
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
The training steps is 100
Traceback (most recent call last):
  File "benchmark_hb.py", line 405, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/function.py", line 144, in wrapped_fn
    return fn(*args, **kwargs)
  File "benchmark_hb.py", line 339, in main
    config=sess_config) as sess:
  File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/session.py", line 174, in HybridBackendMonitoredTrainingSession
    sess = fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 633, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/session.py", line 69, in __init__
    session_creator, hooks, should_recover=True, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 775, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1257, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1262, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 928, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 697, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 323, in prepare_session
    (_maybe_name(init_op), init_fn, self._local_init_op, msg))
RuntimeError: Init operations did not make model ready.  Init op: group_deps_2, init fn: None, local_init_op: name: "group_deps_1"
op: "NoOp"
input: "^group_deps_1/NoOp"
input: "^group_deps_1/NoOp_1"
device: "/job:chief/task:0/device:GPU:0"
, error: Variables not initialized: communicator/0/HbNcclCommHandleOp
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
The training steps is 100
Traceback (most recent call last):
  File "benchmark_hb.py", line 405, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/function.py", line 144, in wrapped_fn
    return fn(*args, **kwargs)
  File "benchmark_hb.py", line 339, in main
    config=sess_config) as sess:
  File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/session.py", line 174, in HybridBackendMonitoredTrainingSession
    sess = fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 633, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/session.py", line 69, in __init__
    session_creator, hooks, should_recover=True, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 775, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1257, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1262, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 928, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 697, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 323, in prepare_session
    (_maybe_name(init_op), init_fn, self._local_init_op, msg))
RuntimeError: Init operations did not make model ready.  Init op: group_deps_2, init fn: None, local_init_op: name: "group_deps_1"
op: "NoOp"
input: "^group_deps_1/NoOp"
input: "^group_deps_1/NoOp_1"
device: "/job:worker/task:0/device:GPU:0"
, error: Variables not initialized: communicator/1/HbNcclCommHandleOp

Expected behavior

code run well

System information

GPU model and memory: 2080Ti
OS Platform:
Docker version:
GCC/CUDA/cuDNN version: cuda 11.4
Python/conda version:
TensorFlow/PyTorch version: DeepRec, commit message: 6bca2cc4e6acaca3766e0425b53bdd

Code to reproduce

Download the train dataset(in csv format) from https://storage.googleapis.com/dataset-uploader/criteo-kaggle/large_version/train.csv
The training script

# Copyright (c) 2022 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

from tensorflow.python.framework import dtypes
import numpy as np
from ast import arg
import time
import argparse
import tensorflow as tf
import os
import sys
import math
import collections
from tensorflow.python.client import timeline
import json

from tensorflow.python.framework import sparse_tensor
from tensorflow.python.feature_column import feature_column_v2 as fc
from tensorflow.python.ops import partitioned_variables
from tensorflow.python.framework import ops
os.environ["TF_GPU_THREAD_MODE"] = "global"
import hybridbackend.tensorflow as hb

# Set to INFO for tracking training, default is WARN. ERROR for least messages
tf.logging.set_verbosity(tf.logging.INFO)
print("Using TensorFlow version %s" % (tf.__version__))

# Definition of some constants
CONTINUOUS_COLUMNS = ['I' + str(i) for i in range(1, 14)]  # 1-13 inclusive
CATEGORICAL_COLUMNS = ['C' + str(i) for i in range(1, 27)]  # 1-26 inclusive
LABEL_COLUMN = ['clicked']
TRAIN_DATA_COLUMNS = LABEL_COLUMN + CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS
FEATURE_COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS
HASH_BUCKET_SIZES = {
    'C1': 2500,
    'C2': 2000,
    'C3': 300000,
    'C4': 250000,
    'C5': 1000,
    'C6': 100,
    'C7': 20000,
    'C8': 4000,
    'C9': 20,
    'C10': 100000,
    'C11': 10000,
    'C12': 250000,
    'C13': 40000,
    'C14': 100,
    'C15': 100,
    'C16': 200000,
    'C17': 50,
    'C18': 10000,
    'C19': 4000,
    'C20': 20,
    'C21': 250000,
    'C22': 100,
    'C23': 100,
    'C24': 250000,
    'C25': 400,
    'C26': 100000
}

EMBEDDING_DIMENSIONS = {
    'C1': 64,
    'C2': 64,
    'C3': 128,
    'C4': 128,
    'C5': 64,
    'C6': 64,
    'C7': 64,
    'C8': 64,
    'C9': 64,
    'C10': 128,
    'C11': 64,
    'C12': 128,
    'C13': 64,
    'C14': 64,
    'C15': 64,
    'C16': 128,
    'C17': 64,
    'C18': 64,
    'C19': 64,
    'C20': 64,
    'C21': 128,
    'C22': 64,
    'C23': 64,
    'C24': 128,
    'C25': 64,
    'C26': 128
}


def transform_numeric(feature):
    r'''Transform numeric features.
    '''
    # Notes: Statistics of Kaggle's Criteo Dataset has been calculated in advance to save time.
    mins_list = [
        0.0, -3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
    ]
    range_list = [
        1539.0, 22069.0, 65535.0, 561.0, 2655388.0, 233523.0, 26297.0, 5106.0,
        24376.0, 9.0, 181.0, 1807.0, 6879.0
    ]

    def make_minmaxscaler(min, range):
        def minmaxscaler(col):
            return (col - min) / range

        return minmaxscaler

    numeric_list = []

    for column_name in CONTINUOUS_COLUMNS:
        normalizer_fn = None
        i = CONTINUOUS_COLUMNS.index(column_name)
        normalizer_fn = make_minmaxscaler(mins_list[i], range_list[i])
        numeric = normalizer_fn(feature[column_name])
        numeric_list.append(tf.reshape(numeric, shape=[-1, 1]))
    return numeric_list


def transform_categorical(feature):
    r'''Transform categorical features.
    '''
    deep_features = []
    max_value = np.iinfo(dtypes.int64.as_numpy_dtype).max

    variables = []
    indices = []
    for column_name in CATEGORICAL_COLUMNS:
        ev_opt = tf.EmbeddingVariableOption(
            evict_option=None, filter_option=None)
        device_str = '/gpu'
        with tf.device(device_str), hb.scope(sharding=True):
            embedding_weights = tf.get_embedding_variable(
                f'{column_name}_weight',
                initializer=tf.random_normal_initializer(
                    mean=0.0, stddev=0.05
                ),
                embedding_dim=EMBEDDING_DIMENSIONS[column_name],
                ev_option=ev_opt
            )

        category = tf.strings.to_hash_bucket_fast(
            feature[column_name], max_value)
        sparse_tensor = fc._to_sparse_input_and_drop_ignore_values(category)
        sparse_tensor = tf.sparse.reshape(sparse_tensor, (-1, 1))
        
        deep_features.append(tf.nn.embedding_lookup_sparse(
            embedding_weights, sparse_tensor, None))
        
        variables.append(embedding_weights)
        indices.append(sparse_tensor)
    return deep_features


def stacked_dcn_v2(features, mlp_dims):
    r'''Stacked DCNv2.

    DCNv2: Improved Deep & Cross Network and Practical Lessons for Web-scale
    Learning to Rank Systems.

    See https://arxiv.org/abs/2008.13535 for more information.
    '''
    with tf.name_scope('cross'):
        cross_input = tf.concat(features, axis=-1)
        cross_input_shape = [-1, sum([f.shape[-1] for f in features])]
        cross_input = tf.reshape(cross_input, cross_input_shape)
        cross_input_sq = tf.layers.dense(
            cross_input, cross_input.shape[-1],
            activation=tf.nn.relu,
            kernel_initializer=tf.truncated_normal_initializer(),
            bias_initializer=tf.zeros_initializer())
        cross_output = cross_input * cross_input_sq + cross_input
        cross_output = tf.reshape(cross_output, [-1, cross_input.shape[1]])
        cross_output_dim = (len(features) * (len(features) + 1)) / 2

    with tf.name_scope('mlp'):
        prev_layer = cross_output
        prev_dim = cross_output_dim
        for i, d in enumerate(mlp_dims[:-1]):
            prev_layer = tf.layers.dense(
                prev_layer, d,
                activation=tf.nn.relu,
                kernel_initializer=tf.random_normal_initializer(
                    mean=0.0,
                    stddev=math.sqrt(2.0 / (prev_dim + d))),
                bias_initializer=tf.random_normal_initializer(
                    mean=0.0,
                    stddev=math.sqrt(1.0 / d)),
                name=f'mlp_{i}')
            prev_dim = d
        return tf.layers.dense(
            prev_layer, mlp_dims[-1],
            activation=tf.nn.sigmoid,
            kernel_initializer=tf.random_normal_initializer(
                mean=0.0,
                stddev=math.sqrt(2.0 / (prev_dim + mlp_dims[-1]))),
            bias_initializer=tf.random_normal_initializer(
                mean=0.0,
                stddev=math.sqrt(1.0 / mlp_dims[-1])),
            name=f'mlp_{len(mlp_dims) - 1}')


# generate dataset pipline
def build_model_input(filename, batch_size, num_epochs):
    def parse_csv(value):
        tf.logging.info('Parsing {}'.format(filename))
        cont_defaults = [[0.0] for i in range(1, 14)]
        cate_defaults = [[' '] for i in range(1, 27)]
        label_defaults = [[0]]
        column_headers = TRAIN_DATA_COLUMNS
        record_defaults = label_defaults + cont_defaults + cate_defaults
        columns = tf.io.decode_csv(value, record_defaults=record_defaults)
        all_columns = collections.OrderedDict(zip(column_headers, columns))
        labels = all_columns.pop(LABEL_COLUMN[0])
        features = all_columns
        return features, labels

    '''Work Queue Feature'''
    if args.workqueue:
        from tensorflow.python.ops.work_queue import WorkQueue
        work_queue = WorkQueue([filename])
        # For multiple files：
        # work_queue = WorkQueue([filename, filename1,filename2,filename3])
        files = work_queue.input_dataset()
    else:
        files = filename
    # Extract lines from input files using the Dataset API.
    dataset = tf.data.TextLineDataset(files)
    dataset = dataset.shuffle(buffer_size=20000,
                              seed=args.seed)  # fix seed for reproducing
    dataset = dataset.repeat(num_epochs)
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(parse_csv, num_parallel_calls=28)
    dataset = dataset.prefetch(2)
    return dataset

@hb.function()
def main():
    
    # check dataset and count data set size
    print("Checking dataset...")
    train_file = args.data_location + '/train.csv'
    if (not os.path.exists(train_file)):
        print("Dataset does not exist in the given data_location.")
        sys.exit()
    no_of_training_examples = sum(1 for line in open(train_file))
    print("Numbers of training dataset is {}".format(no_of_training_examples))

    # set batch size, eporch & steps
    batch_size = args.batch_size

    if args.steps == 0:
        no_of_epochs = 1
        train_steps = math.ceil(
            (float(no_of_epochs) * no_of_training_examples) / batch_size)
    else:
        no_of_epochs = math.ceil(
            (float(batch_size) * args.steps) / no_of_training_examples)
        train_steps = args.steps
    print("The training steps is {}".format(train_steps))

    # set fixed random seed
    tf.set_random_seed(args.seed)

    # create data pipline of train & test dataset
    with tf.device('/cpu:0'):
        train_dataset = build_model_input(train_file, batch_size, no_of_epochs)

        iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
                                                train_dataset.output_shapes)
        next_element = iterator.get_next()

    train_init_op = iterator.make_initializer(train_dataset)

    # create feature column
    feature, labels = next_element[0], next_element[1]

    deep_features = transform_categorical(feature)
    wide_features = transform_numeric(feature)
    logits = stacked_dcn_v2(features=deep_features + wide_features,
                            mlp_dims=[1024, 1024, 512, 256, 1]
                            )
    loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(tf.reshape(labels, (-1, 1)), logits))

    step = tf.train.get_or_create_global_step()
    opt = tf.train.AdagradOptimizer(learning_rate=0.01)
    train_op = opt.minimize(loss, global_step=step)

    # Session config
    sess_config = tf.ConfigProto()

    # # Session hooks
    hooks = []

    # if args.smartstaged and not args.tf:
    #     '''Smart staged Feature'''
    #     next_element = tf.staged(next_element, num_threads=4, capacity=40)
    #     sess_config.graph_options.optimizer_options.do_smart_stage = True
    #     hooks.append(tf.make_prefetch_hook())
    # if args.op_fusion and not args.tf:
    #     '''Auto Graph Fusion'''
    #     sess_config.graph_options.optimizer_options.do_op_fusion = True
    # if args.micro_batch and not args.tf:
    #     '''Auto Mirco Batch'''
    #     sess_config.graph_options.optimizer_options.micro_batch_num = args.micro_batch

    scaffold = tf.train.Scaffold(
        local_init_op=tf.group(
            tf.local_variables_initializer(), train_init_op),
    )

    stop_hook = tf.train.StopAtStepHook(last_step=train_steps)
    log_hook = tf.train.LoggingTensorHook(
        {
            'steps': step,
            'loss': loss,
        }, every_n_iter=1)
    hooks.append(stop_hook)
    hooks.append(log_hook)

    with tf.train.MonitoredTrainingSession(
            master='',
            hooks=hooks,
            scaffold=scaffold,
            config=sess_config) as sess:
        while not sess.should_stop():
            print(sess.run([feature]))
            sess.run([loss, train_op])
    print("Training completed.")


def boolean_string(string):
    low_string = string.lower()
    if low_string not in {'false', 'true'}:
        raise ValueError('Not a valid boolean string')
    return low_string == 'true'


# Get parse
def get_arg_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_location',
                        help='Full path of train data',
                        required=False,
                        default='./data')
    parser.add_argument('--steps',
                        help='set the number of steps on train dataset',
                        type=int,
                        default=0)
    parser.add_argument('--batch_size',
                        help='Batch size to train. Default is 512',
                        type=int,
                        default=512)
    parser.add_argument('--seed',
                        help='set the random seed for tensorflow',
                        type=int,
                        default=2021)
    parser.add_argument('--workqueue',
                        help='Whether to enable Work Queue. Default to False.',
                        type=boolean_string,
                        default=False)
    return parser


# Some DeepRec's features are enabled by ENV.
# This func is used to set ENV and enable these features.
# A triple quotes comment is used to introduce these features and play an emphasizing role.
def set_env_for_DeepRec():
    '''
    Set some ENV for these DeepRec's features enabled by ENV. 
    More Detail information is shown in https://deeprec.readthedocs.io/zh/latest/index.html.
    START_STATISTIC_STEP & STOP_STATISTIC_STEP: On CPU platform, DeepRec supports memory optimization
        in both stand-alone and distributed trainging. It's default to open, and the 
        default start and stop steps of collection is 1000 and 1100. Reduce the initial 
        cold start time by the following settings.
    MALLOC_CONF: On CPU platform, DeepRec can use memory optimization with the jemalloc library.
        Please preload libjemalloc.so by `LD_PRELOAD=./libjemalloc.so.2 python ...`
    '''
    os.environ['START_STATISTIC_STEP'] = '100'
    os.environ['STOP_STATISTIC_STEP'] = '110'
    os.environ['MALLOC_CONF'] = \
        'background_thread:true,metadata_thp:auto,dirty_decay_ms:20000,muzzy_decay_ms:20000'


if __name__ == '__main__':
    parser = get_arg_parser()
    args = parser.parse_args()

    set_env_for_DeepRec()

    main()

Training command:

python -m hybridbackend.run  python benchmark_hb.py --data_location ../data/ --steps 100

Willing to contribute

Yes

ParquetDataset benchmark add tfrecord data

Summary

the benchmark for ParquetDataset is only compare ParquetDataset vs TextLineDataset. add TFRecordDataset, please.

Installation environment

GPU model and memory:
OS Platform:
Docker version:
GCC/CUDA/cuDNN version:
Python/conda version:
TensorFlow/PyTorch version:

Willing to contribute

Yes

ParquetDataset should be able to skip corrupted data

User Story

As a AI infra engineer, I want to skip corrupted data for reading Parquet files, so that training would not be stopped by accidental upstream errors.

Detailed requirements

It should be turned off by default

API Compatibility

Willing to contribute

Yes

init_from_checkpoint throw Exception when using hb.keras.Model

Current behavior

python code like:

tf.feature_column.embedding_column(
      tf.feature_column.categorical_column_with_identity(
          key=offset_name,
          num_buckets=dict_size,
          default_value=0),
      combiner='sqrtn',
      dimension=dimension,
      initializer=glorot_uniform(),
      ckpt_to_load_from=ckpt_to_load_from,
      tensor_name_in_ckpt=tensor_name_in_ckpt

It throw Exception 'tuple' object has no attribute 'is_compatible_with' when ckpt_to_load_from was setted.
It's OK when ckpt_to_load_from and tensor_name_in_ckpt are None

Willing to contribute

Yes

Sync training with ParquetDataset, Use PS-Worker，The system may block because some worker stop early.

User Story

Sync training with ParquetDataset, Use PS-Worker, then some Worker may stop early，then other Workers may block because PS do not gather enough gradients and can not continue update sync_token continue;

Deeprec hangs in distributed mode.

Current behavior

In distributed mode, deeprec works fine when training on one hour of data, but hangs when training on one day or more. Log：

Nvidia-smi:

cpu:

Expected behavior

Deeprec works fine in distributed mode. Log:

System information

GPU model and memory: Two GPU devices： Tesla T4 . Memory: 15109MiB
OS Platform: x86_64 x86_64 x86_64 GNU/Linux
Docker version: Docker version 20.10.8, build 3967b7d
GCC/CUDA/cuDNN version: CUDA 11.4 /cuDnn 8
Python/conda version: python3.6
TensorFlow/PyTorch version: DeepRec deeprec2302, HybridBackend a832b4e

Code to reproduce

    sess_config = tf.ConfigProto(
        # If the device you specify doesn't exist, allow TF to assign the device automatically
        allow_soft_placement=True,
        log_device_placement=False,  # Whether to print the device assignment log
    )
    sess_config.gpu_options.force_gpu_compatible = True
    sess_config.gpu_options.allow_growth = True

    with tf.train.MonitoredTrainingSession(master="", checkpoint_dir=self.__ckpt_dir, config=sess_config):

Willing to contribute

Yes

hb.keras.Model.load_weight support logging details

User Story

When I'm using hb.keras.Model.load_weight(ckpt_path,scope)
It just tell me skipped op details.
I want to know how many op the scope matched and how many weights loaded successfully. The total num and detail names.

Detailed requirements

It should be support a new param like verbose
verbose=0: only print skipped weights-names
verbose=1: 0+ print num(scope matched weights-names) + num(loaded successfully weights-names)
verbose=2: 0+ print scope matched weights-names+ loaded successfully weights-names

API Compatibility

hb.keras.Model.load_weight

Willing to contribute

Yes

Dataset iterator can't be warpped in the hybridBackend scope

Current behavior

I am using hybridBackend to do data parallelism, I create a dataset and make it an iterator, when I use hybridBackend scope to wrap the whole pipeline, an exception occurred after the iterator step, here is the error log:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_util.py", line 324, in _AssertCompatible
    fn(values)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_util.py", line 276, in _check_not_tensor
    _ = [_check_failed(v) for v in nest.flatten(values)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_util.py", line 277, in <listcomp>
    if isinstance(v, ops.Tensor)]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_util.py", line 248, in _check_failed
    raise ValueError(v)
ValueError: Tensor("Iterator_1/Identity:0", shape=(?,), dtype=int64, device=/job:chief/task:0/device:GPU:0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "demo.py", line 332, in <module>
    app.run(runner)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "demo.py", line 213, in runner
    features, labels = datasource.iter.get_next()
  File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/data/iterators.py", line 120, in get_next
    DataSyncRewriting.accept(should_stop)
  File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/data/iterators.py", line 169, in accept
    should_stop = math_ops.cast(should_stop, dtypes.int32)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 702, in cast
    x = ops.convert_to_tensor(x, name="x")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1184, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1242, in convert_to_tensor_v2
    as_ref=False)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1297, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py", line 286, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
    allow_broadcast=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py", line 265, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_util.py", line 449, in make_tensor_proto
    _AssertCompatible(values, dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_util.py", line 328, in _AssertCompatible
    raise TypeError("List of Tensors when single Tensor expected")
TypeError: List of Tensors when single Tensor expected

Expected behavior

System information

GPU model and memory: Tesla P100
OS Platform: Ubuntu 18.04
Docker version: Docker Engine - Community Version: 20.10.14
GCC/CUDA/cuDNN version:
Python/conda version: Python 3.6.9
TensorFlow/PyTorch version: TensorFlow:DeepRec2208

Code to reproduce

import numpy as np
import pandas as pd

new_dtypes = {"uid": np.int64, "packagename": np.int64, "label_play": np.float64}

train_df = pd.DataFrame(np.random.randint(0, 100, (5, 3)), columns=['uid', 'packagename', 'label_play'])
train_df = train_df.astype(new_dtypes)
train_df.to_parquet('train.parquet')

import tensorflow as tf
import hybridbackend.tensorflow as hb
from hybridbackend.tensorflow.data import ParquetDataset
from tensorflow.python.data.ops import dataset_ops
from tensorflow.python.data.experimental.ops.dataframe import to_sparse



def parquet_map(record):
    for key in record:
        record[key] = tf.reshape(record[key], [-1])
    label = record.pop("label_play")
    return record, label


# Create model
def neural_net(features):
    with tf.device("/CPU:0"):
        var = tf.get_embedding_variable(
            "var_0",
            embedding_dim=3,
            initializer=tf.ones_initializer(tf.float32),
            partitioner=tf.fixed_size_partitioner(num_shards=4),
        )

    emb = tf.nn.embedding_lookup(var, features["uid"])
    fun = tf.multiply(emb, 2.0, name="multiply")
    loss = tf.reduce_sum(fun, name="reduce_sum")
    opt = tf.train.AdagradOptimizer(0.1)

    g_v = opt.compute_gradients(loss)
    train_op = opt.apply_gradients(g_v)
    return train_op, loss


with hb.scope():
    with tf.device("/cpu:0"):
        dataset = tf.data.Dataset.list_files(["train.parquet"])
        dataset = dataset.apply(
            tf.data.experimental.parallel_interleave(
                lambda tmp_file: ParquetDataset(
                    tmp_file,
                    drop_remainder=True,
                    batch_size=2,
                    num_parallel_reads=1,
                    fields=[
                        hb.data.DataFrame.Field("uid", tf.int64, ragged_rank=0),
                        hb.data.DataFrame.Field("packagename", tf.int64, ragged_rank=0),
                        hb.data.DataFrame.Field("label_play", tf.float64, ragged_rank=0),
                    ],
                ).apply(
                    to_sparse()
                ),
                cycle_length=1,
                block_length=1,
            )
        )
        dataset = dataset.batch(2, drop_remainder=True,).map(
            map_func=parquet_map,
            num_parallel_calls=dataset_ops.AUTOTUNE,
        )
    
    iterator = dataset.make_one_shot_iterator()
    # iterator = tf.data.make_one_shot_iterator(dataset)
    features, labels = iterator.get_next()

    train_op, loss = neural_net(features)

    scaffold = tf.train.Scaffold(
        init_op=tf.group(
            tf.global_variables_initializer(),
        ),
    )

    with tf.train.MonitoredTrainingSession(
        master="", scaffold=scaffold) as mon_sess:
        while not mon_sess.should_stop():
            _, ev = mon_sess.run([train_op, loss])
            print(ev)

Willing to contribute

Yes

the EarlyStopping callback not working well on multi worker distribute training job

Current behavior

If there is only one worker ,training with EarlyStopping callback is ok. When multi workers with EarlyStopping callback doing distribute training, all workers will be hanging and waiting for synchronizing.

Expected behavior

I want the EarlyStopping callback works well not only on one worker task but also on multi workers distribute training job.

System information

GPU model and memory:
OS Platform:
Docker version:
GCC/CUDA/cuDNN version:
Python/conda version:
TensorFlow/PyTorch version:

Code to reproduce

....
callbacks_list.append(EarlyStopping(monitor="val_loss",
min_delta=self.ctx.min_delta,
patience=self.ctx.patience,
verbose=verbose,
mode="min",
baseline=None,
restore_best_weights=True)
)

....

keras_model.fit(
x=None,
y=None,
validation_data=valid_ds,
steps_per_epoch=self.ctx.steps_per_epoch,
validation_steps=self.ctx.valid_steps_per_epoch,
epochs=self.ctx.callback_num,
callbacks=callbacks_list,
checkpoint_dir=self.ctx.model_save_path,
keep_checkpoint_max=1,
verbose=0)

Willing to contribute

Yes

What version of snappy should I install for building HB from source?

Summary

I am trying to build HB from source, when i use the make -j8 command from the work dir, i get the following error :

(base) root@recall-gpu-01:/HybridBackend# make -j8
mkdir -p /root/projects/tmp/HybridBackend/arrow/build/
ARROW_INSTALL=/root/projects/tmp/HybridBackend/arrow/dist \
ARROW_BUILD=/root/projects/tmp/HybridBackend/arrow/build \
ARROW_OSX_TARGET= \
USE_CXX11_ABI=0 \
WITH_ARROW_HDFS=ON \
WITH_ARROW_S3=ON \
SIMD_LEVEL=AVX2 \
OS=Linux \
bash arrow/build.sh
-- Building using CMake version: 3.16.3
-- Arrow version: 5.0.0 (full: '5.0.0')
-- Arrow SO version: 500 (full: 500.0.0)
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
-- infer not found
-- Found cpplint executable at /root/projects/tmp/HybridBackend/arrow/src/cpp/build-support/cpplint.py
-- System processor: x86_64
-- Arrow build warning level: PRODUCTION
Using ld linker
Configured for RELEASE build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Using CONDA approach to find dependencies
-- Using CONDA_PREFIX for ARROW_PACKAGE_PREFIX: /root/miniconda3
-- Setting (unset) dependency *_ROOT variables: /root/miniconda3
-- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862
-- ARROW_AWSSDK_BUILD_VERSION: 1.8.133
-- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.10
-- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.5.10
-- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5
-- ARROW_BOOST_BUILD_VERSION: 1.75.0
-- ARROW_BROTLI_BUILD_VERSION: v1.0.9
-- ARROW_BZIP2_BUILD_VERSION: 1.0.8
-- ARROW_CARES_BUILD_VERSION: 1.17.1
-- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2
-- ARROW_GFLAGS_BUILD_VERSION: v2.2.2
-- ARROW_GLOG_BUILD_VERSION: v0.4.0
-- ARROW_GRPC_BUILD_VERSION: v1.35.0
-- ARROW_GTEST_BUILD_VERSION: 1.10.0
-- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1
-- ARROW_LZ4_BUILD_VERSION: v1.9.3
-- ARROW_MIMALLOC_BUILD_VERSION: v1.7.2
-- ARROW_ORC_BUILD_VERSION: 1.6.6
-- ARROW_PROTOBUF_BUILD_VERSION: v3.14.0
-- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5
-- ARROW_RE2_BUILD_VERSION: 2021-02-02
-- ARROW_SNAPPY_BUILD_VERSION: 1.1.8
-- ARROW_THRIFT_BUILD_VERSION: 0.13.0
-- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 38a27d391a2b03214b444cb13d5664f1
-- ARROW_UTF8PROC_BUILD_VERSION: v2.6.1
-- ARROW_XSIMD_BUILD_VERSION: e9234cd6e6f4428fc260073b2c34ffe86fda1f34
-- ARROW_ZLIB_BUILD_VERSION: 1.2.11
-- ARROW_ZSTD_BUILD_VERSION: v1.5.0
-- Boost include dir: /usr/include
-- Boost libraries: Boost::system;Boost::filesystem
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR)
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  cmake_modules/FindSnappy.cmake:55 (find_package_handle_standard_args)
  cmake_modules/ThirdpartyToolchain.cmake:235 (find_package)
  cmake_modules/ThirdpartyToolchain.cmake:948 (resolve_dependency)
  CMakeLists.txt:515 (include)


-- Configuring incomplete, errors occurred!
See also "/root/projects/tmp/HybridBackend/arrow/build/CMakeFiles/CMakeOutput.log".
See also "/root/projects/tmp/HybridBackend/arrow/build/CMakeFiles/CMakeError.log".
make: *** [arrow/Makefile:8: /root/projects/tmp/HybridBackend/arrow/build/install_manifest.txt] Error 1

I have install libsnappy-dev and can found it from /usr/local/include/snappy.h and /usr/local/lib/libsnappy.a, but that error still exists, so how should I install the correct snappy version for building HB?

I also tryed the docker images from registry.cn-shanghai.aliyuncs.com/pai-dlc/hybridbackend:developer-tensorflow1.15-manylinux_2_27-py3.6-cu114, same error exists.

Installation environment

GPU model and memory:
OS Platform: "20.04.3 LTS (Focal Fossa)"
Docker version:
GCC/CUDA/cuDNN version: 11.4
Python/conda version: Python 3.8.10/ conda 4.10.3
TensorFlow/PyTorch version: 1.15.5+deeprec2201

Willing to contribute

Yes

EmbeddingLookupRewritingForDeepRecEV Add "part0" to op-name twice

Current behavior

One node name should be "XX/embedding_weights/part_0"，but now it is "XX/embedding_weights/part_0/part_0"

Expected behavior

Remove EmbeddingLookupRewritingForDeepRecEV.build_unsharded_weights rename func like

    if name.endswith('/part_0'):
        return fn(name, *args, **kwargs)
    else:
        return fn(f'{name}/part_0', *args, **kwargs)

It should support mutilple partition like "part_1X"

Willing to contribute

Yes

tf.keras.layers.DenseFeatures api as the candidate of hb.feature_column.DenseFeatures can not work with tf.feature_column.shared_embedding_columns

Current behavior

HB version: HybridBackend 0.7.0-e277c15f3843f98901f0795bc9b7d0768056d5a3; tf1.15.5-v1.15.5+nv22.06-0-g55be3962f8; g++ 7.5.0; CUDA 11.4 (70,75,80,86)
new hb package removes the hb.feature_column.DenseFeatures api. However, tf.keras.layers.DenseFeatures api can not deal with
tf.feature_column.shared_embedding_columns.

Expected behavior

I want a new way to run the code successfully without hb.feature_column.DenseFeatures.

System information

GPU model and memory:
OS Platform:
Docker version:
GCC/CUDA/cuDNN version:
Python/conda version:
TensorFlow/PyTorch version:

Code to reproduce

Willing to contribute

Yes

hb.data.ParquetDataset support Case-sensitive-fields

User Story

I want to read data using hb.data.ParquetDataset and feed field param using my column names from other place.
When I use hb.data.ParquetDataset(filenames=[My parquet files],fields = [My custom column names]), It throws f'Field {f} is not found in the parquet file {filename}'

Detailed requirements

Hive stores the schema of a table in all lowercase. It makes input-param (fields) maybe not in parquet files.
Like "AGE" in fields but "age" in parquet schema. In this case , should do auto fields transform like below:

schema_ds = ParquetFile(filenames[0])
schema_names = schema_ds.schema_arrow.names
def _fix_name(name):
      if name not in schema_names and name.lower() in schema_names:
            return (name.lower(),1)
      else:
            return (name,0)
results_fields = [_fix_name(n) for n in fields]
fixed_fields = [n[0] for n in results_fields]
changed_fields = [n[0] for n in results_fields if n[1]==1]
print("changed-fileds:%s" % changed_fields)

API Compatibility

hb.data.ParquetDataset

Feature Request: Supports prefetching data to GPU

User Story

As a recommender system engineer, I want to read large batch of tabular data on GPU efficiently, so that training performance of large deep recommenders can be improved.

Detailed requirements

It should be easy to use with TensorFlow Dataset API

API Compatibility

Only new APIs should be introduced.

Willing to contribute

Yes

merge embedding table

Too many alltoall op to performance degradation

hb.keras.model evaluate error

Current behavior

hb.keras.model evaluate report error like:

Traceback (most recent call last):
File "models/recommend/MultiTowerDnn.py", line 100, in
base.run(MultiTowerDnn)
File "/var/workspace/models/recommend/base.py", line 331, in run
scores = m.evaluate(valid_ds, verbose=0, steps=None)
File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/keras/model.py", line 609, in evaluate
super().evaluate(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 832, in evaluate
use_multiprocessing=use_multiprocessing)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 704, in evaluate
callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/keras/model.py", line 345, in wrapped_model_iteration
self._save_best_mode):
File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/keras/model.py", line 147, in init
if 'acc' in self._monitor or self._monitor.startswith('fmeasure'):
TypeError: argument of type 'NoneType' is not iterable

Willing to contribute

Yes

Multi worker jobs run error under hb.keras api

Current behavior

current new hb: HybridBackend 0.7.0-3b4195c6ea0160ec9ac8090f21742c9bb9b9d5ba; tf1.15.5-v1.15.5+nv22.06-0-g55be3962f8; g++ 7.5.0; CUDA 11.4 (70,75,80,86) .
Some op in Multi worker jobs run error under hb.keras api . One worker job runs correctly.

Expected behavior

I want the multi worker job runs correctly whatever kind of hb api.

System information

GPU model and memory:
OS Platform:
Docker version:
GCC/CUDA/cuDNN version:
Python/conda version:
TensorFlow/PyTorch version:

Code to reproduce

Willing to contribute

Yes

Feature Request: Support resizing batches from tabular data.

User Story

As a data scientist, I want shuffle samples in a smaller batch size and then use normal batch size in next actions, so that model quality might be improved.
As a big data engineer, I want reading small Parquet files in large batch size, so that flexibility of data management can be improved.

Detailed requirements

It should requires minimal data movements.
it should be flexible to use in complex data pipeline.

API Compatibility

Only new APIs should be introduced.

Willing to contribute

Yes

hb.keras.Model's fit() func support dataset multiple labels

User Story

I want to train model using multiple labels data, but fit function throw Exception like
Error when checking model target:expected no data....

When the dataset only contain one label, It's OK.

Detailed requirements

The dataset like this,the labels maybe tuple or dict:

ds = hb.data.ParquetDataset(XXX)
def map_fn(batch):
    labels = tuple([batch[l] for l in labels])
    features = {}
    #pass
    return features, labels
ds = ds.map(map_fn)

The fit() like this:

m.fit(
    x=train_ds,
    validation_data=valid_ds,
    #XXX
    verbose=0)

I wish fit() can support dataset like above.

API Compatibility

hb.keras.Model's fit()

Willing to contribute

Yes

Feature Request: Support hybrid parallelism.

User Story

As a recommender system engineer, I want to training large recommenders with both replicated variables and sharded variables using collective communication, so that bandwidth utilization can be improved in modern deep learning infrastructure esp. with NVLINK support.

Detailed requirements

It should be able to support various implementation of collective communication (NCCL, ACCL).
It should consume as little space in GPU memory as possible.
It should not block other CUDA kernels with latest CUDA SDK.
It should support half precision communication.
It should be able to tune with hybper parameters.

API Compatibility

Compatible with existing API.

Willing to contribute

Yes

How to pad a column to specific size when using hb.data.ParquetDataset ?

User Story

Since the hb.data.ParquetDataset api can deal with variable-length sequence feature, the pad function can only pad to the max value length. If i want to pad to specific size , some transform need to be done.

Detailed requirements

Can the Pad function support padding to a specific value.

API Compatibility

Willing to contribute

Yes

Support Keras load_weights to do model transfer or inherit

User Story

Sometime I want to load embedding/Dense layer's weights from other models. The model may be same or not.
In keras it could be done by save_weights and load_weights.
In hb, I want similar funcs.

Detailed requirements

It should be like this:
It should support 3 key funcs:

by_name: model or layer can load weights from ckpt/h5/pb file by op-name
skip_mismatch: when finding same name but different shape or elements, just warning and no broke
report: print the layer name when it load weights success

API Compatibility

Model or Layer object

Willing to contribute

Yes

hybridbackend 0.6.0a2 version raise ValueError when ParquetDataset wrapped by parallel_interleave ops

Current behavior

When hb.data.ParquetDataset wrapped by tf.data.experimental.parallel_interleave ops, here is a ValueError: Field xxx (dtype=unkown, shape=()) is incomplete, please specify dtype and ragged_rank

Expected behavior

hb.data.ParquetDataset wrapped by tf.data.experimental.parallel_interleave ops works as normally as hybridbackend-0.6.0a1 version .

System information

GPU model and memory:Tesla T4 16G
OS Platform: Ubuntu 18.04.5 LTS
Docker version: 20.10.14
GCC/CUDA/cuDNN version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)/CUDA Version: 11.4.2/cuDNN 8
Python/conda version:
TensorFlow/PyTorch version: 1.15.5+deeprec2201
HybridBackend version: '0.6.0a2'

Code to reproduce

import tensorflow as tf
import hybridbackend.tensorflow as hb
from tensorflow.python.data.ops import dataset_ops


def make_initializable_iterator(ds):
  r"""Wrapper of make_initializable_iterator.
  """
  if hasattr(dataset_ops, 'make_initializable_iterator'):
    return dataset_ops.make_initializable_iterator(ds)
  return ds.make_initializable_iterator()


def parquet_map(record):
  label = record.pop('label_play')
  return record, label


# Read from a parquet file.
dataset = tf.data.Dataset.list_files([
    'part-00000-d07256ce-4685-4d6c-a9ab-b507ffef206e-c000.snappy.parquet'
],
                                     seed=1)
dataset = dataset.apply(
    tf.data.experimental.parallel_interleave(
        lambda x: hb.data.ParquetDataset(
            x,
            # drop_remainder=True,
            batch_size=4,
            num_parallel_reads=1,
            fields=[
                hb.data.DataFrame.Field('uid', tf.int64),
                hb.data.DataFrame.Field('packagename', tf.int64, ragged_rank=0),
                hb.data.DataFrame.Field('recent_play_3', tf.int64, ragged_rank=1),
                hb.data.DataFrame.Field('label_play', tf.float64),
            ],
        ),
        cycle_length=1,
        block_length=1,
    ))
ds = dataset.prefetch(4)

iterator = make_initializable_iterator(ds)
sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)

with tf.Session(config=sess_config) as sess:
  sess.run(iterator.initializer)
  for i in range(1):
    feature = sess.run(iterator.get_next())
    print(feature)

You can download the toy dataste from here

Willing to contribute

Yes

to_sparse failed for Value with ragged_rank > 1 read from parquet file

Current behavior

when hb read some nested lists with ragged_rank > 1，the read Value cannot be transformed to SparseTensor by function hb.data.to_sparse.

For example:
dense_feature is one of the features read by hb.data.ParquetDataset, and to_sparse does not work for it.

Moreover, if I swap the order of the two nested_row_splits, then it can be to_sparse.

So maybe the order of the nested_row_splits when reading parquet file is incorrect?

Expected behavior

the Value read from parquet file can be transformed to SparseTensor.

System information

GPU model and memory: No
OS Platform: Ubuntu
Docker version: No
GCC/CUDA/cuDNN version: 7.4/No/No
Python/conda version:3.6.13/4.13.0
TensorFlow/PyTorch version:1.14.0

Code to reproduce

import tensorflow as tf
import hybridbackend.tensorflow as hb
dataset = hb.data.ParquetDataset("test2.zstd.parquet", batch_size=1)
dataset = dataset.apply(hb.data.to_sparse())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
sess = tf.Session()
vals = sess.run(next_element)

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

Willing to contribute

Yes

Feature Request: Support reading large batch of tabular data from Parquet files efficiently.

User Story

As a recommender system engineer, I want to read large batch of tabular data from Parquet files efficiently, so that training performance of large deep recommenders can be improved.

Detailed requirements

It should be easy to work with existing Dataset based data pipeline.
It should be optimized for extra large batch size, and utilize features of Parquet format, e.g. column selection, batch reading , and row group filtering.
It should be compatible with vanilla TensorFlow >= 1.14 < 2.0 .

API Compatibility

Only new APIs should be introduced.

Willing to contribute

Yes

alibaba / hybridbackend Goto Github PK

hybridbackend's Introduction

HybridBackend

Features

Usage

Install

Method 1: Install from PyPI

Method 2: Build from source

License

Community

Contact Us

hybridbackend's People

Contributors

Stargazers

Watchers

Forkers

hybridbackend's Issues

User Story

Detailed requirements

API Compatibility

Willing to contribute

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute

User Story

Detailed requirements

API Compatibility

Willing to contribute

User Story

Detailed requirements

API Compatibility

Willing to contribute

User Story

Detailed requirements

API Compatibility

Willing to contribute

User Story

API Compatibility

Willing to contribute

User Story

Detailed requirements

API Compatibility

Willing to contribute

User Story

API Compatibility

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute

User Story

Detailed requirements

API Compatibility

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute