changqi1 / deeprec Goto Github PK

This project forked from deeprec-ai/deeprec

DeepRec is a recommendation engine based on TensorFlow.

License: Apache License 2.0

Starlark 2.43% Shell 0.49% Batchfile 0.02% Python 33.00% Dockerfile 0.05% CMake 0.14% Makefile 0.07% HTML 3.04% C++ 55.94% Cuda 0.13% Jupyter Notebook 1.89% C 0.58% MLIR 1.32% SWIG 0.11% Cython 0.01% LLVM 0.01% Java 0.57% Objective-C 0.06% Objective-C++ 0.14% Ruby 0.01%

deeprec's People

Contributors

Stargazers

Forkers

jmaksymc kmagiers bwlodarcz aalbersk myenthusiastic yuanhu2435 shan2l lyx-liyuxuan web-logs2

deeprec's Issues

[Graph][Optimization]split+concat fusion to improve performance

split+concat fusion optimization
Goal
Optimize performance through split+concat fusion

Problem Description
In some of recommendation model, there is potential performance gain through split and concat fusion.

The step to reproduce the performance issue will be updated later.

Requirement Details

Fusion 2 operators split and concat into 1 operator. Both of the forward and backward need to be covered. And make sure it could be applied in the real model at least.
Datatype: FP32 is needed.
Follow grappler mechanism https://www.tensorflow.org/guide/graph_optimization
Attached code can be used to reproduce it.

training-attention-mlp.py.txt

Test

Unit test code and benchmark is needed.
Using 1 model from model zoo to validate the performance gain. The performance data and analysis result could be described and reproduced.

Code Style and commit

C++ and python: Keep aligned with DeepRec code.

Maintain

All of the issue and bugs related with this op need to be covered in the future.

Definition of Done

Run successfully in DeepRec and could get better performance.
Integrated into DeepRec successfully and commit the code follow DeepRec commit standard.

[Op] Remove _MklFusedMatMul op's redundant ops.

Like the following timeline from WDL training, we could find 2 const ops and 1 _MklToTf op.

Please remove those redundant ops.

[Perf] test MklCPUAllocatorFactory

Replace DeepRec MklCPUAllocatorFactory TensorPoolAllocator with TensorFlow MklCPUAllocator, when batch size is big.
https://github.com/alibaba/DeepRec/blob/main/tensorflow/core/common_runtime/threadpool_device.cc
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/threadpool_device.cc

[Bug] embedding-fusion precision analysis.

Hi @Duyi-Wang , the follow is some UT, which may help you to reduce your validation time.

# Python UT: Include one simple model implement, path at "tensorflow/python/feature_column/feature_column_v2_test.py"
$ bazel test --flaky_test_attempts 1 --test_output=all --nocache_test_results --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 //tensorflow/python/feature_column:feature_column_v2_test

# C++ UT: path at "tensorflow/core/kernels/fused_embedding/embedding_lookup_sparse_op_test.cc"
$ bazel test --flaky_test_attempts 1 --test_output=all --nocache_test_results --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 //tensorflow/core/kernels:embedding_lookup_sparse_op_test

FYI

The detail test code: TEST_F(FusedEmbeddingLocalSparseLookUpOpTest, LocalFloatSumCpu)

[Graph] Remove some oneDNN perf drop ops.

     reco_ops_list_ = gtl::FlatSet<string> {
       "BatchMatMul", "BatchMatMulV2", "BiasAdd", "BiasAddGrad",
       "_FusedMatMul", "_FusedBatchMatMul", "_FusedBatchMatMulV2",
-      "Identity", "LeakyRelu", "LeakyReluGrad", "MatMul",
+      "LeakyRelu", "LeakyReluGrad", "MatMul",
       "Relu", "ReluGrad", "Relu6", "Relu6Grad", "Gelu", "GeluGrad",
-      "Tanh", "TanhGrad", "Reshape"
+      "Tanh", "TanhGrad"
     };

[Python] Undefined symbol: _ZTIN10tensorflow8OpKernelE when building deeprec for MLIR Python API.

System information

Docker Image: alideeprec/deeprec-build:deeprec-dev-cpu-py38-ubuntu20.04
DeepRec version or commit id: 3bc930a
Python version: 3.8.10
Bazel version (if compiling from source): 0.26.1
GCC/Compiler version (if compiling from source): 9.4.0

Describe the problem

I was enabling MLIR Python API in Deeprec. In BUILD, buiding MLIR depends on "//tensorflow/core:ops". So I added "//tensorflow/core:ops" in a BUILD file and built it. But l met an error: undefined symbol: _ZTIN10tensorflow8OpKernelE.

Provide the exact sequence of commands / steps that you executed before running into the problem

Add "//tensorflow/core:ops" in tensorflow/python/BUILD's cc_library( name = "_tf_stack" ) (line 4863). Here is the screenshot after adding"//tensorflow/core:ops":

After revising tensorflow/python/BUILD, run:
$ ./configure
$ bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package
Here is the screenshot of error:

Any other info / logs

Include any logs or source code that would be helpful to diagnose the problem.

[Operator][Optimization]Unsorted_segment_sum op optimization

Unsorted_setment_sum opeartor optimization
Goal
Optimize unsorted_segment_sum operator performance

Problem Description
In some of recommendation model, for example, DLRM, the operator unsorted_segment_sum will bring obvious overhead to performance. So it's very important to reduce its cost.

Here is the step to reproduce the performance issue.

Collect timeline information with DLRM from modelzoo, "numactl -C 8-15 -l python train.py --steps 100 --timeline 49 --no_eval --interaction_op dot". You will find the timeline shows below.

Requirement Details

Rewrite the operator with C++ and intrinsic if possible. Follow the customized op mechanism of Tensorflow. It's better to get benefit of AVX512. And Python API is needed.
Integrate the operators into DeepRec and finish the unit test code.
Test case: Unit test code is needed. DLRM can be used to test the e2e performance gain. The higher, the better. The performance data needs to be described and reproduced.
Code Style and Commit: Keep aligned with DeepRec code for C++ and Python.
Maintain: All of the issues and bugs related to this optimization need to be covered in the future.

Definition of Done

Run successfully in DeepRec and could get better performance.
Integrated into DeepRec successfully and commit the code follow DeepRec commit standard.

[Modelzoo]Rebuild ESMM to update API and enable DeepRec features

Rebuild ESMM to update API and Enable DeepRec Features
Goal
Rebuild ESMM to update API and enable DeepRec Features.

Requirement Details

Rebuild ESMM to update API according to the template.( https://github.com/changqi1/DeepRec/blob/modelzoo-template/modelzoo/template.py)
Enable DeepRec Features in the code and the features are shown below. The features have been enabled in WDL(#37), and please notice that the comments can be mapped to the features below. Add the flags to enable/disable the features in the code.
If there is any problem when enabling the feature below, please describe the details of how to reproduce and what is the issue, especially the known issues below we have submitted to Alibaba.

Features list
Enable the following DeepRec feature(Docs about the features from Alibaba https://deeprec.readthedocs.io/zh/latest/index.html):

Enabled By Default and test the AUC/ACC/Gsteps, which needs to be close to the result before rebuilding

8) Auto Micro Batch same with DeepRec-AI#127
9) FusedEmbedding API, embedding fusion
10) Smart Stage same with DeepRec-AI#122
11) Auto Graph Fusion DeepRec-AI#144
12) CPU Memory Optimization:START_STATISTIC_STEP, STOP_STATISTIC_STEP, jemalloc
14) AdamAsync Optimizer
15) BF16

Disabled by default and test pass is fine. Don't need to ensure the same performance as before

1) Embedding Variable
7) GRPC++ and StarServer
13) Incremental Checkpoint
14) AdagradDecay
2) EmbeddingVariable advanced features：Embedding Elimination
3) EmbeddingVariable advanced feature：Embedding Filter
4) Dynamic-dimension Embedding Variable
5) Adaptive Embedding
17) WorkQueue

Other Features : Disabled by default and test pass is fine. Don't need to ensure the same performance as before. This feature is not supported in feature_column API. We are waiting for Alibaba's update.

6) Multi-Hash Variable

Test

All of the features needs to be enabled in the code by adding flags.(WDL is the template)
Feature8~15 needs to be enabled by default and test passed with the same performance as before.
Other Features need to pass test, not ensure performance. Some of the features have known issues we submitted. If not passed, describe it clearly.

Other Requirements: Dockerfile and Documents

Waiting for Alibaba's requirements

Code Style and commit

Python: Keep aligned with DeepRec code.

Maintain

All of the issues and bugs related to this model need to be covered in the future.

Definition of Done

Run successfully in DeepRec and could get the same performance as the code before rebuilding.
Integrated into DeepRec successfully and commit the code follow DeepRec commit standard.

[Op] performance issue in `_MklFusedMatMul` op.

Compare with the _FusedMatMul, the _MklFusedMatMul shows near double-time use in test. pls investigate the root cause and fix it.

[Framework][Optimization]Enabling RDT to improve performace

Enabling RDT technology in DeepRec
Goal
Achieve the feature of LLC cache management through low level API to improve performance

Problem Description
RDT technology may be helpful to improve performance if the cache management can be controlled through low level API in DeepRec, especially for the weights that could stay in LLC. Details design and requirement still needs to confirmed and will be updated once got alignment with customer.
https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html

[Op] Remove "Reshape" and "Identity" in reco_ops_list_.

When you use _MklReshape in a model, we will find many const and MklToTf ops. Like the following timeline from the WDL model, so need to remove "Reshape" and "Identity" in reco_ops_list.

[Error Log] Unkown op error due to related op removed.

Because some ops have been removed, some 'unknown op ' errors will occur. Just run WDL model in modelzoo by python train.py --steps 1 --no_eval --tf
Other info / logs

2022-06-21 10:03:02.483799: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSum" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT32 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSum
2022-06-21 10:03:02.483837: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSum" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT32 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSum
2022-06-21 10:03:02.485436: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSum" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT32 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSum
2022-06-21 10:03:02.485458: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSum" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT32 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSum
2022-06-21 10:03:02.485469: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485479: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485488: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485497: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485510: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_FLOAT } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485520: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_FLOAT } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485530: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_FLOAT } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485540: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_FLOAT } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485549: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_QINT32 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485557: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_QINT32 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485565: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_QINT32 } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485576: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_QINT32 } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize

[Modelzoo]Rebuild MMoE to update API and Enable DeepRec Features

Rebuild MMoE to update API and Enable DeepRec Features
Goal
Rebuild MMoE to update API and enable DeepRec Features.

Requirement Details

Rebuild MMoE to update API according to the template.( https://github.com/changqi1/DeepRec/blob/modelzoo-template/modelzoo/template.py)
Enable DeepRec Features in the code and the features are shown below. The features have been enabled in WDL(#37), and please notice that the comments can be mapped to the features below. Add the flags to enable/disable the features in the code.
If there is any problem when enabling the feature below, please describe the details of how to reproduce and what is the issue, especially the known issues below we have submitted to Alibaba.

Features list
Enable the following DeepRec feature(Docs about the features from Alibaba https://deeprec.readthedocs.io/zh/latest/index.html):

Enabled By Default and test the AUC/ACC/Gsteps, which needs to be close to the result before rebuilding

Disabled by default and test pass is fine. Don't need to ensure the same performance as before

Other Features : Disabled by default and test pass is fine. Don't need to ensure the same performance as before. This feature is not supported in feature_column API. We are waiting for Alibaba's update.

6) Multi-Hash Variable

Test

All of the features needs to be enabled in the code by adding flags.(WDL is the template)
Feature8~15 needs to be enabled by default and test passed with the same performance as before.
Other Features need to pass test, not ensure performance. Some of the features have known issues we submitted. If not passed, describe it clearly.

Other Requirements: Dockerfile and Documents

Waiting for Alibaba's requirements

Code Style and commit

Python: Keep aligned with DeepRec code.

Maintain

All of the issues and bugs related to this model need to be covered in the future.

Definition of Done

Run successfully in DeepRec and could get the same performance as the code before rebuilding.
Integrated into DeepRec successfully and commit the code follow DeepRec commit standard.

[doc] change the compile option in the README.md

将文档中启用onednn的编译方式
Compile for CPU optimization: oneDNN + Unified Eigen Thread pool

$ bazel build  -c opt --config=opt  --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

Compile for CPU optimization and ABI=0

$ bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

修改成如下内容, 删除--define build_with_mkl_dnn_v1_only=true, 添加--copt=-march=skylake-avx512
Compile for CPU optimization: oneDNN + Unified Eigen Thread pool

$ bazel build  -c opt --config=opt  --config=mkl_threadpool --copt=-march=skylake-avx512 //tensorflow/tools/pip_package:build_pip_package

Compile for CPU optimization and ABI=0

$ bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --config=mkl_threadpool --copt=-march=skylake-avx512 //tensorflow/tools/pip_package:build_pip_package

[Graph][Optimization]Reduce weights packing/unpacking overhead during the scenario of multi matmul

Reduce weights packing/unpacking overhead in multi matmul
Goal
Optimize performance through reducing packing/unpacking overhead in multi matmul operations

Problem Description
In some models, there will be multi continuous matmul operations. For each matmul operation, there will be packing/unpacking in order to improve cache locality. Any packing and unpacking will bring cost. So it's possible that packing could be done before the 1st matmul operation and then do unpacking after the last matmul. Below is the picture which can used to described the method.

Requirement Details

Write a matmul with C++ and prepare the baseline code.
Finish this optimization to achieve a PoC.
Supply unit test code to validate the function.
Integrate it into DeepRec through Grappler mechanism.
Apply the optimization on 1 model and show the performance data.

Test

Using 1 model from model zoo to validate the performance gain. The performance data and analysis result could be described and reproduced.

Code Style and commit

C++ and python: Keep aligned with DeepRec code.

Maintain

All of the issue and bugs related with this op need to be covered in the future.

Definition of Done

Run successfully in DeepRec and could get better performance.
Integrated into DeepRec successfully and commit the code follow DeepRec commit standard.

[Operator][Optimization] Embedding Operator Optimizaition

Migrate 6 embedding op to DeepRec and make sure models from model zoo could get benefit from these optimized op.
sparseEmbedding-base (P0), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-base-fp32-avx512.cc
sparseEmbedding-sparseInput (P1), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-sparseInput-fp32-avx512.cc
sparseEmbedding-select (P1), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-select-fp32-avx512.cc
sparseEmbedding-stringsplit (P1), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-stringsplit-fp32-avx512.cc
sparesEmbedding-bucketized (P2), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-bucketized-fp32-avx512.cc
sparseEmbedding-multiweights (P2), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/multi-sparseEmbedding-base-fp32-avx512.cc
https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt

Features Request

6 operators are ready in DeepRec and could be applied in the real models from model zoo with fp32.
Finish the embedding op functions with C++ as an operator with grappler mechanism https://www.tensorflow.org/guide/graph_optimization
These 6 operators need to be abstracted to an unified embedding operator class

Test

At least 1 test case ready in model zoo. For example, apply 6 ops on 1 model from model zoo. The performance data and analysis result could be described and reproduced.

Code Style and commit

C++ and python: Keep aligned with DeepRec code.

Maintain

All of the issue and bugs related with these 6 op need to be covered in the future.

[BUG] Embedding fusion acc/auc issue

目前定位出两个问题：

我们fusion的op里面没有做查重筛选。
我们fusion的op里，重复的input的输出却不一致。
Fusion 前向gather的输出，标红为重复部分，但是在反向的输出中并没有重复的元素，连相近的结果都不存在，差距都较大。

INFO:tensorflow:input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/GatherV2 = 
[[ 1.2214626   0.5241986   0.22209705 -0.56519794  0.12571095  0.26446024
   0.3907412   0.00005793 -0.45320386 -0.69033164  0.3531894  -0.1513126
   0.00713284 -1.1315851   0.12203985  0.23935615]
 [-1.5699558   0.63231647 -0.5136872   0.18575184 -0.12131955 -1.4859123
   0.8938838  -0.33873808 -0.24968442 -0.47764817  0.36187503 -0.14567816
  -0.24810648 -1.3606204  -0.08617076  0.4501951 ]
 [ 1.395356   -0.85390687 -0.7608217   1.0669864   1.191038    0.88764894
   0.9451067   0.29302412  1.2512774   0.6840943  -0.20568915 -0.32980326
  -0.42660442  0.54374695 -0.9136276   0.04837677]
 [ 1.395356   -0.85390687 -0.7608217   1.0669864   1.191038    0.88764894
   0.9451067   0.29302412  1.2512774   0.6840943  -0.20568915 -0.32980326
  -0.42660442  0.54374695 -0.9136276   0.04837677]
 [ 0.6147636   0.33874637 -0.7812209   1.2390836   1.8089103  -1.2311537
  -0.43859923  1.3363832  -0.72441924  1.3167928   1.1064852   0.51790696
  -0.24631402  1.2318567   1.4000374  -0.30377945]
 [-0.3499662  -1.789908    0.48219246  0.2007537   0.7334909  -0.01890297
   0.08424582 -0.9799169  -0.35487846  0.17760478  0.7782412   0.01907562
  -0.5430275  -1.0409418  -0.06544966 -0.31106764]
 [ 1.395356   -0.85390687 -0.7608217   1.0669864   1.191038    0.88764894
   0.9451067   0.29302412  1.2512774   0.6840943  -0.20568915 -0.32980326
  -0.42660442  0.54374695 -0.9136276   0.04837677]
 [-0.5124917   0.45528954  0.7462012   0.20852847  1.4730995   0.8039012
  -0.5750134   0.22652298  1.5296302   0.779812    1.460728    0.8999218
   1.5914694   0.8920278  -1.1893805   1.916351  ]]

Fusion grad的输入输出

输出：
INFO:tensorflow:head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/FusedEmbeddingSparsePostLookUp_grad/FusedEmbeddingSparsePostLookUpGrad = 
[[ 0.00861822  0.0023349  -0.00688701  0.00269023 -0.00164793  0.00736784
  -0.01025489 -0.00652598  0.00471746  0.00888411 -0.00231681  0.00083448
  -0.00203576 -0.00289572  0.00719752 -0.00490604]
 [ 0.00153809 -0.00080411 -0.00177121  0.00086962 -0.00095507  0.00141255
  -0.00152518 -0.0010505   0.00122721  0.00121513 -0.00102509 -0.00052917
  -0.00006346 -0.00056932  0.00194405 -0.00034838]
 [ 0.00154482 -0.00019465 -0.00210896  0.00204525 -0.00301571  0.0016045
  -0.00139093 -0.00215399 -0.00047724  0.00222365  0.00055881 -0.00044712
  -0.00082448  0.00155544  0.00257766  0.00031559]
 [-0.00136614 -0.00111764  0.00218892 -0.00170602  0.00054201 -0.00347374
   0.00119696  0.00144338 -0.00078496 -0.00169556  0.00112028 -0.00118931
  -0.00130751 -0.00075804 -0.00326457 -0.0000487 ]
 [-0.00209746  0.001618    0.00076764  0.00073953  0.00029006 -0.00244137
   0.00196408  0.00168557  0.00034245 -0.00137542 -0.00048502 -0.00033666
  -0.00041434  0.00007309 -0.00190399 -0.00012118]
 [ 0.00003232 -0.00172924  0.00824459 -0.00492863  0.00488566 -0.01386313
   0.00917176  0.0094776   0.00208267 -0.01060377  0.00307963 -0.00334385
   0.00790155 -0.00217232 -0.00438204  0.00839924]
 [-0.003769    0.00456759  0.00200083 -0.00209491  0.00360792 -0.00388297
   0.00045974  0.00181912 -0.00040632 -0.00027408  0.0035839   0.00000899
   0.00114274  0.00211995 -0.00300257  0.00140855]
 [-0.00272977  0.00265015  0.00149372 -0.00279601  0.00245979 -0.00382371
   0.00186041 -0.00059908  0.00158117 -0.00263364  0.00022644 -0.00097163
   0.00043103  0.00025143 -0.00277124 -0.00126195]], 
输入：
head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/Reshape_grad/Reshape = 
[[-0.00861822 -0.0023349   0.00688701 -0.00269023  0.00164793 -0.00736784
   0.01025489  0.00652598 -0.00471746 -0.00888411  0.00231681 -0.00083448
   0.00203576  0.00289572 -0.00719752  0.00490604]
 [-0.00153809  0.00080411  0.00177121 -0.00086962  0.00095507 -0.00141255
   0.00152518  0.0010505  -0.00122721 -0.00121513  0.00102509  0.00052917
   0.00006346  0.00056932 -0.00194405  0.00034838]
 [-0.00154482  0.00019465  0.00210896 -0.00204525  0.00301571 -0.0016045
   0.00139093  0.00215399  0.00047724 -0.00222365 -0.00055881  0.00044712
   0.00082448 -0.00155544 -0.00257766 -0.00031559]
 [ 0.00136614  0.00111764 -0.00218892  0.00170602 -0.00054201  0.00347374
  -0.00119696 -0.00144338  0.00078496  0.00169556 -0.00112028  0.00118931
   0.00130751  0.00075804  0.00326457  0.0000487 ]
 [ 0.00209746 -0.001618   -0.00076764 -0.00073953 -0.00029006  0.00244137
  -0.00196408 -0.00168557 -0.00034245  0.00137542  0.00048502  0.00033666
   0.00041434 -0.00007309  0.00190399  0.00012118]
 [-0.00003232  0.00172924 -0.00824459  0.00492863 -0.00488566  0.01386313
  -0.00917176 -0.0094776  -0.00208267  0.01060377 -0.00307963  0.00334385
  -0.00790155  0.00217232  0.00438204 -0.00839924]
 [ 0.003769   -0.00456759 -0.00200083  0.00209491 -0.00360792  0.00388297
  -0.00045974 -0.00181912  0.00040632  0.00027408 -0.0035839  -0.00000899
  -0.00114274 -0.00211995  0.00300257 -0.00140855]
 [ 0.00272977 -0.00265015 -0.00149372  0.00279601 -0.00245979  0.00382371
  -0.00186041  0.00059908 -0.00158117  0.00263364 -0.00022644  0.00097163
  -0.00043103 -0.00025143  0.00277124  0.00126195]], 
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/GatherV2 = 
[[ 1.2214626   0.5241986   0.22209705 -0.56519794  0.12571095  0.26446024
   0.3907412   0.00005793 -0.45320386 -0.69033164  0.3531894  -0.1513126
   0.00713284 -1.1315851   0.12203985  0.23935615]
 [-1.5699558   0.63231647 -0.5136872   0.18575184 -0.12131955 -1.4859123
   0.8938838  -0.33873808 -0.24968442 -0.47764817  0.36187503 -0.14567816
  -0.24810648 -1.3606204  -0.08617076  0.4501951 ]
 [ 1.395356   -0.85390687 -0.7608217   1.0669864   1.191038    0.88764894
   0.9451067   0.29302412  1.2512774   0.6840943  -0.20568915 -0.32980326
  -0.42660442  0.54374695 -0.9136276   0.04837677]
 [ 1.395356   -0.85390687 -0.7608217   1.0669864   1.191038    0.88764894
   0.9451067   0.29302412  1.2512774   0.6840943  -0.20568915 -0.32980326
  -0.42660442  0.54374695 -0.9136276   0.04837677]
 [ 0.6147636   0.33874637 -0.7812209   1.2390836   1.8089103  -1.2311537
  -0.43859923  1.3363832  -0.72441924  1.3167928   1.1064852   0.51790696
  -0.24631402  1.2318567   1.4000374  -0.30377945]
 [-0.3499662  -1.789908    0.48219246  0.2007537   0.7334909  -0.01890297
   0.08424582 -0.9799169  -0.35487846  0.17760478  0.7782412   0.01907562
  -0.5430275  -1.0409418  -0.06544966 -0.31106764]
 [ 1.395356   -0.85390687 -0.7608217   1.0669864   1.191038    0.88764894
   0.9451067   0.29302412  1.2512774   0.6840943  -0.20568915 -0.32980326
  -0.42660442  0.54374695 -0.9136276   0.04837677]
 [-0.5124917   0.45528954  0.7462012   0.20852847  1.4730995   0.8039012
  -0.5750134   0.22652298  1.5296302   0.779812    1.460728    0.8999218
   1.5914694   0.8920278  -1.1893805   1.916351  ]], 
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/FusedEmbeddingSparsePreLookUp:0 = 
[2816  903 6681 6681 1309 1777 6681 5311], 
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/FusedEmbeddingSparsePostLookUp:1 = [1 1 1 1 1 1 1 1], 
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/FusedEmbeddingSparsePreLookUp:1 = 
[[0 0]
 [1 0]
 [2 0]
 [3 0]
 [4 0]
 [5 0]
 [6 0]
 [7 0]]

Unfusion grad的输入输出：

输出：INFO:tensorflow:head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad = 
[[-0.00861822 -0.0023349   0.00688701 -0.00269023  0.00164793 -0.00736784
   0.01025489  0.00652598 -0.00471746 -0.00888411  0.00231681 -0.00083448
   0.00203576  0.00289572 -0.00719752  0.00490604]
 [-0.00153809  0.00080411  0.00177121 -0.00086962  0.00095507 -0.00141255
   0.00152518  0.0010505  -0.00122721 -0.00121513  0.00102509  0.00052917
   0.00006346  0.00056932 -0.00194405  0.00034838]
 [ 0.00359032 -0.00325531 -0.00208078  0.00175567 -0.00113423  0.00575221
  -0.00026577 -0.00110852  0.00166852 -0.00025401 -0.00526298  0.00162744
   0.00098925 -0.00291735  0.00368948 -0.00167544]
 [ 0.00209746 -0.001618   -0.00076764 -0.00073953 -0.00029006  0.00244137
  -0.00196408 -0.00168557 -0.00034245  0.00137542  0.00048502  0.00033666
   0.00041434 -0.00007309  0.00190399  0.00012118]
 [-0.00003232  0.00172924 -0.00824459  0.00492863 -0.00488566  0.01386313
  -0.00917176 -0.0094776  -0.00208267  0.01060377 -0.00307963  0.00334385
  -0.00790155  0.00217232  0.00438204 -0.00839924]
 [ 0.00272977 -0.00265015 -0.00149372  0.00279601 -0.00245979  0.00382371
  -0.00186041  0.00059908 -0.00158117  0.00263364 -0.00022644  0.00097163
  -0.00043103 -0.00025143  0.00277124  0.00126195]], 
输入：
head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/embedding_lookup_sparse_grad/strided_slice = 6, 
head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights_grad/tuple/control_dependency_1 = 
[[-0.00861822 -0.0023349   0.00688701 -0.00269023  0.00164793 -0.00736784
   0.01025489  0.00652598 -0.00471746 -0.00888411  0.00231681 -0.00083448
   0.00203576  0.00289572 -0.00719752  0.00490604]
 [-0.00153809  0.00080411  0.00177121 -0.00086962  0.00095507 -0.00141255
   0.00152518  0.0010505  -0.00122721 -0.00121513  0.00102509  0.00052917
   0.00006346  0.00056932 -0.00194405  0.00034838]
 [-0.00154482  0.00019465  0.00210896 -0.00204525  0.00301571 -0.0016045
   0.00139093  0.00215399  0.00047724 -0.00222365 -0.00055881  0.00044712
   0.00082448 -0.00155544 -0.00257766 -0.00031559]
 [ 0.00136614  0.00111764 -0.00218892  0.00170602 -0.00054201  0.00347374
  -0.00119696 -0.00144338  0.00078496  0.00169556 -0.00112028  0.00118931
   0.00130751  0.00075804  0.00326457  0.0000487 ]
 [ 0.00209746 -0.001618   -0.00076764 -0.00073953 -0.00029006  0.00244137
  -0.00196408 -0.00168557 -0.00034245  0.00137542  0.00048502  0.00033666
   0.00041434 -0.00007309  0.00190399  0.00012118]
 [-0.00003232  0.00172924 -0.00824459  0.00492863 -0.00488566  0.01386313
  -0.00917176 -0.0094776  -0.00208267  0.01060377 -0.00307963  0.00334385
  -0.00790155  0.00217232  0.00438204 -0.00839924]
 [ 0.003769   -0.00456759 -0.00200083  0.00209491 -0.00360792  0.00388297
  -0.00045974 -0.00181912  0.00040632  0.00027408 -0.0035839  -0.00000899
  -0.00114274 -0.00211995  0.00300257 -0.00140855]
 [ 0.00272977 -0.00265015 -0.00149372  0.00279601 -0.00245979  0.00382371
  -0.00186041  0.00059908 -0.00158117  0.00263364 -0.00022644  0.00097163
  -0.00043103 -0.00025143  0.00277124  0.00126195]], 
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/embedding_lookup_sparse/UniqueWithCounts:1 = [0 1 2 2 3 4 2 5], 
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/embedding_lookup_sparse/Cast = [0 1 2 3 4 5 6 7]

Undefined symbol: _ZN10tensorflow8GraphDefC1Ev when building Python MLIR

System information

OS Platform and Distribution (e.g., Linux Ubuntu 20.04): Ubuntu 20.04
DeepRec version or commit id: git clone -b add_mlir_python_support https://github.com/374365283/DeepRec-mlir-python.git
Python version: 3.6.12
Bazel version (if compiling from source): 0.26.1
GCC/Compiler version (if compiling from source): 7.5.0

Describe the problem
I am trying to add support for MLIR Python API in Deeprec.
Use PYBIND11 to define the Python API in tensorflow/python/util/tf_stack.cc which depends on tensorflow/core/compiler/mlir/python/mlir.cc and mlir.h.
Then add "//tensorflow/compiler/mlir/python:mlir" in tensorflow/python/BUILD _tf_stack's deps list.

After compiling, I met the error: Undefined symbol: _ZN10tensorflow8GraphDefC1Ev.

The most likely reason is that in line79 of tensorflow/core/compiler/mlir/python/mlir.cc, GraphDef depends on graph.pb.h and grap.pb.cc. Even if "protos_all_cc" is already in _tf_stack's deps tree, it still can't find the definition in grap.pb.cc.

Provide the exact sequence of commands / steps that you executed before running into the problem
$ ./configure
$ bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package
Undefined symbol: _ZN10tensorflow8GraphDefC1Ev

[Modelzoo]Rebuild DBMTL to update API and Enable DeepRec Features

Rebuild DBMTL to update API and Enable DeepRec Features
Goal
Rebuild DBMTL to update API and enable DeepRec Features.

Requirement Details

Rebuild DBMTL to update API according to the template.( https://github.com/changqi1/DeepRec/blob/modelzoo-template/modelzoo/template.py)
Enable DeepRec Features in the code and the features are shown below. The features have been enabled in WDL(#37), and please notice that the comments can be mapped to the features below. Add the flags to enable/disable the features in the code.
If there is any problem when enabling the feature below, please describe the details of how to reproduce and what is the issue, especially the known issues below we have submitted to Alibaba.

Features list
Enable the following DeepRec feature(Docs about the features from Alibaba https://deeprec.readthedocs.io/zh/latest/index.html):

Enabled By Default and test the AUC/ACC/Gsteps, which needs to be close to the result before rebuilding

Disabled by default and test pass is fine. Don't need to ensure the same performance as before

Other Features : Disabled by default and test pass is fine. Don't need to ensure the same performance as before. This feature is not supported in feature_column API. We are waiting for Alibaba's update.

6) Multi-Hash Variable

Test

All of the features needs to be enabled in the code by adding flags.(WDL is the template)
Feature8~15 needs to be enabled by default and test passed with the same performance as before.
Other Features need to pass test, not ensure performance. Some of the features have known issues we submitted. If not passed, describe it clearly.

Other Requirements: Dockerfile and Documents

Waiting for Alibaba's requirements

Code Style and commit

Python: Keep aligned with DeepRec code.

Maintain

All of the issues and bugs related to this model need to be covered in the future.

Definition of Done

Run successfully in DeepRec and could get the same performance as the code before rebuilding.
Integrated into DeepRec successfully and commit the code follow DeepRec commit standard.

[op] the "mkl_transpose" op don’t work for 3D and 4D Tensors.

#1 regarding the permutations which seem to be working only for 2D Tensors on bf16 (they don’t work for 3D and 4D Tensors).

[Graph][Optimization] Concat+cast fusion to improve performance

concat+cast fusion optimization
Goal
Optimize performance through concat+cast fusion

Problem Description
In some of recommendation model, for example, DLRM, after enabling bf16 in DeepRec, there is potential performance gain through concat and cast fusion.

Here is the step to reproduce the performance issue.

Collect timeline information with DLRM from modelzoo, "numactl -C 8-15 -l python train.py --steps 100 --timeline 49 --no_eval --interaction_op dot --bf16". You will find the timeline shows below.

Requirement Details

Fusion 2 operators concat and cast into 1 operator. Both of the forward and backward operations need to be covered. And make sure it could be applied in the real models DLRM at least.
Follow grappler mechanism https://www.tensorflow.org/guide/graph_optimization
Unit test code and benchmark code are needed.

Test

Using DLRM to validate the performance gain. The performance data and analysis result could be described and reproduced.

Code Style and commit

C++ and python: Keep aligned with DeepRec code.

Maintain

All of the issue and bugs related with this op need to be covered in the future.

Definition of Done

Run successfully in DeepRec and could get better performance.
Integrated into DeepRec successfully and commit the code follow DeepRec commit standard.

[UT] status: Internal: Missing 0-th output from {{node MatMul_1}}

Steps to reproduce

default_opts="
             --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 \
             --copt=-O2 \
             --copt=-Wformat \
             --copt=-Wformat-security \
             --copt=-fstack-protector \
             --copt=-fPIC \
             --copt=-fpic \
             --linkopt=-znoexecstack \
             --linkopt=-zrelro \
             --linkopt=-znow \
             --linkopt=-fstack-protector"

mkl_opts="--config=mkl_threadpool \
           --define build_with_mkl_dnn_v1_only=true \
           --copt=-DENABLE_INTEL_MKL_BFLOAT16 \
           --copt=-march=skylake-avx512"

test_opts="--nocache_test_results \
           --test_output=all \
           --verbose_failures \
           --test_verbose_timeout_warnings \
           --flaky_test_attempts 1 \
           --test_timeout 99999999 \
           --test_size_filters=small,medium,large,enormous \
           -c opt \
           --keep_going"

bazel test ${default_opts} ${mkl_opts} ${test_opts} -- //tensorflow/core/grappler/optimizers:mkl_remapper_test

[Modelzoo]Rebuild SimpleMultiTask to update API and Enable DeepRec Features

Rebuild SimpleMultiTask to update API and Enable DeepRec Features
Goal
Rebuild SimpleMultiTask to update API and enable DeepRec Features.

Requirement Details

Rebuild SimpleMultiTask to update API according to the template.( https://github.com/changqi1/DeepRec/blob/modelzoo-template/modelzoo/template.py)
Enable DeepRec Features in the code and the features are shown below. The features have been enabled in WDL(#37), and please notice that the comments can be mapped to the features below. Add the flags to enable/disable the features in the code.
If there is any problem when enabling the feature below, please describe the details of how to reproduce and what is the issue, especially the known issues below we have submitted to Alibaba.

Features list
Enable the following DeepRec feature(Docs about the features from Alibaba https://deeprec.readthedocs.io/zh/latest/index.html):

Enabled By Default and test the AUC/ACC/Gsteps, which needs to be close to the result before rebuilding

Disabled by default and test pass is fine. Don't need to ensure the same performance as before

Other Features : Disabled by default and test pass is fine. Don't need to ensure the same performance as before. This feature is not supported in feature_column API. We are waiting for Alibaba's update.

6) Multi-Hash Variable

Test

All of the features needs to be enabled in the code by adding flags.(WDL is the template)
Feature8~15 needs to be enabled by default and test passed with the same performance as before.
Other Features need to pass test, not ensure performance. Some of the features have known issues we submitted. If not passed, describe it clearly.

Other Requirements: Dockerfile and Documents

Waiting for Alibaba's requirements

Code Style and commit

Python: Keep aligned with DeepRec code.

Maintain

All of the issues and bugs related to this model need to be covered in the future.

Definition of Done

Run successfully in DeepRec and could get the same performance as the code before rebuilding.
Integrated into DeepRec successfully and commit the code follow DeepRec commit standard.

[UT] //tensorflow/python/kernel_tests/segment_reduction_ops_test does not work

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below):
Python version:
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

Here is the problem
https://github.com/changqi1/DeepRec-deprecated/issues/49#issuecomment-1015280420

changqi1 / deeprec Goto Github PK

deeprec's People

Contributors

Stargazers

Forkers

deeprec's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs