changqi1 / deeprec Goto Github PK
View Code? Open in Web Editor NEWThis project forked from deeprec-ai/deeprec
DeepRec is a recommendation engine based on TensorFlow.
License: Apache License 2.0
This project forked from deeprec-ai/deeprec
DeepRec is a recommendation engine based on TensorFlow.
License: Apache License 2.0
将文档中启用onednn的编译方式
Compile for CPU optimization: oneDNN + Unified Eigen Thread pool
$ bazel build -c opt --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package
Compile for CPU optimization and ABI=0
$ bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package
修改成如下内容, 删除--define build_with_mkl_dnn_v1_only=true
, 添加--copt=-march=skylake-avx512
Compile for CPU optimization: oneDNN + Unified Eigen Thread pool
$ bazel build -c opt --config=opt --config=mkl_threadpool --copt=-march=skylake-avx512 //tensorflow/tools/pip_package:build_pip_package
Compile for CPU optimization and ABI=0
$ bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --config=mkl_threadpool --copt=-march=skylake-avx512 //tensorflow/tools/pip_package:build_pip_package
Reduce weights packing/unpacking overhead in multi matmul
Goal
Optimize performance through reducing packing/unpacking overhead in multi matmul operations
Problem Description
In some models, there will be multi continuous matmul operations. For each matmul operation, there will be packing/unpacking in order to improve cache locality. Any packing and unpacking will bring cost. So it's possible that packing could be done before the 1st matmul operation and then do unpacking after the last matmul. Below is the picture which can used to described the method.
Requirement Details
Test
Code Style and commit
Maintain
Definition of Done
Hi @Duyi-Wang , the follow is some UT, which may help you to reduce your validation time.
# Python UT: Include one simple model implement, path at "tensorflow/python/feature_column/feature_column_v2_test.py"
$ bazel test --flaky_test_attempts 1 --test_output=all --nocache_test_results --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 //tensorflow/python/feature_column:feature_column_v2_test
# C++ UT: path at "tensorflow/core/kernels/fused_embedding/embedding_lookup_sparse_op_test.cc"
$ bazel test --flaky_test_attempts 1 --test_output=all --nocache_test_results --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 //tensorflow/core/kernels:embedding_lookup_sparse_op_test
FYI
Replace DeepRec MklCPUAllocatorFactory TensorPoolAllocator with TensorFlow MklCPUAllocator, when batch size is big.
https://github.com/alibaba/DeepRec/blob/main/tensorflow/core/common_runtime/threadpool_device.cc
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/threadpool_device.cc
System information
Describe the problem
I am trying to add support for MLIR Python API in Deeprec.
Use PYBIND11 to define the Python API in tensorflow/python/util/tf_stack.cc which depends on tensorflow/core/compiler/mlir/python/mlir.cc and mlir.h.
Then add "//tensorflow/compiler/mlir/python:mlir" in tensorflow/python/BUILD _tf_stack's deps list.
After compiling, I met the error: Undefined symbol: _ZN10tensorflow8GraphDefC1Ev.
The most likely reason is that in line79 of tensorflow/core/compiler/mlir/python/mlir.cc, GraphDef depends on graph.pb.h and grap.pb.cc. Even if "protos_all_cc" is already in _tf_stack's deps tree, it still can't find the definition in grap.pb.cc.
Provide the exact sequence of commands / steps that you executed before running into the problem
$ ./configure
$ bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package
Undefined symbol: _ZN10tensorflow8GraphDefC1Ev
Steps to reproduce
default_opts="
--cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 \
--copt=-O2 \
--copt=-Wformat \
--copt=-Wformat-security \
--copt=-fstack-protector \
--copt=-fPIC \
--copt=-fpic \
--linkopt=-znoexecstack \
--linkopt=-zrelro \
--linkopt=-znow \
--linkopt=-fstack-protector"
mkl_opts="--config=mkl_threadpool \
--define build_with_mkl_dnn_v1_only=true \
--copt=-DENABLE_INTEL_MKL_BFLOAT16 \
--copt=-march=skylake-avx512"
test_opts="--nocache_test_results \
--test_output=all \
--verbose_failures \
--test_verbose_timeout_warnings \
--flaky_test_attempts 1 \
--test_timeout 99999999 \
--test_size_filters=small,medium,large,enormous \
-c opt \
--keep_going"
bazel test ${default_opts} ${mkl_opts} ${test_opts} -- //tensorflow/core/grappler/optimizers:mkl_remapper_test
split+concat fusion optimization
Goal
Optimize performance through split+concat fusion
Problem Description
In some of recommendation model, there is potential performance gain through split and concat fusion.
The step to reproduce the performance issue will be updated later.
Requirement Details
Test
Code Style and commit
Maintain
Definition of Done
Migrate 6 embedding op to DeepRec and make sure models from model zoo could get benefit from these optimized op.
sparseEmbedding-base (P0), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-base-fp32-avx512.cc
sparseEmbedding-sparseInput (P1), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-sparseInput-fp32-avx512.cc
sparseEmbedding-select (P1), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-select-fp32-avx512.cc
sparseEmbedding-stringsplit (P1), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-stringsplit-fp32-avx512.cc
sparesEmbedding-bucketized (P2), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/sparseEmbedding-bucketized-fp32-avx512.cc
sparseEmbedding-multiweights (P2), kernel from https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt/blob/main/ops/embedding_ops/multi-sparseEmbedding-base-fp32-avx512.cc
https://github.com/intel-sandbox/applications.ai.easyrec.inference-opt
Features Request
Test
Code Style and commit
Maintain
#1 regarding the permutations which seem to be working only for 2D Tensors on bf16 (they don’t work for 3D and 4D Tensors).
concat+cast fusion optimization
Goal
Optimize performance through concat+cast fusion
Problem Description
In some of recommendation model, for example, DLRM, after enabling bf16 in DeepRec, there is potential performance gain through concat and cast fusion.
Here is the step to reproduce the performance issue.
Requirement Details
Test
Code Style and commit
Maintain
Definition of Done
Because some ops have been removed, some 'unknown op ' errors will occur. Just run WDL model in modelzoo by python train.py --steps 1 --no_eval --tf
Other info / logs
2022-06-21 10:03:02.483799: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSum" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT32 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSum
2022-06-21 10:03:02.483837: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSum" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT32 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSum
2022-06-21 10:03:02.485436: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSum" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT32 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSum
2022-06-21 10:03:02.485458: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSum" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT32 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSum
2022-06-21 10:03:02.485469: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485479: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485488: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485497: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "QuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } }') for unknown op: QuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485510: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_FLOAT } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485520: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_FLOAT } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485530: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_FLOAT } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485540: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_FLOAT } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485549: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_QINT32 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485557: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_QINT32 } } } constraint { name: "out_type" allowed_values { list { type: DT_QINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485565: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_QINT32 } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
2022-06-21 10:03:02.485576: E tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "_MklQuantizedConv2DWithBiasReluAndSumAndRequantize" device_type: "CPU" constraint { name: "Tinput" allowed_values { list { type: DT_QUINT8 } } } constraint { name: "Tfilter" allowed_values { list { type: DT_QINT8 } } } constraint { name: "Tbias" allowed_values { list { type: DT_QINT32 } } } constraint { name: "out_type" allowed_values { list { type: DT_QUINT8 } } } label: "QuantizedMklOp"') for unknown op: _MklQuantizedConv2DWithBiasReluAndSumAndRequantize
Rebuild MMoE to update API and Enable DeepRec Features
Goal
Rebuild MMoE to update API and enable DeepRec Features.
Requirement Details
Features list
Enable the following DeepRec feature(Docs about the features from Alibaba https://deeprec.readthedocs.io/zh/latest/index.html):
8) Auto Micro Batch same with DeepRec-AI#127
9) FusedEmbedding API, embedding fusion
10) Smart Stage same with DeepRec-AI#122
11) Auto Graph Fusion DeepRec-AI#144
12) CPU Memory Optimization:START_STATISTIC_STEP, STOP_STATISTIC_STEP, jemalloc
14) AdamAsync Optimizer
15) BF16
1) Embedding Variable
7) GRPC++ and StarServer
13) Incremental Checkpoint
14) AdagradDecay
2) EmbeddingVariable advanced features:Embedding Elimination
3) EmbeddingVariable advanced feature:Embedding Filter
4) Dynamic-dimension Embedding Variable
5) Adaptive Embedding
17) WorkQueue
6) Multi-Hash Variable
Test
Other Requirements: Dockerfile and Documents
Code Style and commit
Maintain
Definition of Done
System information
Describe the problem
I was enabling MLIR Python API in Deeprec. In BUILD, buiding MLIR depends on "//tensorflow/core:ops". So I added "//tensorflow/core:ops" in a BUILD file and built it. But l met an error: undefined symbol: _ZTIN10tensorflow8OpKernelE.
Provide the exact sequence of commands / steps that you executed before running into the problem
Add "//tensorflow/core:ops" in tensorflow/python/BUILD's cc_library( name = "_tf_stack" ) (line 4863). Here is the screenshot after adding"//tensorflow/core:ops":
After revising tensorflow/python/BUILD, run:
$ ./configure
$ bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package
Here is the screenshot of error:
Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem.
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
Here is the problem
https://github.com/changqi1/DeepRec-deprecated/issues/49#issuecomment-1015280420
Enabling RDT technology in DeepRec
Goal
Achieve the feature of LLC cache management through low level API to improve performance
Problem Description
RDT technology may be helpful to improve performance if the cache management can be controlled through low level API in DeepRec, especially for the weights that could stay in LLC. Details design and requirement still needs to confirmed and will be updated once got alignment with customer.
https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html
reco_ops_list_ = gtl::FlatSet<string> {
"BatchMatMul", "BatchMatMulV2", "BiasAdd", "BiasAddGrad",
"_FusedMatMul", "_FusedBatchMatMul", "_FusedBatchMatMulV2",
- "Identity", "LeakyRelu", "LeakyReluGrad", "MatMul",
+ "LeakyRelu", "LeakyReluGrad", "MatMul",
"Relu", "ReluGrad", "Relu6", "Relu6Grad", "Gelu", "GeluGrad",
- "Tanh", "TanhGrad", "Reshape"
+ "Tanh", "TanhGrad"
};
目前定位出两个问题:
INFO:tensorflow:input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/GatherV2 =
[[ 1.2214626 0.5241986 0.22209705 -0.56519794 0.12571095 0.26446024
0.3907412 0.00005793 -0.45320386 -0.69033164 0.3531894 -0.1513126
0.00713284 -1.1315851 0.12203985 0.23935615]
[-1.5699558 0.63231647 -0.5136872 0.18575184 -0.12131955 -1.4859123
0.8938838 -0.33873808 -0.24968442 -0.47764817 0.36187503 -0.14567816
-0.24810648 -1.3606204 -0.08617076 0.4501951 ]
[ 1.395356 -0.85390687 -0.7608217 1.0669864 1.191038 0.88764894
0.9451067 0.29302412 1.2512774 0.6840943 -0.20568915 -0.32980326
-0.42660442 0.54374695 -0.9136276 0.04837677]
[ 1.395356 -0.85390687 -0.7608217 1.0669864 1.191038 0.88764894
0.9451067 0.29302412 1.2512774 0.6840943 -0.20568915 -0.32980326
-0.42660442 0.54374695 -0.9136276 0.04837677]
[ 0.6147636 0.33874637 -0.7812209 1.2390836 1.8089103 -1.2311537
-0.43859923 1.3363832 -0.72441924 1.3167928 1.1064852 0.51790696
-0.24631402 1.2318567 1.4000374 -0.30377945]
[-0.3499662 -1.789908 0.48219246 0.2007537 0.7334909 -0.01890297
0.08424582 -0.9799169 -0.35487846 0.17760478 0.7782412 0.01907562
-0.5430275 -1.0409418 -0.06544966 -0.31106764]
[ 1.395356 -0.85390687 -0.7608217 1.0669864 1.191038 0.88764894
0.9451067 0.29302412 1.2512774 0.6840943 -0.20568915 -0.32980326
-0.42660442 0.54374695 -0.9136276 0.04837677]
[-0.5124917 0.45528954 0.7462012 0.20852847 1.4730995 0.8039012
-0.5750134 0.22652298 1.5296302 0.779812 1.460728 0.8999218
1.5914694 0.8920278 -1.1893805 1.916351 ]]
Fusion grad的输入输出
输出:
INFO:tensorflow:head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/FusedEmbeddingSparsePostLookUp_grad/FusedEmbeddingSparsePostLookUpGrad =
[[ 0.00861822 0.0023349 -0.00688701 0.00269023 -0.00164793 0.00736784
-0.01025489 -0.00652598 0.00471746 0.00888411 -0.00231681 0.00083448
-0.00203576 -0.00289572 0.00719752 -0.00490604]
[ 0.00153809 -0.00080411 -0.00177121 0.00086962 -0.00095507 0.00141255
-0.00152518 -0.0010505 0.00122721 0.00121513 -0.00102509 -0.00052917
-0.00006346 -0.00056932 0.00194405 -0.00034838]
[ 0.00154482 -0.00019465 -0.00210896 0.00204525 -0.00301571 0.0016045
-0.00139093 -0.00215399 -0.00047724 0.00222365 0.00055881 -0.00044712
-0.00082448 0.00155544 0.00257766 0.00031559]
[-0.00136614 -0.00111764 0.00218892 -0.00170602 0.00054201 -0.00347374
0.00119696 0.00144338 -0.00078496 -0.00169556 0.00112028 -0.00118931
-0.00130751 -0.00075804 -0.00326457 -0.0000487 ]
[-0.00209746 0.001618 0.00076764 0.00073953 0.00029006 -0.00244137
0.00196408 0.00168557 0.00034245 -0.00137542 -0.00048502 -0.00033666
-0.00041434 0.00007309 -0.00190399 -0.00012118]
[ 0.00003232 -0.00172924 0.00824459 -0.00492863 0.00488566 -0.01386313
0.00917176 0.0094776 0.00208267 -0.01060377 0.00307963 -0.00334385
0.00790155 -0.00217232 -0.00438204 0.00839924]
[-0.003769 0.00456759 0.00200083 -0.00209491 0.00360792 -0.00388297
0.00045974 0.00181912 -0.00040632 -0.00027408 0.0035839 0.00000899
0.00114274 0.00211995 -0.00300257 0.00140855]
[-0.00272977 0.00265015 0.00149372 -0.00279601 0.00245979 -0.00382371
0.00186041 -0.00059908 0.00158117 -0.00263364 0.00022644 -0.00097163
0.00043103 0.00025143 -0.00277124 -0.00126195]],
输入:
head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/Reshape_grad/Reshape =
[[-0.00861822 -0.0023349 0.00688701 -0.00269023 0.00164793 -0.00736784
0.01025489 0.00652598 -0.00471746 -0.00888411 0.00231681 -0.00083448
0.00203576 0.00289572 -0.00719752 0.00490604]
[-0.00153809 0.00080411 0.00177121 -0.00086962 0.00095507 -0.00141255
0.00152518 0.0010505 -0.00122721 -0.00121513 0.00102509 0.00052917
0.00006346 0.00056932 -0.00194405 0.00034838]
[-0.00154482 0.00019465 0.00210896 -0.00204525 0.00301571 -0.0016045
0.00139093 0.00215399 0.00047724 -0.00222365 -0.00055881 0.00044712
0.00082448 -0.00155544 -0.00257766 -0.00031559]
[ 0.00136614 0.00111764 -0.00218892 0.00170602 -0.00054201 0.00347374
-0.00119696 -0.00144338 0.00078496 0.00169556 -0.00112028 0.00118931
0.00130751 0.00075804 0.00326457 0.0000487 ]
[ 0.00209746 -0.001618 -0.00076764 -0.00073953 -0.00029006 0.00244137
-0.00196408 -0.00168557 -0.00034245 0.00137542 0.00048502 0.00033666
0.00041434 -0.00007309 0.00190399 0.00012118]
[-0.00003232 0.00172924 -0.00824459 0.00492863 -0.00488566 0.01386313
-0.00917176 -0.0094776 -0.00208267 0.01060377 -0.00307963 0.00334385
-0.00790155 0.00217232 0.00438204 -0.00839924]
[ 0.003769 -0.00456759 -0.00200083 0.00209491 -0.00360792 0.00388297
-0.00045974 -0.00181912 0.00040632 0.00027408 -0.0035839 -0.00000899
-0.00114274 -0.00211995 0.00300257 -0.00140855]
[ 0.00272977 -0.00265015 -0.00149372 0.00279601 -0.00245979 0.00382371
-0.00186041 0.00059908 -0.00158117 0.00263364 -0.00022644 0.00097163
-0.00043103 -0.00025143 0.00277124 0.00126195]],
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/GatherV2 =
[[ 1.2214626 0.5241986 0.22209705 -0.56519794 0.12571095 0.26446024
0.3907412 0.00005793 -0.45320386 -0.69033164 0.3531894 -0.1513126
0.00713284 -1.1315851 0.12203985 0.23935615]
[-1.5699558 0.63231647 -0.5136872 0.18575184 -0.12131955 -1.4859123
0.8938838 -0.33873808 -0.24968442 -0.47764817 0.36187503 -0.14567816
-0.24810648 -1.3606204 -0.08617076 0.4501951 ]
[ 1.395356 -0.85390687 -0.7608217 1.0669864 1.191038 0.88764894
0.9451067 0.29302412 1.2512774 0.6840943 -0.20568915 -0.32980326
-0.42660442 0.54374695 -0.9136276 0.04837677]
[ 1.395356 -0.85390687 -0.7608217 1.0669864 1.191038 0.88764894
0.9451067 0.29302412 1.2512774 0.6840943 -0.20568915 -0.32980326
-0.42660442 0.54374695 -0.9136276 0.04837677]
[ 0.6147636 0.33874637 -0.7812209 1.2390836 1.8089103 -1.2311537
-0.43859923 1.3363832 -0.72441924 1.3167928 1.1064852 0.51790696
-0.24631402 1.2318567 1.4000374 -0.30377945]
[-0.3499662 -1.789908 0.48219246 0.2007537 0.7334909 -0.01890297
0.08424582 -0.9799169 -0.35487846 0.17760478 0.7782412 0.01907562
-0.5430275 -1.0409418 -0.06544966 -0.31106764]
[ 1.395356 -0.85390687 -0.7608217 1.0669864 1.191038 0.88764894
0.9451067 0.29302412 1.2512774 0.6840943 -0.20568915 -0.32980326
-0.42660442 0.54374695 -0.9136276 0.04837677]
[-0.5124917 0.45528954 0.7462012 0.20852847 1.4730995 0.8039012
-0.5750134 0.22652298 1.5296302 0.779812 1.460728 0.8999218
1.5914694 0.8920278 -1.1893805 1.916351 ]],
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/FusedEmbeddingSparsePreLookUp:0 =
[2816 903 6681 6681 1309 1777 6681 5311],
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/FusedEmbeddingSparsePostLookUp:1 = [1 1 1 1 1 1 1 1],
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/fused_embedding_lookup_sparse/FusedEmbeddingSparsePreLookUp:1 =
[[0 0]
[1 0]
[2 0]
[3 0]
[4 0]
[5 0]
[6 0]
[7 0]]
Unfusion grad的输入输出:
输出:INFO:tensorflow:head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad =
[[-0.00861822 -0.0023349 0.00688701 -0.00269023 0.00164793 -0.00736784
0.01025489 0.00652598 -0.00471746 -0.00888411 0.00231681 -0.00083448
0.00203576 0.00289572 -0.00719752 0.00490604]
[-0.00153809 0.00080411 0.00177121 -0.00086962 0.00095507 -0.00141255
0.00152518 0.0010505 -0.00122721 -0.00121513 0.00102509 0.00052917
0.00006346 0.00056932 -0.00194405 0.00034838]
[ 0.00359032 -0.00325531 -0.00208078 0.00175567 -0.00113423 0.00575221
-0.00026577 -0.00110852 0.00166852 -0.00025401 -0.00526298 0.00162744
0.00098925 -0.00291735 0.00368948 -0.00167544]
[ 0.00209746 -0.001618 -0.00076764 -0.00073953 -0.00029006 0.00244137
-0.00196408 -0.00168557 -0.00034245 0.00137542 0.00048502 0.00033666
0.00041434 -0.00007309 0.00190399 0.00012118]
[-0.00003232 0.00172924 -0.00824459 0.00492863 -0.00488566 0.01386313
-0.00917176 -0.0094776 -0.00208267 0.01060377 -0.00307963 0.00334385
-0.00790155 0.00217232 0.00438204 -0.00839924]
[ 0.00272977 -0.00265015 -0.00149372 0.00279601 -0.00245979 0.00382371
-0.00186041 0.00059908 -0.00158117 0.00263364 -0.00022644 0.00097163
-0.00043103 -0.00025143 0.00277124 0.00126195]],
输入:
head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/embedding_lookup_sparse_grad/strided_slice = 6,
head/gradients/input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights_grad/tuple/control_dependency_1 =
[[-0.00861822 -0.0023349 0.00688701 -0.00269023 0.00164793 -0.00736784
0.01025489 0.00652598 -0.00471746 -0.00888411 0.00231681 -0.00083448
0.00203576 0.00289572 -0.00719752 0.00490604]
[-0.00153809 0.00080411 0.00177121 -0.00086962 0.00095507 -0.00141255
0.00152518 0.0010505 -0.00122721 -0.00121513 0.00102509 0.00052917
0.00006346 0.00056932 -0.00194405 0.00034838]
[-0.00154482 0.00019465 0.00210896 -0.00204525 0.00301571 -0.0016045
0.00139093 0.00215399 0.00047724 -0.00222365 -0.00055881 0.00044712
0.00082448 -0.00155544 -0.00257766 -0.00031559]
[ 0.00136614 0.00111764 -0.00218892 0.00170602 -0.00054201 0.00347374
-0.00119696 -0.00144338 0.00078496 0.00169556 -0.00112028 0.00118931
0.00130751 0.00075804 0.00326457 0.0000487 ]
[ 0.00209746 -0.001618 -0.00076764 -0.00073953 -0.00029006 0.00244137
-0.00196408 -0.00168557 -0.00034245 0.00137542 0.00048502 0.00033666
0.00041434 -0.00007309 0.00190399 0.00012118]
[-0.00003232 0.00172924 -0.00824459 0.00492863 -0.00488566 0.01386313
-0.00917176 -0.0094776 -0.00208267 0.01060377 -0.00307963 0.00334385
-0.00790155 0.00217232 0.00438204 -0.00839924]
[ 0.003769 -0.00456759 -0.00200083 0.00209491 -0.00360792 0.00388297
-0.00045974 -0.00181912 0.00040632 0.00027408 -0.0035839 -0.00000899
-0.00114274 -0.00211995 0.00300257 -0.00140855]
[ 0.00272977 -0.00265015 -0.00149372 0.00279601 -0.00245979 0.00382371
-0.00186041 0.00059908 -0.00158117 0.00263364 -0.00022644 0.00097163
-0.00043103 -0.00025143 0.00277124 0.00126195]],
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/embedding_lookup_sparse/UniqueWithCounts:1 = [0 1 2 2 3 4 2 5],
input_layer/sparse_input_layer/input_layer/C10_embedding/C10_embedding_weights/embedding_lookup_sparse/Cast = [0 1 2 3 4 5 6 7]
Rebuild ESMM to update API and Enable DeepRec Features
Goal
Rebuild ESMM to update API and enable DeepRec Features.
Requirement Details
Features list
Enable the following DeepRec feature(Docs about the features from Alibaba https://deeprec.readthedocs.io/zh/latest/index.html):
8) Auto Micro Batch same with DeepRec-AI#127
9) FusedEmbedding API, embedding fusion
10) Smart Stage same with DeepRec-AI#122
11) Auto Graph Fusion DeepRec-AI#144
12) CPU Memory Optimization:START_STATISTIC_STEP, STOP_STATISTIC_STEP, jemalloc
14) AdamAsync Optimizer
15) BF16
1) Embedding Variable
7) GRPC++ and StarServer
13) Incremental Checkpoint
14) AdagradDecay
2) EmbeddingVariable advanced features:Embedding Elimination
3) EmbeddingVariable advanced feature:Embedding Filter
4) Dynamic-dimension Embedding Variable
5) Adaptive Embedding
17) WorkQueue
6) Multi-Hash Variable
Test
Other Requirements: Dockerfile and Documents
Code Style and commit
Maintain
Definition of Done
Unsorted_setment_sum opeartor optimization
Goal
Optimize unsorted_segment_sum operator performance
Problem Description
In some of recommendation model, for example, DLRM, the operator unsorted_segment_sum will bring obvious overhead to performance. So it's very important to reduce its cost.
Here is the step to reproduce the performance issue.
Requirement Details
Definition of Done
Rebuild SimpleMultiTask to update API and Enable DeepRec Features
Goal
Rebuild SimpleMultiTask to update API and enable DeepRec Features.
Requirement Details
Features list
Enable the following DeepRec feature(Docs about the features from Alibaba https://deeprec.readthedocs.io/zh/latest/index.html):
8) Auto Micro Batch same with DeepRec-AI#127
9) FusedEmbedding API, embedding fusion
10) Smart Stage same with DeepRec-AI#122
11) Auto Graph Fusion DeepRec-AI#144
12) CPU Memory Optimization:START_STATISTIC_STEP, STOP_STATISTIC_STEP, jemalloc
14) AdamAsync Optimizer
15) BF16
1) Embedding Variable
7) GRPC++ and StarServer
13) Incremental Checkpoint
14) AdagradDecay
2) EmbeddingVariable advanced features:Embedding Elimination
3) EmbeddingVariable advanced feature:Embedding Filter
4) Dynamic-dimension Embedding Variable
5) Adaptive Embedding
17) WorkQueue
6) Multi-Hash Variable
Test
Other Requirements: Dockerfile and Documents
Code Style and commit
Maintain
Definition of Done
Rebuild DBMTL to update API and Enable DeepRec Features
Goal
Rebuild DBMTL to update API and enable DeepRec Features.
Requirement Details
Features list
Enable the following DeepRec feature(Docs about the features from Alibaba https://deeprec.readthedocs.io/zh/latest/index.html):
8) Auto Micro Batch same with DeepRec-AI#127
9) FusedEmbedding API, embedding fusion
10) Smart Stage same with DeepRec-AI#122
11) Auto Graph Fusion DeepRec-AI#144
12) CPU Memory Optimization:START_STATISTIC_STEP, STOP_STATISTIC_STEP, jemalloc
14) AdamAsync Optimizer
15) BF16
1) Embedding Variable
7) GRPC++ and StarServer
13) Incremental Checkpoint
14) AdagradDecay
2) EmbeddingVariable advanced features:Embedding Elimination
3) EmbeddingVariable advanced feature:Embedding Filter
4) Dynamic-dimension Embedding Variable
5) Adaptive Embedding
17) WorkQueue
6) Multi-Hash Variable
Test
Other Requirements: Dockerfile and Documents
Code Style and commit
Maintain
Definition of Done
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.