GithubHelp home page GithubHelp logo

microsoft / nnfusion Goto Github PK

View Code? Open in Web Editor NEW
930.0 44.0 155.0 170.2 MB

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.

License: MIT License

CMake 0.59% Shell 0.46% C++ 85.40% Perl 0.46% Cuda 0.23% C 0.07% PureBasic 0.01% Python 11.38% Dockerfile 0.15% Makefile 0.01% Jupyter Notebook 1.09% Batchfile 0.02% HLSL 0.14%

nnfusion's Introduction

NNFusion is a flexible and efficient DNN compiler that can generate high-performance executables from a DNN model description (e.g., TensorFlow frozen models and ONNX format). With the efficient compiler as core, NNFusion aims to:

  • facilitate full-stack model optimization
  • provide framework-free code generation capability
  • support new accelerator devices as target inferencing devices

Who should consider using NNFusion?

  • Developers who want to speed up the execution performance of their pre-defined or pre-trained DNN model.
  • Developers who want to deploy their pre-trained model as framework-free source codes with minimum library dependencies.
  • Researchers who want to quickly try new compiler optimization ideas or customize optimizations on some specific models.

Highlight features

  • Provide a full-stack optimization mechanism, including:
    • Data-flow graph optimizations, e.g., CSE, compile-time constant folding, etc.
    • Model-specific kernel selection, kernel co-scheduling, kernel fusion and auto kernel tuner integration.
    • Static memory layout and placement optimizations.
  • Provide ahead-of-time and source-to-source (model-to-code) compilation to reduce runtime overhead and remove library/framework dependencies.
  • Support popular DNN model formats including TensorFlow and ONNX as input models.
  • Support customized optimization in an easier and more efficient way, e.g., directly replacing hand-crafted kernels on the generated human-readable code.
  • Support commonly used devices like CUDA GPUs, ROCm GPUs and CPU.
  • Support parallel training via SuperScaler

Get Started

Quick Start with Docker Image

For end users, simply use docker to compile your model and generate high-performance executable.

NNFusion supports and is well tested on Ubuntu 16.04 and 18.04 with a CUDA GPU equipped.

You should install nvidia-docker on your device to do the following steps.

We will use a simple TensorFlow LSTM inference model as an example. You can download a frozen version from our model zoo:

wget https://nnfusion.blob.core.windows.net/models/tensorflow/frozen_lstm_l8s8h256_bs1.pb

To use your own model to get started, please refer to Supported Models to see whether it is supported and freeze your model according to Freeze Your Model.

  1. Pull docker image docker pull nnfusion/cuda:10.2-cudnn7-devel-ubuntu18.04

  2. Run docker container with the given image

docker run -t --name [YOUR_CONTAINER_NAME] -d nnfusion/cuda:10.2-cudnn7-devel-ubuntu18.04
docker start [YOUR_CONTAINER_NAME]
docker exec -it [YOUR_CONTAINER_NAME] bash
  1. Put your model in the container

In host, you can use docker cp host_path [YOUR_CONTAINER_NAME]:container_path to copy your model into the container, or use docker run -t -i -v <host_dir>:<container_dir> to map the host dir to the container.

  1. Compile Model

When model is prepared, we can compile model in the container and run it to see the performance.

cd root
nnfusion path/[YOUR_MODEL_FILE]

Note: If you are using an ONNX model, the compile command will be nnfusion path/[YOUR_MODEL_FILE] -f onnx

  1. Build and Run Compiled Model
cd root/nnfusion_rt/cuda_codegen
cmake . && make -j
./main_test
  1. The output of NNFusion should be Tensors with value and model iteration times. Using the example model frozen_lstm_l8s8h256_bs1.pb, you will see the output of this model and a summary of performance:
Result_2261_0:
8.921492e-03 1.182088e-02 8.937406e-03 7.932204e-03 1.574194e-02 3.844390e-03 -1.505094e-02 -1.112035e-02 5.026608e-03 -8.032205e-03  .. (size = 256, ends with 1.357487e-02);
Result_2261_0:
8.921492e-03 1.182088e-02 8.937406e-03 7.932204e-03 1.574194e-02 3.844390e-03 -1.505094e-02 -1.112035e-02 5.026608e-03 -8.032205e-03  .. (size = 256, ends with 1.357487e-02);
...
Iteration time 2.735200 ms
Iteration time 2.741376 ms
Iteration time 2.733440 ms
Iteration time 2.726528 ms
Iteration time 2.731616 ms
Iteration time 2.736544 ms
Iteration time 2.728576 ms
Iteration time 2.733440 ms
Iteration time 2.732992 ms
Iteration time 2.729536 ms
Iteration time 2.726656 ms
Iteration time 2.732512 ms
Iteration time 2.732032 ms
Iteration time 2.730208 ms
Iteration time 2.732960 ms
Summary: [min, max, mean] = [2.724704, 2.968352, 2.921987] ms

For more detailed information on NNFusion usage, please refer to NNFusion Usage.

For TensorFlow users, you can refer to Kernel Tuner Tutorial to learn how to compile a TensorFlow model and tune each operator in this model to generate the end-to-end source code.

For detailed example about training,please refer to How to use NNFusion Python interface for inference/training.

Build from Source Code

Researchers or contributors who want to do more research on optimizing model compilation, you can build NNFusion from source code. To build from source code, please read the following documents:

  1. Read Before Started page to see supported CUDA GPUs and required libs.
  2. Read Build Guide for more information on how to build and install NNFusion in your native system or in the docker container.
  3. After building and installing NNFusion, please refer to Compile Guide and Tool Usage to learn how to compile or optimize a DNN model.

Speedups on benchmarks

To learn how much performance improvement that NNFusion can archive on some typical DNN models, please refer to the README page at our OSDI'20 artifact branch.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

To contribute, please refer to Contribution Guide to see more details.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Reference

Please cite NNFusion or Rammer in your publications if it helps your research:

@inproceedings {rammer-osdi20,
author = {Lingxiao Ma and Zhiqiang Xie and Zhi Yang and Jilong Xue and Youshan Miao and Wei Cui and Wenxiang Hu and Fan Yang and Lintao Zhang and Lidong Zhou},
title = {Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks},
booktitle = {14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20)},
year = {2020},
isbn = {978-1-939133-19-9},
pages = {881--897},
url = {https://www.usenix.org/conference/osdi20/presentation/ma},
publisher = {{USENIX} Association},
month = nov,
}

nnfusion's People

Contributors

alisachen98 avatar cjkkkk avatar colinyoyo26 avatar ghostplant avatar guoshzhao avatar heheda12345 avatar jlxue avatar jsoref avatar leiwang1999 avatar lijiansong avatar lynex avatar mzmssg avatar niupple avatar siahuat0727 avatar tong-shao avatar wenxcs avatar xiayuqing0622 avatar xiezhq-hermann avatar xysmlx avatar yiyione avatar yuxiaoguo avatar zyeric avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nnfusion's Issues

[BUG]cmake failure

  1. git clone from master
  2. use docker to build from source code
  3. freeze model in docker
  4. compile model
  5. cmake in cuda_codegen
    Error occurs when executing step5 in both tensorflow and pytorch models:
    image

[BUG] Compile Error in source code

πŸ› Bug

Compile Error occurs in nnfusion/src/nnfusion/core/operators/generic_op/generic_op_define/Elementwise.cpp:6:27: error: non-local lambda expression cannot have a capture-default
To Reproduce
Steps to reproduce the behavior

  1. mkdir build && cd build && cmake .. && make -j6

[ 52%] Building CXX object src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/generic_op/generic_op_define/Elementwise.cpp.o
/home/liang/Documents/nnfusion/src/nnfusion/core/operators/generic_op/generic_op_define/Elementwise.cpp:6:27: error: non-local lambda expression cannot have a capture-default
6 | auto trans_elementwise = [&](std::shared_ptrgraph::GNode& curr, const std::string& topi) {
| ^
make[2]: *** [src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/build.make:433: src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/generic_op/generic_op_define/Elementwise.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2058: src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/all] Error 2
make: *** [Makefile:149: all] Error 2

Expected behavior

Additional context

Protoc version 3.6.1
Cmake version 3.18.3
g++ version 9.3.0
Ubuntu 20.04 x86_64

CMake has passed

(base) liang@:~/Documents/nnfusion/build$ cmake ..

-- MSRAsia NNFusion Team(@nnfusion)
-- https://github.com/microsoft/nnfusion
-- 
-- Installation directory: /usr/local
-- thirdparty enabled
-- tools enabled
-- nnfusion enabled
-- unit tests enabled
-- Configuring done
-- Generating done
-- -- Build files have been written to: /home/liang/Documents/nnfusion

[BUG] Unexpected exit when codegen

πŸ› Bug

To Reproduce
Steps to reproduce the behavior:

  1. Inside the container provided;

Expected behavior

Exit when codegen

Additional context

  • Special container to reproduce the bug is provided.

[Story] [Oct] Code Refactor Plan Proposal

The purpose of the code refactor is to improve our code quality and usability, the approach we take should be considered from two sides: from the user's aspect and from our developers' aspect.

From user's aspect, the main goal is to make user can use NNF as a real tool: compile the model and understand the procedure easily:

  1. Building stages
    Currently we make our building scripts to support install dependencies, run making in native env or inside container, but currently we haven't consider much about what users may actually meet in real scenario:
    Take #48 for example, the user didn't read our doc thus he don't know we use ubuntu 16/18 not 20. So this system version should be checked in the building scripts.
    Work items:

  2. Testing stages
    Our testing utilities are tricky and hard to use: user should have NVIDIA or CUDA hardware and configure it through a specific config file. The unit tests have no check for hardware which will result in failing test.
    We should make testing more easily and make the test report easily for user to understand.
    Besides, we should check coverage for each PR and give report for whether the code change is covered by test, this will help us to improve code quality.

  3. Validation stages
    The NNF is currently more like in tech validation stage: we might configure the env by hand and user can do validation. So what we need to do is to make script or NNFusion CLI more easily for user to compile model;
    One more problem: Because NNF need frozen model but freeze a model is not a standard procedure for user so NNF may take fault frozen model as input, should we provide a standard script to freeze model?
    Work item:

  • User interface for Inference and Trainning

From our developers' aspect:

  1. License Problem:
    Apache-2 license is kind strict and hard to modify code, so we move the code we didn't rewrite into thirdparty folder, but we need to rewrite those code and bring those back to our source code tree. If not, code reader might get confused about where the code is actually. Those code are mainly related to operator set, some core data type, and the importer frontend for TF and ONNX. We might discuss those at later sections;

  2. Operator Set:
    The operator set we use originates from Ngraph and is amended with some op with "OperatorV2" type. The main goal will be make all the operator set migrated to "OperatorV2" or a new class which is not hard-coded anymore and could be added/removed/changed easily.
    The operator should also support serialized.

  3. Kernels:
    Currently we have hard-coded kernels, antares kernels( antares-ir), kernelDB kernels. We have features but we didn't provide a good mechanism to pick kernels from those. And all the kernels' interface are not same.
    So in this part, we might need to:
    Firstly, design a general interface for all the kernel provider, which make us support more provider like TVM.
    Secondly, design kernel selection policies for kernel providers;
    The new interface will give our optimization pass more flexibility to pick/change kernels.

  4. Code generator:
    This is might the most hard part of the refactor plan: since the code generator is complex and integrate many many features and those sub features are interacted with each other.
    The main goal is to make the code gen much much more simple and could be easily use to support "new" device with much much less code change.

  5. Profiler
    Our profiler have some flaws. For example, it does not guarantee the input data is valid, which may cause error when profiling some kernels (eg. OneHot). Also, the profiler and codegen are independent in current design, but they share many functions. We may use codegen to do profiling.

  6. Training
    We have added basic training feature like autodiff, backward ops etc. But endusers cannot easily use them and integrate with their own project. This problem is not only for training, but training is an important factor to consider. For a better training experience, we have two items: The first one is a clear Python interface hiding NNFusion trivias and implementation. Then, based on the interface, we need to figure out the scope and add missed training features.

[ENHANCEMENT] Active block check in -fblockfusion_level=2

πŸš€ Feature
Check GridDim in -fblockfusion_level=2 to satisfy the active block limitation in CUDA.

Motivation
BlockFusion with -fblockfusion_level=2 uses inter-block synchronization primitives. Improper number of BEs (vEUs) may lead to deadlock due to the active block limitation in CUDA.

Pitch
We can use nvcc to check the GridDim after blockfusion codegen and adaptively change the number of BEs (vEUs) to satisfy the active block limitation in CUDA.

Alternatives
Fallback to -fblockfusion_level=1 when the GridDim exceeds the active block limitation. The overhead of inter-block synchronization is becoming larger with the increasing of blocks.

Additional context

NNFusion Backlog

This is the backlog of NNFusion, which is to track issues that are not yet planned but consider as future candidate items.

Our current or upcoming release is tracking in #194 .

Our release procedures are listed in #195.

NNFusion users are highly encouraged to comment and suggest on the priority, preference and needs for the work items. Please feel free to share your ideas with us or contribute to NNFusion.

Backlog (no rank)

Typo: Module | Module Owner

Mechanism

  • Custom op support
  • support training | @mzmssg
    • Support learning rate scheduling
    • Support external optimizer
    • Freeze some layers for fine tuning
    • Support gradient stop
  • support low-precision & mixed-precision | @Niupple
    • Fp16 specific kernels
    • wait until Antares is integrated into NNFusion, by which nothing specific should be done
    • manually generate & inject FP16 specific kernels with Antares into Kernel DB.
    • Modify the data type in generated IR
  • auto kernel tuner integration | @jlxue
    • Kernel DB
    • Add kernelEmitters (CPU/CUDA/ROCm) to parse and emit Antares kernels
    • Modify kernel selection pass accordingly
  • offline inference(PAI) | @wenxcs
    • Reduce padding(bytedance/effective_transformer)
    • Docker image for BERT offline inference
    • Batch backet inference
    • Offline inference wrapper
  • parallel training support (via SuperScaler) | @lynex
    • v0.2 new datatype support

Refactor/Improvement

  • detect unsupported model & ops | @mzmssg
    • Others(unsupported op attr etc.)
  • support block-fusion as default | @xysmlx
    • Define and implement the interfaces between BlockFusion and kernel tuner
    • End-to-end test with kernel tuner enabled
    • automatic active block check #50
    • Tune-efficient policy in kernel tuner
    • Refactor BlockCudaEmitter
    • Advanced scheduling policy with kernel tuner
  • support reduce-fusion | @xiayuqing0622
    • add reduce fusion pass
    • optimize schedular
    • test performance
    • transport the code to github
  • sub-graph substitution | @wenxcs
    • Graph match feature, by FSM:
    • Replacing current Pattern Match;
    • Graph Re-writer Tool;
    • Antares Fusion
  • code refactor | @wenxcs
    • Move Operator define to opdefine_v2
    • Robust validation pipeline

Frontend & Backend support

  • support CPU | @guoshzhao
    • Update azure mirror download urls for thirdparty package
  • support HLSL | Pending
  • model support(training models & more inference models) | Pending

Common Tools

  • python interface | @mzmssg
    • Share const across multi nnf_rt
    • Install by pip

Documentation

[BUG] Link error in debug mode

πŸ› Bug

when building with debug mode, the link stage will fail.

To Reproduce
Steps to reproduce the behavior:

  1. cmake .. -DDEBUG_ENABLE=TRUE
  2. make -j

Error log:
../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In functionBlockFusionWavefrontOptimizer::SplitGroup(std::shared_ptr<std::vector<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup, std::allocator<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup > > >)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:296: undefined reference to BlockFusionWavefrontOptimizer::MAX_GROUP' ../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::GroupProfiler(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:362: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' ../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::FuseGroupOnGraph(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:432: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' collect2: error: ld returned 1 exit status src/tools/nnfusion/CMakeFiles/nnfusion.dir/build.make:124: recipe for target 'src/tools/nnfusion/nnfusion' failed make[2]: *** [src/tools/nnfusion/nnfusion] Error 1 CMakeFiles/Makefile2:2193: recipe for target 'src/tools/nnfusion/CMakeFiles/nnfusion.dir/all' failed make[1]: *** [src/tools/nnfusion/CMakeFiles/nnfusion.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs.... ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::SplitGroup(std::shared_ptr<std::vector<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup, std::allocator<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup > > >)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:296: undefined reference to BlockFusionWavefrontOptimizer::MAX_GROUP' ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::GroupProfiler(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:362: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::FuseGroupOnGraph(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:432: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' collect2: error: ld returned 1 exit status test/CMakeFiles/unit-test.dir/build.make:1014: recipe for target 'test/unit-test' failed make[2]: *** [test/unit-test] Error 1 CMakeFiles/Makefile2:2240: recipe for target 'test/CMakeFiles/unit-test.dir/all' failed make[1]: *** [test/CMakeFiles/unit-test.dir/all] Error 2 Makefile:148: recipe for target 'all' failed make: *** [all] Error 2

Expected behavior

build success.

Additional context

no.

[ENHANCEMENT] CUDA-Graph integration

πŸš€ Feature

CUDA-Graph is introduced in CUDA-10.1 to reduce kernel launch overhead. CUDA-Graph matches current NNFusion's design, so it could be easily integrated to cuda_codegen to improve performance.

Motivation

Pitch

Add stream in kernel_entry and capture the kernel_entry function to initialize the cuda-graph.

Note that it cannot capture default stream and there should not be host-blocking API calls (e.g., cudaDeviceSynchronize) during stream capturing.

Alternatives

Additional context

https://developer.nvidia.com/blog/cuda-graphs/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

[BUG] static "/usr/local/cuda" path string in cuda_lib cmake codegen

πŸ› Bug

Although the cmake file in nnfusion_rt folder can search CUDA path, the linking procedure still uses "/usr/local/cuda" path to link cuda libraries. It will result in linking error when users set custom CUDA paths.

To Reproduce
Steps to reproduce the behavior:

  1. set custom CUDA path (e.g., /usr/local/cuda-10.2)
  2. remove "/usr/local/cuda" or redirect "/usr/local/cuda" to another CUDA path (e.g., /usr/local/cuda -> /usr/local/cuda-9.0)
  3. compile model and build nnfusion_rt

Expected behavior

cmake .
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda-10.2 (found version "10.2")
-- Configuring done
-- Generating done
make -j
Scanning dependencies of target nnfusion_naive_rt
[ 96%] Linking CXX static library libnnfusion_naive_rt.a
[ 96%] Built target nnfusion_naive_rt
Scanning dependencies of target main_test
[ 98%] Building CXX object CMakeFiles/main_test.dir/main_test.cpp.o
[100%] Linking CXX executable main_test
/usr/bin/ld: cannot find -lcudnn
collect2: error: ld returned 1 exit status
CMakeFiles/main_test.dir/build.make:110: recipe for target 'main_test' failed
make[2]: *** [main_test] Error 1
CMakeFiles/Makefile2:96: recipe for target 'CMakeFiles/main_test.dir/all' failed
make[1]: *** [CMakeFiles/main_test.dir/all] Error 2
Makefile:102: recipe for target 'all' failed
make: *** [all] Error 2

Additional context

After setting soft link "/usr/local/cuda -> /usr/local/cuda-10.2", it works well.

[BUG] kernel fusion pass NullPointer bug

πŸ› Bug

Run kernel fusion pass on some models might report Check failed: '((cuda_kernel) != nullptr)' error at /src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166.

To Reproduce
Steps to reproduce the behavior:

  1. the bug need specific model that has a broadcast node with multiple output edges, each of them connects to an element-wise node
  2. run nnfusion xx.pb --format tensorflow -fdefault_device CUDA -fblockfusion_level=0 -fkernel_fusion_level=3

Error logs:
[ERROR] 2020-10-12T03:16:32z src/nnfusion/util/errors.hpp 169 Check failed: '((cuda_kernel) != nullptr)' at /home/jxue/repo/nnfusion-jlxue/src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166: kernel type: terminate called after throwing an instance of 'nnfusion::errors::NullPointer' what(): Check failed: '((cuda_kernel) != nullptr)' at /home/jxue/repo/nnfusion-jlxue/src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166: kernel type: Aborted (core dumped)

Expected behavior

compile success.

Additional context

no.

Error handling for GPU kernel launch/execution

πŸš€ Feature
Add error handling to better discover errors of kernels caused by kernel launching and kernel execution.

Motivation
I am checking the correctness of one models with CPU backend. And plan to compare the results with CUDA backend. But finally found that the results of CUDA backend is wrong because that one kernel with invalid configuration failed to launch. And the CUDA program just executed normally and didn't report any information.

Pitch

  1. Want to be notified whether there have kernels failed in advance.
  2. If possible, want to know what kernels fail. If this will introduce more overhead, only the previous one is acceptable.

Alternatives

Additional context

[BUG] Check failed: 'm_ref_count > 0' and bad free when set -fblockfusion_level=2

πŸ› Bug

Check failed: 'm_ref_count > 0' and bad free when set -fblockfusion_level=2

lstm-tf-slope.const_folded.pb

[ERROR] 2020-10-13T11:25:27z src/nnfusion/util/errors.hpp 169   Check failed: 'm_ref_count > 0' at /home/lingm/projects/rammer_artifact/thirdparty/ngraph/src/nnfusion/common/descriptor/tensor.hpp:87:
(no explanation given)
terminate called after throwing an instance of 'nnfusion::errors::CheckError'
  what():  Check failed: 'm_ref_count > 0' at /home/lingm/projects/rammer_artifact/thirdparty/ngraph/src/nnfusion/common/descriptor/tensor.hpp:87:
(no explanation given)
Aborted (core dumped)

frozen_lstm_l2_s2_h256.const_folded.pb

[ERROR] 2020-10-14T04:42:56z src/nnfusion/util/errors.hpp 169   Check failed: 'found' at /home/lingm/projects/rammer_artifact/src/nnfusion/engine/memory_allocator.cpp:241:
bad free
terminate called after throwing an instance of 'nnfusion::errors::CheckError'
  what():  Check failed: 'found' at /home/lingm/projects/rammer_artifact/src/nnfusion/engine/memory_allocator.cpp:241:
bad free
Aborted (core dumped)

To Reproduce
Steps to reproduce the behavior:

  1. enable -fblockfusion_level=2 during model compilation

Expected behavior

Additional context

Remove "using namespace" in headers

NNFusion have some using namespace in headers for convenience like

using namespace nnfusion;
, which unlimitedly pollutes the global namespace and is strongly discouraged. A bad case is that the log level(INFO/ERROR) are globalized, in conflict with other 3rd libraries.
We should fix it by several steps:

  • Refine code guideline, disallow using namespace in headers
  • Inspect automatically by tools like clang-format
  • Remove such snippet in common headers
  • Remove such snippet in dedicated headers

[BUG] missing link

πŸ› Bug

image

Expected behavior

nnfusion home page, need a link to the how-to get start
Additional context

[ENHANCEMENT] NNF Python interface design

πŸš€ Feature

Provide a Python runner to improve NNF usability.

Motivation

Currently, NNFusion is still not easy to use for third parties, the reasons include:

  • Need explicitly freeze model
  • Complex codegen steps and flags
  • Model in cpp source code, no standard format for integration

A possible solution is providing a Python interface hiding these details.

Goal

Provide a python wrapper for NNF, improve NNF usability for PyTorch users. Use PyTorch model/tensor as standard interface, then users only need to replace forward execution by NNF.

Non-Goal

  1. Support Full PyTorch feature like sparse Tensor(Should raise unsupported exception)
  2. Models cannot be converted to ONNX
  3. Other frameworks like TF

Class Definition

  • Executor
    Executor is a simple Python binding on nnf_rt, it accepts a folder containing the nnf_rt, provided a call() func to execute nnf_rt and write output with raw_pointer.
  • Session
    Session is the pipeline generating an Executor, it accepts a PyTorch model and input_desc(a list of input shape/type/device), then does codegen, compiles and loads compiled files to an Executor. It also wraps the internal Executor by PyTorch tensors interface.
  • Runner
    Runner is the outmost interface, it maintains a cache map from input_desc to Sessions. Every time PyTorch tensors fed, Runner checks input tensor shape/type/device, then forward the tensors to corresponding Session(cache hit) or construct a new Session(cache missed).

Workflow

  1. Init Runner with specific model, nnf_flags and workdir(store nnf_rt)
  2. Feed input PyTorch tensors to Runner
  3. Runner checks input shape/type/device, if already cached a corresponding Session, go to step 5
  4. Runner generates a new Session, store it under related input_desc key
    1. Session codegens and compiles model
    2. Session loads compiled files into an Executor
    3. Session binds weights and output tensors
  5. Runner picks Session from cache
  6. Runner forwards input tensors to the Session
    1. Session unwraps tensors to raw_pointer, forward to Executor
    2. Session returns result tensors

Usage

  • PyTorch example
...
model = MLP()
model.load_state_dict(torch.load('/path/to/checkpoint'))
data_loader = get_data_loader(batch_size=batch_size)

for batch in data_loader:
      out = model(batch)
      ...

torch.save(model.state_dict(), '/path/to/checkpoint')
  • Its NNF version
## load model and data loader by PyTorch
model = MLP()
model.load_state_dict(torch.load('/path/to/checkpoint'))
data_loader = get_data_loader(batch_size=batch_size)

## init NNF Runner
nnf_flags = {
    codegen_debug: True, 
    kernel_fusion_level: 2
}
runner = NNFRunner(model, **nnf_flags)

## replace execution by NNF
for batch in data_loader:
      out = runner(batch)
      ...

## save model by PyTorch
torch.save(model.state_dict(), '/path/to/checkpoint')

Work Item
We need more discussion to break down items, roughly includes:

  • Supports build nnf_rt in dynamic lib
  • Codegen Python binding?[low priority]
  • Share constant across different Session
  • Runner implementation

[Story] Optimization on Bert

Feasibility:
The NNFusion project need some flag models to prove the usability, we choose Bert as one of the models.

Target:

  1. Improve NNFusion's inference effectiveness on Transformer/Bert;
  2. Provide a friendly interface on LM tasks;
  3. Provide a docker solution/container to do inference/deployed easily;

Work items:

  • Investigation on current bert acceleration projects;

Validation:

[BUG] thirdparty/ngraph not in linter path

πŸ› Bug

Our code_style checker ignored thirdparty/ folder, but the majority of our code is currently in thirdparty/ngraph, we should add it to whitelist

To Reproduce
Steps to reproduce the behavior:
1.
2.
3.

Expected behavior

Additional context

NNFusion v0.1 Endgame Plan

release manager @wenxcs
cut branch date: 10/22
test period : 10.22-10.26
target release date: @wenxcs

New Feature

Feature Feature Owner Test Owner(s) Test case Status

Documents

Installation(native & docker)

Workload 10+ tf models & 2 ONNX models

Framework

Hardware & Runtime

Artifact

  • Artifact | @xysmlx
  • Provide A good performance model for advanced user | @jixue

Other Work Items

  • Unsupported Operators
  • Release Note

[ENHANCEMENT] optimizations for some complex operations

πŸš€ Feature

NNFusion has kernel implementations for some complex operations (e.g., GELU, LayerNorm). However, some frontends implement these operations as a series of simple operators so that current NNFusion cannot recognize these patterns and call relevant advanced kernel implementations:

  • GELU
  • Non-fused BatchNorm
  • LayerNorm

Motivation

Pitch

These policies can be implemented in the pattern substitution pass.

Alternatives

Additional context

Possible identifier conflict in kernel DB

Current identifier of kernel DB does not have delimiter for parameters, which may introduce possibility of identifier conflicts for different kernel configurations.

For example: [1, 256, 16, 16] and [12, 56, 16, 16] could have the same identifier 12561616.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.