GithubHelp home page GithubHelp logo

oneflow-inc / oneflow Goto Github PK

View Code? Open in Web Editor NEW
5.7K 146.0 650.0 84.01 MB

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

Home Page: http://www.oneflow.org

License: Apache License 2.0

CMake 0.56% C++ 56.71% Cuda 11.12% C 2.06% Python 28.62% Shell 0.13% Dockerfile 0.04% MLIR 0.74% NASL 0.03%
deep-learning machine-learning deep-neural-networks ml distributed neural-network cuda

oneflow's Introduction

OneFlow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient. With OneFlow, it is easy to:

Simple CI Nightly Docker Image Nightly Release Documentation

Latest News

Publication

System Requirements

General

  • Linux
  • Python 3.7, 3.8, 3.9, 3.10, 3.11

CUDA

  • CUDA arch 60 or above

  • CUDA Toolkit version 10.0 or above

  • Nvidia driver version 440.33 or above

    OneFlow will work on a minimum supported driver, and any driver beyond. For more information, please refer to CUDA compatibility documentation.

Install

Preinstall docker image

docker pull oneflowinc/oneflow:nightly-cuda11.7

Pip Install

  • (Highly recommended) Upgrade pip

    python3 -m pip install --upgrade pip #--user
    
  • To install latest stable release of OneFlow with CUDA support:

    python3 -m pip install oneflow
  • To install nightly release of OneFlow with CPU-only support:

    python3 -m pip install --pre oneflow -f https://oneflow-staging.oss-cn-beijing.aliyuncs.com/branch/master/cpu
  • To install nightly release of OneFlow with CUDA support:

    python3 -m pip install --pre oneflow -f https://oneflow-staging.oss-cn-beijing.aliyuncs.com/branch/master/cu118

    If you are in China, you could run this to have pip download packages from domestic mirror of pypi:

    python3 -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
    

    For more information on this, please refer to pypi 镜像使用帮助

Install from Source

Clone Source Code
  • Option 1: Clone source code from GitHub

    git clone https://github.com/Oneflow-Inc/oneflow.git
  • Option 2: Download from Aliyun(Only available in China)

    curl https://oneflow-public.oss-cn-beijing.aliyuncs.com/oneflow-src.zip -o oneflow-src.zip
    unzip oneflow-src.zip
Build OneFlow
  • Install dependencies

    apt install -y libopenblas-dev nasm g++ gcc python3-pip cmake autoconf libtool
    

    These dependencies are preinstalled in offical conda environment and docker image, you can use the offical conda environment here or use the docker image by:

    docker pull oneflowinc/manylinux2014_x86_64_cuda11.2
  • In the root directory of OneFlow source code, run:

    mkdir build
    cd build
    
  • Config the project, inside build directory:

    • If you are in China

      config for CPU-only like this:

      cmake .. -C ../cmake/caches/cn/cpu.cmake
      

      config for CUDA like this:

      cmake .. -C ../cmake/caches/cn/cuda.cmake -DCMAKE_CUDA_ARCHITECTURES=80 -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCUDNN_ROOT_DIR=/usr/local/cudnn
      
    • If you are not in China

      config for CPU-only like this:

      cmake .. -C ../cmake/caches/international/cpu.cmake
      

      config for CUDA like this:

      cmake .. -C ../cmake/caches/international/cuda.cmake -DCMAKE_CUDA_ARCHITECTURES=80 -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCUDNN_ROOT_DIR=/usr/local/cudnn
      

      Here the DCMAKE_CUDA_ARCHITECTURES macro is used to specify the CUDA architecture, and the DCUDA_TOOLKIT_ROOT_DIR and DCUDNN_ROOT_DIR macros are used to specify the root path of the CUDA Toolkit and CUDNN.

  • Build the project, inside build directory, run:

    make -j$(nproc)
    
  • Add oneflow to your PYTHONPATH, inside build directory, run:

    source source.sh
    

    Please note that this change is not permanent.

  • Simple validation

    python3 -m oneflow --doctor
    

Troubleshooting

Please refer to troubleshooting for common issues you might encounter when compiling and running OneFlow.

Getting Started

Documentation

Model Zoo and Benchmark

Communication

The Team

OneFlow was originally developed by OneFlow Inc and Zhejiang Lab.

License

Apache License 2.0

oneflow's People

Contributors

bbuf avatar chengtbf avatar clackhan avatar daquexian avatar dounm avatar duduscript avatar flowingsun007 avatar guo-ran avatar hjchen2 avatar jackalcooper avatar junior-talk avatar ldpe2g avatar leaves-zwx avatar liufengwei0103 avatar liujuncheng avatar lixiang007666 avatar lixinqi avatar mard1no avatar marigoold avatar mosout avatar ouyangyu avatar scxfjiang avatar shawnxuan avatar simonjjj avatar strint avatar willzhang4a58 avatar wind5 avatar wyg1997 avatar yuanms2 avatar zyeric avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oneflow's Issues

Error: dev_job_set python OFRecord Module

最近使用dev_job_set的前端,非常不好用, 避免重复踩坑,请及时解决。

目前想了解的问题:

  1. dev_job_set 的前端进行了哪些模块的测试
  2. dev_job_set 的前端目前仍存在哪些问题未解决

ofrecord

vm/stream_desc.msg.h 导致 g++ 7.3.0 编译失败

[ 98%] Built target of_include_copy
In file included from /oneflow_src/build-conda/3rd-install/glog/include/glog/logging.h:44:0,
                 from /oneflow_src/oneflow/core/common/flat_msg.h:6,
                 from /oneflow_src/oneflow/core/vm/stream_desc.msg.h:6,
                 from /oneflow_src/oneflow/core/vm/vm_desc.msg.h:4,
                 from /oneflow_src/oneflow/core/vm/vm_desc_test.cpp:2:
/usr/local/envs/1f-dev/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/sstream:300:7: error: 'struct std::__cxx11::basic_stringbuf<_CharT, _Traits, _Alloc>::__xfer_bufptrs' redeclared with different access
       struct __xfer_bufptrs
       ^~~~~~

more debug or error information

Oneflow should provide more debug or error information when core dump, such as op name.

  1. unsubscribed regst
    unsubscribed regst will crash oneflow with no information, it is hard to locate op for user.
  2. model shape doesn't match
    e.g. bert pretrain and finetune may have diffenrent shape of one or some of op's model, it's better to provide op name beside shape information.

ERROR: Skipping '//tensorflow/compiler/jit/xla_lib:libxla_core.so': error loading package 'tensorflow/compiler/jit/xla_lib':

i am follow build the xrt with : https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/xrt/README.md

    git submodule update --init --recursive && \
    cd build && \
    cmake .. -DWITH_XLA=ON -DWITH_TENSORRT=ON -DTHIRD_PARTY=ON -DCMAKE_BUILD_TYPE=Release && \
    make -j$(nproc) && \
    cmake .. -DWITH_XLA=ON  -DWITH_TENSORRT=ON -DTHIRD_PARTY=OFF -DCMAKE_BUILD_TYPE=Release && \
    make -j$(nproc) && \
    make pip_install

and there is error occur:

ERROR: Skipping '//tensorflow/compiler/jit/xla_lib:libxla_core.so': error loading package 'tensorflow/compiler/jit/xla_lib': Label '//tensorflow/core:platform/default/build_config.bzl' crosses boundary of subpackage 'tensorflow/core/platform' (perhaps you meant to put the colon here: '//tensorflow/core/platform:default/build_config.bzl'?)WARNING: Target pattern parsing failed.ERROR: error loading package 'tensorflow/compiler/jit/xla_lib': Label '//tensorflow/core:platform/default/build_config.bzl' crosses boundary of subpackage 'tensorflow/core/platform' (perhaps you meant to put the colon here: '//tensorflow/core/platform:default/build_config.bzl'?)

so how to fix

refactor user op cpp api

注册user_op和注册user_op grad的时候,所暴露的api名称里常常带wrapper,这容易让用户困惑。
去掉c++ user_op api名称里的wrapper。

0-D tensor

目前oneflow中scalar使用形状为[1,]的tensor表示,但这本质上还是属于1-D的tensor。0-D tensor表示的scalar和1-D tensor表示的scalar在某些op的计算中具有不同的数学意义,以下是一个tensorflow gather的例子:

a = np.arange(24).reshape((2,3,4))
b1 = np.array([2], np.int32)
b2 = 2
c1 = tf.gather(a, b1)    # c1.shape == [2, 1, 4]
c2 = tf.gather(a, b2)    # c2.shape == [2, 4]

allow random init for missing models in model_load_snapshot_path

bert finetune has more ops with model than pretrain.
Currently, we need run finetune once without model_load_snapshot_path to get random initialized models, and copy them to pretrained model path.
It will be convenience to allow random init for missing models.

GetTmpSizeForReduceSum 函数返回值异常

情况描述
在15机器上跑网络的时候,不小心模型放到显存不足的GPU(0号GPU)上运行,当时GPU信息如下
image
出错信息
image
问题定位
定位问题在
operator/sigmoid_cross_entropy_loss_op.cpp:54
const int64_t sum_buf_size = GetTmpSizeForReduceSum(pred_blob_desc->data_type(), data_dim)
中抛出错误,定位了一个问题是当前情况下kernel/kernel_util.cu:342: SwitchSum()函数计算 tmp_storage_size 出现了问题

protobuf build error with gcc-7.5

System Info
Ubuntu 18.04.4 LTS
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CUDA/cuDNN Version: 10.1 / 7.6

Build Log
/home/xfjiang/repos/oneflow/third_party/protobuf/include/google/protobuf/map.h:882:29: error: cannot call member function ‘bool google::protobuf::Map<Key, T>::InnerMap::TableEntryIsNonEmptyList(google::protobuf::Map<Key, T>::size_type) const [with Key = std::cxx11::basic_string; T = oneflow::Feature; google::protobuf::Map<Key, T>::size_type = long unsigned int]’ without object
if (m
->TableEntryIsNonEmptyList(bucket_index
)) {
~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
/home/xfjiang/repos/oneflow/third_party/protobuf/include/google/protobuf/map.h:897:24: error: cannot call member function ‘bool google::protobuf::Map<Key, T>::InnerMap::TableEntryIsList(google::protobuf::Map<Key, T>::size_type) const [with Key = std::cxx11::basic_string; T = oneflow::Feature; google::protobuf::Map<Key, T>::size_type = long unsigned int]’ without object
return m
->TableEntryIsList(bucket_index
);
~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
CMake Error at of_ccobj_generated_where_kernel_util.cu.o.cmake:280 (message):
Error generating file
/home/xfjiang/repos/oneflow/build/CMakeFiles/of_ccobj.dir/oneflow/customized/kernels/./of_ccobj_generated_where_kernel_util.cu.o

CMakeFiles/of_ccobj.dir/build.make:8043: recipe for target 'CMakeFiles/of_ccobj.dir/oneflow/customized/kernels/of_ccobj_generated_where_kernel_util.cu.o' failed
make[2]: *** [CMakeFiles/of_ccobj.dir/oneflow/customized/kernels/of_ccobj_generated_where_kernel_util.cu.o] Error 1
CMakeFiles/Makefile2:416: recipe for target 'CMakeFiles/of_ccobj.dir/all' failed
make[1]: *** [CMakeFiles/of_ccobj.dir/all] Error 2
Makefile:114: recipe for target 'all' failed
make: *** [all] Error 2

Solution

  • downgrade gcc-7.5 to gcc-6.5 (also works with gcc-4.8.5)
  • or patch third-party protobuf

Reference: TensorFlow Issue 26155

oneflow vsh debug常用命令速查

vsh基本功能

命令 arg0 其余参数 注释
v 任意类型   返回arg0,相当于bash的echo
print 任意类型   调用python的print
-> 赋值   示例:v “foobar” | -> $var
foreach python容器 arg1: 处理函数 遍历数据示例:list 1 2 3 | foreach print
map python容器 arg1: 处理函数  
filter python容器 arg1: 过滤  
flat-map python容器 arg1: 处理函数  
reduce python容器 arg1: 二元函数arg2: 初始值  
group-by python容器 arg1: 分类函数  

python数据结构封装

命令 arg0 其余参数 注释
list 元素 元素 构造python list示例:(list 1 2 3 4)
tuple 元素 元素 构造python tuple(tuple 1 2 3 4)
dict 构造python dict(dict)
. 容器 arg1: 字段名或下标 访问容器元素或属性示例:list 1 2 3 | .0
. 容器 arg1: 字段名或下标arg2: value 设置容器的元素或属性示例:dict | .a “foo” | .b “bar”
tuple-to-list 对应的容器,下同   转换函数
list-to-tuple      
pair-list-to-dict      
dict-to-pair-list      

gdb命令封装

命令 arg0 其余参数 注释
p 变量或表达式   与gdb的p命令等价
this/p-str-format 对象指针 格式字符串,其中%s指代this 执行指针的方法或访问指针的成员

c++容器

命令 arg0 其余参数 注释
std-vector-get 对应容器的gdb对象,下同 arg1: 下标 获取容器的元素
std-list-get      
std-queue-get      
std-deque-get      
std-unordered-set-get      
std-unordered-map-get      
std-vector-to-list 对应容器的gdb对象,下同   将容器转为python list对象,其元素为容器元素的gdb对象
std-list-to-list      
std-queue-to-list      
std-deque-to-list      
std-unordered-set-to-list      
std-unordered-map-to-list      

c++指针

命令 arg0 其余参数 注释
std-unique-ptr-get 对应指针的gdb对象,下同    
std-shared-ptr-get      
ptr-get      
get-address     获取对象地址

c++类型

命令 arg0 其余参数 注释
gdb/type gdb对象   获取c++类型
gdb/dynamic-type gdb对象   获取实际子类类型
gdb/dynamic-cast gdb对象   由基类转成实际子类类型

oneflow protobuf

命令 arg0 其余参数 注释
gdb/of-pb-msg-from-cpp-to-py proto message指针   将一个c++ protobuf message对象转化成python的proto对象

oneflow kernel blob

命令 arg0 其余参数 注释
gdb/of-bn-in-op-to-blob-ptr bn_in_op字符串   需要在kernel函数里执行给定bn_in_op,返回对应的Blob对象指针
gdb/of-blob-dump-by-blob-ptr blob指针的gdb对象   需要在kernel函数里执行返回blob数据构成的numpy.ndarray对象
gdb/of-blob-dump-by-blob-name bn_in_op字符串   需要在kernel函数里执行返回blob数据构成的numpy.ndarray对象

在 oneflow function 外调用oneflow op的api,需要更好错误提示

import tensorflow as tf
import oneflow as flow
import numpy as np

# tf.enable_eager_execution()
# assert tf.executing_eagerly()


def test_matmul(a_shape, b_shape, transpose_a=False, transpose_b=False):
    a = np.random.random_sample(a_shape).astype(np.float32)
    b = np.random.random_sample(b_shape).astype(np.float32)
    b = flow.get_variable(name = 'v1', shape=b_shape, split_axis=1) 
    bias = flow.get_variable(name = 'v1', shape=(b_shape[1],), split_axis=1) 

    # OneFlow
    flow.config.gpu_device_num(1)
    @flow.function
    def MatmulTestJob(a=flow.input_blob_def(a_shape), b=flow.input_blob_def(b_shape)):
        flow.config.piece_size(1).default_data_type(flow.float)
        return flow.bias_add(flow.matmul(a, b, transpose_a, transpose_b), bias)
        # return flow.matmul(a, b, transpose_a, transpose_b)

    MatmulTestJob(a, b).get


# run one example each time
if __name__ == "__main__":

    test_matmul(a_shape=(10, 10, 64, 32), b_shape=(10, 10, 32, 128))

Traceback (most recent call last):
  File "/home/caishenghang/oneflow/oneflow/python/test/matmul_demo.py", line 38, in <module>
    test_matmul(a_shape=(10, 10, 64, 32), b_shape=(10, 10, 32, 128))
  File "/home/caishenghang/oneflow/oneflow/python/test/matmul_demo.py", line 12, in test_matmul
    b = flow.get_variable(name = 'v1', shape=b_shape, split_axis=1)
  File "/home/caishenghang/oneflow/build/python_scripts/oneflow/python/ops/get_variable.py", line 53, in get_variable
    compile_context.CurJobAddOp(op_conf)
  File "/home/caishenghang/oneflow/build/python_scripts/oneflow/python/framework/compile_context.py", line 24, in CurJobAddOp
    def CurJobAddOp(op_conf): return _CurJobAddNonInputOp(op_conf)
  File "/home/caishenghang/oneflow/build/python_scripts/oneflow/python/framework/compile_context.py", line 33, in _CurJobAddNonInputOp
    op_conf.device_type = placement_context.CurPlacementGroupGetDeviceType(op_conf)
  File "/home/caishenghang/oneflow/build/python_scripts/oneflow/python/framework/placement_context.py", line 16, in CurPlacementGroupGetDeviceType
    assert len(placement_scope_stack) > 0
AssertionError
Segmentation fault (core dumped)

reshape

使用reshape在前端实现squeeze和expand_dims需要知道输入的tensor的形状信息,目前只能拿到name信息。

No CMAKE_ASM_NASM_COMPILER could be found

-- Performing Test HAVE_VERSION_SCRIPT - Success
-- Linker supports GNU-style version scripts
[ 47%] Built target zlib
-- The ASM_NASM compiler identification is unknown
-- Didn't find assembler
CMake Error at simd/CMakeLists.txt:41 (enable_language):
No CMAKE_ASM_NASM_COMPILER could be found.

Tell CMake where to find the compiler by setting either the environment
variable "ASM_NASM" or the CMake cache entry CMAKE_ASM_NASM_COMPILER to the
full path to the compiler, or to the compiler name if it is in the PATH.

-- Configuring incomplete, errors occurred!
See also "/root/oneflow/build/libjpeg-turbo/src/libjpeg-turbo/CMakeFiles/CMakeOutput.log".
See also "/root/oneflow/build/libjpeg-turbo/src/libjpeg-turbo/CMakeFiles/CMakeError.log".
CMakeFiles/libjpeg-turbo.dir/build.make:108: recipe for target 'libjpeg-turbo/src/libjpeg-turbo-stamp/libjpeg-turbo-configure' failed
make[2]: *** [libjpeg-turbo/src/libjpeg-turbo-stamp/libjpeg-turbo-configure] Error 1
CMakeFiles/Makefile2:1993: recipe for target 'CMakeFiles/libjpeg-turbo.dir/all' failed
make[1]: *** [CMakeFiles/libjpeg-turbo.dir/all] Error 2

pure virtual method called

2020-05-07T16:07:15.6852956Z 
2020-05-07T16:07:15.6853601Z OK
2020-05-07T16:07:15.7829262Z pure virtual method called
2020-05-07T16:07:15.7830222Z terminate called without an active exception
2020-05-07T16:07:19.8485966Z ci/test/1node_model_test.sh: line 8:    13 Aborted                 (core dumped) python3 models/1node_test.py
2020-05-07T16:07:21.8466843Z ##[error]Process completed with exit code 134.
2020-05-07T16:07:21.8577932Z Post job cleanup.
2020-05-07T16:07:21.9965622Z [command]/home/caishenghang/oss/git-2.25.0-rc0/install/bin/git version

GDB

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f5901ea4801 in __GI_abort () at abort.c:79
#2  0x00007f5883c03957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f5883c09ab6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f5883c09af1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f5883c0a8bf in __cxa_pure_virtual () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f585d9c3b8b in grpc::ClientContext::ClientContext() () from /src/build/python_scripts/oneflow/_oneflow_internal.so
#7  0x00007f585ceaf84d in oneflow::CtrlClient::__lambda5::operator() (__closure=0x7f54ec006ee0) at ../oneflow/core/control/ctrl_client.cpp:168
#8  0x00007f5883c349e0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f5901c4c6db in start_thread (arg=0x7f54f77fe700) at pthread_create.c:463
#10 0x00007f5901f8588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) f 7
#7  0x00007f585ceaf84d in oneflow::CtrlClient::__lambda5::operator() (__closure=0x7f54ec006ee0) at ../oneflow/core/control/ctrl_client.cpp:168
168             grpc::ClientContext client_ctx;
(gdb) l
163           {
164             std::unique_lock<std::mutex> lck(need_heartbeat_thread_stop_mtx_);
165             if (need_heartbeat_thread_stop_) { break; }
166           }
167           for (size_t i = 0; i < stubs_.size(); ++i) {
168             grpc::ClientContext client_ctx;
169             request.set_addr(Global<EnvDesc>::Get()->machine(i).addr());
170             GRPC_CHECK(stubs_[i]->CallMethod<CtrlMethod::kLoadServer>(&client_ctx, request, &response))
171                 << "Machine " << i << " lost";
172           }
(gdb) 

undefined reference to gfortran

Merge完develop后编译报错

/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/libopenblas.so, not found (try using -rpath or -rpath-link)
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: /usr/lib64/libopenblas.so: undefined reference to `_gfortran_compare_string@GFORTRAN_1.0'
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: /usr/lib64/libopenblas.so: undefined reference to `_gfortran_etime@GFORTRAN_1.0'
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: /usr/lib64/libopenblas.so: undefined reference to `_gfortran_concat_string@GFORTRAN_1.0'
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: /usr/lib64/libopenblas.so: undefined reference to `_gfortran_pow_i4_i4@GFORTRAN_1.0'
collect2: error: ld returned 1 exit status
make[2]: *** [bin/oneflow_testexe] Error 1
make[1]: *** [CMakeFiles/oneflow_testexe.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/libopenblas.so, not found (try using -rpath or -rpath-link)
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: /usr/lib64/libopenblas.so: undefined reference to `_gfortran_compare_string@GFORTRAN_1.0'
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: /usr/lib64/libopenblas.so: undefined reference to `_gfortran_etime@GFORTRAN_1.0'
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: /usr/lib64/libopenblas.so: undefined reference to `_gfortran_concat_string@GFORTRAN_1.0'
/home/zjhushengjian/anaconda3/envs/oneflow/bin/ld: /usr/lib64/libopenblas.so: undefined reference to `_gfortran_pow_i4_i4@GFORTRAN_1.0'
collect2: error: ld returned 1 exit status
make[2]: *** [bin/oneflow_worker] Error 1
make[1]: *** [CMakeFiles/oneflow_worker.dir/all] Error 2
make: *** [all] Error 2

protobuf buffer limits

The default limit is 64M.
when we put a lot of data into OFRecord,it will encounter a bug without warnnings.

print op dump when print ip or conv out

Issue description

Test print op function on dataloader op out OK.
core dump when print ip and conv op out.

Steps to reproduce the issue

  1. add or modify print op config in net.prototxt
...
op {
  name: "print"
  print_conf {
    lbn: "label/out"
    lbn: "ip10/out"
    print_path: "./log/"
  }
}
...
  1. run oneflow
    same for conv

Expected result?

  1. label and ip10 folder and out subfolder are created in file system, contents are stored in these folders.
  2. oneflow quit without error.

Actual result?

  1. label, ip10 folder and out subfolder were created in file system, but no contents.
  2. oneflow quit with error

Additional details / screenshot

both ip and conv cases are dumped at clone_kernel.cpp:34
no further analysis.

TODO: fix numerical instability in sparse_softmax_cross_entropy_with_logits

sparse_softmax_cross_entropy_with_logits 里面包含一个 log softmax,oneflow 里会直接计算 SafeLog(softmax(x)),这样在一些情况下 softmax(x) 会低于 SafeLog 的阈值 1e-20,只能得到一个常数 SafeLog(1e-20)=-46.0517,但其他框架 tf、pytorch 计算 log_softmax 时会把 softmax(x) 的分子提到 log 外面,变成

tmp = x - reduce_max(x, axis);
log_softmax(x, axis) = tmp - log(reduce_sum(exp(tmp), axis))

(可参考 scipy 的代码: https://github.com/scipy/scipy/blob/v1.5.0/scipy/special/_logsumexp.py#L217
这样在 axis 方向上,tmp 一定有一个 0,并且其它值都小于 0,所以 log(x) 的输入 reduce_sum(exp(tmp), axis) 一定大于 1 而且小于 x.shape[axis],是一个非常安全的范围,log 总能得到正确的结果。

一个可以复现问题的完整代码:

import numpy as np
import oneflow as flow
import torch
import torch.nn.functional as F
import tensorflow as tf

func_config = flow.FunctionConfig()
func_config.default_data_type(flow.float)

@flow.global_function(func_config)
def FlowJob(labels=flow.FixedTensorDef((1,), dtype=flow.int64), logits=flow.FixedTensorDef((1, 2))):                                                              
  return flow.nn.sparse_softmax_cross_entropy_with_logits(labels, logits)
 
labels=np.array([0])
logits=np.array([[-30, 30]]).astype(np.float32)
tf_res = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits)
pytorch_res = F.cross_entropy(torch.tensor(logits), torch.tensor(labels))
flow_res = FlowJob(labels, logits).get().ndarray()
print(f'tf: {tf_res}')
print(f'pytorch: {pytorch_res}')
print(f'flow: {flow_res}') 

输出:

tf: [60.]
pytorch: 60.0
flow: [46.0517]

🤣 我最近没有时间修,记个 TODO

Flaw about learning rate drop at warm-up and decay boundary

目前的学习率在warm-up和decay交界处会有一个下降断点,需要被修正。
更进一步,需要一个优雅的方案解决各种情况下学习率设置的问题,比如通过python计算每一步的学习率,传递给oneflow使用。

TODO: 动态加载 op 的 abi 兼容

abi 是二进制文件暴露出的接口,gcc 4.x 和 gcc 5 编译出来的二进制文件的 abi 不一样,所以 gcc5 和更旧版本的 gcc 编译的二进制文件不能相互链接。目前这个对 oneflow 还没有影响,因为我们都是用源码编译的,不存在使用两种编译器的情况。

但如果之后要对外发布预编译的 oneflow whl 包,情况就会变得复杂。预编译的包应该会使用 gcc4 编译,如果某个装了较新版本的编译器的用户安装了我们发布的 whl 包,又自己编译了动态加载的 op 并加载,就会出现 abi 不兼容的错误。

解决方法是使用旧版 abi 编译出的 oneflow 包的 get_compile_flags() 方法(这个方法会给用户返回编译动态加载 op 时所需的编译选项)应该添加 -D_GLIBCXX_USE_CXX11_ABI=0 以强制使用旧版 abi。

Failed to initialize NVML: Driver/library version mismatch

可能的原因

  1. 驱动更新时未做清理 / 未完全清理 / 强制覆盖安装
    解决方法:尝试重启,重装驱动。
  2. nouveau未放 blacklist,清理驱动过程中系统自动搜寻可用设备控制器,nouveau重新接管,导致驱动正确安装但是未正确加载
    解决方法: echo “blacklist nouveau” > /etc/modprobe.d/blacklist-nouveau.conf,重启
  3. nvidia docker 内部 cuda 版本不兼容物理机驱动,兼容列表见github nvidia-docker wiki 页面
    解决方法:切换能适配物理机的cuda 镜像或重装适配对应 cuda 镜像的物理机驱动

default_data_type not work

code

import oneflow as flow
import numpy as np

func_config = flow.FunctionConfig()
func_config.default_data_type(flow.int32)
#func_config.default_data_type(flow.float32)

def test_naive(test_case):
    @flow.function(func_config)
    #def ModJob(a=flow.FixedTensorDef((5, 2), dtype=flow.int32), b=flow.FixedTensorDef((5, 2), dtype=flow.int32)):
    def ModJob(a=flow.FixedTensorDef((5, 2)), b=flow.FixedTensorDef((5, 2))):
        return a % b

    x = (np.random.rand(5, 2)*1000).astype(np.int32)
    y = (np.random.rand(5, 2)*1000).astype(np.int32)
    z = None
    z = ModJob(x, y).get().ndarray()
    test_case.assertTrue(np.array_equal(z, x % y))

run

python 1node_test.py test_mod_int.py

error

Traceback (most recent call last):
  File "/home/xiexuan/sandbox/oneflow/build/python_scripts/oneflow/python/framework/job_instance.py", line 83, in PushBlob
    try: self.push_cb_(ofblob.OfBlob(of_blob_ptr))
  File "/home/xiexuan/sandbox/oneflow/build/python_scripts/oneflow/python/framework/input_blob_def.py", line 211, in <lambda>
    return lambda ofblob: ofblob.CopyFromNdarray(copied)
  File "/home/xiexuan/sandbox/oneflow/build/python_scripts/oneflow/python/framework/ofblob.py", line 49, in CopyFromNdarray
    return self._CopyFromNdarrayLists([[src_ndarray]])
  File "/home/xiexuan/sandbox/oneflow/build/python_scripts/oneflow/python/framework/ofblob.py", line 101, in _CopyFromNdarrayLists
    self._CopyFromNdarrayListAndIsNewSliceStartMask(flat_ndarray_list, is_new_slice_start_mask)
  File "/home/xiexuan/sandbox/oneflow/build/python_scripts/oneflow/python/framework/ofblob.py", line 114, in _CopyFromNdarrayListAndIsNewSliceStartMask
    copy_method(self.of_blob_ptr_, tensor)
TypeError: Array of type 'float' required.  Array of type 'int' given
...

gcc10 出现同名函数冲突

/home/sty/cccc/oneflow/build/grpc/src/grpc/src/core/lib/support/log_linux.c:58:13: error: conflicting types for ‘gettid’
58 | static long gettid(void) { return syscall(__NR_gettid); }
| ^~~~~~
In file included from /usr/include/unistd.h:1170,
from /home/xxx/cccc/oneflow/build/grpc/src/grpc/src/core/lib/support/log_linux.c:56:
/usr/include/x86_64-linux-gnu/bits/unistd_ext.h:34:16: note: previous declaration of ‘gettid’ was here
34 | extern __pid_t gettid (void) __THROW;
| ^~~~~~

gcc版本:
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 10-20200411-0ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-10 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none,amdgcn-amdhsa,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.0.1 20200411 (experimental) [master revision bb87d5cc77d:75961caccb7:f883c46b4877f637e0fa5025b4d6b5c9040ec566] (Ubuntu 10-20200411-0ubuntu1)

发行版:ubuntu 20.04

AttributeError: 'NoneType' object has no attribute 'net'

code snippet

import oneflow as of

train_config = of.ConfigProtoBuilder()
train_config.gpu_device_num(config.device_num)
train_config.grpc_use_no_signal()
train_config.model_load_snapshot_path(config.pretrain_model_path)
train_config.model_save_snapshots_path(config.model_save_path)

of.init(train_config)
dnn = of.deprecated.get_cur_job_dlnet_builder()

error information

I0822 10:46:11.610416551   37494 ev_epoll_linux.c:82]        Use of signals is disabled. Epoll engine will not be used

File"/home/dongjiaxu/Work/oneflow/build/python_scripts/oneflow/python/deprecated/dl_net.py", line 28, 
in get_cur_job_dlnet_builder_cur_job2dl_net_builder[id(compile_ctx.cur_job)] = DLNet(compile_ctx.cur_job.net)

AttributeError: 'NoneType' object has no attribute 'net'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.