GithubHelp home page GithubHelp logo

microsoft / directml Goto Github PK

View Code? Open in Web Editor NEW
2.0K 64.0 274.0 316.62 MB

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.

License: MIT License

Python 46.65% Shell 0.33% C++ 42.97% CMake 3.21% C 6.05% HLSL 0.02% PowerShell 0.77%

directml's Introduction

DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.

When used standalone, the DirectML API is a low-level DirectX 12 library and is suitable for high-performance, low-latency applications such as frameworks, games, and other real-time applications. The seamless interoperability of DirectML with Direct3D 12 as well as its low overhead and conformance across hardware makes DirectML ideal for accelerating machine learning when both high performance is desired, and the reliability and predictability of results across hardware is critical.

More information about DirectML can be found in Introduction to DirectML.

Visit the DirectX Landing Page for more resources for DirectX developers.

Getting Started with DirectML

DirectML is distributed as a system component of Windows 10, and is available as part of the Windows 10 operating system (OS) in Windows 10, version 1903 (10.0; Build 18362), and newer.

Starting with DirectML version 1.4.0, DirectML is also available as a standalone redistributable package (see Microsoft.AI.DirectML), which is useful for applications that wish to use a fixed version of DirectML, or when running on older versions of Windows 10.

Hardware requirements

DirectML requires a DirectX 12 capable device. Almost all commercially-available graphics cards released in the last several years support DirectX 12. Examples of compatible hardware include:

  • AMD GCN 1st Gen (Radeon HD 7000 series) and above
  • Intel Haswell (4th-gen core) HD Integrated Graphics and above
  • NVIDIA Kepler (GTX 600 series) and above
  • Qualcomm Adreno 600 and above

For application developers

DirectML exposes a native C++ DirectX 12 API. The header and library (DirectML.h/DirectML.lib) are available as part of the redistributable NuGet package, and are also included in the Windows 10 SDK version 10.0.18362 or newer.

For users, data scientists, and researchers

DirectML is built-in as a backend to several frameworks such as Windows ML, ONNX Runtime, and TensorFlow.

See the following sections for more information:

DirectML Samples

DirectML C++ sample code is available under Samples.

  • HelloDirectML: A minimal "hello world" application that executes a single DirectML operator.
  • DirectMLSuperResolution: A sample that uses DirectML to execute a basic super-resolution model to upscale video from 540p to 1080p in real time.
  • yolov4: YOLOv4 is an object detection model capable of recognizing up to 80 different classes of objects in an image. This sample contains a complete end-to-end implementation of the model using DirectML, and is able to run in real time on a user-provided video stream.

DirectML Python sample code is available under Python/samples. The samples require PyDirectML, an open source Python projection library for DirectML, which can be built and installed to a Python executing environment from Python/src. Refer to the Python/README.md file for more details.

DxDispatch Tool

DxDispatch is simple command-line executable for launching DirectX 12 compute programs (including DirectML operators) without writing all the C++ boilerplate.

Windows ML on DirectML

Windows ML (WinML) is a high-performance, reliable API for deploying hardware-accelerated ML inferences on Windows devices. DirectML provides the GPU backend for Windows ML.

DirectML acceleration can be enabled in Windows ML using the LearningModelDevice with any one of the DirectX DeviceKinds.

For more information, see Get Started with Windows ML.

ONNX Runtime on DirectML

ONNX Runtime is a cross-platform inferencing and training accelerator compatible with many popular ML/DNN frameworks, including PyTorch, TensorFlow/Keras, scikit-learn, and more.

DirectML is available as an optional execution provider for ONNX Runtime that provides hardware acceleration when running on Windows 10.

For more information about getting started, see Using the DirectML execution provider.

PyTorch with DirectML

PyTorch with DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware. This is done through torch-directml, a plugin for PyTorch.

PyTorch with DirectML is supported on both the latest versions of Windows and the Windows Subsystem for Linux, and is available for download as a PyPI package. For more information about getting started with torch-directml, see our Windows or WSL 2 guidance on Microsoft Learn.

TensorFlow with DirectML

TensorFlow is a popular open source platform for machine learning and is a leading framework for training of machine learning models.

DirectML acceleration for TensorFlow 1.15 is currently available for Public Preview. TensorFlow on DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware.

TensorFlow on DirectML is supported on both the latest versions of Windows 10 and the Windows Subsystem for Linux, and is available for download as a PyPI package. For more information about getting started, see GPU accelerated ML training (docs.microsoft.com)

Feedback

We look forward to hearing from you!

External Links

Documentation

DirectML programming guide
DirectML API reference

More information

Introducing DirectML (Game Developers Conference '19)
Accelerating GPU Inferencing with DirectML and DirectX 12 (SIGGRAPH '18)
Windows AI: hardware-accelerated ML on Windows devices (Microsoft Build '20)
Gaming with Windows ML (DirectX Developer Blog)
DirectML at GDC 2019 (DirectX Developer Blog)
DirectX ❤ Linux (DirectX Developer Blog)

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

directml's People

Contributors

bbernhar avatar chrilamsft avatar dependabot[bot] avatar fdwr avatar gbionescu avatar huningxin avatar inisis avatar jamather avatar jamiemagee avatar jeffbloo avatar jstoecker avatar kevincog-msft avatar linnealovespie avatar lpbourret avatar maggie1059 avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar mingmingtasd avatar msftgits avatar nvvlad avatar patricevignola avatar python3kgae avatar raoanag avatar smk2007 avatar sumitsays avatar tbqh avatar tinyboxvk avatar walbourn avatar wchao1115 avatar zhangxiang1993 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

directml's Issues

mnist.py sample fails with debug layer enabled

The error log is

D3D12 ERROR: ID3D12GraphicsCommandList::ExecuteMetaCommand: Supplied parameters size [40] doesn't match enumerated size[32]. [ EXECUTION ERROR #1174: META_COMMAND_PARAMETER_SIZE_MISMATCH]

I enabled the DirectML debugging layer.

I ran this sample on a machine with Windows 10 20H2 build 19042.685 and Intel HD630 GPU with driver version 27.20.100.9030 (11/27/2020).

BTW, I can workaround it by supplying ExecutionFlags.DISABLE_META_COMMANDS.

extremely bad performance (25x) compared to CPU only mode (26 ms vs 670ms)

This is a follow up issue to the microsoft/onnxruntime#5617 .
Using the DML provider results in 25x worse performance (26 ms vs 670 ms) thats most likely caused by the MemcpyToHost operation.

Run :

import numpy as np
import onnxruntime
from timeit import default_timer as timer

time_dict = {}
def get_min_max(name, elapsed):
    min_tm = time_dict[name][0]
    max_tm = time_dict[name][1]
    if min_tm > elapsed:
        min_tm = elapsed
    if max_tm < elapsed:
        max_tm = elapsed
    return min_tm, max_tm

class Benchmark_Block(object):
    def __init__(self, name='code-block'):
        self.name = name
        if name not in time_dict.keys():
            time_dict[name] = (999, -999)

    def __enter__(self):
        self.start = timer()

    def __exit__(self, exc_type, exc_value, exc_traceback):
        end = timer()
        elapsed = end - self.start 
        (min_tm, max_tm) = get_min_max(self.name, elapsed)
        time_dict[self.name] = (min_tm, max_tm)
        print(f'{self.name} took {elapsed*1000:.3f} ms [min/max: {min_tm*1000:.1f}/{max_tm*1000:.1f}] ms')

model_path = 'r18_q_onnx.onnx'
img_input= np.random.randn(1, 3, 112, 112)
img_input= np.asarray(res, dtype=np.float32) 

sess = onnxruntime.InferenceSession(model_path)
ort_inputs = {sess.get_inputs()[0].name: img_input}
ort_outs = sess.run(None, ort_inputs)

providers = onnxruntime.get_available_providers()
print(f'available providers : {providers}')
print(f'current device: {onnxruntime.get_device()}')

sess.set_providers(['CPUExecutionProvider'])
with Benchmark_Block('CPU_ONNX') as blk:
    ort_outs = sess.run(None, ort_inputs)

sess.set_providers(['DmlExecutionProvider'])
with Benchmark_Block('DML_ONNX') as blk:
    ort_outs = sess.run(None, ort_inputs)

Onnx model : https://gofile.io/d/GdkIeR

Is there a workaround to avoide this at the moment?

Detection on custom yolov3-tiny doesn't work well

I trained using google colab my own custom yolov3-tiny model with input quality of 928x928.

I've converted my custom yolov3-tiny model using convery.py
py convert.py -weights ./data/yolov3-tiny.weights -output ./checkpoints/yolov3-tiny.tf -num_classes 4 --tiny
And inferced it using detect_video.py (Also changing the classes names in coco.names)
py detect_video.py -classes ./data/coco.names -weights ./checkpoints/yolov3-tiny.tf --tiny -size 928 -video ./data/clip.mkv -num_classes 4
The problem is it does work but even on a extreme high quality input (928x928). The detections aren't stable and when I reduce the input image by a bit (608x608) it doesn't detect anything.

In comparison I inferced my model using OpenCV darknet port using only cpu. And even when reducing
the input image to 160x160 the detections worked perfectly.
Just slower obviously not utilizing my AMD - RX480 GPU.

It seems that the efficiency is a bit better than utilizing only on the cpu.
But the accuracy is so low I can't depend on this method.
My guess is that in the conversion I'm doing something wrong.
Because the models accuracy goes to hell.

Can you point to what am I doing wrong?
Or the lost of accuracy is the cost of converting yolo models to tf models?

DirectML Linux support

Are there plans for support on Linux?
It would be quite useful for AMD users since ROCm supports only a limited set of cards currently and Intel GPU users only have PlaidML as an option.

Cannot build the yolov4 sample

Only error left are those '&' requires l-values and I can't seem to understand what the issue is. Trying to build from the latest commit on the master branch (6fed953).

Severity Code Description Project File Line Suppression State
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 45
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 47
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 55
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 57
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 92
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 461
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 647
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 909
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 911
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 940
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 942
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 466
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 468
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 488
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 490
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 495
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 497
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 505
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 507
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 512
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 514
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 522
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 524
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 529
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 531
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 566
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 582
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 599
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 640
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 649
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 656
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 658
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 664

DirectML v1.4 ARM64 version?

Would it be possible to have an ARM64 version of DirectML? I downloaded the NuGet package but it only includes x86/x64 versions.

My goal is to execute a PyTorch exported ONNX model (yolov5) using the GPU. On x64 it runs fine with DirectML v1.4 but not with the older one included in Windows. That's why I would need to use the newer NuGet version and redistribute it for my ARM64 application (Hololens 2).

Thanks.

Operator GRU does not accept input buffer at index 2

Here is the error I got when do input buffer binding for operator GRU:

D3D12 ERROR: : the dispatchable object expects nothing to be bound at index 2, but a binding of type DML_BINDING_TYPE_BUFFER was provided. Use binding type NONE to bind nothing to this slot. [ UNKNOWN ERROR #1: STRING_FROM_APPLICATION]

According to GRU description below, it need at least three inputs: data input, weight and recurrent. The BindInputs function seems accept tensor at index 0 (data input) and index 1 (weight), but not index 2 (recurrent). Are there any misunderstanding? Are there any sample code for GRU usage?

https://docs.microsoft.com/en-us/windows/win32/api/directml/ns-directml-dml_gru_operator_desc

Error "cannot open source file 'DirectML.h'" happened on Version 1903 and OS build 18343.1.

Steps to reproduce this issue:

  1. Log in as a Windows Insider,
  2. Download Windows 10 Insider Preview Client x64 en-us 18343 ISO from https://www.microsoft.com/en-us/software-download/windowsinsiderpreviewadvanced,
  3. Transfer the ISO file to installation media,
  4. Boot PC from the installation media,
  5. Install latest VS with latest Windows SDK (10.0.17763.0)
  6. build this HelloDirectML sample project

Please help look at this issue, thanks.

Inferencing yolov3-tiny doesn't detect anything

When I convert a yolov3 model and inference it everything works great.
But then I downloaded https://pjreddie.com/media/files/yolov3-tiny.weights
converted using this command:
py convert.py -weights ./data/yolov3-tiny.weights -output ./checkpoints/yolov3-tiny.tf --tiny
It's says weights saved and everything seems perfect but then I try infercing it using the default command:
py detect_video.py -weights ./checkpoints/yolov3-tiny.tf -video 0 --tiny
And it doesn't detect anything.
I even deleted everything and used the tiny setup.py to figure out if something I did
went wrong.
But same thing happened no detections.
Have I done something wrong?

Can the values of DimensionCount in DML_BUFFER_TENSOR_DESC be less than 4?

According to the DML_BUFFER_TENSOR_DESC doc, the valid values are either 4 or 5.

In DirectML, all buffer tensors must have a DimensionCount of either 4 or 5.

However, according to my test, the user code can set DimensionCount less than 4. For example, the following PyDirectML code that adds two tensors in shape of [2, 2] works just fine.

import pydirectml as dml
import numpy as np

device = dml.Device()
builder = dml.GraphBuilder(device)
data_type = dml.TensorDataType.FLOAT32
flags = dml.TensorFlags.OWNED_BY_DML
input_bindings = []
a = dml.input_tensor(builder, 0, dml.TensorDesc(data_type, [2, 2]))
input_bindings.append(dml.Binding(a, np.ones([2, 2], dtype=np.float32)))
b = dml.input_tensor(builder, 1, dml.TensorDesc(data_type, flags, [2, 2]))
input_bindings.append(dml.Binding(b, np.ones([2, 2], dtype=np.float32)))
c = dml.add(a, b)
op = builder.build(dml.ExecutionFlags.NONE, [c])
output_data = device.compute(op, input_bindings, [c])
output_tensor = np.array(output_data[0], np.float32)
print(output_tensor)

The output is

[[2. 2.]
 [2. 2.]]

Internally, the PyDirectML and DirectMLX.h will set the DimensionCount to 2. Other dimensions also work, like 1-d or 3-d.

Actually this is a very nice feature and would simplify the user code. I just want to know whether this feature is officially supported by DirectML. If it is, the doc probably needs to be updated.

ImportError: libd3d12.so: cannot open shared object file: No such file or directory

I've been following these install docs but for no clear reason I get the import error in the title.

My GPU is a Radeon VII and I've installed the linked AMD WSL Preview drivers, so that shouldn't be an issue. My WSL environment is set up properly as well:

PS C:\Users\cass> wsl cat /proc/version
Linux version 4.19.104-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Wed Feb 19 06:37:35 UTC 2020
PS C:\Users\cass> wsl --list --verbose
  NAME            STATE           VERSION
* Ubuntu-20.04    Running         2
Full stack trace
(base) cass@deskfox:~$ python3.7
Python 3.7.7 (default, May  7 2020, 21:25:33)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
Traceback (most recent call last):
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/cass/miniconda3/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/home/cass/miniconda3/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libd3d12.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow/__init__.py", line 110, in <module>
    from tensorflow_core import *
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/__init__.py", line 28, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow/__init__.py", line 58, in __getattr__
    module = self._load()
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow/__init__.py", line 52, in _load
    module = _importlib.import_module(self.__name__)
  File "/home/cass/miniconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/__init__.py", line 55, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/cass/miniconda3/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/home/cass/miniconda3/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libd3d12.so: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

[installation] Could not find a version that satisfies the requirement tensorflow-directml (from versions: none)

Hi,

After following the steps described in https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-wsl till pip install tensorflow-directml,

the error appeared as

ERROR: Could not find a version that satisfies the requirement tensorflow-directml (from versions: none)
ERROR: No matching distribution found for tensorflow-directml

BTW, I am using python 3.8

and I did python list tensorflow*, which outputed

Package Version


certifi 2020.6.20
pip 20.1.1
setuptools 49.2.0.post20200714
wheel 0.34.2

Vega 8 :DirectML device enumeration: found 0 compatible adapters.

I do the same as https://docs.microsoft.com/zh-cn/windows/win32/direct3d12/gpu-tensorflow-wsl, and I have a GPU : Vega 8.
But it reported:

Python 3.6.12 |Anaconda, Inc.| (default, Sep  8 2020, 23:10:56)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
>>> tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
>>> print(tf.add([1.0, 2.0], [3.0, 4.0]))
2020-11-02 21:19:05.139698: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 0 compatible adapters.
2020-11-02 21:19:05.141140: I tensorflow/core/common_runtime/eager/execute.cc:571] Executing op Add in device /job:localhost/replica:0/task:0/device:CPU:0
tf.Tensor([4. 6.], shape=(2,), dtype=float32)

Thank you in advance !
If other information I should submit, please let me know !

A numpy/cupy project on top of Directml

Hi, I'd like to know if the team behind directml has plans to release a like-numpy/cupy library using Directx 12 for computing, this would open up for a lot of possibilities, since a lot of numerical libraries use numpy for computing, having an option to accelarate such operations, I can see pandas using this, and many researchers implementing tools for this.
I also would like to have info on how to use the low-level math ops from directml.

WSL support for RTX TITAN

Hi,

Testing DirectML with recent WSL2 (2 days ago insider build) I am getting an ImportError: libd3d12.so: cannot open shared object file: No such file or directory

I suppose this come from a missing driver update from NVIDIA's drivers (we have geforce AND quadro drivers available online but no driver for RTX TITAN in my case). -> https://developer.nvidia.com/cuda/wsl/download

Do you confirm ?

Regards

"found 0 compatible adapters" after following setup instructions

I did my best to follow the instructions at https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-wsl. I already had a WSL 1 installation of Ubuntu, so in order, I:

  1. Installed the NVIDIA preview drivers
  2. Set my WSL to default to WSL 2
  3. Reset my Ubuntu installation
  4. Installed the nvidia-cuda-toolkit package

Successfully ran the deviceQuery sample to get this output:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce RTX 2070 SUPER"
  CUDA Driver Version / Runtime Version          11.1 / 10.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1815 MHz (1.81 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

However, now if I try to verify that TF/DirectML is picking up my GPU, I get:

Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
>>> tf.enable_eager_execution9tf.ConfigProto(log_device_placement=True))
  File "<stdin>", line 1
    tf.enable_eager_execution9tf.ConfigProto(log_device_placement=True))
                                                                       ^
SyntaxError: invalid syntax
>>> tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
>>> print(tf.add([1.0, 2.0], [3.0, 4.0])
... )
2020-06-28 02:54:53.570206: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 0 compatible adapters.
2020-06-28 02:54:53.572666: I tensorflow/core/common_runtime/eager/execute.cc:571] Executing op Add in device /job:localhost/replica:0/task:0/device:CPU:0
tf.Tensor([4. 6.], shape=(2,), dtype=float32)

I don't know if I should have done things in a different order, given that I already had WSL up and running. But if I can do things differently, I'd love to know. So far, it looks like only the most preliminary of guides are up.

HelloDirectML crash on DELL XPS 13 9380 Windows 10.0.18363

Hello,

We are having DirectML.dll fault issue on some laptop when running HelloDirectML example from Github. Other DirectML based application also experiencing same issue.
Below is an event error log indicating DirectML.dll faulting (Event Viewer->Windows Logs->Applications):

Faulting application name: HelloDirectML.exe, version: 0.0.0.0, time stamp: 0x5f7cefd4
Faulting module name: DirectML.dll, version: 10.0.18362.997, time stamp: 0x7fc3de11
Exception code: 0xc0000409
Fault offset: 0x00000000000a28d1
Faulting process id: 0x4238
Faulting application start time: 0x01d69c315d6e5eb0
Faulting application path: C:\Users\dasguk\Downloads\HelloDirectML.exe
Faulting module path: C:\WINDOWS\SYSTEM32\DirectML.dll
Report Id: 95947e12-fe63-4740-b6eb-a273e71e8974
Faulting package full name:
Faulting package-relative application ID:

Some system information if this helps:

Laptop: Dell XPS 13 9380
OS: Microsoft Windows 10 Enterprise 10.0.18363
CPU: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
GPU: Intel(R) UHD Graphics 620 DriverVersion: 27.20.100.8476

We checked the system requirements and the laptop meets these requirements. Could you please advise what’s the possible reason here?

Thanks.

DirectML vs CrossFire & GPU Workload

I have a computer with 2x AMD 570 and a crossfire motherboard.
For maximum performance should I

  1. activate/deactivate crossfire ?
  2. set GPU Workload - Graphics/Compute? (AMD Driver settings)

OOM at 48GB GPU mem on GPT-2 inference due to memory leak in directML.

First of all thank you for all your work, it's very exsiting to see windows/amd ML gap being closed!
Issue is:

git clone https://github.com/openai/gpt-2.git
cd gpt-2
python -m pip install -r requirements.txt
python3 download_model.py 1558M
python src/interactive_conditional_samples.py 1558M

Given any text for inference, repetedly consumes all 48GB GPU mem (AMD Radeon VII 16GB + 32GB shared memory) and falls with:

  (0) Resource exhausted: OOM when allocating tensor with shape[1,48,2,25,455,64] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[node sample_sequence/while/concat (defined at F:\DSML\Soft\Anaconda\envs\directml\lib\site-packages\tensorflow_core\python\framework\ops.py:1762) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Full log and env in the comments below. Probably related with the resource releasing issue on

from ai_benchmark import AIBenchmark
benchmark = AIBenchmark(use_CPU=None, verbose_level=1)
results = benchmark.run()

which also OOM falls during execution. Mem for tensors is not released after runs and even after sess.close()

Dot product operator

Hi!

I have a question. Is there any reason why dot product does not exist in the DirectML operator? I want to do calculations like numpy.dot(). I'm sorry if I misunderstood something.

Thanks.

Intel hd4600 graphics can not run tensorflow-directml programs,but hd530 works fine.

On hd4600 the console showing:
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:50: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use layer.__call__ method instead.
WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1066: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead. In particular, tf.control_dependencies(tf.GraphKeys.UPDATE_OPS) should not be used (consult the tf.keras.layers.batch_normalization documentation).
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:218: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\ops\losses\losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:219: The name tf.losses.get_total_loss is deprecated. Please use tf.compat.v1.losses.get_total_loss instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:220: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:234: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:238: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:248: The name tf.nn.xw_plus_b is deprecated. Please use tf.compat.v1.nn.xw_plus_b instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:254: The name tf.losses.mean_squared_error is deprecated. Please use tf.compat.v1.losses.mean_squared_error instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:258: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:267: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-07-15 21:43:32.488885: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 1 compatible adapters.
2020-07-15 21:43:32.501242: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-15 21:43:32.503347: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 4600)
2020-07-15 21:43:32.621172: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:269: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:271: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

2020-07-15 21:44:13.976556: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:372] Check failed: (((HRESULT)((dml_device_->GetDeviceRemovedReason()))) >= 0) == true (0 vs. 1)

On hd530 the console showing:
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:50: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use layer.__call__ method instead.
WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1066: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead. In particular, tf.control_dependencies(tf.GraphKeys.UPDATE_OPS) should not be used (consult the tf.keras.layers.batch_normalization documentation).
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:218: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\ops\losses\losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:219: The name tf.losses.get_total_loss is deprecated. Please use tf.compat.v1.losses.get_total_loss instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:220: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:234: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:238: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:248: The name tf.nn.xw_plus_b is deprecated. Please use tf.compat.v1.nn.xw_plus_b instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:254: The name tf.losses.mean_squared_error is deprecated. Please use tf.compat.v1.losses.mean_squared_error instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:258: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:267: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-07-15 21:48:28.272086: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 2 compatible adapters.
2020-07-15 21:48:28.274903: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-15 21:48:28.275889: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 530)
2020-07-15 21:48:28.346440: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll
2020-07-15 21:48:28.366793: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 1 (Intel(R) HD Graphics 530)
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:269: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:271: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

2020-07-15 21:49:10.806959 number iterations: 0
cost is: 2.8335354
accuracy is: 0.0

How to compile operator with the API "CompileGraph()"

Hi,
I met a strange problem when I tried to implement a simple model which only has 2 convolution layer. The model result was correct when the model has two outputs(export the results of both convolutions), but wrong while has only one output(export the result of the last convolution).

   `ComPtr<IDMLOperator> conv1;
    DX::ThrowIfFailed(m_dmlDevice->CreateOperator(
        &opDesc,
        IID_PPV_ARGS(&conv1)));
    
    ComPtr<IDMLOperator> conv2;
    DX::ThrowIfFailed(m_dmlDevice->CreateOperator(
        &opDesc,
        IID_PPV_ARGS(&conv2)));

    DML_OPERATOR_GRAPH_NODE_DESC nodeDesc = { conv1.Get() };
    DML_OPERATOR_GRAPH_NODE_DESC nodeDesc2 = { conv2.Get() };

    std::vector<DML_GRAPH_NODE_DESC> nodes;
    nodes.push_back({ DML_GRAPH_NODE_TYPE_OPERATOR, &nodeDesc });
    nodes.push_back({ DML_GRAPH_NODE_TYPE_OPERATOR, &nodeDesc2 });
    
    DML_INPUT_GRAPH_EDGE_DESC inputEdgeDesc = {};
    inputEdgeDesc.GraphInputIndex = 0;
    inputEdgeDesc.ToNodeIndex = 0;
    inputEdgeDesc.ToNodeInputIndex = 0;

    DML_INPUT_GRAPH_EDGE_DESC filterEdgeDesc = {};
    filterEdgeDesc.GraphInputIndex = 1;
    filterEdgeDesc.ToNodeIndex = 0;
    filterEdgeDesc.ToNodeInputIndex = 1;

    DML_INPUT_GRAPH_EDGE_DESC biasEdgeDesc = {};
    biasEdgeDesc.GraphInputIndex = 2;
    biasEdgeDesc.ToNodeIndex = 0;
    biasEdgeDesc.ToNodeInputIndex = 2;

    DML_INPUT_GRAPH_EDGE_DESC filterEdgeDesc2 = {};
    filterEdgeDesc2.GraphInputIndex = 3;
    filterEdgeDesc2.ToNodeIndex = 1;
    filterEdgeDesc2.ToNodeInputIndex = 1;

    DML_INPUT_GRAPH_EDGE_DESC biasEdgeDesc2 = {};
    biasEdgeDesc2.GraphInputIndex = 4;
    biasEdgeDesc2.ToNodeIndex = 1;
    biasEdgeDesc2.ToNodeInputIndex = 2;

    std::vector<DML_GRAPH_EDGE_DESC> inputEdges;
    inputEdges.push_back({ DML_GRAPH_EDGE_TYPE_INPUT,&inputEdgeDesc });
    inputEdges.push_back({ DML_GRAPH_EDGE_TYPE_INPUT,&filterEdgeDesc });
    inputEdges.push_back({ DML_GRAPH_EDGE_TYPE_INPUT,&biasEdgeDesc });
    inputEdges.push_back({ DML_GRAPH_EDGE_TYPE_INPUT,&filterEdgeDesc2 });
    inputEdges.push_back({ DML_GRAPH_EDGE_TYPE_INPUT,&biasEdgeDesc2 });

    /*DML_OUTPUT_GRAPH_EDGE_DESC outputEdgeDesc = {};
    outputEdgeDesc.GraphOutputIndex = 0;
    outputEdgeDesc.FromNodeIndex = 0;
    outputEdgeDesc.FromNodeOutputIndex = 0;*/

    DML_OUTPUT_GRAPH_EDGE_DESC outputEdgeDesc2 = {};
    outputEdgeDesc2.GraphOutputIndex = 0;
    outputEdgeDesc2.FromNodeIndex = 1;
    outputEdgeDesc2.FromNodeOutputIndex = 0;

    std::vector<DML_GRAPH_EDGE_DESC> outputEdges;
    //outputEdges.push_back({ DML_GRAPH_EDGE_TYPE_OUTPUT,&outputEdgeDesc });
    outputEdges.push_back({ DML_GRAPH_EDGE_TYPE_OUTPUT,&outputEdgeDesc2 });

    DML_INTERMEDIATE_GRAPH_EDGE_DESC interEdgeDesc = {};
    interEdgeDesc.FromNodeIndex = 0;
    interEdgeDesc.FromNodeOutputIndex = 0;
    interEdgeDesc.ToNodeIndex = 1;
    interEdgeDesc.ToNodeInputIndex = 0;

    std::vector<DML_GRAPH_EDGE_DESC> interEdges;
    interEdges.push_back({ DML_GRAPH_EDGE_TYPE_INTERMEDIATE,&interEdgeDesc });

    DML_GRAPH_DESC graphDesc = {};
    graphDesc.InputCount = 5;
    graphDesc.OutputCount = 1;
    graphDesc.NodeCount = nodes.size();
    graphDesc.Nodes = nodes.data();
    graphDesc.InputEdgeCount = inputEdges.size();
    graphDesc.InputEdges = inputEdges.data();
    graphDesc.OutputEdgeCount = outputEdges.size();
    graphDesc.OutputEdges = outputEdges.data();
    graphDesc.IntermediateEdgeCount = interEdges.size();
    graphDesc.IntermediateEdges = interEdges.data();

    ComPtr<IDMLCompiledOperator> compiledOp;
    DX::ThrowIfFailed(m_dmlDevice->CompileGraph(
        &graphDesc,
        DML_EXECUTION_FLAG_NONE,
        IID_PPV_ARGS(compiledOp.ReleaseAndGetAddressOf())));`

WSL2: D3D12CreateDevice Check Failed

Windows: Build 20150.rs_prerelease.200612-1734
WSL2 kernel: 4.19.121-microsoft-WSL2-standard
Intel HD 630, Driver: 28.20.100.8322
Radeon RX Vega M GH, Driver: 20.20.01.05
GeForce RTX 2070, via Thunderbolt 3, Driver: 455.38
$ uname -a
Linux hades 4.19.121-microsoft-WSL2-standard #1 SMP Thu May 14 20:25:24 UTC 2020 x86_64 GNU/Linux
$ lspci
a087:00:00.0 3D controller: Microsoft Corporation Device 008e
ad33:00:00.0 3D controller: Microsoft Corporation Device 008e
addd:00:00.0 SCSI storage controller: Red Hat, Inc. Virtio filesystem (rev 01)
b7a1:00:00.0 3D controller: Microsoft Corporation Device 008e
b8b2:00:00.0 SCSI storage controller: Red Hat, Inc. Virtio filesystem (rev 01)
$ ls /usr/lib/wsl/lib
libcuda.so  libcuda.so.1  libcuda.so.1.1  libd3d12.so  libdirectml.so  libdxcore.so
$ conda --version
conda 4.8.3
$ python --version
Python 3.7.7 
$ conda list | grep directml
# packages in environment at /home/username/.conda/envs/directml:
tensorflow-directml       1.15.3.dev200615          pypi_0    pypi

The code I was trying to run is the basic example from Tensorflow tutorials:

#!/usr/bin/env python

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

DirectML was able to create devices on the Intel/AMD adapters, but failed for the NVIDIA one. The complete output:

$ ./test_direct_ml.py
WARNING:tensorflow:From /home/username/.conda/envs/directml/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples
2020-06-19 04:04:30.146727: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 3 compatible adapters.
2020-06-19 04:04:30.147651: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Radeon RX Vega M GH Graphics
2020-06-19 04:04:31.902541: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library libdirectml.so.ba106a7c621ea741d2159d8708ee581c11918380
2020-06-19 04:04:31.930277: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 1 (Intel(R) HD Graphics 630
2020-06-19 04:04:35.040210: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 2 (NVIDIA GeForce RTX 2070
2020-06-19 04:04:35.050639: F tensorflow/core/common_runtime/dml/dml_device.cc:26] Check failed: (((HRESULT)((D3D12CreateDevice(adapter.Impl()->Get(), feature_level, __uuidof(**(&d3d_device_)), IID_PPV_ARGS_Helper(&d3d_device_))))) >= 0) == true (0 vs. 1)
Aborted

NOTE: If I disconnect the NVIDIA adapter from my machine, the above code runs well on the other two devices with the expected output, 3D usage in Task Manager and everything.

80004002 No such interface supported

Hi,
I ran the HelloDirectML demo on a machine, the result is right, but I saw the the error “80004002 No such interface supported“ in the output window of Visual Studio.
image
image
image

Is there any plan for Pytorch-DML as well?

Hi, Thanks a lot once again for this.
I wonder if you have any plans to have a pytorch port of DML as well? knowing that Microsoft joined as the mainainers of Pytorch on windows since version 1.6.
Any feedback on this is greatly apprecaited.

GetDeviceRemovedReason failed on ICNet Training

I was running ai-benchmark on my machine to ensure everything was setup properly.
CPU: Ryzen 5 3600
GPU: RX 5700XT
OS: Windows 10 (Version 10.0.18363 Build 18363)

I installed directml from pip (version 1.15.3.dev200911)
I installed ai-benchmark from pip (version 0.1.2)

My code is below:

from ai_benchmark import AIBenchmark

benchmark_gpu = AIBenchmark(use_CPU=False, verbose_level=1)
result_gpu = benchmark_gpu.run()

The test is properly utilizing my GPU, without a lot of overhead
gpu

But the ICNet training always fails with the below error output:

14.1 - inference | batch=5, size=1024x1536: 536 ± 7 ms
2020-09-28 10:29:04.811275: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:366] Check failed: (((HRESULT)((dml_device_->GetDeviceRemovedReason()))) >= 0) == true (0 vs. 1)

yolov3 training loss blows up

Hi, I wanted to use this repo for benchmarking, so I have run the code without any modifications on 2 platforms. While the squeezenet training worked correctly, the yolov3 training loss blew up to nan. My two platforms I tested on:

  1. AMD Vega FE, Windows 10, WSL2, Ubuntu 18.04
  2. Nvidia 1080 ti, RHL 7.9

On both, I get this behavior:

(directml) ilyak@DESKTOP-UUENB3S:/mnt/d/Documents/DirectML/TensorFlow/yolov3$ time python train_voc.py --epochs 10 --batch_size 16
2020-10-25 10:54:50.815566: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library libdirectml.so.70b4b8b341c8bda5dc82ecd28a29d918c28282b0
2020-10-25 10:54:50.998874: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 1 compatible adapters.
WARNING:tensorflow:From /home/ilyak/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W1025 10:54:51.059149 139831479416640 deprecation.py:506] From /home/ilyak/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-10-25 10:54:57.951699: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 0 (Radeon Vega Frontier Edition
WARNING:tensorflow:From /mnt/d/Documents/DirectML/TensorFlow/yolov3/yolov3_tf2/dataset.py:29: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1025 10:55:01.693185 139831479416640 deprecation.py:323] From /mnt/d/Documents/DirectML/TensorFlow/yolov3/yolov3_tf2/dataset.py:29: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From train.py:109: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

W1025 10:55:03.373506 139831479416640 module_wrapper.py:139] From train.py:109: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

Train on 357 steps, validate on 363 steps
Epoch 1/10
2020-10-25 10:55:39.355775: I tensorflow/core/profiler/lib/profiler_session.cc:205] Profiler session started.
 25/357 [=>............................] - ETA: 30:05 - loss: 197015067602.1656 - yolo_output_0_loss: 195338272768.0000 - yolo_output_1_loss: 1676120704.0000 - yolo_output_2_loss: 664655.75 26/357 [=>............................] - ETA: 29:54 - loss: 189910974853.1592 - yolo_output_0_loss: 188294381568.0000 - yolo_output_1_loss: 1615940224.0000 - yolo_output_2_loss: 640135.56 27/357 [=>............................] - ETA: 29:43 - loss: 183518095127.9311 - yolo_output_0_loss: 181960425472.0000 - yolo_output_1_loss: 1557037824.0000 - yolo_output_2_loss: 633761.50 28/357 [=>............................] - ETA: 29:33 - loss: 177186779954.5049 - yolo_output_0_loss: 175667314688.0000 - yolo_output_1_loss: 1518691072.0000 - yolo_output_2_loss: 776710.62 29/357 [=>............................] - ETA: 29:23 - loss: 171112061513.0393 - yolo_output_0_loss: 169614901248.0000 - yolo_output_1_loss: 1496414848.0000 - yolo_output_2_loss: 755390.75 30/357 [=>............................] - ETA: 29:15 - loss: 166301570681.8046 - yolo_output_0_loss: 164835262464.0000 - yolo_output_1_loss: 1465576192.0000 - yolo_output_2_loss: 730211.31 31/357 [=>............................] - ETA: 29:06 - loss: 174625272487.4238 - yolo_output_0_loss: 165716246528.0000 - yolo_output_1_loss: 8885001216.0000 - yolo_output_2_loss: 24020536. 32/357 [=>............................] - ETA: 28:58 - loss: 5969125737652498.0000 - yolo_output_0_loss: 4505247821070336.0000 - yolo_output_1_loss: 1462073573769216.0000 - yolo_output_2_l 33/357 [=>............................] - ETA: 28:49 - loss: 5788243849893409.0000 - yolo_output_0_loss: 4368725843116032.0000 - yolo_output_1_loss: 1417768570191872.0000 - yolo_output_2_l 34/357 [=>............................] - ETA: 28:41 - loss: 5618001846072096.0000 - yolo_output_0_loss: 4240234380263424.0000 - yolo_output_1_loss: 1376069538021376.0000 - yolo_output_2_l 35/357 [=>............................] - ETA: 28:32 - loss: 5457490411422473.0000 - yolo_output_0_loss: 4119087311486976.0000 - yolo_output_1_loss: 1336753541611520.0000 - yolo_output_2_l 36/357 [==>...........................] - ETA: 28:25 - loss: 5305918492813840.0000 - yolo_output_0_loss: 4004686294155264.0000 - yolo_output_1_loss: 1299628381175808.0000 - yolo_output_2_l 37/357 [==>...........................] - ETA: 28:18 - loss: 5162574162094760.0000 - yolo_output_0_loss: 3896501436678144.0000 - yolo_output_1_loss: 1264512057475072.0000 - yolo_output_2_l 38/357 [==>...........................] - ETA: 28:12 - loss: 5026844055823483.0000 - yolo_output_0_loss: 3794086867763200.0000 - yolo_output_1_loss: 1231237603655680.0000 - yolo_output_2_l 39/357 [==>...........................] - ETA: 28:05 - loss: 4897959446858478.0000 - yolo_output_0_loss: 3696807469121536.0000 - yolo_output_1_loss: 1199671338860544.0000 - yolo_output_2_l 40/357 [==>...........................] - ETA: 27:58 - loss: 4775525557550146.0000 - yolo_output_0_loss: 3604391047200768.0000 - yolo_output_1_loss: 1169690856521728.0000 - yolo_output_2_l 41/357 [==>...........................] - ETA: 27:51 - loss: 4659273293776710.0000 - yolo_output_0_loss: 3516685935968256.0000 - yolo_output_1_loss: 1141178716127232.0000 - yolo_output_2_l 42/357 [==>...........................] - ETA: 27:45 - loss: 4979880065214234.0000 - yolo_output_0_loss: 3623235753082880.0000 - yolo_output_1_loss: 1355100299722752.0000 - yolo_output_2_l 43/357 [==>...........................] - ETA: 27:38 - loss: 4864215937318020.0000 - yolo_output_0_loss: 3539118281719808.0000 - yolo_output_1_loss: 1323589332631552.0000 - yolo_output_2_l 44/357 [==>...........................] - ETA: 27:31 - loss: 6526500006779312.0000 - yolo_output_0_loss: 5220556907479040.0000 - yolo_output_1_loss: 1304335665332224.0000 - yolo_output_2_l 45/357 [==>...........................] - ETA: 27:24 - loss: 6391131528590372.0000 - yolo_output_0_loss: 5113532228042752.0000 - yolo_output_1_loss: 1276009684926464.0000 - yolo_output_2_l 46/357 [==>...........................] - ETA: 27:16 - loss: 102949305698452064.0000 - yolo_output_0_loss: 14371433018818560.0000 - yolo_output_1_loss: 86353083445018624.0000 - yolo_output 47/357 [==>...........................] - ETA: 27:09 - loss: 100946136722534208.0000 - yolo_output_0_loss: 14250655317229568.0000 - yolo_output_1_loss: 84518015718129664.0000 - yolo_output 48/357 [===>..........................] - ETA: 27:02 - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan                                               57/357 [===>..........................] - ETA: 25:59 - loss: nan - yolo_output_0_loss: nan - yolo_output_1_loss: nan - yolo_output_2_loss: nan

C++ DirectML.dll causes crash in debug x64 mode when using NuGet package Microsoft.AI.MachineLearning 1.5.2

Hello,

I'm experiencing a runtime crash with the C++ DirectML API in Debug x64 mode after upgrading my NuGet package Microsoft.AI.MachineLearning from version 1.4.0 to 1.5.2.
There is no error in Release x64 mode.

The reason why I'm using this package is because the included DirectML.dll improves DirectML performance greatly.
There seems to be an issue when creating a DirectMLOperator.
The operator type is DML_OPERATOR_JOIN.

Can you please help me identify the issue?
Also how can I find the latest DirectML.dll file without downloading the package?

DirectML dll error

HelloDirectML crashed

Hi,
When change this line from

    dml::Expression output = dml::Identity(input);

to some thing like this

    dml::Expression output = dml::Identity(input)+dml::Identity(input);

and compiled with USE_DMLX, the executable will crash with device removed error. How does this happen? Many thanks!

Could not load dynamic library 'libcuda.so.1'

Followed the instructions here

~ » cat /proc/version                                                                                                                                                             1 ↵ jlam@MAKERPC
Linux version 4.4.0-20150-Microsoft ([email protected]) (gcc version 5.4.0 (GCC) ) #1000-Microsoft Thu Jun 12 17:34:00 PST 2020

I'm running build 20150, but am getting this error:

Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
>>>
>>> tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
>>>
>>> print(tf.add([1.0, 2.0], [3.0, 4.0]))
2020-06-17 16:36:05.469811: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-06-17 16:36:05.469926: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-06-17 16:36:05.470029: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (MAKERPC): /proc/driver/nvidia/version does not exist
2020-06-17 16:36:05.470532: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-17 16:36:05.483133: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3400000000 Hz
2020-06-17 16:36:05.487879: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fffe52ac420 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-17 16:36:05.488038: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
tf.Tensor([4. 6.], shape=(2,), dtype=float32)

Trained Weights Source

Hey!

I'm pretty interested in this project, and I'm curious as to what the format for the weights of your trained model are. I have a couple of ONNX models that I wanted to use in with this, and I wanted to know how to load them.

Thanks!

inconsistent behavior of optional input tensor binding

dml::Graph graph(...);
auto input  = dml::InputTensor(graph,0,...);
auto filter = dml::InputTensor(graph,1,...);
auto output = dml::Convolution(input,filter);
auto tmp = graph.Compile(DML_EXECUTION_FLAG_ALLOW_HALF_PRECISION_COMPUTATION,{output});
// BindInputs : there must be 3 bindings
// 0 DML_BINDING_TYPE_BUFFER
// 1 DML_BINDING_TYPE_BUFFER
// 2 DML_BINDING_TYPE_NONE
dml::Graph graph(...);
auto input  = dml::InputTensor(graph,0,...);
auto filter = dml::InputTensor(graph,1,...);
auto inter = dml::Convolution(input,filter);
auto output = dml::Identity(inter);
auto tmp = graph.Compile(DML_EXECUTION_FLAG_ALLOW_HALF_PRECISION_COMPUTATION,{output});
// BindInputs : there can only be 2 bindings
// 0 DML_BINDING_TYPE_BUFFER
// 1 DML_BINDING_TYPE_BUFFER

tensorflow module has a conflict on WSL2

According to your Introductory Test, your module fails with:
Linux PROJEKT 5.4.72-microsoft-standard-WSL2 #1 SMP Wed Oct 28 23:40:43 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Python 3.6.12

>>> import tensorflow.compat.v1 as tf
Traceback (most recent call last):
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/home/support/miniconda3/envs/directml/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/home/support/miniconda3/envs/directml/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libd3d12.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow/init.py", line 102, in
from tensorflow_core import *
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/init.py", line 28, in
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow/init.py", line 50, in getattr
module = self._load()
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow/init.py", line 44, in _load
module = _importlib.import_module(self.name)
File "/home/support/miniconda3/envs/directml/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/init.py", line 55, in
from tensorflow.python import pywrap_tensorflow
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 74, in
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/home/support/miniconda3/envs/directml/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/home/support/miniconda3/envs/directml/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/home/support/miniconda3/envs/directml/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libd3d12.so: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

Package Version


absl-py 0.11.0
astor 0.8.1
cached-property 1.5.2
certifi 2020.12.5
gast 0.2.2
google-pasta 0.2.0
grpcio 1.35.0
h5py 3.1.0
importlib-metadata 3.4.0
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
Markdown 3.3.3
numpy 1.18.5
opt-einsum 3.3.0
pip 20.3.3
protobuf 3.14.0
setuptools 52.0.0.post20210125
six 1.15.0
tensorboard 1.15.0
tensorflow-directml 1.15.4.dev201216
tensorflow-estimator 1.15.1
termcolor 1.1.0
typing-extensions 3.7.4.3
Werkzeug 1.0.1
wheel 0.36.2
wrapt 1.12.1
zipp 3.4.0

Can DirectML help run a CUDA based AI App on AMD GPU?

Hello I came across DirectML as I was looking for setting up the following app by facebookresearch on a local windows10 machine. As I don't have Nvidia card, but rather a AMD Vega64, I have not been able to run it so far.

I read in DirectML documentation that it can run cross-platform code, so I'm wondering if its possible to run the below mentioned app on a Windows 10 PC (Ubuntu subsystem?) ??

https://github.com/facebookresearch/pifuhd

Memory Leak occurred when a for loop was add to the "HelloDirectML" Demo

Hi,
When I add a for loop to the HelloDirectML demo to run the operator continuously,I found the memory usage continue to grow at the same time, seems like a memory leak.

main.zip

for (int i = 0; i != 1000; ++i) {
        commandList->SetDescriptorHeaps(ARRAYSIZE(d3D12DescriptorHeaps), d3D12DescriptorHeaps);

        dmlCommandRecorder->RecordDispatch(commandList.get(), dmlCompiledOperator.get(), dmlBindingTable.get());

        CloseExecuteResetWait(d3D12Device, commandQueue, commandAllocator, commandList);

        std::this_thread::sleep_for(std::chrono::milliseconds(30));
}

How to add custom operator

This issue is merely a question.

Based on my previous experience of deploying model with TensorRT, in the wild, there always be some operators that is not supported by the framework. And to speed up a model, you usually want to write custom kernels. TensorRT solved this by allowing developer to write a plugin for missing ops or manually fusing some ops. How is DirectML solve this? Via DirectCompute Kernel Call? I don't find any resource (docs or tutorial) about it.

DirectMLX.h compile error with clang

This can be reproduced by clang 11.0.0 by downloading it from LLVM.org.

The compile error is

.\Libraries\DirectMLX.h(176,72): error: arithmetic on a pointer to an incomplete type 'const dml::Expression'

The clang invocation command line and output are:

> "c:\Program Files\LLVM\bin\clang-cl.exe" -I.\Python\dependencies\microsoft.ai.directml.1.4.0\include /std:c++17 -I.\Libraries\ main.cpp
In file included from main.cpp:1:
In file included from ./precomp.h:33:
.\Libraries\DirectMLX.h(176,72): error: arithmetic on a pointer to an incomplete type 'const dml::Expression'
                : m_begin(dml::detail::data(container)), m_end(m_begin + dml::detail::size(container)) {}
                                                               ~~~~~~~ ^
.\Libraries\DirectMLX.h(627,68): note: in instantiation of function template specialization 'dml::detail::span<const
      dml::Expression>::span<dml::detail::span<const dml::Expression> &>' requested here
            detail::GraphDesc graph = m_graphBuilder->GetGraphDesc(outputs);
                                                                   ^
.\Libraries\DirectMLX.h(249,11): note: forward declaration of 'dml::Expression'
    class Expression;
          ^
1 error generated.

There is also a warning if turning on "-Wunused-variable /DNDEBUG". The warning is

.\Libraries\DirectMLX.h(479,28): warning: unused variable 'dimensionCount' [-Wunused-variable]

The clang invocation command line and output are:

> "c:\Program Files\LLVM\bin\clang-cl.exe" -I.\Python\dependencies\microsoft.ai.directml.1.4.0\include /std:c++17 -I.\Libraries\ -Wunused-variable /DNDEBUG main.cpp
In file included from main.cpp:1:
In file included from ./precomp.h:33:
.\Libraries\DirectMLX.h(479,28): warning: unused variable 'dimensionCount' [-Wunused-variable]
            const uint32_t dimensionCount = static_cast<uint32_t>(tensorSizes.size());
                           ^
.\Libraries\DirectMLX.h(176,72): error: arithmetic on a pointer to an incomplete type 'const dml::Expression'
                : m_begin(dml::detail::data(container)), m_end(m_begin + dml::detail::size(container)) {}
                                                               ~~~~~~~ ^
.\Libraries\DirectMLX.h(627,68): note: in instantiation of function template specialization 'dml::detail::span<const
      dml::Expression>::span<dml::detail::span<const dml::Expression> &>' requested here
            detail::GraphDesc graph = m_graphBuilder->GetGraphDesc(outputs);
                                                                   ^
.\Libraries\DirectMLX.h(249,11): note: forward declaration of 'dml::Expression'
    class Expression;
          ^
1 warning and 1 error generated.

Embedding Layer throws error when using with a DML device

Hi, I'd like to post an issue that I'm having while building a model in a DML scope, if I build my model inside a CPU scope and train it inside a DML it works just fine, but for some reason building a model inside a DML gives me this especific error, I'd take to post this because someone might have similar issues in the future, here's a the error log.

WARNING:tensorflow:From C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\ops\math_grad.py:1394: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
1/Unknown - 13s 13s/stepTraceback (most recent call last):
File "lstm_seq2seq_attention.py", line 104, in
model.fit(dataset, epochs=epochs, callbacks=[ckpt])
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 727, in fit
use_multiprocessing=use_multiprocessing)
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 331, in fit
total_epochs=epochs)
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 461, in call
return self.stateless_fn(*args, **kwds)
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1822, in call
return graph_function.filtered_call(args, kwargs) # pylint: disable=protected-access
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1141, in filtered_call
self.captured_inputs)
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1224, in call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 511, in call
ctx=ctx)
File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core.status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation model/embedding/embedding_lookup/Read/ReadVariableOp: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:DML:0' because no supported kernel for DML devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index
=2 requested_device_name
='/job:localhost/replica:0/task:0/device:DML:0' assigned_device_name
='/job:localhost/replica:0/task:0/device:DML:0' resource_device_name
='/job:localhost/replica:0/task:0/device:DML:0' supported_device_types
=[CPU] possible_devices_=[]
StridedSlice: DML CPU
VariableShape: DML CPU
Unique: CPU
Shape: DML CPU
_Arg: DML CPU
ResourceGather: DML CPU
ReadVariableOp: DML CPU
Identity: DML CPU
Const: DML CPU
UnsortedSegmentSum: CPU
Mul: DML CPU
AssignVariableOp: DML CPU
ResourceScatterAdd: CPU
Sqrt: DML CPU
AddV2: DML CPU
RealDiv: DML CPU
AssignSubVariableOp: DML CPU
NoOp: DML CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
model_embedding_embedding_lookup_read_readvariableop_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
adam_adam_update_readvariableop_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
adam_adam_update_readvariableop_2_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
model/embedding/embedding_lookup/Read/ReadVariableOp (ReadVariableOp) /job:localhost/replica:0/task:0/device:DML:0
model/embedding/embedding_lookup (ResourceGather) /job:localhost/replica:0/task:0/device:DML:0
model/embedding/embedding_lookup/Identity_1 (Identity) /job:localhost/replica:0/task:0/device:DML:0
VariableShape_1 (VariableShape)
Adam/Adam/update/Read/ReadVariableOp (ReadVariableOp) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/Unique (Unique) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/Shape (Shape) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/strided_slice/stack (Const) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/strided_slice/stack_1 (Const) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/strided_slice/stack_2 (Const) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/strided_slice (StridedSlice) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/UnsortedSegmentSum (UnsortedSegmentSum) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/mul (Mul) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/ReadVariableOp (ReadVariableOp) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/mul_1 (Mul) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/AssignVariableOp (AssignVariableOp) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/ResourceScatterAdd (ResourceScatterAdd) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/ReadVariableOp_1 (ReadVariableOp) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/mul_2 (Mul) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/mul_3 (Mul) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/ReadVariableOp_2 (ReadVariableOp) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/mul_4 (Mul) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/AssignVariableOp_1 (AssignVariableOp) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/ResourceScatterAdd_1 (ResourceScatterAdd) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/ReadVariableOp_3 (ReadVariableOp) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/Sqrt (Sqrt) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/mul_5 (Mul) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/add (AddV2) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/truediv (RealDiv) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/AssignSubVariableOp (AssignSubVariableOp) /job:localhost/replica:0/task:0/device:DML:0
Adam/Adam/update/group_deps (NoOp) /job:localhost/replica:0/task:0/device:DML:0

Op: ReadVariableOp
Node attrs: dtype=DT_FLOAT
Registered kernels:
device='CPU'
device='DML'

     [[{{node model/embedding/embedding_lookup/Read/ReadVariableOp}}]] [Op:__inference_distributed_function_31502]

convolution and gemm of PyDirectML throw exception without optional bias

According to the interface definition, the bias parameter of gemm and convolution is optional. However, if not setting this parameter into gemm or convolution, the device.compute method would throw exception.

The exception is

RuntimeError: Unknown exception

The following sample code could reproduce the gemm issue. The commented bias code path works fine.

import pydirectml as dml
import numpy as np

device = dml.Device()
builder = dml.GraphBuilder(device)
data_type = dml.TensorDataType.FLOAT32
flags = dml.TensorFlags.OWNED_BY_DML
input_bindings = []
a = dml.input_tensor(builder, 0, dml.TensorDesc(data_type, [1, 1, 3, 4]))
input_bindings.append(dml.Binding(a, np.ones([1, 1, 3, 4], dtype=np.float32)))
b = dml.input_tensor(builder, 1, dml.TensorDesc(data_type, [1, 1, 4, 3]))
input_bindings.append(dml.Binding(b, np.ones([1, 1, 4, 3], dtype=np.float32)))
# bias = dml.input_tensor(builder, 2, dml.TensorDesc(data_type, flags, [1, 1, 3, 3]))
# input_bindings.append(dml.Binding(bias, np.zeros([3, 3], dtype=np.float32)))
# c = dml.gemm(a, b, bias)
c = dml.gemm(a, b)
op = builder.build(dml.ExecutionFlags.NONE, [c])
output_data = device.compute(op, input_bindings, [c])
output_tensor = np.array(output_data[0], np.float32)
print(output_tensor)

The following sample code could reproduce the convolution issue. The commented bias code path works fine.

import pydirectml as dml
import numpy as np

device = dml.Device()
builder = dml.GraphBuilder(device)
data_type = dml.TensorDataType.FLOAT32
flags = dml.TensorFlags.OWNED_BY_DML
input_bindings = []
input = dml.input_tensor(builder, 0, dml.TensorDesc(data_type, [1, 1, 5, 5]))
input_bindings.append(dml.Binding(input, np.ones([1, 1, 5, 5], dtype=np.float32)))
weight = dml.input_tensor(builder, 1, dml.TensorDesc(data_type, flags, [1, 1, 3, 3]))
input_bindings.append(dml.Binding(weight, np.ones([1, 1, 3, 3], dtype=np.float32)))
# bias = dml.input_tensor(builder, 2, dml.TensorDesc(data_type, flags, [1, 1, 1, 1]))
# input_bindings.append(dml.Binding(bias, np.zeros([1, 1, 1, 1], dtype=np.float32)))
# conv = dml.convolution(input, weight, bias)
conv = dml.convolution(input, weight)
op = builder.build(dml.ExecutionFlags.NONE, [conv])
output_data = device.compute(op, input_bindings, [conv])
output_tensor = np.array(output_data[0], np.float32)
print(output_tensor)

It is expected not to throw exception and compute the correct output.

[pydirectml] fail to run a graph with single reshape (reinterpret) node

The following code could be used to reproduce this issue

import pydirectml as dml
import numpy as np

device = dml.Device()
builder = dml.GraphBuilder(device)
input_bindings = []
input = dml.input_tensor(builder, 0, dml.TensorDesc(dml.TensorDataType.FLOAT32, [16, 4, 4, 10]))
input_bindings.append(dml.Binding(input, np.ones([16, 4, 4, 10], dtype=np.float32)))
reshape = dml.reinterpret(input, dml.TensorDataType.FLOAT32, [1, 1, 256, 10], [2560, 2560, 10, 1])
# works with the following line
# reshape = dml.activation_identity(reshape)
op = builder.build(dml.ExecutionFlags.NONE, [reshape])
output_data = device.compute(op, input_bindings, [reshape])
output_tensor = np.array(output_data[0], np.float32)
print(output_tensor)

The error log is

Traceback (most recent call last):
  File "reshape.py", line 12, in <module>
    op = builder.build(dml.ExecutionFlags.NONE, [reshape])
RuntimeError: E_INVALIDARG

As commented in the code, if append another activation_identity node, it works. This issue can be reproduced before and after PR #69 .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.