microsoft / directml Goto Github PK

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.

License: MIT License

Python 46.65% Shell 0.33% C++ 42.97% CMake 3.21% C 6.05% HLSL 0.02% PowerShell 0.77%

directml's Introduction

DirectML

When used standalone, the DirectML API is a low-level DirectX 12 library and is suitable for high-performance, low-latency applications such as frameworks, games, and other real-time applications. The seamless interoperability of DirectML with Direct3D 12 as well as its low overhead and conformance across hardware makes DirectML ideal for accelerating machine learning when both high performance is desired, and the reliability and predictability of results across hardware is critical.

More information about DirectML can be found in Introduction to DirectML.

Getting Started with DirectML
DirectML Samples
DxDispatch Tool
Windows ML on DirectML
ONNX Runtime on DirectML
PyTorch with DirectML
TensorFlow with DirectML
Feedback
External Links
- Documentation
- More information
Contributing

Visit the DirectX Landing Page for more resources for DirectX developers.

Getting Started with DirectML

DirectML is distributed as a system component of Windows 10, and is available as part of the Windows 10 operating system (OS) in Windows 10, version 1903 (10.0; Build 18362), and newer.

Starting with DirectML version 1.4.0, DirectML is also available as a standalone redistributable package (see Microsoft.AI.DirectML), which is useful for applications that wish to use a fixed version of DirectML, or when running on older versions of Windows 10.

Hardware requirements

DirectML requires a DirectX 12 capable device. Almost all commercially-available graphics cards released in the last several years support DirectX 12. Examples of compatible hardware include:

AMD GCN 1st Gen (Radeon HD 7000 series) and above
Intel Haswell (4th-gen core) HD Integrated Graphics and above
NVIDIA Kepler (GTX 600 series) and above
Qualcomm Adreno 600 and above

For application developers

DirectML exposes a native C++ DirectX 12 API. The header and library (DirectML.h/DirectML.lib) are available as part of the redistributable NuGet package, and are also included in the Windows 10 SDK version 10.0.18362 or newer.

The Windows 10 SDK can be downloaded from the Windows Dev Center
Microsoft.AI.DirectML on the NuGet Gallery
DirectML programming guide
DirectML API reference

For users, data scientists, and researchers

DirectML is built-in as a backend to several frameworks such as Windows ML, ONNX Runtime, and TensorFlow.

See the following sections for more information:

Windows ML on DirectML
ONNX Runtime on DirectML
TensorFlow with DirectML
PyTorch with DirectML

DirectML Samples

DirectML C++ sample code is available under Samples.

HelloDirectML: A minimal "hello world" application that executes a single DirectML operator.
DirectMLSuperResolution: A sample that uses DirectML to execute a basic super-resolution model to upscale video from 540p to 1080p in real time.
yolov4: YOLOv4 is an object detection model capable of recognizing up to 80 different classes of objects in an image. This sample contains a complete end-to-end implementation of the model using DirectML, and is able to run in real time on a user-provided video stream.

DirectML Python sample code is available under Python/samples. The samples require PyDirectML, an open source Python projection library for DirectML, which can be built and installed to a Python executing environment from Python/src. Refer to the Python/README.md file for more details.

MobileNet: Adapted from the ONNX MobileNet model. MobileNet classifies an image into 1000 different classes. It is highly efficient in speed and size, ideal for mobile applications.
MNIST: Adapted from the ONNX MNIST model. MNIST predicts handwritten digits using a convolution neural network.
SqueezeNet: Based on the ONNX SqueezeNet model. SqueezeNet performs image classification trained on the ImageNet dataset. It is highly efficient and provides results with good accuracy.
FNS-Candy: Adapted from the Windows ML Style Transfer model sample, FNS-Candy re-applies specific artistic styles on regular images.
Super Resolution: Adapted from the ONNX Super Resolution model, Super-Res upscales and sharpens the input images to refine the details and improve image quality.

DxDispatch Tool

DxDispatch is simple command-line executable for launching DirectX 12 compute programs (including DirectML operators) without writing all the C++ boilerplate.

Windows ML on DirectML

Windows ML (WinML) is a high-performance, reliable API for deploying hardware-accelerated ML inferences on Windows devices. DirectML provides the GPU backend for Windows ML.

DirectML acceleration can be enabled in Windows ML using the LearningModelDevice with any one of the DirectX DeviceKinds.

For more information, see Get Started with Windows ML.

Windows Machine Learning Overview (docs.microsoft.com)
Windows Machine Learning GitHub
WinMLRunner, a tool for executing ONNX models using WinML with DirectML

ONNX Runtime on DirectML

ONNX Runtime is a cross-platform inferencing and training accelerator compatible with many popular ML/DNN frameworks, including PyTorch, TensorFlow/Keras, scikit-learn, and more.

DirectML is available as an optional execution provider for ONNX Runtime that provides hardware acceleration when running on Windows 10.

For more information about getting started, see Using the DirectML execution provider.

PyTorch with DirectML

PyTorch with DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware. This is done through torch-directml, a plugin for PyTorch.

PyTorch with DirectML is supported on both the latest versions of Windows and the Windows Subsystem for Linux, and is available for download as a PyPI package. For more information about getting started with torch-directml, see our Windows or WSL 2 guidance on Microsoft Learn.

TensorFlow with DirectML

TensorFlow is a popular open source platform for machine learning and is a leading framework for training of machine learning models.

DirectML acceleration for TensorFlow 1.15 is currently available for Public Preview. TensorFlow on DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware.

TensorFlow on DirectML is supported on both the latest versions of Windows 10 and the Windows Subsystem for Linux, and is available for download as a PyPI package. For more information about getting started, see GPU accelerated ML training (docs.microsoft.com)

Feedback

We look forward to hearing from you!

For TensorFlow with DirectML issues, bugs, and feedback; or for general DirectML issues and feedback, please file an issue or contact us directly at [email protected].
For PyTorch with DirectML issues, bugs, and feedback; or for general DirectML issues and feedback, please file an issue or contact us directly at [email protected].
For Windows ML issues, please file a GitHub issue at microsoft/Windows-Machine-Learning or contact us directly at [email protected].
For ONNX Runtime issues, please file an issue at microsoft/onnxruntime.

External Links

Documentation

DirectML programming guide
DirectML API reference

More information

Introducing DirectML (Game Developers Conference '19)
Accelerating GPU Inferencing with DirectML and DirectX 12 (SIGGRAPH '18)
Windows AI: hardware-accelerated ML on Windows devices (Microsoft Build '20)
Gaming with Windows ML (DirectX Developer Blog)
DirectML at GDC 2019 (DirectX Developer Blog)
DirectX ❤ Linux (DirectX Developer Blog)

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

directml's People

Contributors

Stargazers

Watchers

Forkers

bailehang stephensmitchell-forks fujunwei pmbrown1055 bhouse-microsoft kevincog-msft hrjanardhan bbernhar davionzhang darrenscrews huningxin lpbourret oneengineer roman380 jaydenchou john-h-k ganeshkumartk crahrig ipadawan fredyfx phymucs rheehot anarinsk chomolungma haldcs ai-hub-deep-learning-fundamental bhaskers-blu-org2 biguncle sbcyr wintersnow212 kanbuncha taffywrinkle claudiusgonzo furyhawk 975150313 hixio-mh c0ns0le mousumih gcourtney27 uzbekdev1 dx-tools snapbuy umangyadav i-zby lihbgame ghimanshu98 samratkishore neophack ncnnnnn devbox10 ayazkadam global-localhost global19 global19-atlassian-net liyuming1978 dxml mfkiwl doytsujin mingmingtasd jacquesvanrhynmsft qpc-database toruserajee riyasoni1 xeddmc yiwang akshayytondak shahjaidev cdalag standardgalactic marmikreal yellowsimulator kbillore ryanlai2 xwyangjshb dhinagaran-s stjordanis andreabrantes atelis smk2007 canaxx jofigue jfeil aeioaeu shubhammittal98 icodein deeplearningnrs pavlvstc evolution99 mmdonohue afiqmuzaffar ricklentz luodiw zeeroocooll mf3129 ultimopl norm1988 alphayama gokultonpe loidohm nichm0617

directml's Issues

Will DirectML inside WSL supported on Windows-on-Arm?

It will be great to be able to develop using Surface Pro X!

mnist.py sample fails with debug layer enabled

The error log is

D3D12 ERROR: ID3D12GraphicsCommandList::ExecuteMetaCommand: Supplied parameters size [40] doesn't match enumerated size[32]. [ EXECUTION ERROR #1174: META_COMMAND_PARAMETER_SIZE_MISMATCH]

I enabled the DirectML debugging layer.

I ran this sample on a machine with Windows 10 20H2 build 19042.685 and Intel HD630 GPU with driver version 27.20.100.9030 (11/27/2020).

BTW, I can workaround it by supplying ExecutionFlags.DISABLE_META_COMMANDS.

extremely bad performance (25x) compared to CPU only mode (26 ms vs 670ms)

This is a follow up issue to the microsoft/onnxruntime#5617 .
Using the DML provider results in 25x worse performance (26 ms vs 670 ms) thats most likely caused by the MemcpyToHost operation.

Run :

import numpy as np
import onnxruntime
from timeit import default_timer as timer

time_dict = {}
def get_min_max(name, elapsed):
    min_tm = time_dict[name][0]
    max_tm = time_dict[name][1]
    if min_tm > elapsed:
        min_tm = elapsed
    if max_tm < elapsed:
        max_tm = elapsed
    return min_tm, max_tm

class Benchmark_Block(object):
    def __init__(self, name='code-block'):
        self.name = name
        if name not in time_dict.keys():
            time_dict[name] = (999, -999)

    def __enter__(self):
        self.start = timer()

    def __exit__(self, exc_type, exc_value, exc_traceback):
        end = timer()
        elapsed = end - self.start 
        (min_tm, max_tm) = get_min_max(self.name, elapsed)
        time_dict[self.name] = (min_tm, max_tm)
        print(f'{self.name} took {elapsed*1000:.3f} ms [min/max: {min_tm*1000:.1f}/{max_tm*1000:.1f}] ms')

model_path = 'r18_q_onnx.onnx'
img_input= np.random.randn(1, 3, 112, 112)
img_input= np.asarray(res, dtype=np.float32) 

sess = onnxruntime.InferenceSession(model_path)
ort_inputs = {sess.get_inputs()[0].name: img_input}
ort_outs = sess.run(None, ort_inputs)

providers = onnxruntime.get_available_providers()
print(f'available providers : {providers}')
print(f'current device: {onnxruntime.get_device()}')

sess.set_providers(['CPUExecutionProvider'])
with Benchmark_Block('CPU_ONNX') as blk:
    ort_outs = sess.run(None, ort_inputs)

sess.set_providers(['DmlExecutionProvider'])
with Benchmark_Block('DML_ONNX') as blk:
    ort_outs = sess.run(None, ort_inputs)

Onnx model : https://gofile.io/d/GdkIeR

Is there a workaround to avoide this at the moment?

Detection on custom yolov3-tiny doesn't work well

I trained using google colab my own custom yolov3-tiny model with input quality of 928x928.

I've converted my custom yolov3-tiny model using convery.py
py convert.py -weights ./data/yolov3-tiny.weights -output ./checkpoints/yolov3-tiny.tf -num_classes 4 --tiny
And inferced it using detect_video.py (Also changing the classes names in coco.names)
py detect_video.py -classes ./data/coco.names -weights ./checkpoints/yolov3-tiny.tf --tiny -size 928 -video ./data/clip.mkv -num_classes 4
The problem is it does work but even on a extreme high quality input (928x928). The detections aren't stable and when I reduce the input image by a bit (608x608) it doesn't detect anything.

In comparison I inferced my model using OpenCV darknet port using only cpu. And even when reducing
the input image to 160x160 the detections worked perfectly.
Just slower obviously not utilizing my AMD - RX480 GPU.

It seems that the efficiency is a bit better than utilizing only on the cpu.
But the accuracy is so low I can't depend on this method.
My guess is that in the conversion I'm doing something wrong.
Because the models accuracy goes to hell.

Can you point to what am I doing wrong?
Or the lost of accuracy is the cost of converting yolo models to tf models?

DirectML Linux support

Are there plans for support on Linux?
It would be quite useful for AMD users since ROCm supports only a limited set of cards currently and Intel GPU users only have PlaidML as an option.

Cannot build the yolov4 sample

Only error left are those '&' requires l-values and I can't seem to understand what the issue is. Trying to build from the latest commit on the master branch (6fed953).

Severity Code Description Project File Line Suppression State
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 45
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 47
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 55
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 57
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\WeightData.cpp 92
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 461
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 647
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 909
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 911
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 940
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4.cpp 942
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 466
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 468
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 488
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 490
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 495
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 497
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 505
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 507
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 512
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 514
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 522
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 524
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 529
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 531
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 566
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 582
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 599
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 640
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 649
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 656
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 658
Error C2102 '&' requires l-value yolov4 E:\MLStuff\DirectML-master\Samples\yolov4\yolov4ResourceBuilder.cpp 664

DirectML v1.4 ARM64 version?

Would it be possible to have an ARM64 version of DirectML? I downloaded the NuGet package but it only includes x86/x64 versions.

My goal is to execute a PyTorch exported ONNX model (yolov5) using the GPU. On x64 it runs fine with DirectML v1.4 but not with the older one included in Windows. That's why I would need to use the newer NuGet version and redistribute it for my ARM64 application (Hololens 2).

Thanks.

Operator GRU does not accept input buffer at index 2

Here is the error I got when do input buffer binding for operator GRU:

D3D12 ERROR: : the dispatchable object expects nothing to be bound at index 2, but a binding of type DML_BINDING_TYPE_BUFFER was provided. Use binding type NONE to bind nothing to this slot. [ UNKNOWN ERROR #1: STRING_FROM_APPLICATION]

According to GRU description below, it need at least three inputs: data input, weight and recurrent. The BindInputs function seems accept tensor at index 0 (data input) and index 1 (weight), but not index 2 (recurrent). Are there any misunderstanding? Are there any sample code for GRU usage?

https://docs.microsoft.com/en-us/windows/win32/api/directml/ns-directml-dml_gru_operator_desc

Error "cannot open source file 'DirectML.h'" happened on Version 1903 and OS build 18343.1.

Steps to reproduce this issue:

Log in as a Windows Insider,
Download Windows 10 Insider Preview Client x64 en-us 18343 ISO from https://www.microsoft.com/en-us/software-download/windowsinsiderpreviewadvanced,
Transfer the ISO file to installation media,
Boot PC from the installation media,
Install latest VS with latest Windows SDK (10.0.17763.0)
build this HelloDirectML sample project

Please help look at this issue, thanks.

Inferencing yolov3-tiny doesn't detect anything

When I convert a yolov3 model and inference it everything works great.
But then I downloaded https://pjreddie.com/media/files/yolov3-tiny.weights
converted using this command:
py convert.py -weights ./data/yolov3-tiny.weights -output ./checkpoints/yolov3-tiny.tf --tiny
It's says weights saved and everything seems perfect but then I try infercing it using the default command:
py detect_video.py -weights ./checkpoints/yolov3-tiny.tf -video 0 --tiny
And it doesn't detect anything.
I even deleted everything and used the tiny setup.py to figure out if something I did
went wrong.
But same thing happened no detections.
Have I done something wrong?

Can the values of DimensionCount in DML_BUFFER_TENSOR_DESC be less than 4?

According to the DML_BUFFER_TENSOR_DESC doc, the valid values are either 4 or 5.

In DirectML, all buffer tensors must have a DimensionCount of either 4 or 5.

However, according to my test, the user code can set DimensionCount less than 4. For example, the following PyDirectML code that adds two tensors in shape of [2, 2] works just fine.

import pydirectml as dml
import numpy as np

device = dml.Device()
builder = dml.GraphBuilder(device)
data_type = dml.TensorDataType.FLOAT32
flags = dml.TensorFlags.OWNED_BY_DML
input_bindings = []
a = dml.input_tensor(builder, 0, dml.TensorDesc(data_type, [2, 2]))
input_bindings.append(dml.Binding(a, np.ones([2, 2], dtype=np.float32)))
b = dml.input_tensor(builder, 1, dml.TensorDesc(data_type, flags, [2, 2]))
input_bindings.append(dml.Binding(b, np.ones([2, 2], dtype=np.float32)))
c = dml.add(a, b)
op = builder.build(dml.ExecutionFlags.NONE, [c])
output_data = device.compute(op, input_bindings, [c])
output_tensor = np.array(output_data[0], np.float32)
print(output_tensor)

The output is

[[2. 2.]
 [2. 2.]]

Internally, the PyDirectML and DirectMLX.h will set the DimensionCount to 2. Other dimensions also work, like 1-d or 3-d.

Actually this is a very nice feature and would simplify the user code. I just want to know whether this feature is officially supported by DirectML. If it is, the doc probably needs to be updated.

ImportError: libd3d12.so: cannot open shared object file: No such file or directory

I've been following these install docs but for no clear reason I get the import error in the title.

My GPU is a Radeon VII and I've installed the linked AMD WSL Preview drivers, so that shouldn't be an issue. My WSL environment is set up properly as well:

PS C:\Users\cass> wsl cat /proc/version
Linux version 4.19.104-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Wed Feb 19 06:37:35 UTC 2020
PS C:\Users\cass> wsl --list --verbose
  NAME            STATE           VERSION
* Ubuntu-20.04    Running         2

Full stack trace

(base) cass@deskfox:~$ python3.7
Python 3.7.7 (default, May  7 2020, 21:25:33)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
Traceback (most recent call last):
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/cass/miniconda3/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/home/cass/miniconda3/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libd3d12.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow/__init__.py", line 110, in <module>
    from tensorflow_core import *
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/__init__.py", line 28, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow/__init__.py", line 58, in __getattr__
    module = self._load()
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow/__init__.py", line 52, in _load
    module = _importlib.import_module(self.__name__)
  File "/home/cass/miniconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/__init__.py", line 55, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/cass/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/cass/miniconda3/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/home/cass/miniconda3/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libd3d12.so: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

[installation] Could not find a version that satisfies the requirement tensorflow-directml (from versions: none)

Hi,

After following the steps described in https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-wsl till pip install tensorflow-directml,

the error appeared as

ERROR: Could not find a version that satisfies the requirement tensorflow-directml (from versions: none)
ERROR: No matching distribution found for tensorflow-directml

BTW, I am using python 3.8

and I did python list tensorflow*, which outputed

Package Version

certifi 2020.6.20
pip 20.1.1
setuptools 49.2.0.post20200714
wheel 0.34.2

Vega 8 ：DirectML device enumeration: found 0 compatible adapters.

I do the same as https://docs.microsoft.com/zh-cn/windows/win32/direct3d12/gpu-tensorflow-wsl, and I have a GPU : Vega 8.
But it reported:

Python 3.6.12 |Anaconda, Inc.| (default, Sep  8 2020, 23:10:56)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
>>> tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
>>> print(tf.add([1.0, 2.0], [3.0, 4.0]))
2020-11-02 21:19:05.139698: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 0 compatible adapters.
2020-11-02 21:19:05.141140: I tensorflow/core/common_runtime/eager/execute.cc:571] Executing op Add in device /job:localhost/replica:0/task:0/device:CPU:0
tf.Tensor([4. 6.], shape=(2,), dtype=float32)

Thank you in advance !
If other information I should submit, please let me know !

A numpy/cupy project on top of Directml

Hi, I'd like to know if the team behind directml has plans to release a like-numpy/cupy library using Directx 12 for computing, this would open up for a lot of possibilities, since a lot of numerical libraries use numpy for computing, having an option to accelarate such operations, I can see pandas using this, and many researchers implementing tools for this.
I also would like to have info on how to use the low-level math ops from directml.

WSL support for RTX TITAN

Hi,

Testing DirectML with recent WSL2 (2 days ago insider build) I am getting an ImportError: libd3d12.so: cannot open shared object file: No such file or directory

I suppose this come from a missing driver update from NVIDIA's drivers (we have geforce AND quadro drivers available online but no driver for RTX TITAN in my case). -> https://developer.nvidia.com/cuda/wsl/download

Do you confirm ?

Regards

"found 0 compatible adapters" after following setup instructions

I did my best to follow the instructions at https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-wsl. I already had a WSL 1 installation of Ubuntu, so in order, I:

Installed the NVIDIA preview drivers
Set my WSL to default to WSL 2
Reset my Ubuntu installation
Installed the nvidia-cuda-toolkit package

Successfully ran the deviceQuery sample to get this output:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce RTX 2070 SUPER"
  CUDA Driver Version / Runtime Version          11.1 / 10.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1815 MHz (1.81 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

However, now if I try to verify that TF/DirectML is picking up my GPU, I get:

Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
>>> tf.enable_eager_execution9tf.ConfigProto(log_device_placement=True))
  File "<stdin>", line 1
    tf.enable_eager_execution9tf.ConfigProto(log_device_placement=True))
                                                                       ^
SyntaxError: invalid syntax
>>> tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
>>> print(tf.add([1.0, 2.0], [3.0, 4.0])
... )
2020-06-28 02:54:53.570206: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 0 compatible adapters.
2020-06-28 02:54:53.572666: I tensorflow/core/common_runtime/eager/execute.cc:571] Executing op Add in device /job:localhost/replica:0/task:0/device:CPU:0
tf.Tensor([4. 6.], shape=(2,), dtype=float32)

I don't know if I should have done things in a different order, given that I already had WSL up and running. But if I can do things differently, I'd love to know. So far, it looks like only the most preliminary of guides are up.

HelloDirectML crash on DELL XPS 13 9380 Windows 10.0.18363

Hello,

We are having DirectML.dll fault issue on some laptop when running HelloDirectML example from Github. Other DirectML based application also experiencing same issue.
Below is an event error log indicating DirectML.dll faulting (Event Viewer->Windows Logs->Applications):

Faulting application name: HelloDirectML.exe, version: 0.0.0.0, time stamp: 0x5f7cefd4
Faulting module name: DirectML.dll, version: 10.0.18362.997, time stamp: 0x7fc3de11
Exception code: 0xc0000409
Fault offset: 0x00000000000a28d1
Faulting process id: 0x4238
Faulting application start time: 0x01d69c315d6e5eb0
Faulting application path: C:\Users\dasguk\Downloads\HelloDirectML.exe
Faulting module path: C:\WINDOWS\SYSTEM32\DirectML.dll
Report Id: 95947e12-fe63-4740-b6eb-a273e71e8974
Faulting package full name:
Faulting package-relative application ID:

Some system information if this helps:

Laptop: Dell XPS 13 9380
OS: Microsoft Windows 10 Enterprise 10.0.18363
CPU: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
GPU: Intel(R) UHD Graphics 620 DriverVersion: 27.20.100.8476

We checked the system requirements and the laptop meets these requirements. Could you please advise what’s the possible reason here?

Thanks.

DirectML vs CrossFire & GPU Workload

I have a computer with 2x AMD 570 and a crossfire motherboard.
For maximum performance should I

activate/deactivate crossfire ?
set GPU Workload - Graphics/Compute? (AMD Driver settings)

OOM at 48GB GPU mem on GPT-2 inference due to memory leak in directML.

First of all thank you for all your work, it's very exsiting to see windows/amd ML gap being closed!
Issue is:

git clone https://github.com/openai/gpt-2.git
cd gpt-2
python -m pip install -r requirements.txt
python3 download_model.py 1558M
python src/interactive_conditional_samples.py 1558M

Given any text for inference, repetedly consumes all 48GB GPU mem (AMD Radeon VII 16GB + 32GB shared memory) and falls with:

  (0) Resource exhausted: OOM when allocating tensor with shape[1,48,2,25,455,64] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[node sample_sequence/while/concat (defined at F:\DSML\Soft\Anaconda\envs\directml\lib\site-packages\tensorflow_core\python\framework\ops.py:1762) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Full log and env in the comments below. Probably related with the resource releasing issue on

from ai_benchmark import AIBenchmark
benchmark = AIBenchmark(use_CPU=None, verbose_level=1)
results = benchmark.run()

which also OOM falls during execution. Mem for tensors is not released after runs and even after sess.close()

Dot product operator

Hi!

I have a question. Is there any reason why dot product does not exist in the DirectML operator? I want to do calculations like numpy.dot(). I'm sorry if I misunderstood something.

Thanks.

Propose to make the device wrapper of PyDirectML for C++ sample usage

The device wrapper of Python samples is a quite good helper for DirectML device and resource management. I propose to make it also useful for C++ sample, e.g. HelloDirectML . That would simplify the C++ sample and improve the code reuse.

Any thoughts?

Does DirectMLX support transposing a tensor?

Transpose is used to permute the axes of a tensor, e.g. the Transpose operator of ONNX.

According to DirectML doc, this operator probably can be implemented by DML_ELEMENT_WISE_IDENTITY_OPERATOR_DESC.

However, it is unclear to me how to transpose a tensor with DirectMLX.

Intel hd4600 graphics can not run tensorflow-directml programs，but hd530 works fine.

On hd4600 the console showing:
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:50: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use layer.__call__ method instead.
WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1066: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead. In particular, tf.control_dependencies(tf.GraphKeys.UPDATE_OPS) should not be used (consult the tf.keras.layers.batch_normalization documentation).
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:218: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\ops\losses\losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:219: The name tf.losses.get_total_loss is deprecated. Please use tf.compat.v1.losses.get_total_loss instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:220: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:234: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:238: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:248: The name tf.nn.xw_plus_b is deprecated. Please use tf.compat.v1.nn.xw_plus_b instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:254: The name tf.losses.mean_squared_error is deprecated. Please use tf.compat.v1.losses.mean_squared_error instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:258: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:267: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-07-15 21:43:32.488885: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 1 compatible adapters.
2020-07-15 21:43:32.501242: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-15 21:43:32.503347: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 4600)
2020-07-15 21:43:32.621172: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll
WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:269: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:271: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

2020-07-15 21:44:13.976556: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:372] Check failed: (((HRESULT)((dml_device_->GetDeviceRemovedReason()))) >= 0) == true (0 vs. 1)

On hd530 the console showing:
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.