openxla / iree Goto Github PK

A retargetable MLIR-based machine learning compiler and runtime toolkit.

License: Apache License 2.0

C++ 38.36% C 20.59% MLIR 26.16% CMake 4.64% Python 5.68% Shell 0.90% Starlark 2.39% Batchfile 0.01% PowerShell 0.04% NASL 0.03% Java 0.11% Dockerfile 0.08% Assembly 0.02% GLSL 0.01% HTML 0.15% JavaScript 0.13% Metal 0.02% Objective-C 0.69% WGSL 0.01% Pawn 0.01%

mlir vulkan tensorflow spirv cuda jax pytorch

iree's Introduction

XLA

XLA (Accelerated Linear Algebra) is an open-source machine learning (ML) compiler for GPUs, CPUs, and ML accelerators.

The XLA compiler takes models from popular ML frameworks such as PyTorch, TensorFlow, and JAX, and optimizes them for high-performance execution across different hardware platforms including GPUs, CPUs, and ML accelerators.

Get started

If you want to use XLA to compile your ML project, refer to the corresponding documentation for your ML framework:

If you're not contributing code to the XLA compiler, you don't need to clone and build this repo. Everything here is intended for XLA contributors who want to develop the compiler and XLA integrators who want to debug or add support for ML frontends and hardware backends.

Contribute

If you'd like to contribute to XLA, review How to Contribute and then see the developer guide.

Contacts

For questions, contact the maintainers - maintainers at openxla.org

Resources

Community Resources

Code of Conduct

While under TensorFlow governance, all community spaces for SIG OpenXLA are subject to the TensorFlow Code of Conduct.

iree's People

Contributors

Stargazers

Watchers

Forkers

tpoisonooo monkeyking lamarrr aniket-daphale stellaraccident antiagainst sondro gmngeoffrey so-man crazyrex rsuderman ezhangle scotttodd brettkoonce suphoff denis0x0d iree-copybara-bot mbrukman joulroad maheshravishankar nicolasvasilache shafiahmed apo-shower zhibinliintel gino-m sarathkumarsivan silvrwolfboy qqsun8819 buchgr liutongxuan phoenix-meadowlark jinglingxuxing xiaming9880 arthurchan2018 parvizp swiftoffice2018 hanhanw chitrak7 linecode l1129433134 feiyunwill yashrajsingh-4799 watargolla feihugis not-jenni tralivali1234 proofcross shanshanpt littlemaer emperoryp7 rise-lang jschuhmacher thomasraoux poechsel ashikpaul salimmj prezent-lee silvasean georgemitenkov imaginationtech zhangys-lucky river707 sigmamupi navdeepkk windqaq fedelebron jbampton biswajeetmishra143 arjunsurendran24 bernhardriemann yashsehgal neotim bjacob waldoswify93 ikrima 00mjk kooljblack patternagainstuser freedomtan marbre shawnjohnjr renjie-liu rednoah91 souravbadami simon-camp jerryshih sifive zongwuyang jieungkim harishch4 hkim15 inho9606 metagraph-dev chmodawk miguelespinozamedina dwayne45 termuxarch natashaknk xinan-jiang dajuro

iree's Issues

Vectorize HLO dialect dispatch regions

To enable better SPIR-V and SIMD codegen we should vectorize dispatch regions. There's a vectorization transform in MLIR however it's specific to the affine dialect and not easily usable for our purposes - something much simpler would do, especially considering our starting point with variable-width vector HLO ops.

Input:

  %c = xla_hlo.mul %a, %b : tensor<1024xf32>

Output:

  { iree.workload_divisor = {4, 4, 4} }
  %c = xla_hlo.mul %a, %b : tensor<4xf32>

We could do this as part of lowering to SPIR-V or SIMD directly however both would benefit and having a common path through the index map would be nice if we need to adjust that to take the divisors into account.

Start a Dawn HAL backend that can enumerate GPU API backends and create devices

Implement fiber scheduling in the runtime for invocations

All invocations currently run synchronously. To allow for custom imported functions that may perform sync/async work, overlapped scheduling of invocations, and sequencer-level cellular batching we'll want to plug a simple user-mode fiber scheduler into the Instance such that invocations across contexts can be executed as available.

We can either allow users to donate their calling thread to perform fiber scheduling or spawn a dedicated scheduler thread. Until deferred calls are implemented there should not be much functional change.

Automatically build a docker image on .gitmodules change with all submodules

Right now we have to get llvm9 and all of the submodules before we can run any action on the CI. It'd be cool to setup a push action that, if .gitmodules is changed, rebuilds a base ubuntu docker image with our deps and the bigger submodules (llvm, tensorflow) already checked out. This way incremental non-submodule changes should be much faster.

Add #line macros to generated sequencer code

#line will let us embed the original source locations into the generated files, meaning that we'll get stack traces into the original frontend language. We could have an option for the generator to switch what locations get used (source location, line in HLO MLIR dump, etc).

iree/hal/buffer_mapping_test.cc seems to be missing ASSERT_OK_AND_ASSIGN macro

ERROR: C:/src/ireepub/iree/iree/hal/BUILD:68:1: Couldn't build file iree/hal/_objs/buffer_test/buffer_mapping_test.obj: C++ compilation of rule '//iree/hal:buffer_test' failed (Exit 1)
iree/hal/buffer_mapping_test.cc(87,24): error: expected expression
ASSERT_OK_AND_ASSIGN(auto mapping,
^

Refactor sequencer IR to match v2 design

The current sequencer IR was a stopgap to get the end-to-end flow working. The real sequencer needs the concepts of command buffers, synchronization, and more flexible buffer handling. Half of this work is defining the IR in MLIR and adapting the existing conversion/translation code to work with it while the other half is refactoring the sequencer VM to do the proper dispatching. Once this is done we'll have a good foundation for performing a variety of optimizations in the compiler and beginning code generation work.

Synthesize indirect dispatch workload calculations for data-dependent shapes

When we aren't able to fully parameterize shapes based on input shapes (such as when a shape is sliced out of an arbitrary tensor) we'll need to synthesize a dispatch that computes the workload for indirect dispatch. We'll want to add bounds checking and compatibility checks and some way of communicating data loss errors (such as when the computed workload is not possible to perform), though we could possibly work around this by enqueuing multiple indirect dispatches and using as many as we need to satisfy the request.

CMake build issues on Linux.

Hello everyone, I was trying to build IREE by CMake on Linux and here some minor issues:

The SRCS path to file_io and file_mapping libs
https://github.com/google/iree/blob/master/iree/base/CMakeLists.txt#L64
Should the path be "internal/file_io_posix.cc" instead "file_io_posix.cc" and so on?
executable_cache needs iree::base::wait_handle which is commented out in https://github.com/google/iree/blob/master/iree/base/CMakeLists.txt#L349
Tensorflow path, should it be changed from
build_tools/third_party/tensorflow
to
third_party/tensorlow ?

By the way, as far as I understood you have own internal build system, do you have any plans to support stable build under Linux?
Thanks.

Implement a CPU execution thread pool

This can be used for scheduling the SIMD JIT or codegen.

High-level design is of an iree/task/ task management system that is specialized for our workload (out-of-order dynamic DAGs with late-binding to parameters like dispatch counts). A custom HAL command buffer will produce the DAG fragments consisting of tasks and then submit them to task queues that fan out to a thread pool. cpuinfo will be used by default to create a thread pool that attempts to map one thread to each l2 cache, though this could be overridden.

Major components:

Synchronization primitives to allow inter-thread communications (#3294)
Cross-platform thread library (#3294)
cpuinfo dependency for querying cache sizes and instruction sets available (#3292)
Pluggable task executor and basic thread pool implementation for executing queues of tasks
HAL command buffer/queue integration using the task executor
Plumbing of workgroup xyz/count/etc into dispatched executables

Add RenderDoc API integration in the Vulkan backend

RenderDoc provides a rich API by which we can trigger capture and demarcate "frames" (probably just top-level invocations). We can add comments to frames that show in the UI which would make it easier to map back to original invocations. RenderDoc can also capture stack traces on calling into the Vulkan API entry points which would be useful when using the sequencer codegen (otherwise we'd just see VM stacks).

iree/base/wait_handle.cc relies on unix specific headers/features

Does not build on Windows.

ERROR: C:/src/ireepub/iree/iree/base/BUILD:348:1: Couldn't build file iree/base/_objs/wait_handle_test/wait_handle_test.obj: C++ compilation of rule '//iree/base:wait_handle_test' failed (Exit 1)
iree/base/wait_handle_test.cc(17,10): fatal error: 'unistd.h' file not found
#include <unistd.h>
^~~~~~~~~~
1 error generated.
ERROR: C:/src/ireepub/iree/iree/base/BUILD:330:1: Couldn't build file iree/base/_objs/wait_handle/wait_handle.obj: C++ compilation of rule '//iree/base:wait_handle' failed (Exit 1)
iree/base/wait_handle.cc(19,10): fatal error: 'poll.h' file not found
#include <poll.h>
^~~~~~~~
1 error generated.

Add rich cross-environment stack traces

Related to #265 (which will carry the stack traces), this is tracking the work required to generate source maps, embed them in modules, and perform lookups at runtime.

Implement runtime shape table dereferencing

We'll need to serialize parametric shape tables built at compile-time (#36) and dereference them at runtime. For values that are fully-known during sequencer execution we can evaluate inline and effectively have static shapes for all ops using the given shapes.

Add profiling support to the HAL API for use in benchmarks

We'll want a way to expose various metrics from the HAL implementations in a way that avoids excessive sequencer work. This could be accomplished by a begin/end profiling API and a resulting profile that contains cumulative, sampled, or averaged counter values per-backend. On Vulkan this may mean some vendor-specific performance counters in addition to timestamps inserted into command buffers.

Refresh Python API

There is the beginning of Python bindings for the compiler and vm in the upstream repo. Need to integrate into the OSS repo and work out dependencies.

As a followon, the Python API should align with the upcoming C API.

Import Dawn as a submodule and get it building

Bazel and CMake?

Refactor SPIR-V translation to separate index map from code emission

We can reuse the same index map and affine expression codegen for the SIMD lowering. Moving it somewhere common and ensuring we don't have any SPIR-V specific stuff will allow us to wire up std/SIMD ops for lowering.

vkCreateDevice: pCreateInfo->pQueueCreateInfos[0].queueCount (=2) is not less than or equal to available queue count

Full message:

vkCreateDevice: pCreateInfo->pQueueCreateInfos[0].queueCount (=2) is not less than or equal to 
available queue count for this pCreateInfo->pQueueCreateInfos[0].queueFamilyIndex} (=0) obtained 
previously from vkGetPhysicalDeviceQueueFamilyProperties (i.e. is not less than or equal to 1). The 
Vulkan spec states: queueCount must be less than or equal to the queueCount member of the 
VkQueueFamilyProperties structure, as returned by vkGetPhysicalDeviceQueueFamilyProperties in 
the pQueueFamilyProperties[queueFamilyIndex]
(https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkDeviceQueueCreateInfo-queueCount-00382)

Platform info:

Windows 10
AMD Radeon Vega GPU
Manual patch to select the second GPU device

Command line:
bazel run --config=windows iree/tools/iree-run-mlir -- $(pwd)/iree/samples/hal/simple_compute_test.mlir --input_values="4xf32=1.0 2.0 3.
0 4.0\n4xf32=2.0 4.0 6.0 8.0" --iree_logtostderr --target_backends=vulkan --print_mlir=false 2>&1 | tee ~/vklog.txt

Identify function call trees worth deferring for cellular batching

Possibly just op+shape-based to start and only around the core expensive dispatches.

Input:

  iree_hl_seq.dispatch[...](...) {
  // giant matmul
}

Deferred:

  iree_hl_seq.deferred_call @outlined_dispatch(...);
...
func @outlined_dispatch(...) {
  iree_hl_seq.dispatch[...](...) {
    // giant matmul
  }
}

Define initial SIMD dialect

Starting from the WebAssembly SIMD proposal define an MLIR dialect with the appropriate types (v128) and ops.

Add a tool to enumerate vulkan devices and add test runner flags for selecting

Right now the logic to select a Vulkan device for the test runners is too simplistic (it just takes the first). We should probably have two flags: a device name to match and an index. Also, it would be nice to have a tool to enumerate what the driver sees.

Prototype dynamic shape inference lowering to HLO/IREE

This may be done for us as part of MLIR's tf2xla effort however we need to ensure what is produced lines up with what we need to perform symbolic shape calculation. Specifically, we want to be able to know which calculations are part of shape inference vs. general arithmetic and which values are shapes.

Add target configuration to compiler for specifying assumed/required features

Right now we are hardcoding which backends we target and assuming the same parameters for all of them. To allow benchmarking of changes that may require specific target capabilities (Vulkan/SPIR-V extensions, device limits, etc) we should be able to specify these as flags to the compiler and produce a variety of executables. At runtime we should then match against those to select the best suited for the given runtime configuration and allow overrides to make it easier to compare.

Add Vulkan support to the native debugger app

This will let us avoid the additional dependency on GL and simplify the cross-platform porting of the native app. Since we are using SDL this shouldn't be too bad as I believe it is a toggle. Note that we'll want to keep GL support working for the web.

Loosen static shapes on deferred calls to allow cellular batching

After identifying good targets for cellular batching (#41) we'll need to ensure we can actually batch. Though coalescing is possible even if batching is not and often still provides throughput benefits the real wins come from increasing arithmetic density of the GEMVs. We should be able to detect which shape dimensions we can make partial for a given deferred call body and do so.

TensorFlow colab integration

We should add plumbing to launch a Colab kernel that pulls TensorFlow and the IREE Python API together. This will make it easy to interactively exercise the system, author models, etc.

The dependencies are fairly intricate to get right, which should be the main work that needs to be done. Once the build support is in place, we should publish a Docker image so anyone can experiment easily.

Add deferred calls to the sequencer IR and runtime

A deferred call allows for explicit compile-time indication of which parts of the sequencer execution graph are optimal for coalescing and possible batching. To start we can use heuristics to identify candidates (large conv/matmul/etc) while in the future we can add cost analysis and profile-guided annotation. The runtime can trigger fiber yielding and manage the policy used to flush pending deferred calls.

Dynamic shapes will be required to effectively perform batching, however coalescing should be possible even with fully static shapes. Ideally we would be able to loosen static shaping of call trees to allow batching even when the input HLO is fully shaped by either inserting dynamic dimensions or making outer dimensions dynamic when it would cause no observable changes.

Add simple SIMD JIT abstraction

A visitor-pattern shim that took care of parsing the bytecode, assigning registers, and calling emission stubs for the generic SIMD dialect ops would let us more easily plug in machine code generators for NEON/AVX/etc. Does not have to be particularly sophisticated given our tiny op coverage (no need for full xbyak-like functionality, for example).

Bazel build support

Need to migrate BUILD files from upstream and make compatible with OSS release.

This partially works now, with iree/base and iree/hal/vulkan building. Will iterate on the rest.

Building with bazel 1.0

There is an incompatibility with bazel 1.0:
ERROR: C:/users/stella/_bazel_stella/452ktct5/external/glslang/BUILD.bazel:57:15: in nocopts attribute of cc_library rule @glslang//:glslang: This attribute was removed. See bazelbuild/bazel#8706 for details.

Lower VM dialect to LLVM IR

For scenarios where dynamic module loading is not required and entire modules can be compiled into applications we can lower the VM IR to LLVM IR within MLIR's transformation pipeline. Instead of embedding vm.call ops that are dispatched at runtime to things like the HAL we can instead lower to llvm::CallInst to runtime-resolved function pointers. This still enables all of the flexibility of heterogeneous/runtime-determined devices, pluggable diagnostics, and backend composition without any need for flatbuffers or the VM bytecode interpreter.

The VM was designed to make such a lowering easy and the C-style struct-based function pointer registration for runtime modules was designed to make emitting code that used it fairly robust even when linked in dynamically such as when embedded in shared objects.

An extension of this is what we've been calling 'runtimeless mode', where the IREE VM linkage code is statically linked into the binary alongside the generated module LLVM IR. If only a single HAL backend is linked in then (with some build-fu) we should be able to get call devirtualization to reduce code size to precisely the functionality used by the module.

Early versions can be tested by tacking on additional vm-to-llvm conversions on the primary pipeline, and for integration a new iree/compiler/Translation target can be added (or an option can be added here: https://github.com/google/iree/blob/master/iree/compiler/Translation/IREEVM.cpp#L93)

Refactor sequencer translation to allow for multiple serialization formats

Right now the sequencer translation always lowers to bytecode. We should be able to emit C code or LLVM IR just as easily from the low-level sequencer dialect. Whether these targets are represented as new dialects (such as LLVM IR) or directly as a serialization mechanism (bytecode, C, etc) varies, so we should probably have both. This may mainly consist of splitting serialization from the translation (as the SPIR-V dialect is split).

Build symbolic shape calculation propagation infrastructure

A core idea of the dynamic shape system in IREE is that we can build a deterministic shape calculation table for each function that is based entirely on the input argument shapes. All shapes then used within the function can be looked up in that table by the sequencer and possibly evaluated during recording and allocations, copies, etc can all use those references in to allow them to be performed parametrically. This avoids the need for fully dynamic dispatch and allows us to still plan allocations and aliasing at compile-time.

This work would be to derive the table at compile-time, reference the table in various sequencer IR ops that may require it (workloads, allocations, etc), and propagate it through function calls where possible.

Prepare arguments/results for deferred calls

Something we may want to consider is how the deferred calls will place their allocations: ideally we would write inputs into a ringbuffer that is then consumed by the larger batched calls. We may be able to do this by using module-level globals or other state that is aware of the nature of the deferred calls and the current policy (defining the number of pending calls/etc).

Add OSS scripts for checking lint and formatting

These exist upstream and bad things happen on merge when misaligned.

Add SPIR-V matmul variants/specialization constants for various workloads and targets

Enabling the use of extensions (such as cooperative matrix) and special codepaths for various runtime-detectable capabilities would be a good way of evaluating the flexibility of the compiler. Most of this work is predicated on support for such features in MLIR's SPIR-V dialect (if we want to go that route) and the design of the integration, however we could do some proof of concept work for benchmarking independently.

Expose the runtime API via C

A C API will allow us to easily interop in a variety of languages (C#, python, rust, swift, etc) as well as maintain a documented and tested API surface that has less chance of breaking as internal refactoring is performed.

Add NEON SIMD JIT machine code emitter

Add SwiftShader support to the Vulkan HAL

We can use SwiftShader as a software Vulkan ICD. At first, this will let us run tests and other CI on machines with no hardware Vulkan driver.

Debugger UI reworking for new runtime API

Now that the new runtime API is setup we can update the UI to better match the context/invocation flow. The current debugger assumes that modules exist only within a single context and does not show invocations well.

Plumb an allocator interface through the C++ runtime API

The C API allows for an iree_allocator_t to be passed to all functions that may allocate (directly or indirectly). It'd be nice to carry that through the C++ API such that it can be used for all heap allocations. We can also pass the allocator down through the HAL to let Vulkan use it.

I don't think it's important to get the VM using an allocator - the codegen sequencer will be the choice for low memory consumption/code size.

As part of this it would be worth switching to fixed arrays (not absl::FixedArray, but an actual fixed array), intrusive lists, etc to avoid unneeded allocations of containers.

Support Builds on Windows

We ultimately want to support Windows builds via cmake and bazel. This issue will track the progress.

With a bit of tweaking, the bazel build mostly works for:
iree/tools/run_module (the simplest e2e example)

Generate bytecode reader/writer and disassembler from tblgen

We should be able to generate these - at least the shims - from tblgen. This will allow us to drop the bytecode reader/writer and custom op serializers and dramatically speed things up. In fact, we should be able to generate the entire bytecode dispatch table and all pointer math.

As part of this it may be worth changing the bytecode (which was designed to be easy to parse using the current approach) to be leaner and easier to codegen instead. For example, we may turn a cond branch into only having a true case and args followed by an unconditional branch for the false case which would allow us to avoid a more complicated set of serialization logic.

We may want a sequencer interface that is used by both the sequencer codegen and bytecode dispatch. This way the generated becomes more like a visitor dispatcher instead of anything needing real logic.

Add a WebSockets relay for debug server

This could either be linked directly into the binary (allowing direct connection) or run as a separate app on hosting devices. Ideally either. The goal is to be able to run the debugger within a browser and connect to local processes or processes over adb relays. If that's not possible (due to security restrictions?) we'll probably want to kill the web debugger and focus on a native debugger instead.

Document WTF support and add missing top-level scopes

We can also add frame tracking for application-invoked flushes, timeline span thingies (whatever I called them) for async invocations, etc. Seeing if we can marshal Vulkan timestamps in would also be good.

InitializeEnvironment() is not editing argv to only include positional arguments

In the OSS build, the backing absl::ParseCommandLine returns a vector of positional arguments, but callers are expecting that the passed argv is modified (consistent with Google upstream). This needs to be normalized.

Add AVX SIMD JIT emitter

Add bytecode representation of std/SIMD dialects for JITing

For the initial SIMD JIT we can reuse the bytecode infrastructure (maybe?) - though it may not be worth it. Since we know there will be much simpler and is designed for JITing instead of interpreting I think we could cut a lot of corners, though reusing printing/parsing infrastructure would be nice. Can evaluate if it's worth it to do register assignment at compile-time or during JITing as that would determine if we are using virtual registers or real physical registers (and what limits we may need to place on those).

Error 'error: Server does not allow request for unadvertised object b4ee523ffc962e25ef96f3bb67847521624d607a' on update of submodules

When updating the abseil or llvm submodules, I am often getting this error:
error: Server does not allow request for unadvertised object b4ee523ffc962e25ef96f3bb67847521624d607a

(or similar)

I think this has to do with the shallow submodule fetch.

I was able to make progress by running:
git submodule update --depth 1000