iree-org / iree Goto Github PK
View Code? Open in Web Editor NEWA retargetable MLIR-based machine learning compiler and runtime toolkit.
Home Page: http://iree.dev/
License: Apache License 2.0
A retargetable MLIR-based machine learning compiler and runtime toolkit.
Home Page: http://iree.dev/
License: Apache License 2.0
A visitor-pattern shim that took care of parsing the bytecode, assigning registers, and calling emission stubs for the generic SIMD dialect ops would let us more easily plug in machine code generators for NEON/AVX/etc. Does not have to be particularly sophisticated given our tiny op coverage (no need for full xbyak-like functionality, for example).
When updating the abseil or llvm submodules, I am often getting this error:
error: Server does not allow request for unadvertised object b4ee523ffc962e25ef96f3bb67847521624d607a
(or similar)
I think this has to do with the shallow submodule fetch.
I was able to make progress by running:
git submodule update --depth 1000
RenderDoc provides a rich API by which we can trigger capture and demarcate "frames" (probably just top-level invocations). We can add comments to frames that show in the UI which would make it easier to map back to original invocations. RenderDoc can also capture stack traces on calling into the Vulkan API entry points which would be useful when using the sequencer codegen (otherwise we'd just see VM stacks).
This could either be linked directly into the binary (allowing direct connection) or run as a separate app on hosting devices. Ideally either. The goal is to be able to run the debugger within a browser and connect to local processes or processes over adb relays. If that's not possible (due to security restrictions?) we'll probably want to kill the web debugger and focus on a native debugger instead.
This will let us avoid the additional dependency on GL and simplify the cross-platform porting of the native app. Since we are using SDL this shouldn't be too bad as I believe it is a toggle. Note that we'll want to keep GL support working for the web.
This may be done for us as part of MLIR's tf2xla effort however we need to ensure what is produced lines up with what we need to perform symbolic shape calculation. Specifically, we want to be able to know which calculations are part of shape inference vs. general arithmetic and which values are shapes.
For the initial SIMD JIT we can reuse the bytecode infrastructure (maybe?) - though it may not be worth it. Since we know there will be much simpler and is designed for JITing instead of interpreting I think we could cut a lot of corners, though reusing printing/parsing infrastructure would be nice. Can evaluate if it's worth it to do register assignment at compile-time or during JITing as that would determine if we are using virtual registers or real physical registers (and what limits we may need to place on those).
In the OSS build, the backing absl::ParseCommandLine returns a vector of positional arguments, but callers are expecting that the passed argv is modified (consistent with Google upstream). This needs to be normalized.
Hello everyone, I was trying to build IREE by CMake on Linux and here some minor issues:
file_io
and file_mapping
libsexecutable_cache
needs iree::base::wait_handle
which is commented out in https://github.com/google/iree/blob/master/iree/base/CMakeLists.txt#L349build_tools/third_party/tensorflow
third_party/tensorlow
?By the way, as far as I understood you have own internal build system, do you have any plans to support stable build under Linux?
Thanks.
We can use SwiftShader as a software Vulkan ICD. At first, this will let us run tests and other CI on machines with no hardware Vulkan driver.
#line
will let us embed the original source locations into the generated files, meaning that we'll get stack traces into the original frontend language. We could have an option for the generator to switch what locations get used (source location, line in HLO MLIR dump, etc).
We ultimately want to support Windows builds via cmake and bazel. This issue will track the progress.
With a bit of tweaking, the bazel build mostly works for:
iree/tools/run_module (the simplest e2e example)
Enabling the use of extensions (such as cooperative matrix) and special codepaths for various runtime-detectable capabilities would be a good way of evaluating the flexibility of the compiler. Most of this work is predicated on support for such features in MLIR's SPIR-V dialect (if we want to go that route) and the design of the integration, however we could do some proof of concept work for benchmarking independently.
We'll need to serialize parametric shape tables built at compile-time (#36) and dereference them at runtime. For values that are fully-known during sequencer execution we can evaluate inline and effectively have static shapes for all ops using the given shapes.
Bazel and CMake?
Now that the new runtime API is setup we can update the UI to better match the context/invocation flow. The current debugger assumes that modules exist only within a single context and does not show invocations well.
These exist upstream and bad things happen on merge when misaligned.
The C API allows for an iree_allocator_t
to be passed to all functions that may allocate (directly or indirectly). It'd be nice to carry that through the C++ API such that it can be used for all heap allocations. We can also pass the allocator down through the HAL to let Vulkan use it.
I don't think it's important to get the VM using an allocator - the codegen sequencer will be the choice for low memory consumption/code size.
As part of this it would be worth switching to fixed arrays (not absl::FixedArray, but an actual fixed array), intrusive lists, etc to avoid unneeded allocations of containers.
A C API will allow us to easily interop in a variety of languages (C#, python, rust, swift, etc) as well as maintain a documented and tested API surface that has less chance of breaking as internal refactoring is performed.
A core idea of the dynamic shape system in IREE is that we can build a deterministic shape calculation table for each function that is based entirely on the input argument shapes. All shapes then used within the function can be looked up in that table by the sequencer and possibly evaluated during recording and allocations, copies, etc can all use those references in to allow them to be performed parametrically. This avoids the need for fully dynamic dispatch and allows us to still plan allocations and aliasing at compile-time.
This work would be to derive the table at compile-time, reference the table in various sequencer IR ops that may require it (workloads, allocations, etc), and propagate it through function calls where possible.
To enable better SPIR-V and SIMD codegen we should vectorize dispatch regions. There's a vectorization transform in MLIR however it's specific to the affine dialect and not easily usable for our purposes - something much simpler would do, especially considering our starting point with variable-width vector HLO ops.
Input:
%c = xla_hlo.mul %a, %b : tensor<1024xf32>
Output:
{ iree.workload_divisor = {4, 4, 4} }
%c = xla_hlo.mul %a, %b : tensor<4xf32>
We could do this as part of lowering to SPIR-V or SIMD directly however both would benefit and having a common path through the index map would be nice if we need to adjust that to take the divisors into account.
When we aren't able to fully parameterize shapes based on input shapes (such as when a shape is sliced out of an arbitrary tensor) we'll need to synthesize a dispatch that computes the workload for indirect dispatch. We'll want to add bounds checking and compatibility checks and some way of communicating data loss errors (such as when the computed workload is not possible to perform), though we could possibly work around this by enqueuing multiple indirect dispatches and using as many as we need to satisfy the request.
Starting from the WebAssembly SIMD proposal define an MLIR dialect with the appropriate types (v128) and ops.
We can also add frame tracking for application-invoked flushes, timeline span thingies (whatever I called them) for async invocations, etc. Seeing if we can marshal Vulkan timestamps in would also be good.
ERROR: C:/src/ireepub/iree/iree/base/BUILD:348:1: Couldn't build file iree/base/_objs/wait_handle_test/wait_handle_test.obj: C++ compilation of rule '//iree/base:wait_handle_test' failed (Exit 1)
iree/base/wait_handle_test.cc(17,10): fatal error: 'unistd.h' file not found
#include <unistd.h>
^~~~~~~~~~
1 error generated.
ERROR: C:/src/ireepub/iree/iree/base/BUILD:330:1: Couldn't build file iree/base/_objs/wait_handle/wait_handle.obj: C++ compilation of rule '//iree/base:wait_handle' failed (Exit 1)
iree/base/wait_handle.cc(19,10): fatal error: 'poll.h' file not found
#include <poll.h>
^~~~~~~~
1 error generated.
We should be able to generate these - at least the shims - from tblgen. This will allow us to drop the bytecode reader/writer and custom op serializers and dramatically speed things up. In fact, we should be able to generate the entire bytecode dispatch table and all pointer math.
As part of this it may be worth changing the bytecode (which was designed to be easy to parse using the current approach) to be leaner and easier to codegen instead. For example, we may turn a cond branch into only having a true case and args followed by an unconditional branch for the false case which would allow us to avoid a more complicated set of serialization logic.
We may want a sequencer interface that is used by both the sequencer codegen and bytecode dispatch. This way the generated becomes more like a visitor dispatcher instead of anything needing real logic.
Possibly just op+shape-based to start and only around the core expensive dispatches.
Input:
iree_hl_seq.dispatch[...](...) {
// giant matmul
}
Deferred:
iree_hl_seq.deferred_call @outlined_dispatch(...);
...
func @outlined_dispatch(...) {
iree_hl_seq.dispatch[...](...) {
// giant matmul
}
}
After identifying good targets for cellular batching (#41) we'll need to ensure we can actually batch. Though coalescing is possible even if batching is not and often still provides throughput benefits the real wins come from increasing arithmetic density of the GEMVs. We should be able to detect which shape dimensions we can make partial for a given deferred call body and do so.
The current sequencer IR was a stopgap to get the end-to-end flow working. The real sequencer needs the concepts of command buffers, synchronization, and more flexible buffer handling. Half of this work is defining the IR in MLIR and adapting the existing conversion/translation code to work with it while the other half is refactoring the sequencer VM to do the proper dispatching. Once this is done we'll have a good foundation for performing a variety of optimizations in the compiler and beginning code generation work.
This can be used for scheduling the SIMD JIT or codegen.
High-level design is of an iree/task/
task management system that is specialized for our workload (out-of-order dynamic DAGs with late-binding to parameters like dispatch counts). A custom HAL command buffer will produce the DAG fragments consisting of tasks and then submit them to task queues that fan out to a thread pool. cpuinfo will be used by default to create a thread pool that attempts to map one thread to each l2 cache, though this could be overridden.
Major components:
Need to migrate BUILD files from upstream and make compatible with OSS release.
This partially works now, with iree/base and iree/hal/vulkan building. Will iterate on the rest.
Right now we are hardcoding which backends we target and assuming the same parameters for all of them. To allow benchmarking of changes that may require specific target capabilities (Vulkan/SPIR-V extensions, device limits, etc) we should be able to specify these as flags to the compiler and produce a variety of executables. At runtime we should then match against those to select the best suited for the given runtime configuration and allow overrides to make it easier to compare.
We should add plumbing to launch a Colab kernel that pulls TensorFlow and the IREE Python API together. This will make it easy to interactively exercise the system, author models, etc.
The dependencies are fairly intricate to get right, which should be the main work that needs to be done. Once the build support is in place, we should publish a Docker image so anyone can experiment easily.
Full message:
vkCreateDevice: pCreateInfo->pQueueCreateInfos[0].queueCount (=2) is not less than or equal to
available queue count for this pCreateInfo->pQueueCreateInfos[0].queueFamilyIndex} (=0) obtained
previously from vkGetPhysicalDeviceQueueFamilyProperties (i.e. is not less than or equal to 1). The
Vulkan spec states: queueCount must be less than or equal to the queueCount member of the
VkQueueFamilyProperties structure, as returned by vkGetPhysicalDeviceQueueFamilyProperties in
the pQueueFamilyProperties[queueFamilyIndex]
(https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkDeviceQueueCreateInfo-queueCount-00382)
Platform info:
Command line:
bazel run --config=windows iree/tools/iree-run-mlir -- $(pwd)/iree/samples/hal/simple_compute_test.mlir --input_values="4xf32=1.0 2.0 3.
0 4.0\n4xf32=2.0 4.0 6.0 8.0" --iree_logtostderr --target_backends=vulkan --print_mlir=false 2>&1 | tee ~/vklog.txt
All invocations currently run synchronously. To allow for custom imported functions that may perform sync/async work, overlapped scheduling of invocations, and sequencer-level cellular batching we'll want to plug a simple user-mode fiber scheduler into the Instance such that invocations across contexts can be executed as available.
We can either allow users to donate their calling thread to perform fiber scheduling or spawn a dedicated scheduler thread. Until deferred calls are implemented there should not be much functional change.
Something we may want to consider is how the deferred calls will place their allocations: ideally we would write inputs into a ringbuffer that is then consumed by the larger batched calls. We may be able to do this by using module-level globals or other state that is aware of the nature of the deferred calls and the current policy (defining the number of pending calls/etc).
We can reuse the same index map and affine expression codegen for the SIMD lowering. Moving it somewhere common and ensuring we don't have any SPIR-V specific stuff will allow us to wire up std/SIMD ops for lowering.
Related to #265 (which will carry the stack traces), this is tracking the work required to generate source maps, embed them in modules, and perform lookups at runtime.
Right now we have to get llvm9 and all of the submodules before we can run any action on the CI. It'd be cool to setup a push action that, if .gitmodules is changed, rebuilds a base ubuntu docker image with our deps and the bigger submodules (llvm, tensorflow) already checked out. This way incremental non-submodule changes should be much faster.
There is the beginning of Python bindings for the compiler and vm in the upstream repo. Need to integrate into the OSS repo and work out dependencies.
As a followon, the Python API should align with the upcoming C API.
We'll want a way to expose various metrics from the HAL implementations in a way that avoids excessive sequencer work. This could be accomplished by a begin/end profiling API and a resulting profile that contains cumulative, sampled, or averaged counter values per-backend. On Vulkan this may mean some vendor-specific performance counters in addition to timestamps inserted into command buffers.
There is an incompatibility with bazel 1.0:
ERROR: C:/users/stella/_bazel_stella/452ktct5/external/glslang/BUILD.bazel:57:15: in nocopts attribute of cc_library rule @glslang//:glslang: This attribute was removed. See bazelbuild/bazel#8706 for details.
Right now the logic to select a Vulkan device for the test runners is too simplistic (it just takes the first). We should probably have two flags: a device name to match and an index. Also, it would be nice to have a tool to enumerate what the driver sees.
A deferred call allows for explicit compile-time indication of which parts of the sequencer execution graph are optimal for coalescing and possible batching. To start we can use heuristics to identify candidates (large conv/matmul/etc) while in the future we can add cost analysis and profile-guided annotation. The runtime can trigger fiber yielding and manage the policy used to flush pending deferred calls.
Dynamic shapes will be required to effectively perform batching, however coalescing should be possible even with fully static shapes. Ideally we would be able to loosen static shaping of call trees to allow batching even when the input HLO is fully shaped by either inserting dynamic dimensions or making outer dimensions dynamic when it would cause no observable changes.
Right now the sequencer translation always lowers to bytecode. We should be able to emit C code or LLVM IR just as easily from the low-level sequencer dialect. Whether these targets are represented as new dialects (such as LLVM IR) or directly as a serialization mechanism (bytecode, C, etc) varies, so we should probably have both. This may mainly consist of splitting serialization from the translation (as the SPIR-V dialect is split).
For scenarios where dynamic module loading is not required and entire modules can be compiled into applications we can lower the VM IR to LLVM IR within MLIR's transformation pipeline. Instead of embedding vm.call ops that are dispatched at runtime to things like the HAL we can instead lower to llvm::CallInst to runtime-resolved function pointers. This still enables all of the flexibility of heterogeneous/runtime-determined devices, pluggable diagnostics, and backend composition without any need for flatbuffers or the VM bytecode interpreter.
The VM was designed to make such a lowering easy and the C-style struct-based function pointer registration for runtime modules was designed to make emitting code that used it fairly robust even when linked in dynamically such as when embedded in shared objects.
An extension of this is what we've been calling 'runtimeless mode', where the IREE VM linkage code is statically linked into the binary alongside the generated module LLVM IR. If only a single HAL backend is linked in then (with some build-fu) we should be able to get call devirtualization to reduce code size to precisely the functionality used by the module.
Early versions can be tested by tacking on additional vm-to-llvm conversions on the primary pipeline, and for integration a new iree/compiler/Translation
target can be added (or an option can be added here: https://github.com/google/iree/blob/master/iree/compiler/Translation/IREEVM.cpp#L93)
ERROR: C:/src/ireepub/iree/iree/hal/BUILD:68:1: Couldn't build file iree/hal/_objs/buffer_test/buffer_mapping_test.obj: C++ compilation of rule '//iree/hal:buffer_test' failed (Exit 1)
iree/hal/buffer_mapping_test.cc(87,24): error: expected expression
ASSERT_OK_AND_ASSIGN(auto mapping,
^
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.