tum-ei-eda / mlonmcu Goto Github PK

Tool for the deployment and analysis of TinyML applications on TFLM and MicroTVM backends

License: Apache License 2.0

Makefile 0.09% Jupyter Notebook 35.92% Python 49.43% Jinja 3.64% Dockerfile 0.11% CMake 2.53% C 5.70% C++ 2.42% Shell 0.17%

mlonmcu's Issues

Run TFLMC on target architecture

Currently we generate the static TFLM code running the TFLite Micro Compiler (Pre-Interpreter, TFLMC) on the host. This results in a potentially larger arena size if using the results on a 32 bit architecture which is typically fine. It still would be desirable to get more accurate estimations by running the "preinterpretation" on the actual target (e.g. a simulator).

The priority of this is pretty low as we prefer to use TVM rather than TFLM anyway.

Strange Cycle Count Differences

I recently added a corstone300 target based on the ARM Cortex-M55 FVP. (See #3)

While it was expected that RISCV and ARM targets arch not comparable in terms of cycle counts as they model different architectures. However as all of them should be ISS with a constant CPI of 1 I would not have expected the following:

The estimated cycles counts for the corstone300 on the same target_software tend to be 5-10 times smaller than the RISCV ones, even without features such as cmsisnn.

I can not tell if this is just the simulators being implemented very differently or if there is another issue with either the cycle count reading (e.g. Cyclce Count register overflowing at 32 bit) or different compiler optimiation flags?

Printing the number of Instructions/Cycles to increases the program size

This issue should document the following issue:

Currently we have use several different ways to access the number of cycles/instructions for executing a model:

spike: Use RISCV performance counters at runtime to measure elapsed cycles during main - Introducing a rather large ROM overhead as we have to link printf etc. even in Release mode. (RAM/Cycles overhead should be negligible)
ovpsim: Parse stdout for metrics printed AFTER simulation. However as target_sw is the same as used by spike the same overheads are expected. (This could be changed easily)
etiss_pulpino: Parse stdout for metrics printed AFTER simulation. (Alternative: Use json file which can be generated by VP) - This approach does not rely in printf and thus leads to much smaller program sizes. Once performance counters are implemented here as well, we could use them instead to be consistent.
corstone300: Similar to spike/ovpsim
esp32/esp32c3: Using ESP Timer and printing elapsed cycles via UART - Similar overheads for printf/string handling + additional drivers for UART,...

There is probably no good solution to this, even if every used Simulator would provide a way to access those metrics without using printf that would still not be applicable to real hardware targets.

Maybe we should agree on a consistent way to get the Cycles/Instructions e.g. using printf etc. in every program to make the targets more comparable. Who du you think @rafzi @fabianpedd?

Another discussion:

Do we just want to consider the time to execute main() or also the time spent in the bootloader/startup code? Currently the approach is different from target to target.

Add Session Artifacts

We use the Artifact class to manage all the intermediate results as well as the final resport of a run.

We could re-use this concept on a session level for the Session-Report (Dataframe which combines the Reports for every run in a session) and Visualization results (if available).

Feature: Update Backend Options from Target

Some backends allow target specific optimizations. Especially TVM support the following target specific flags to enable specific schedules or features:

device (i.e. arm_cpu, cuda,...)
mcpu
model
mattr (i.e. +neon)
keys

We should find a way to update these automatically given the used target. (i.e. to enable usage of SIMD intrisincs which are only available on a given set of devices). However this transformation should be optional and thus should be enabled explicitly. By default, generic (portable) implementations should be used by the backend.

Open questions:

How would we name such feature (i.e. overwriteopts)? Or should it just be a per-target config option (i.e. spike.overwrite_backend_options=1)?

Broken CI

Two of the four he CICD build jobs are currently failing:

Python 3.10: error while compiling something from source using PIP
Python 3.8: Needs multiple hours for PIP dependency resolution before getting killed

Implement LLVM backend for TVM

As a long-term goal it would be great to run TVM models directly using tvms default runtime (without the crt.)

This would involve adding new backend (i.e. tvmllvm) to the project using tvmc compile without the microTVM specific flags.

In addition we need to decide if we want to build the host software by ourselves or just let tvmc run do the work (loosing MLIF-specific (profiling) features).

Rename Run-Stage

Currently we have sort of a naming conflict in the project:

A session may be composed of multiple runs while each run has a set of stages:

LOAD: Process frontend
TUNE: Tune Model (Optional)
BUILD: Process backend
COMPILE: Compile Target software
RUN: (Flash &) Run Target Software and collect results
POSTPROCESS: Evaluate/transform resulting report (Optional)

At some time we should rename the RUN stage to something less confusing, e.g. execute or evaluate.

Deploy Examples/Demos on actual Hardware

The current focus of mlonmcu target software is to benchmark inference performance and validate model outputs for predefined input data.

As support for real hardware targets (e.g. esp32, esp32) was recently introduced it would be great to also use the on-board peripherals (i.e. microphone/MEMS sensors) in the deployed target software (if feasible).

The following points have to be consider:

How to switch between benchmark/demo mode? - Command line flag? Optional run stage?
How to design abstraction interface for target-specific code in per-model mlif_overwrite code? Multiple files? Compiler macros? (#if defined(ESPIDF_PLATFORM) && defined(RUN_DEMO) ....)
As the demo should not end after a single inference we would need a way to keep the model running (e.g. --num=inf) - Should mlonmcu stop after flashing or also monitor the hardware?

Make report format configurable

Currently reports are exported as CSV only.

The the future we should at least support Excel file formats as well.

At at later point in time it would be great if we could also export figures/visualizations in addition the the report.

Define benchmark tasks and tests

We already support a large number of backends, targets and features in MLonMCU. This leads to an enormous amount of different configurations. For the CI we should define on which combinations of Targets/Models/Backends/Features/Configs should benchmarks be performed. Also we can define another (minimal) set of combination for testing purposes to detect regressions or bugs which have non been found during benchmarks.

Discuss Framework/Backend names

Regarding frameworks I currentlu use the following naming which is suboptimal:

Tensorflow Lite for Microcontrollers tflite
MicroTVM: tvm

The obvious alternative would be to use tflm and microtvm however there are a few caveats:

tflm sounds very similar to the backend names tlmc and tflmi. Alternative: tflite-micro?
For consistency we should also rename the backend names appropriately as they still are referencing uTVM which was renamed to microTVM about a year ago -> microtvmaot etc. just does not sound right...
In theory we are not restricted to microTVM, so just using tvm might generalize more.

Regarding the backend names the is also some inconsitency:

utvmrt references the old uTVM Graph Runtime which is know known as Graph executor. Also the used runtime is called CRT which is not directly related to microtvm.

Add options to split temporary artifacts in per-stage subdirectories

A MLonMCU "run" typically consists of the following stages:

LOAD: Process the model in the frotend to produce e.g. a .tflite file
TUNE: Generate tuning records (Optional)
BUILD: Run the chosen backend to generate (wrapper) code for the model
COMPILE: Compile the target software using the generated code
RUN: Run the resulting ELF on the defined target platform.
(POSTPROCESS: unimplemented)

Currently the intermediate artifacts are dumped in a single directory $MLONMCU_HOME/temp/sessions/$SID/runs/$RID/ (with SID being the session ID and RID the run ID)

I would like to add an config option artifacts_per_stage=1/0 which can be defined on the command line of the the environments.yml to instead use subdiretories for every stage i.e. $MLONMCU_HOME/temp/sessions/$SID/runs/$RID/{load,build,compile,run,postprocess}

Static Memory Usage - Update ELF sections

For calculating the static ROM/RAM usage on non-ETISS targets, I have created mlonmcu/target/elf.py heavily inspired by ETISS get_metrics.py.

However as we are now also targeting ARM and x86 we will very likely miss some important ELF sections.

Refactor progress bars

The --progress option is available for the mlonmcu flow and mlonmcu setup commands.

I modified their format string to get rid of the estimated remaining which which is not useful for our approach. However it would be great if we could add an indicator of how long is the underlying process running.

Add ONNX frontend

As TVM can process .onnx (pretrained) models, it would be great if we could also sue them

Tasks

Look for example ONNX models (for MCUs) and add them to mlonmcu-models
Implement OnnxFrontend
Integration into TVMBackend's
Consider adding unit/integration tests

Warning

The frontends-API in mlonmcu is currently very limited as there only exists the tflite frontend. It is likely that we run into some minor issue when adding more complex frontends in the future.

Support Python 3.6

Currently we can not support Python 3.6 because of using features which are only implemented from version 3.7+.

Lets try to document those and decide if we can provide some workarounds to support python 3.6 systems.

Use standalone tflite-micro repository

We currently still use our own tensorflow v2.4 fork even if the project would theorethically support the latest version of standalone tflite-micro. The only thing missing is patches to the Preinterpreter (tflmc backend) codebase.

Stream input data over serial port at runtime

Another long term goal for the project would be the ability to input send data for the model over a serial interface to the target device (real hardware or simulator) and receive the resulting tensor data after inference. This would allow to validate the results on the computer in instead of on the hardware itself, i.e. to test a larger number of samples (which would not fit in ROM) or to evaluate the on-device accuracy (Optional as the quantized accuracy should be the same as determined during model conversion.)

The existing MLIF inout data feature should stay the default.

Some considerations:

The will probably only be a serial port: How to "multiplex" serial debugging and raw serial data?
How to enable the functionality? --feature stream_data?

Refactor Validate Feature

Since the frontends/backends should now be aware of the input tensor names (as well as types and sizes) in a model we can probably shift from using per-input/output .bin files for test data to a single pair (in/out) of .npz numpy pickled files.

This would allow validating model outputs using the tvm Platform tvmc run ... and introducing optional atol (absolute tolerance) and rtol (relative tolerance) to compensate deviations from the expected (golden) output values due to the used framework/target. We could provide a switch to decide if the outputs have to match exactly of if some tolerance can be accepted

I addition we could add a mode (for classification models) where we only supply the expected output label (or its index) as golden reference. This would only detect if a keyword is miss-classified i.e. due to some implementation bug...

As a long term goal we could also consider using the validate feature for estimating the on-device accuracy of a model (Which would ideally match the one of the quantized TFLite model). To do this effectively (e.g. with a lot of samples) we have to stream the inputs/outputs to/from the target (device/simulator) and validate it on the host. Otherwise we would need to like an unreasonable amount of ROM data to the ELF.

However we still need to decide on the following:

Should we keep supporting the traditional approach using .bin files as a fallback if a .npz file does not yet exist (Eventuelly by renaming the current implementation to validatelegacy or by introfucing and config option e.g. validate.legacy=1)
Can we agree on a default set of atol and rtol which hold for every model? Probably not because we have to consider different datatypes and output ranges....
In the end we might still need different sets of golden outputs for different framework_target combinations... (At least as long TVM does not support TFLITEs rounding mode). If yes, how should those be organized? I mean do we need to provide different ins/outs directorys and parse their names or could we provide a map for this?

[TVM] Get tir.round support upstreamed

Currently we need to use our own fork or patch the TVM codebase to get some of our models running:

diff --git a/src/target/source/codegen_c.cc b/src/target/source/codegen_c.cc
index a31111153..bf43aabd3 100644
--- a/src/target/source/codegen_c.cc
+++ b/src/target/source/codegen_c.cc
@@ -675,6 +675,10 @@ void CodeGenC::VisitExpr_(const CallNode* op, std::ostream& os) {  // NOLINT(*)
       const StringImmNode* str = op->args[0].as<StringImmNode>();
       ICHECK(str != nullptr);
       os << "__tvm_param__" << str->value;
+    } else if (ptr_op->name == "tir.round") {
+      os << "(";
+      this->PrintExpr(op->args[0], os);
+      os << " + 0.5f)";
     } else {
       LOG(FATAL) << "Unresolved call " << op->op;
     }

I would like to use the upstream TVM repository in the default environment and use our fork in a new environment called tumeda.

@rafzi Is there an issue why this small patch did not yet make it upstream?

Killed process hangs and suppresses report

A run of mine was killed with SIGKILL and caused a hang. After four SIGINTs (^C below) I got back to the command prompt, but no report was generated for the runs before.

ERROR - The process returned an non-zero exit code -9! (CMD: `/home/user1/ml_on_mcu/venv/bin/python -m tvm.driver.tvmc compile /home/user1/mlenv/temp/sessions/96/runs/4/nasnet.tflite --target c -f mlf --executor aot --runtime crt --pass-config tir.disable_vectorize=True --pass-config relay.moiopt.enable=True --pass-config relay.moiopt.noftp=False --pass-config relay.moiopt.onlyftp=False --pass-config relay.moiopt.norecurse=True --opt-level 3 --input-shapes input_1:[1,224,224,3] --model-format tflite --runtime-crt-system-lib 0 --target-c-constants-byte-alignment 4 --target-c-workspace-byte-alignment 4 --target-c-executor aot --target-c-unpacked-api 0 --target-c-interface-api packed --output /tmp/tmpa7fw4300/default.tar`)
Traceback (most recent call last):
  File "/home/user1/mlonmcu/mlonmcu/session/run.py", line 538, in process
    func()
  File "/home/user1/mlonmcu/mlonmcu/session/run.py", line 433, in build
    self.backend.generate_code()
  File "/home/user1/mlonmcu/mlonmcu/flow/tvm/backend/tvmaot.py", line 119, in generate_code
    out = self.invoke_tvmc_compile(out_path, dump=dump, verbose=verbose)
  File "/home/user1/mlonmcu/mlonmcu/flow/tvm/backend/backend.py", line 228, in invoke_tvmc_compile
    return self.invoke_tvmc("compile", *args, verbose=verbose)
  File "/home/user1/mlonmcu/mlonmcu/flow/tvm/backend/backend.py", line 220, in invoke_tvmc
    return utils.python(*pre, command, *args, live=verbose, env=env)
  File "/home/user1/mlonmcu/mlonmcu/setup/utils.py", line 171, in python
    return exec_getout(sys.executable, *args, **kwargs)
  File "/home/user1/mlonmcu/mlonmcu/setup/utils.py", line 154, in exec_getout
    assert exit_code == 0, "The process returned an non-zero exit code {}! (CMD: `{}`)".format(
AssertionError: The process returned an non-zero exit code -9! (CMD: `/home/user1/ml_on_mcu/venv/bin/python -m tvm.driver.tvmc compile /home/user1/mlenv/temp/sessions/96/runs/4/nasnet.tflite --target c -f mlf --executor aot --runtime crt --pass-config tir.disable_vectorize=True --pass-config relay.moiopt.enable=True --pass-config relay.moiopt.noftp=False --pass-config relay.moiopt.onlyftp=False --pass-config relay.moiopt.norecurse=True --opt-level 3 --input-shapes input_1:[1,224,224,3] --model-format tflite --runtime-crt-system-lib 0 --target-c-constants-byte-alignment 4 --target-c-workspace-byte-alignment 4 --target-c-executor aot --target-c-unpacked-api 0 --target-c-interface-api packed --output /tmp/tmpa7fw4300/default.tar`)
ERROR - [session-96] [run-4] Run failed at stage 'BUILD', aborting...
############### HANG HERE ###################################
^C^CTraceback (most recent call last):
  File "/home/user1/mlonmcu/mlonmcu/session/session.py", line 258, in process_runs
    _join_workers(workers)
  File "/home/user1/mlonmcu/mlonmcu/session/session.py", line 198, in _join_workers
    results.append(w.result())
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 435, in result
    self._condition.wait(timeout)
  File "/usr/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user1/mlonmcu/mlonmcu/cli/main.py", line 116, in <module>
    sys.exit(main(args=sys.argv[1:]))  # pragma: no cover
  File "/home/user1/mlonmcu/mlonmcu/cli/main.py", line 107, in main
    args.func(args)
  File "/home/user1/mlonmcu/mlonmcu/cli/flow.py", line 64, in handle
    args.flow_func(args)
  File "/home/user1/mlonmcu/mlonmcu/cli/compile.py", line 108, in handle
    kickoff_runs(args, RunStage.COMPILE, context)
  File "/home/user1/mlonmcu/mlonmcu/cli/common.py", line 191, in kickoff_runs
    success = session.process_runs(
  File "/home/user1/mlonmcu/mlonmcu/session/session.py", line 290, in process_runs
    _join_workers(workers)
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 628, in __exit__
    self.shutdown(wait=True)
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 229, in shutdown
    t.join()
  File "/usr/lib/python3.9/threading.py", line 1033, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.9/threading.py", line 1049, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/usr/lib/python3.9/threading.py'>
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 1415, in _shutdown
    atexit_call()
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/usr/lib/python3.9/threading.py", line 1033, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.9/threading.py", line 1049, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt: 
^C⏎

Corstone300 simulation does not stop on error

The Corstone300 simulator (ARM Cortex-M55 FVP) has a quite annoying exit condition:

It waits until EXITTHESIM is printed to the terminal by the target software.

When a program executed successfully, this will be done at the end of the main function, however if execution stops earlier with an call to exit(1) (i.e. from TVMPlatformAbort()) we currently can not catch this.

We need to find a workaround for this. Here are some ideas:

Try to overwrite the exit() function in the MLIF lib
Figure out if atexit() would work

Add ability to resume a (failed) run

It would be very helpful for debugging if there was a possibility to continue a run/session from the last completed stage. This we we could debug/tweek code for later stages much easier without waiting for all prior stages to complete.

For now, I would restrict the feature to always choose the most recent session.

Support further operating systems.

As a long-term goal it would be great to support operating systems different from Ubuntu/Debian. As there will definitely be dependencies which are not available on very OS, we will likely have to disable certain features/targets/... on some hosts.

Tasks

Provide support using --docker command line flag
Support latest MacOS (with limited functionality)
Check out if Windows support would be feasible (using MinGW?) -> If not, just recommend WSL2!
Test and document other GNU/UNIX distributions with different package managers etc.

Decouple/Isolate backend definitions and their implementation

I am a bit unhappy with the current approach how the backend implementations are handles in MCUonMCU.

Some examples:

The TFLMCBackend backend is only a wrapper around the previously installed tflmc dependency (invoked in a subprocess) Multiple versions of the tool can easily used pointing the tflmc.exe property to the according path.
tflmi wrapper-generation utils on the other side are written in Python and just called from within the TFLMIBackend. As the wrappers are generated without any external dependency this works out great like it is
The TVM backends are quite different. All of them are written in Python code which heavily depends on the tvm python lib of a specific TVM installation. The required version of TVM stored in the property tvm.src_dir might change during the MLonMCU flow which can not be handle properly by Python.

For the last problem there exist multiple solutions:

Use multiprocessing.Process() to invoke the TVM backends in a new process which can use a different PYTHONPATH. The two main issues of this approach are:
- Inconsistency between frameworks: While on the TFLM side, this should not be required, both frameworks should be handled in a similar way. We could simply use the same approach on the TFLM side, but this would result in more nested processed (as tflmc internally calls tflite_micro_compiler)
- Logging module is not multiprocessing-safe -> unable to properly control (mute, redirect) stdout of backend implementation
Like 1. But using subprocess.Popen instead -> allows to decouple backend-stdout from rest of MLonMCU
Only implement the wrapper generation (without TVM dependencies) in the Backend itself an define the actual script somewhere else (like other dependencies)
- Preferable approach because isolation on the same level as on TFLM backends
- Reuse TVMC command line utility? -> How to integrate custom features (See Issue #9)
- Should we create a new repository for TVM-specific features or Include them in TVM codebase? (In my opinion they do not belong to mlonmcu/flow/tvm as the are useful without mlonmcu -> Make them an external dependency instead)

Refactor configuration management

An detailed overview of the existing configuration mechanics in MLonMCU can be found here: [Will be added later]

TL;DR

Every relevant entity (frontend, framework, backend, target) has a name as well as a self.config dictionary as well as class variables DEFAULTS and REQUIRED. The entity’s name will be used as a prefix on the global config layer for mapping configs to specific instances and overwrites the defined defaults (tflmi.arena_size -> tflmi.config["arena_size"]). The REQUIRED variable contains a list of all config keys which need to be defined explicitly beforehand. The mapping of that configuration as well as validating the keys is done in the constructors right now.

The mentioned scheme has some disadvantages I we should get rid off like i.e.
• a lot of redundant code because every entity handles its configuration separately
• No possibility to update the configuration after the constructor without potentially breaking something
• No easy way to quickly export all configuration into a single dict.
• Class variables are used to define default values which is a good practice (the reason for this was to get the required config keys of a class without instantiating it first.

My proposal for a configuration refactoring is as follows:
• Implement a Config class which acts as a “smart” replacement for the self.config dicts, also handling defaults, required keys and the mapping of prefixed keys.
• Optional: Add subclasses for i.e. BackendConfig, TargetConfig,… (only if necessary)
• Optional: Add abstract Configurable base class to all relevant classes (required?)
• On the run-level a single config should be shared between all relevant objects (no copying! if a frontend sets config["tflmi.arena_size"] = 1024, it will instantly be available in the tflmi backend)
• This also allows an easy export of all config to a report as we do not need to collect every single config manually as everything is stored a a common place. In addition it would be possible to only export config values which not not match their default value.

Extend Platform API: BuildPlatform

A platform API was recently added to MLonMCU and it's idea can be described as follows:

Common interface: Platform
Specific platform types/classes:
- CompilePlatform: A platform which is able to build target software with given codegen results
- TargetPlatform: A platform with the ability to flash/monitor specific (hardware) targets
A certain platform (say espidf or platformio) inherits from one or both of the base classes depending on the implemented features
Platforms can also be seen as a "target registry" as they are able to dynamically create Target instances for supported target names.

This Issue proposes to add another type of platforms: BuildPlatform
This would be a platform which wraps around a backend and should be therefore able to run code generation.
A realistic example of how this might be use would be a microtvm platform as TVM provides a Project API with templates to support large number of target devices. The full flow from building a model over compiling the running the model (using an RPC server) can be handled using the tvmc micro tool.

I am actually not sure if this would be a good idea at some point in time, however it always make sense to think about ways to generalize existing APIs. For this reason this should mainly be documenting the concept which might get picked up at some point.

Re-implement features

This issue should should list all the features we want to support until the release of the project.

First I would like to shortly explain the features types which are denoted after each entry on the list:

Setup Feature: For features which required installing specific dependencies beforehand. Affecst the mlonmcu setup command.
Frontend Feature: Frontend related features are used in the LOAD stage of a run. (i.e. packed)
Framework/Backend Feature: Features specific to a backend or framework (i.e. unpacked_api). Affects the BUILD stage of a run.
Target Feature: Target specific features (i.e. etissdbg). Affects the COMPILEand RUN stage
Compile Feature: MLIF/CMake related features (i.e. debug). Affects the COMPILE stage

Required:

Unpacked API (Backend)
Debug Target Software (Compile)
WIP: Debug Arena Usage (Backend + Compile?)
Custom Memplan (Backend)
WIP: GDBServer support (Target)
WIP: Debug ETISSVP (Target)
Use existing autotuning results (Backend)
Run TVM Autotuning on host (Backend)
WIP: muRISCVNN (Setup + Framework)

Delayed:

Fusetile Support? (Setup? + Backend)
Run TVM autotuning on ETISSVP (Setup + Backend)
Visualization: export TFLITE network or relay model graph (Frontend[tflite] + Backend[TVM])

Optional:

Packed tensor support (Frontend + Framework[TFLM only] + Backend)

Enable GitHub Pages

For private origanization repositories GItHub Pages (which we need to release our Sphinx documentation) is a paid feature.

As soon as the Project is open-sourced, we can enabled this feature.

Support latest muRISCV-NN

The new version of the library is going to be released soon, so we also want to use it in MLonMCU.

The integration process can be summarized as follows:

Implement MLonMCU Feature
Install dependencies via mlonmcu setup
Integrate into MLIF
Support TFLMI backend (should work out of the box with MLIF patches)
Support TFLMC backend (needs patches, generation on host architecture feasible?)
Support and Test V-Extension
Optional: Integration into TVM backends via CMSISNN BYOC wrappers

Implement missing subcommands

There are at least 2 mlonmcu subcommands for the command line I would like to support in the feature:

mlonmcu export
mlonmcu cleanup

An optional one would be mlonmcu activate which would (automatically) look for an MLonMCU environment and export the MLONMCU_HOME environment var inside the current shell. (Inspired by conda activate)

Get fair comparisons by using memcpy for inputs and outputs on every backend

Currently some backends use their own input buffer which needs to be filled beforehand while others just take a pointer to the constant input data in ROM.

The former has the advantage that the input buffer itself can be considered memory planning. The latter uses ROM instead of RAM which may be desirable or not.

We have at least 1 model with a fairly large input size (~40kB). This leads to an to observation that e.g. tvmaot needs much more RAM than tflmi for this model.

We could instead decide to use the same approach for every supported backend, aka. copy the constant input data via memcpy to RAM first.

Fix CI Demo + Benchmarking container

Curently the Demo in the GitHub Actions will fails because it need access to the mlonmcu-models submodule. As this is currently private, the GitHub Runner can not clone it. Normally you would generate a PAT (Personal Access Token) to resolve those problems, but this is not available for Organization accounts.

I am considering to just put mlonmcu-models and mlonmcu-sw on public to fix the CI.

Allow sharing of equivalent workloads

In the current implementation every defined run is being processed independently. This has the main advantage, that parallel processing can be applied very easily. However in certain situations this approach results in many redundancy in terms of processed workloads.

Imagine the following example:

mlonmcu run aww vww -b tflmc -b tvmaot -t etiss_pulpino -t host_x86

This would currently result in the following workloads:

• Stage LOAD: 8 times
• Stage BUILD: 8 times
• Stage COMPILE: 8 times
• Stage RUN: 8 times

However it would be more efficient with the following scheme:

• Stage LOAD: 2 times
• Stage BUILD: 4 times
• Stage COMPILE: 8 times
• Stage RUN: 8 times

The question is how we integrate this approach in the flow. One option would be to add the possibility to specify a “parent” of a run explicitly. However there are a few caveats which we have to discuss:

• What should happen with the run ids? (Use nested numbers (e.g. 4_0 for children?)
• Would this force us to process runs on a per-stage (—config runs_per_stage=1) basis to handle the dependencies easily? (i.e. COMPILE jobs can not start unless every BUILD job has finished which results in potentially high sync times)

CMSIS-NN Feature: Dependency resolution issue

There is a flaw iin the current implementation of the cmsisnn issue:

This feature requires the config (or chache key) cmsisnn.lib which is then passed to the tflm.optimized_kernel_libs configuration variable.
While this works out when only using CMSIS-NN on a single target (e.g. ARM corstone300 FVP) we get into trouble if we also want to use other architectures, as each of them is building a specific static library for CMSISNN.

Do achieve this in mlonmcu setup builds multiple static libraries leading to the following dependency cache (deps/cache.ini):

[x86]
cmsisnn.lib = /tmp/mlonmcu_env_test/deps/install/cmsisnn_x86/libcmsis-nn.a

[dbg,x86]
cmsisnn.lib = /tmp/mlonmcu_env_test/deps/install/cmsisnn_x86_dbg/libcmsis-nn.a

[riscv]
cmsisnn.lib = /tmp/mlonmcu_env_test/deps/install/cmsisnn_riscv/libcmsis-nn.a

[dbg,riscv]
cmsisnn.lib = /tmp/mlonmcu_env_test/deps/install/cmsisnn_riscv_dbg/libcmsis-nn.a

[arm]
cmsisnn.lib = /tmp/mlonmcu_env_test/deps/install/cmsisnn_arm/libcmsis-nn.a

[arm,dbg]
cmsisnn.lib = /tmp/mlonmcu_env_test/deps/install/cmsisnn_arm_dbg/libcmsis-nn.a

The required flags for the architecture get be get from the via target.get_arch() method
However as all features are initialized in the beginning before any targets exists, the cache variable cmsisnn.lib can not be resolved because it is missing the flags x86/arm/riscv and dbg from the compile stage... -> The actual issue

A workaround for this issue is explicitly passing the value of cmsisnn.lib via the command line: --config cmsisnn.lib=/tmp/mlonmcu_env/deps/install/cmsisnn_x86_dbg/libmuriscv-nn.a

To tacke the issue we have to resolve one/both of these problems:

A. Currently Features are "globally" applied to a session, e.g. every run receives the same set of features (Features are instaiated before runs are created), this would not allow mixing different architectures for a single feature. Instead we would move the feature initialization the every single run, which has the caveat that we end up with a totals of #runs x #features feature objects.
B. Circular Feature dependencies: We can not just stick to the following initialization order: Feature -> Frontend -> Backend -> Framework -> Target -> MLIF -> Postprocess as some features depend on the Target. However as Frontend/Backend/Framework features exists, we can not just move the Feature initialization to a later stage.

Another option which might be feasible is the following:

Instead of the actual value of cmsisnn.lib config/cache variable we just pass some sort of reference to tflm.optimized_kernel_libs which is then resolved when tflm.optimized_kernel_libs is accessed, e.g. in TfLiteFramework.get_cmake_args(). We should keep this in mind when tackling #15 as this might help to get rid of A. and B..

The issue would also apply to muriscvnn, as we it would be great if we could use the scalar version on other platforms as well for comparisons.

Support more targets

Currently we support two targets:

etiss_pulpino (Might be renamed to etissvp or edaduino)
host_x86

In the future we might support further architectures/simulator/devices e.g.:

Architectures: ARM for comparions?
Simulators: Spike (Instruction accurate), QEMU (not suitable for benchmarking?), OVPsim (proprietary)?
Devices: Real hardware at some time?

Implement postprocesses

Postprocesses are intended to run after a session to do i.e. one of the following things on the resulting dataframe:

Modify columns (Reame, scale, drop, add)
Merge rows. Use-cases:
- Getting detailed cycle counts for initialization and invocation from two runs which a different X in --num X
- Average over cycles for numtiple runs (Relevant when benchmarking on non-deterministic platforms)
Create visualizations?

TVM: Allow custom crt_config

Currently at least one MLPerf Tiny model fails to build/invoke using the tvmrt or tvmcg backend. The reason for this is that the default value (10) of the constant TVM_CRT_MAX_ARGS is exceeded. We need to add a backend config to make this user-configurable. Also it would be useful to maintain our own default crt_config.h. However I am not sure where it should be stored so that it can be accessed from every environment. (Copy/clone/download it during initialization of the environments?)

Discussion on custom error types

Currently error handling in MLonMCU is done in two different ways:
• Raising RuntimeErrors with a message
• Assertions with optional messages

I had the idea to add custom error Types such as RunError, SessionError, BackendError, which should help to make some typical errors easier to understand. However I do not know what is the best approach as we also should not overdo it.

Another related point is if we should omit stack-traces for user-facing interface. (i.e. if a model was not found inside an environment it would be enough to just print an error message)

We should also evaluate if logging.error(msg) might help us to clean up error messages.

Fix/Update README Badges

Most of the README badges are currently broken. Some will start working as soon as the repo goes public. some might need little fixes or should just be removed.

Handling of CMake versions

The new muriscvnn has a relatively strict requirement on the used CMake version (3.22 or so) which leads to erros on most OSes.

Currently this results in an error during mlonmcu setup which is quite hard to read so we should catch stuff like this earlier.

Two ideas:

Add cmake as managed dependency to the environment (like llvm etc.) and always use this cmake executable when utils.cmake() is called instead of the system one.
Add a check in the _validate_muriscvnn() function which throws an more-readable error if the version of CMake installed on the system is too low.

@rafzi @fabianpedd what do you think?

Encapsulate framework runtime libs from models

In the original version of ml_on_mcu we have managed build of the target software as follows:

Create a single build directory for each different set of features, e.g. build, build_dbg, build_muriscvnn_dbg,...
Make sure that only one program can be build at a time (no parallelism allowed)
Copy over the resulting ELF file to another place before building the next executable.

This has the following advantages:

The runtime libraries for TVM and TFLM only need to be build once per build directory
Easy to debug CMake/Make problems as build directories do not change

However there are some drawbacks:

No option to exploit parallelism when building target software
Risk of inconsistencies due to mixing up runtime libs.

In the new MLonMCU the approach is currently as follows:

Create a new build directory for every single run e.g. temp/sessions/0/runs/0/mlif_build

While this resolves all issues listed above this leads to a few problems as well:

Need to rebuild the runtime lib for the chosen backend every single time (at least 30 seconds for TFLM) which makes everything slower and leads to more disk usage.
As build directories change with every run, it is much harder to debug CMake/Make-related issues. The build process is split up into separate steps and we explicitly need to provide CMake the prebuilt libraries.

How to overcome these limitations?

My proposal:

Split up the MLIF into two parts:
1. Framework/Backend Runtime Libraries (Dependencies)
2. Actual Target Software and codegen results
Build 1. only once per framework/backend combination (for specific features such as dbg, muRiscvNN etc.) and only build 2. in every invocation of the flow which should be much faster.

Will this work out?

AFAIK we will get into issues with the TVM CRT as there is a dependency cycle... (CRT needs kernels and kernels need CRT)
We also need to decide where the artifacts of 1. should be stored and when they are created: Either create them in the deps directory during mlonmcu setup for every possible combination of "flags" or create them on-demand in $MLONMCU_HOME/temp/mlif/?

Release Package on PyPi

Obvious step after the project is released: Make sure it can be installed via: pip install mlonmcu.

Support new ESP-IDF target: Linux

This would be an alternative to the existing target host_x86 provided by the MLIF platform. The ESP-IDF target ist still experimental, so we can consider supporting it as soon as it becomes stable.

Generate requirements.txt from enabled components in environment

With an increasing number of components (frameworks, backends, frontends, platforms, targets, features,…) in MLonMCU the list of packages in the requirements.txt file is growing…

Especially since a dependency to the tensorflow package was added, installing the Python dependencies for the first time (I.e. in CI) does often take an unreasonable amount of time while also consuming more than one GB of disk space.

We might be at a point now where the “user” should decide which features we wants to use and which not. We already have those information in an environments environment.yml and would just need a script which maintains a mapping for the required packages for each component and writes all those which are needed to a requirements.txt file inside the environment directory.

TVM does a similar thing we you probably use as inspiration:

https://github.com/apache/tvm/blob/main/python/gen_requirements.py

Refactor Environment Configuration

I came up with a new sturcture for the environment.yml files (See first comment) and would like to discuss it in this context.

Properly catch CTRL+C during a benchmarking session

If a MLonMCU benchmarking session is interrupted right inside a benchmarking session, no report will be produced.

Ideally we should catch an CTRL-C signal to stop the running jobs manually and Update the report using the data which was already available an the previous stage. Finally append a new colum to the report which indicates which rows are incomplete due to beeing canceled by the user and make sure to return a non-zero exit code to the shell.

Optional: Check if CTRL+C is hit multiple times and exit directly (Maybe this works right out-of-the box?)

Implement Timeouts

There are several situations where it would make sense to time-out a Python function after some defined period:

Target execution: Models might run into an error which does not result in a crash on a certain simulator (e.g. ETISS just printing stuff stuff on the console instead)
Platform flashing/monitoring: It can always happen that some operation does not finish in finite time, i.e. due to a unplugged cable etc.
Autotuning Sessions
Optional: Code Generation which does not finish in reasonable time?

While some components offer ways to manage timeouts by themselves (i.e. corstone300) it would still be great to have a consistent API for such things

Actually Target-related timeouts are already part of of MLonMCU codebase but currently raise NotImplementedError.

Add feature: USMP

The Unified Static Memory planning is now fully integrated into TVM.

I will create a feature for it which should be very basis as it only needs to set a few backend options!

tum-ei-eda / mlonmcu Goto Github PK

mlonmcu's Issues

Another discussion:

Tasks

Warning

Tasks

TL;DR

Required:

Delayed:

Optional:

Recommend Projects

Recommend Topics

Recommend Org

Jobs