GithubHelp home page GithubHelp logo

Inference time using Interpreter API on Android inconsistent and 10–50 times slower than same tflite model on iOS about tensorflow HOT 11 OPEN

jakubdolejs avatar jakubdolejs commented on May 3, 2024
Inference time using Interpreter API on Android inconsistent and 10–50 times slower than same tflite model on iOS

from tensorflow.

Comments (11)

sawantkumar avatar sawantkumar commented on May 3, 2024 1

Hi @jakubdolejs ,

When it comes to speed and performance, NCNN is generally considered to be faster than TFLite in many scenarios so your results are somewhat as expected. However i will replicate the issue on my available pixel phone using tflite and i will get back to you . I don't think there is a need to file a separate issue yet.

from tensorflow.

sawantkumar avatar sawantkumar commented on May 3, 2024

Hi @jakubdolejs

There could be a bunch of reasons behind performance issues on pixel 4a compared to the iPhone 12. When you use Core ML delegate on the iPhone, it is using NPU which is much faster compared to the gpu on pixel 4a . Can you also benchmark your model on the pixel 4a using tensorflow profiler which will give you detailed information regarding your model execution like how many partitions of the model are created before execution and how many layers are falling back to the cpu in case of gpu delegate. Also pixel 4a's GPU is not optimised for fp32 calculations , it is only optimised for fp16 operations , so that could be the culprit behind poor gpu performance while using fp32. Can you share the tensorflow lite profiler results once you benchmark your tflite model on the pixel using profiler.

from tensorflow.

jakubdolejs avatar jakubdolejs commented on May 3, 2024

Thank you @sawantkumar. I'll try the profiler and upload the results here.

from tensorflow.

jakubdolejs avatar jakubdolejs commented on May 3, 2024

Hello @sawantkumar,

I ran the benchmark tool with different options on the float32 and float16 models. Please see the attached results. The file names ending with gpu are from runs that had the --use_gpu flag set to true. The ones ending with nnapi had the --use_nnapi flag set to true. The commands used to invoke the tests are included in the txt files.

Please let me know if you see anything unexpected in the results.

fp16_gpu.txt
fp16_nnapi.txt
fp16.txt
fp32_gpu.txt
fp32_nnapi.txt
fp32.txt

from tensorflow.

sawantkumar avatar sawantkumar commented on May 3, 2024

Hello @jakubdolejs,

I've reviewed the log files, and everything appears as expected, except for the discrepancies noted in the files fp32_gpu.txt and fp16_gpu.txt. While the average latency GPU numbers from the TFLite profiler seem almost identical for both fp16 and fp32 models, the logs from your Android code indicate a clear difference between fp32 and fp16 GPU numbers. To facilitate a more accurate comparison, could you also profile your models on an iPhone 12 using TFLite Profiler for iOS?

Regarding the inconsistency in inference numbers during the first few runs on the Pixel 4a, could you integrate a few warm-up loops in your Android code before benchmarking and let me know the results? Please feel free to reach out if you encounter any difficulties during this process.

from tensorflow.

jakubdolejs avatar jakubdolejs commented on May 3, 2024

Thank you @sawantkumar. I'll try the iOS app and report back. I really appreciate you helping me through this.

from tensorflow.

jakubdolejs avatar jakubdolejs commented on May 3, 2024

Hi @sawantkumar,

Here are the benchmarks from iOS (iPhone 12 mini). It looks like the app runs the inference on the UI thread. For all the models I get this warning in the log: This method should not be called on the main thread as it may lead to UI unresponsiveness.. I redacted the above messages from the log output for brevity.

FP16 Model on CPU

INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Inter-run delay (seconds): [-1]
INFO: Num threads: [4]
INFO: Benchmark name: [arc_psd_001_fp16_benchmark]
INFO: Min warmup runs: [1]
INFO: Graph: [/private/var/containers/Bundle/Application/1D088489-D83E-4A30-B2A8-26180514520A/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite]
INFO: Input layers: [in0]
INFO: Input shapes: [1,640,640,3]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model /private/var/containers/Bundle/Application/1D088489-D83E-4A30-B2A8-26180514520A/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite
Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
WARN: Tensor # 0 is named inputs_0 but flags call it in0
Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 22.4319
INFO: Initialized session in 150.857ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=3 first=206791 curr=185620 min=185245 max=206791 avg=192552 std=10069

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

INFO: count=50 first=189002 curr=222573 min=189002 max=222573 avg=208663 std=9484

INFO: Inference timings in us: Init: 150857, First inference: 206791, Warmup (avg): 192552, Inference (avg): 208663
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=87.2656 overall=159.392

FP16 model with CoreML delegate:

INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Inter-run delay (seconds): [-1]
INFO: Num threads: [4]
INFO: Benchmark name: [arc_psd_001_fp16_benchmark]
INFO: Min warmup runs: [1]
INFO: Graph: [/private/var/containers/Bundle/Application/50D22DE8-EDBA-4763-8298-7187B6D7FD12/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite]
INFO: Input layers: [in0]
INFO: Input shapes: [1,640,640,3]
INFO: Use CoreML: [1]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model /private/var/containers/Bundle/Application/50D22DE8-EDBA-4763-8298-7187B6D7FD12/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite
Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
WARN: Tensor # 0 is named inputs_0 but flags call it in0
coreml_version must be 2 or 3. Setting to 3.
INFO: COREML delegate created.
CoreML delegate: 215 nodes delegated out of 384 nodes, with 15 partitions.
INFO: CoreML delegate: 215 nodes delegated out of 384 nodes, with 15 partitions.
INFO: Explicitly applied COREML delegate, and the model graph will be partially executed by the delegate w/ 13 delegate kernels.
Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 22.4319
INFO: Initialized session in 2540.03ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.

Note that when running the FP16 model with CoreML delegate I got a EXC_BAD_ACCESS error here:

TfLiteBenchmarkTfLiteModelRunWithArgs(benchmark, argc, argv.data());

That's why the log is truncated.

FP32 model on CPU:

INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Inter-run delay (seconds): [-1]
INFO: Num threads: [4]
INFO: Benchmark name: [arc_psd_001_fp32_benchmark]
INFO: Min warmup runs: [1]
INFO: Graph: [/private/var/containers/Bundle/Application/6731B3C4-1818-4B08-977B-9D7C0C8DBD81/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite]
INFO: Input layers: [in0]
INFO: Input shapes: [1,640,640,3]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model /private/var/containers/Bundle/Application/6731B3C4-1818-4B08-977B-9D7C0C8DBD81/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite
Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
WARN: Tensor # 0 is named inputs_0 but flags call it in0
Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 44.7677
INFO: Initialized session in 195.673ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=3 first=200505 curr=187763 min=185514 max=200505 avg=191261 std=6600

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

INFO: count=50 first=190545 curr=222754 min=190545 max=223493 avg=210610 std=9386

INFO: Inference timings in us: Init: 195673, First inference: 200505, Warmup (avg): 191261, Inference (avg): 210610
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=44.5 overall=116.626

FP32 model with CoreML delegate:

INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Inter-run delay (seconds): [-1]
INFO: Num threads: [4]
INFO: Benchmark name: [arc_psd_001_fp32_benchmark]
INFO: Min warmup runs: [1]
INFO: Graph: [/private/var/containers/Bundle/Application/8336E5F7-0CCF-4333-9BD9-4CC385A1B930/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite]
INFO: Input layers: [in0]
INFO: Input shapes: [1,640,640,3]
INFO: Use CoreML: [1]
INFO: #threads used for CPU inference: [4]
INFO: Loaded model /private/var/containers/Bundle/Application/8336E5F7-0CCF-4333-9BD9-4CC385A1B930/TFLiteBenchmark.app/ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite
Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
WARN: Tensor # 0 is named inputs_0 but flags call it in0
coreml_version must be 2 or 3. Setting to 3.
INFO: COREML delegate created.
CoreML delegate: 215 nodes delegated out of 253 nodes, with 15 partitions.
INFO: CoreML delegate: 215 nodes delegated out of 253 nodes, with 15 partitions.
INFO: Explicitly applied COREML delegate, and the model graph will be partially executed by the delegate w/ 13 delegate kernels.
Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 44.7677
INFO: Initialized session in 3285.41ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=12 first=66600 curr=41255 min=40902 max=66600 avg=43699.3 std=6953

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

INFO: count=50 first=40886 curr=41157 min=40386 max=41937 avg=41173.8 std=325

INFO: Inference timings in us: Init: 3285407, First inference: 66600, Warmup (avg): 43699.3, Inference (avg): 41173.8
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=230.595 overall=303.407

from tensorflow.

jakubdolejs avatar jakubdolejs commented on May 3, 2024

Hi @sawantkumar,

I ran a test on the Pixel 4a with the different model combinations. I ran 50 iterations but this time I included a warmup of 10 inference runs. The first few runs are still very slow. Is this to be expected?

How do you recommend the warmup is handled in production? The app I'm building will need to run inference on a few images at a time but it shouldn't take 3 seconds per image.

Here is the test function that produced the results in this CSV file:

@Test
fun testInferenceSpeed() {
    val context = InstrumentationRegistry.getInstrumentation().context
    val assetManager = context.assets
    // Input serialized as a float array in JSON
    val jsonFile = "face_on_iPad_001.jpg-flat.json"
    assetManager.open(jsonFile).use { inputStream ->
        val json = inputStream.bufferedReader().use { it.readText() }
        val floatArray = Json.decodeFromString<FloatArray>(json)
        // Models – float32 and float16
        val models = mapOf(Pair("fp32", "ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite"), Pair("fp16","ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite"))
        val options = arrayOf("gpu", "nnapi", "cpu", "xnnpack")
        val table = mutableMapOf<String,Array<Long>>()
        val runCount = 50
        val warmupRunCount = 10
        for (model in models.entries) {
            assetManager.open(model.value).use { modelInputStream ->
                // Copy the model from assets to the cache directory
                val modelFile = File(context.cacheDir, model.value)
                modelFile.outputStream().use { outputStream ->
                    modelInputStream.copyTo(outputStream)
                }
                for (option in options) {
                    val interpreterOptions = InterpreterApi.Options()
                    val compatibilityList = CompatibilityList()
                    when (option) {
                        "gpu" -> {
                            compatibilityList.use {
                                if (it.isDelegateSupportedOnThisDevice) {
                                    interpreterOptions.addDelegate(
                                        GpuDelegate(
                                            it.bestOptionsForThisDevice
                                        )
                                    )
                                }
                            }
                        }
                        "nnapi" -> {
                            if (android.os.Build.VERSION.SDK_INT >= android.os.Build.VERSION_CODES.P) {
                                interpreterOptions.addDelegate(NnApiDelegate())
                                interpreterOptions.useNNAPI = true
                            }
                        }
                        "cpu" -> {
                            interpreterOptions.numThreads =
                                Runtime.getRuntime().availableProcessors()
                            interpreterOptions.useXNNPACK = false
                        }

                        "xnnpack" -> {
                            interpreterOptions.numThreads =
                                Runtime.getRuntime().availableProcessors()
                            interpreterOptions.useXNNPACK = true
                        }
                        else -> throw IllegalArgumentException("Unknown option: $option")
                    }
                    InterpreterApi.create(modelFile, interpreterOptions)
                        .use { interpreterApi ->
                            for (i in 0 until warmupRunCount) {
                                interpreterApi.allocateTensors()
                                val input = FloatBuffer.wrap(floatArray)
                                val output =
                                    FloatBuffer.allocate(5 * 8400).also { it.rewind() }
                                interpreterApi.run(input, output)
                            }
                            val times = mutableListOf<Long>()
                            for (i in 0 until runCount) {
                                interpreterApi.allocateTensors()
                                val input = FloatBuffer.wrap(floatArray)
                                val output =
                                    FloatBuffer.allocate(5 * 8400).also { it.rewind() }
                                val time = measureTimeMillis {
                                    interpreterApi.run(input, output)
                                }
                                times.add(time)
                            }
                            table.getOrPut("${model.key}-${option}") { times.toTypedArray() }
                        }
                }
            }
        }
        var csv = table.keys.map { "\"$it\"" }.joinToString(",")
        val rowCount = table.values.map { it.size }.min()
        for (i in 0 until rowCount) {
            csv += "\n"
            csv += table.keys.map { table[it]!![i].toString() }.joinToString(",")
        }
        File(context.cacheDir, "inference_speed.csv").outputStream().use { fileOutputStream ->
            OutputStreamWriter(fileOutputStream).use { outputStreamWriter ->
                outputStreamWriter.write(csv)
            }
        }
    }
}

from tensorflow.

sawantkumar avatar sawantkumar commented on May 3, 2024

Hi @jakubdolejs,

Apologies for the delay; I wasn't available over the weekend. After analyzing the iOS numbers, it's evident that the Core ML delegate on the iPhone 12 Mini outperforms the GPU delegate on the Pixel 4a by approximately 7x for fp32 models. Additionally, the iPhone 12 Mini's CPU executes models roughly 2x faster than the Pixel 4a's CPU. These results clearly indicate that the iPhone 12 Mini offers faster model execution both on CPU and GPU compared to the Pixel 4a.

However, if you're aiming to maximize performance on your Pixel device, consider utilizing its DSP. Please ensure third-party access to the DSP is permitted on the Pixel phone, then optimize performance using SNPE provided by Qualcomm.

from tensorflow.

sawantkumar avatar sawantkumar commented on May 3, 2024

Also regarding handling GPU warm up runs in production , from my experience i have also seen that the first few inference runs on Android TFLite GPU can be slower because of initialization Overhead that is when you run inference for first few times, TensorFlow Lite needs to initialize various components, such as loading the model, allocating memory, and setting up the GPU context. This initialization process can take some time, causing the first few inferences to be slower. To handle such scenarios in production you can perform the GPU warm-up runs during the app's startup. This could be something like an inference loop of 50 or 100 iterations on dummy data on app's startup . Please let me know if you have any further issues or questions.

from tensorflow.

jakubdolejs avatar jakubdolejs commented on May 3, 2024

Hello @sawantkumar,

I've done some more testing and profiling. I built an Android app that lets me change between the FP16 and FP32 models and toggle the different options. Here are my findings:

  • The initial slowdown only happens on some devices (e.g., Pixel 4a, Elo touch). On some devices (Pixel 6, Galaxy Tab S6 Lite) the inference runs at consistent speed from start to finish with any given options.
  • The initial slowdown happens regardless of which delegates are used.
  • I tried using the same model converted to NCNN and it runs at consistent speed on any device. The speed is comparable to the TfLite model after the "warmup".
  • The slower devices I mentioned like Pixel 4a run first 50 or so iterations in about 3500 ms, after which the speed increases to about 400–500 ms per inference. Even this is not consistent. Sometimes the inference keeps running slowly at over 3000 ms even after hundreds of iterations. For comparison, the faster devices run inference at roughly 300 ms from the get go.
  • On iOS, the story is slightly different. It takes about 2 seconds to load the model but afterwards the inference runs consistently at about 70 ms (with the CoreML delegate).

From using NCNN I can see that even the underpowered devices don't require a warmup to run at acceptable speeds. I believe there may be a bug in TfLite. It shouldn't take 3 minutes to "warm up".

Would you like me to file a separate issue with a bug report or can you escalate this one?

from tensorflow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.