nehbit avatar nehbit commented on May 27, 2024 9

I can confirm this is also the case for two AMD GPUs I've tested on different machines. Both were much, much slower than running the same thing on the CPU.

I've spent some time in the evening to properly set up an Anaconda environment with this in it, and fed it a nontrivial task. I can confirm that it is indeed giving me about 3x speed boost when my GPU is about 60% utilisation. I suspect a larger task would get closer to 100% utilisation and that would give us the expected ~5x speed over CPU. So in my set-up at least, it is now indeed working correctly. Just make sure you add these two lines at the beginning of the file you're running:

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name = 'gpu')

And ignore the fact that TF still tells you that there is no GPU present even after these lines.

Here's the GPU vs CPU comparison I've got on a MNIST simple image classification task from the official TF2 models library (link):


58/58 [==============================] - 44s 731ms/step - loss: 2.2478 - sparse_categorical_accuracy: 0.2123 - val_loss: 1.6934 - val_sparse_categorical_accuracy: 0.7380
Epoch 2/10
58/58 [==============================] - 41s 712ms/step - loss: 1.3346 - sparse_categorical_accuracy: 0.6389 - val_loss: 0.5675 - val_sparse_categorical_accuracy: 0.8169
Epoch 3/10
58/58 [==============================] - 42s 722ms/step - loss: 0.6407 - sparse_categorical_accuracy: 0.7925 - val_loss: 0.3464 - val_sparse_categorical_accuracy: 0.9036
Epoch 4/10
58/58 [==============================] - 42s 719ms/step - loss: 0.4668 - sparse_categorical_accuracy: 0.8519 - val_loss: 0.3279 - val_sparse_categorical_accuracy: 0.8989
Epoch 5/10
58/58 [==============================] - 42s 728ms/step - loss: 0.4090 - sparse_categorical_accuracy: 0.8706 - val_loss: 0.2688 - val_sparse_categorical_accuracy: 0.9206
Epoch 6/10
58/58 [==============================] - 43s 739ms/step - loss: 0.3439 - sparse_categorical_accuracy: 0.8930 - val_loss: 0.2169 - val_sparse_categorical_accuracy: 0.9355
Epoch 7/10
58/58 [==============================] - 42s 727ms/step - loss: 0.3048 - sparse_categorical_accuracy: 0.9069 - val_loss: 0.1968 - val_sparse_categorical_accuracy: 0.9423
Epoch 8/10
58/58 [==============================] - 43s 744ms/step - loss: 0.2650 - sparse_categorical_accuracy: 0.9180 - val_loss: 0.2029 - val_sparse_categorical_accuracy: 0.9393
Epoch 9/10
58/58 [==============================] - 42s 733ms/step - loss: 0.2947 - sparse_categorical_accuracy: 0.9077 - val_loss: 0.1733 - val_sparse_categorical_accuracy: 0.9486
Epoch 10/10
58/58 [==============================] - 43s 746ms/step - loss: 0.2352 - sparse_categorical_accuracy: 0.9256 - val_loss: 0.1637 - val_sparse_categorical_accuracy: 0.9484

GPU: (AMD Radeon RX Vega 64, 8GB)

58/58 [==============================] - 21s 278ms/step - loss: 2.0568 - sparse_categorical_accuracy: 0.2967 - val_loss: 0.5769 - val_sparse_categorical_accuracy: 0.8364
Epoch 2/10
58/58 [==============================] - 15s 258ms/step - loss: 0.5700 - sparse_categorical_accuracy: 0.8216 - val_loss: 0.2908 - val_sparse_categorical_accuracy: 0.9163
Epoch 3/10
58/58 [==============================] - 15s 254ms/step - loss: 0.3343 - sparse_categorical_accuracy: 0.9014 - val_loss: 0.2121 - val_sparse_categorical_accuracy: 0.9417
Epoch 4/10
58/58 [==============================] - 15s 255ms/step - loss: 0.2452 - sparse_categorical_accuracy: 0.9273 - val_loss: 0.1726 - val_sparse_categorical_accuracy: 0.9486
Epoch 5/10
58/58 [==============================] - 15s 254ms/step - loss: 0.2032 - sparse_categorical_accuracy: 0.9394 - val_loss: 0.1475 - val_sparse_categorical_accuracy: 0.9572
Epoch 6/10
58/58 [==============================] - 15s 253ms/step - loss: 0.1784 - sparse_categorical_accuracy: 0.9468 - val_loss: 0.1266 - val_sparse_categorical_accuracy: 0.9625
Epoch 7/10
58/58 [==============================] - 15s 255ms/step - loss: 0.1600 - sparse_categorical_accuracy: 0.9515 - val_loss: 0.1157 - val_sparse_categorical_accuracy: 0.9659
Epoch 8/10
58/58 [==============================] - 15s 253ms/step - loss: 0.1431 - sparse_categorical_accuracy: 0.9573 - val_loss: 0.1016 - val_sparse_categorical_accuracy: 0.9679
Epoch 9/10
58/58 [==============================] - 15s 254ms/step - loss: 0.1322 - sparse_categorical_accuracy: 0.9603 - val_loss: 0.0919 - val_sparse_categorical_accuracy: 0.9715
Epoch 10/10
58/58 [==============================] - 15s 258ms/step - loss: 0.1181 - sparse_categorical_accuracy: 0.9650 - val_loss: 0.0827 - val_sparse_categorical_accuracy: 0.9750

As an aside, I find it funny that Apple managed to do this before AMD for AMD graphics cards — right now, the only way to use an AMD card on a real ML (in TF) workflow is to stick it into a Mac, since AMDs own effort, rOCM, is still fairly unfinished. Interesting times!

Whichever engineering team inside Apple that managed to pull this off, major kudos.

nehbit avatar nehbit commented on May 27, 2024 9

@bryanlimy It wasn't anything special, but here it goes:

  • Install Anaconda normally
  • Create a Python 3.8 environment in it
  • Enter the Anaconda env you just created
  • Grab the zipped copy of the tf-macos here: (I got this link from the shell script in this repo)
  • Extract it, go to x86_64 folder, there'll be 7-8 .whl files
  • Their names will be in the format of: tensorflow_macos-0.1a0-cp38-cp38-macosx_11_0_x86_64.whl
  • If you try to install these with pip install [filename], it'll fail saying that these aren't for your OS. This was a head scratcher. You need to replace 11_0 in the filenames (bolded above) with 10_16. Example: tensorflow_macos-0.1a0-cp38-cp38-macosx_10_16_x86_64.whl (this is funny, seems we've collectively decided to call Big Sur 10.16, not 11.0)
  • Install all of them with pip. If you get dependency warnings for non-matching versions, fix them by downloading the right versions, pip will tell which versions satisfy the requirement.

That's pretty much it. I did this after spending a good hour trying to get scikit-learn on the virtualenv created by the script in this repo, which is needed to run anything nontrivial.

sevenold avatar sevenold commented on May 27, 2024 4

After increasing the model parameters,GPU is faster than CPU.

tranbach avatar tranbach commented on May 27, 2024 3

Same for me using the latest MacBook Pro 16. I trained couple epochs of VGG19: the GPU version takes 49 seconds, CPU version takes 7 seconds, tensorflow 2.3.1 takes 6 seconds, while plaidML it takes 2 seconds to train. I thought the AMD GPUs are supported through Metal like plaidML ...

I get the following messages from tensorflow:

2020-11-18 20:29:50.600229: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-18 20:29:52.421434: I tensorflow/compiler/mlir/] None of the MLIR optimization passes are enabled (registered 2)

dkgaraujo avatar dkgaraujo commented on May 27, 2024 2

Hi! For R users, I created a benchmark code to compare tf-mac with CPU or GPU, as well as with GPU-accelerated plaidml. You can find the code here:

lostmsu avatar lostmsu commented on May 27, 2024 2

To all of you guys in this thread. You need to understand, that the MNIST dataset and the model used in the example are too small to gain from GPU due to the need to juggle data between them outweighing any potential performance gains.

To see a real difference (if any) you can:

  1. Increase number of layers.
  2. Increase layer widths (e.g. for dense layers - number of units, for convolutions - number of channels).
  3. Increase batch size to 64 (typical for many training tasks).

Try replacing the last cell with

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(10, activation='softmax')

P.S. I don't have a compatible Mac myself, but interested in the results from various machines.


With these settings I get about 10ms/step on my Titan V and ~87% GPU load according to nvidia-smi.

nehbit avatar nehbit commented on May 27, 2024

I can confirm this is also the case for two AMD GPUs I've tested on different machines. Both were much, much slower than running the same thing on the CPU.

tzm41 avatar tzm41 commented on May 27, 2024

Same, testing MNIST with sample CNN on my MBP 16 2019 with AMD Radeon Pro 5500M, it seems to get stuck in between batches.

sevenold avatar sevenold commented on May 27, 2024

same, testing MNIST with sample CNN on my MBP 16 2019 with AMD Radeon Pro 5300M
WARNING:tensorflow:Eager mode on GPU is extremely slow. Consider to use CPU instead

JL1829 avatar JL1829 commented on May 27, 2024

Same, I have below two warning:

  1. 1. WARNING:tensorflow:Eager mode on GPU is extremely slow. Consider to use CPU instead
  2. 2. 2020-11-19 14:42:26.766401: I tensorflow/compiler/mlir/] None of the MLIR optimization passes are enabled (registered 2)

And I can see from the Activity Monitor, no GPU was utilised.

qixiang109 avatar qixiang109 commented on May 27, 2024

Same, I have below two warning:

  1. 1. WARNING:tensorflow:Eager mode on GPU is extremely slow. Consider to use CPU instead
  2. 2. 2020-11-19 14:42:26.766401: I tensorflow/compiler/mlir/] None of the MLIR optimization passes are enabled (registered 2)

And I can see from the Activity Monitor, no GPU was utilised.

your warning says that "Eager mode on GPU is extremely slow", try tf.compat.v1.disable_eager_execution()

JL1829 avatar JL1829 commented on May 27, 2024

Same, I have below two warning:

  1. 1. WARNING:tensorflow:Eager mode on GPU is extremely slow. Consider to use CPU instead
  2. 2. 2020-11-19 14:42:26.766401: I tensorflow/compiler/mlir/] None of the MLIR optimization passes are enabled (registered 2)

And I can see from the Activity Monitor, no GPU was utilised.

your warning says that "Eager mode on GPU is extremely slow", try tf.compat.v1.disable_eager_execution()

Tried, same

sevenold avatar sevenold commented on May 27, 2024

MBP 16 2019


#!/usr/bin/env python
# coding: utf-8
import tensorflow.compat.v2 as tf
import tensorflow_datasets as tfds


from tensorflow.python.framework.ops import disable_eager_execution

from tensorflow.python.compiler.mlcompute import mlcompute

(ds_train, ds_test), ds_info = tfds.load(
    split=['train', 'test'],

def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label

ds_train =
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(

ds_test =
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(10, activation='softmax')

GPU: (AMD Radeon Pro 5300M 4 GB)




CPU(2.6 GHz 六核Intel Core i7)


# from tensorflow.python.compiler.mlcompute import mlcompute
# mlcompute.set_mlc_device(device_name='gpu')





nehbit avatar nehbit commented on May 27, 2024

Interesting, I ran your code on both GPU and CPU, my results are similar: in your task, the CPU seems faster. That said, if I had to make a completely uneducated guess, I'd say your task is small enough that moving data from CPU to GPU for processing is taking the lion's share of time spent, and the GPU is spending most of the time waiting. The GPU power consumption I get on my task is around 300W, in yours it barely goes above idle at 50W.

This is the model I used on my test, but I suspect even this is too fast per epoch. Might want to give it a shot:

This is what I got with your code:


469/469 [==============================] - 7s 11ms/step - loss: 0.6087 - accuracy: 0.8330 - val_loss: 0.1977 - val_accuracy: 0.9423
Epoch 2/10
469/469 [==============================] - 4s 9ms/step - loss: 0.1781 - accuracy: 0.9501 - val_loss: 0.1355 - val_accuracy: 0.9593
Epoch 3/10
469/469 [==============================] - 4s 9ms/step - loss: 0.1220 - accuracy: 0.9647 - val_loss: 0.1126 - val_accuracy: 0.9680
Epoch 4/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0943 - accuracy: 0.9739 - val_loss: 0.0924 - val_accuracy: 0.9724
Epoch 5/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0779 - accuracy: 0.9771 - val_loss: 0.0833 - val_accuracy: 0.9744
Epoch 6/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0603 - accuracy: 0.9835 - val_loss: 0.0794 - val_accuracy: 0.9756
Epoch 7/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0495 - accuracy: 0.9859 - val_loss: 0.0743 - val_accuracy: 0.9771
Epoch 8/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0424 - accuracy: 0.9883 - val_loss: 0.0687 - val_accuracy: 0.9790
Epoch 9/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0359 - accuracy: 0.9899 - val_loss: 0.0713 - val_accuracy: 0.9779
Epoch 10/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0289 - accuracy: 0.9924 - val_loss: 0.0707 - val_accuracy: 0.9777


469/469 [==============================] - 4s 4ms/step - loss: 0.6033 - accuracy: 0.8359 - val_loss: 0.1923 - val_accuracy: 0.9457
Epoch 2/10
469/469 [==============================] - 1s 2ms/step - loss: 0.1792 - accuracy: 0.9499 - val_loss: 0.1379 - val_accuracy: 0.9605
Epoch 3/10
469/469 [==============================] - 1s 2ms/step - loss: 0.1238 - accuracy: 0.9645 - val_loss: 0.1093 - val_accuracy: 0.9671
Epoch 4/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0929 - accuracy: 0.9737 - val_loss: 0.0967 - val_accuracy: 0.9707
Epoch 5/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0748 - accuracy: 0.9778 - val_loss: 0.0845 - val_accuracy: 0.9738
Epoch 6/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0608 - accuracy: 0.9825 - val_loss: 0.0764 - val_accuracy: 0.9769
Epoch 7/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0528 - accuracy: 0.9853 - val_loss: 0.0768 - val_accuracy: 0.9764
Epoch 8/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0428 - accuracy: 0.9886 - val_loss: 0.0849 - val_accuracy: 0.9722
Epoch 9/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0350 - accuracy: 0.9905 - val_loss: 0.0752 - val_accuracy: 0.9771
Epoch 10/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0297 - accuracy: 0.9923 - val_loss: 0.0730 - val_accuracy: 0.9787

bryanlimy avatar bryanlimy commented on May 27, 2024

@sevenold I don't think we have to disable eager execution. Also, 2ms/step on CPU might suggest that the task is too small and the overhead of transferring data between GPU and CPU is larger than the speedup you would get from a GPU. Maybe try setting up a larger network?

EDIT: I think this model is way too small to see any benefits from the GPU

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(10, activation='softmax')

@nehbit do you mind sharing your steps to install tensorflow_macos with conda instead of virtualenv?

chandc avatar chandc commented on May 27, 2024

I have been able to reproduce the MNIST CNN runtime results on an MBP 16-inch, 2019, 32GB RAM, AMD Radeon Pro 5500M 8GB and with BigSur version 11.0.1, Python 3.8.6

CPU = 132 secs
GPU = 105 secs
Colab = 23 secs

Outputs are provided below:

Train: X=(60000, 28, 28), y=(60000,)
Test: X=(10000, 28, 28), y=(10000,)
Model: "sequential"

Layer (type) Output Shape Param #

conv2d (Conv2D) (None, 26, 26, 32) 320

max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0

conv2d_1 (Conv2D) (None, 12, 12, 32) 4128

max_pooling2d_1 (MaxPooling2 (None, 6, 6, 32) 0

flatten (Flatten) (None, 1152) 0

dense (Dense) (None, 500) 576500

dense_1 (Dense) (None, 10) 5010

Total params: 585,958
Trainable params: 585,958
Non-trainable params: 0

Epoch 1/10
469/469 [==============================] - 13s 27ms/step - loss: 0.3570 - accuracy: 0.8916
Epoch 2/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0474 - accuracy: 0.9847
Epoch 3/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0279 - accuracy: 0.9914
Epoch 4/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0203 - accuracy: 0.9940
Epoch 5/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0142 - accuracy: 0.9954
Epoch 6/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0111 - accuracy: 0.9964
Epoch 7/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0099 - accuracy: 0.9967
Epoch 8/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0073 - accuracy: 0.9977
Epoch 9/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0077 - accuracy: 0.9976
Epoch 10/10
469/469 [==============================] - 13s 28ms/step - loss: 0.0054 - accuracy: 0.9983
Time: 132.01292276382446
Accuracy: 0.989

GPU with eager execution disabled
Epoch 1/10
60000/60000 [==============================] - 10s 172us/sample - loss: 0.1522 - accuracy: 0.9535
Epoch 2/10
60000/60000 [==============================] - 10s 171us/sample - loss: 0.0465 - accuracy: 0.9854
Epoch 3/10
60000/60000 [==============================] - 10s 169us/sample - loss: 0.0311 - accuracy: 0.9906
Epoch 4/10
60000/60000 [==============================] - 10s 169us/sample - loss: 0.0213 - accuracy: 0.9931
Epoch 5/10
60000/60000 [==============================] - 10s 170us/sample - loss: 0.0157 - accuracy: 0.9950
Epoch 6/10
60000/60000 [==============================] - 10s 174us/sample - loss: 0.0117 - accuracy: 0.9963
Epoch 7/10
60000/60000 [==============================] - 11s 184us/sample - loss: 0.0093 - accuracy: 0.9969
Epoch 8/10
60000/60000 [==============================] - 11s 179us/sample - loss: 0.0087 - accuracy: 0.9972
Epoch 9/10
60000/60000 [==============================] - 11s 176us/sample - loss: 0.0076 - accuracy: 0.9974
Epoch 10/10
60000/60000 [==============================] - 11s 176us/sample - loss: 0.0059 - accuracy: 0.9981
Time: 104.61961388587952

Epoch 1/10
469/469 [==============================] - 2s 4ms/step - loss: 0.1515 - accuracy: 0.9537
Epoch 2/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0464 - accuracy: 0.9855
Epoch 3/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0307 - accuracy: 0.9901
Epoch 4/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0215 - accuracy: 0.9929
Epoch 5/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0155 - accuracy: 0.9950
Epoch 6/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0127 - accuracy: 0.9961
Epoch 7/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0096 - accuracy: 0.9967
Epoch 8/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0071 - accuracy: 0.9977
Epoch 9/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0079 - accuracy: 0.9973
Epoch 10/10
469/469 [==============================] - 2s 3ms/step - loss: 0.0062 - accuracy: 0.9977
Time: 23.20409369468689
Accuracy: 0.990

qixiang109 avatar qixiang109 commented on May 27, 2024

I can confirm this is also the case for two AMD GPUs I've tested on different machines. Both were much, much slower than running the same thing on the CPU.

I've spent some time in the evening to properly set up an Anaconda environment with this in it, and fed it a nontrivial task. I can confirm that it is indeed giving me about 3x speed boost when my GPU is about 60% utilisation. I suspect a larger task would get closer to 100% utilisation and that would give us the expected ~5x speed over CPU. So in my set-up at least, it is now indeed working correctly. Just make sure you add these two lines at the beginning of the file you're running:

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name = 'gpu')

And ignore the fact that TF still tells you that there is no GPU present even after these lines.

Here's the GPU vs CPU comparison I've got on a MNIST simple image classification task from the official TF2 models library (link):


58/58 [==============================] - 44s 731ms/step - loss: 2.2478 - sparse_categorical_accuracy: 0.2123 - val_loss: 1.6934 - val_sparse_categorical_accuracy: 0.7380
Epoch 2/10
58/58 [==============================] - 41s 712ms/step - loss: 1.3346 - sparse_categorical_accuracy: 0.6389 - val_loss: 0.5675 - val_sparse_categorical_accuracy: 0.8169
Epoch 3/10
58/58 [==============================] - 42s 722ms/step - loss: 0.6407 - sparse_categorical_accuracy: 0.7925 - val_loss: 0.3464 - val_sparse_categorical_accuracy: 0.9036
Epoch 4/10
58/58 [==============================] - 42s 719ms/step - loss: 0.4668 - sparse_categorical_accuracy: 0.8519 - val_loss: 0.3279 - val_sparse_categorical_accuracy: 0.8989
Epoch 5/10
58/58 [==============================] - 42s 728ms/step - loss: 0.4090 - sparse_categorical_accuracy: 0.8706 - val_loss: 0.2688 - val_sparse_categorical_accuracy: 0.9206
Epoch 6/10
58/58 [==============================] - 43s 739ms/step - loss: 0.3439 - sparse_categorical_accuracy: 0.8930 - val_loss: 0.2169 - val_sparse_categorical_accuracy: 0.9355
Epoch 7/10
58/58 [==============================] - 42s 727ms/step - loss: 0.3048 - sparse_categorical_accuracy: 0.9069 - val_loss: 0.1968 - val_sparse_categorical_accuracy: 0.9423
Epoch 8/10
58/58 [==============================] - 43s 744ms/step - loss: 0.2650 - sparse_categorical_accuracy: 0.9180 - val_loss: 0.2029 - val_sparse_categorical_accuracy: 0.9393
Epoch 9/10
58/58 [==============================] - 42s 733ms/step - loss: 0.2947 - sparse_categorical_accuracy: 0.9077 - val_loss: 0.1733 - val_sparse_categorical_accuracy: 0.9486
Epoch 10/10
58/58 [==============================] - 43s 746ms/step - loss: 0.2352 - sparse_categorical_accuracy: 0.9256 - val_loss: 0.1637 - val_sparse_categorical_accuracy: 0.9484

GPU: (AMD Radeon RX Vega 64, 8GB)

58/58 [==============================] - 21s 278ms/step - loss: 2.0568 - sparse_categorical_accuracy: 0.2967 - val_loss: 0.5769 - val_sparse_categorical_accuracy: 0.8364
Epoch 2/10
58/58 [==============================] - 15s 258ms/step - loss: 0.5700 - sparse_categorical_accuracy: 0.8216 - val_loss: 0.2908 - val_sparse_categorical_accuracy: 0.9163
Epoch 3/10
58/58 [==============================] - 15s 254ms/step - loss: 0.3343 - sparse_categorical_accuracy: 0.9014 - val_loss: 0.2121 - val_sparse_categorical_accuracy: 0.9417
Epoch 4/10
58/58 [==============================] - 15s 255ms/step - loss: 0.2452 - sparse_categorical_accuracy: 0.9273 - val_loss: 0.1726 - val_sparse_categorical_accuracy: 0.9486
Epoch 5/10
58/58 [==============================] - 15s 254ms/step - loss: 0.2032 - sparse_categorical_accuracy: 0.9394 - val_loss: 0.1475 - val_sparse_categorical_accuracy: 0.9572
Epoch 6/10
58/58 [==============================] - 15s 253ms/step - loss: 0.1784 - sparse_categorical_accuracy: 0.9468 - val_loss: 0.1266 - val_sparse_categorical_accuracy: 0.9625
Epoch 7/10
58/58 [==============================] - 15s 255ms/step - loss: 0.1600 - sparse_categorical_accuracy: 0.9515 - val_loss: 0.1157 - val_sparse_categorical_accuracy: 0.9659
Epoch 8/10
58/58 [==============================] - 15s 253ms/step - loss: 0.1431 - sparse_categorical_accuracy: 0.9573 - val_loss: 0.1016 - val_sparse_categorical_accuracy: 0.9679
Epoch 9/10
58/58 [==============================] - 15s 254ms/step - loss: 0.1322 - sparse_categorical_accuracy: 0.9603 - val_loss: 0.0919 - val_sparse_categorical_accuracy: 0.9715
Epoch 10/10
58/58 [==============================] - 15s 258ms/step - loss: 0.1181 - sparse_categorical_accuracy: 0.9650 - val_loss: 0.0827 - val_sparse_categorical_accuracy: 0.9750

As an aside, I find it funny that Apple managed to do this before AMD for AMD graphics cards — right now, the only way to use an AMD card on a real ML (in TF) workflow is to stick it into a Mac, since AMDs own effort, rOCM, is still fairly unfinished. Interesting times!

Whichever engineering team inside Apple that managed to pull this off, major kudos.

can you try a larger model such as ResNet50, and btw, which AMD card do you use?

Akshaysehgal2005 avatar Akshaysehgal2005 commented on May 27, 2024

To all of you guys in this thread. You need to understand, that the MNIST dataset and the model used in the example are too small to gain from GPU due to the need to juggle data between them outweighing any potential performance gains.

To see a real difference (if any) you can:

  1. Increase number of layers.
  2. Increase layer widths (e.g. for dense layers - number of units, for convolutions - number of channels).
  3. Increase batch size to 64 (typical for many training tasks).

Try replacing the last cell with

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(10, activation='softmax')

P.S. I don't have a compatible Mac myself, but interested in the results from various machines.


With these settings I get about 10ms/step on my Titan V and ~87% GPU load according to nvidia-smi.

Sadly, I still couldn't replicate the results you mention above. Used around 53M parameters and still see drastic difference between CPU and GPU speeds (CPU being much faster). I do see GPU utilization in activity moniter with device is set to GPU though. Maybe it's because of the 'compatible Macs' you mention? Is there a list of such devices? From what I saw, requirements only mention Big Sur 11.

anna-tikhonova avatar anna-tikhonova commented on May 27, 2024

To provide an update:
VGG19: We've identified the issue and the fix with be in the next update.
MNIST: We will investigate and report back.

giordan12 avatar giordan12 commented on May 27, 2024

Related to this issue, has anyone else attempted to use a I am using one and my training times using the GPU are much slower than the CPU. I'm registering 797ms using the CPU and 386s using the GPU.

dkgaraujo avatar dkgaraujo commented on May 27, 2024

Related to this issue, has anyone else attempted to use a I am using one and my training times using the GPU are much slower than the CPU. I'm registering 797ms using the CPU and 386s using the GPU.

Hi, @giordan12: it is very likely the issue you are facing of comparatively slow performance with GPU is not related to the specific source of the data ( in this case), but with the fact that either your dataset or even your batch_size is too small to really optimize performance with the GPU. Remember they are massively parallel calculation machines, so as a rough rule, to take advantage of GPU acceleration you want preferably to send as large batches as they can get their hands on.

For more discussion and results on this, please see #25 (comment) and associated thread.

jkleckner avatar jkleckner commented on May 27, 2024

@giordan12 It would be interesting to view the gpu usage as done in that other thread.

anna-tikhonova avatar anna-tikhonova commented on May 27, 2024

Same for me using the latest MacBook Pro 16. I trained couple epochs of VGG19: the GPU version takes 49 seconds, CPU version takes 7 seconds, tensorflow 2.3.1 takes 6 seconds, while plaidML it takes 2 seconds to train. I thought the AMD GPUs are supported through Metal like plaidML ...

@tranbach Could you tell us what batch size you were using to train VGG19?

tranbach avatar tranbach commented on May 27, 2024

@anna-tikhonova Batch size of 64.

quangtvdevnet avatar quangtvdevnet commented on May 27, 2024


This addons runs so slow in my Mac (2019 pro 16 inch - AMD Radeon Pro 5500M 4GB). The Mac hangs and has to restart...

I run on ResNet NN.

