First tests using this fork, running model training against Cifar10 dataset for benchm

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

If you need another example, running this code (from <a class="issue-link js-issue-lin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I managed to reproduce the segfault from <a class="user-mention notranslate" data-hove

this appears to be the same as <a class="issue-link js-issue-link" data-error-text="Fa

Model training on cpu (Intel) throws seg fault about tensorflow_macos HOT 11 OPEN

apple commented on May 27, 2024

Model training on cpu (Intel) throws seg fault

from tensorflow_macos.

Comments (11)

tux-o-matic commented on May 27, 2024 1

Thanks @atw1020 , indeed reducing the batch size in my benchmark allows epochs to complete on cpu.
It's an interesting behaviour.
I don't expect to be able to use large batch sizes on a laptop with integrated GPU, but when so much is shared.
It's surprising that TF with CoreML is so limited on CPU, yet the GPU with the same memory can handle larger batch sizes.
For reference, the original benchmark used 32as batch size, that worked only on the GPU. Taking it down to 16 works on the CPU (20is too high, crashes again).

from tensorflow_macos.

anna-tikhonova commented on May 27, 2024

@tux-o-matic Thank you for reporting this issue. Could you, please, point us to or attach an example you are running? This way, we can reproduce this issue locally and investigate.

from tensorflow_macos.

tux-o-matic commented on May 27, 2024

Hi @anna-tikhonova , I uses this Python code.
Just needs the TF fork and NumPy:

python cifar10_cnn.py

In my case, on a MacBook Air with Intel chips, the backend seems to choose the CPU by default and then throws the error.
However, if I specify

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name='gpu')
tensorflow.config.run_functions_eagerly(False)

Then the model gets trained, I can see with the Activity Monitor that Python threads are offloading work to the GPU. But on this integrated Intel GPU the perf is worse than the CPU and even PlaidML as a backend for TF could do better on the GPU.

from tensorflow_macos.

hughack commented on May 27, 2024

If you need another example, running this code (from #35 ) also defaults to CPU and Seg Faults.

import tensorflow as tf

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

Machine specs:
MacOS 11.0.1 on MacBook Pro, 15 inch, 2019.
2.3 GHz 8-Core Intel Core i9
16 GB 2400 MHz DDR4
Radeon Pro 560X 4 GB

from tensorflow_macos.

pooyadavoodi commented on May 27, 2024

@tux-o-matic @hughack I apologize for the late reply. I just tried both of the scripts you provided. I'm not able to reproduce the issue. It's possible that it is resolved in a MacOS update. Could you please try again using an updated MacOS and let me know if you can still reproduce this?

from tensorflow_macos.

tux-o-matic commented on May 27, 2024

Hi @pooyadavoodi.
On an up to date BigSur, Python 3.8.7 and latest release of this project, still hit the same error.

from tensorflow_macos.

pooyadavoodi commented on May 27, 2024

I managed to reproduce the segfault from @hughack's script using v0.1alpha0, and that issue is resolved in the latest release v0.1alpha2.

@tux-o-matic Could you share the BigSur version you are using? Also are you using the python that comes with the OS, otherwise how did you install it?

from tensorflow_macos.

tux-o-matic commented on May 27, 2024

I'm testing from BigSur 11.0.1. Python 3.8.7comes from MacPorts. Earlier tests were on older point release of Python 3.8, still from MacPorts.

from tensorflow_macos.

atw1020 commented on May 27, 2024

this appears to be the same as #127

I posted over there that I've found that this issue seems to be tied to batch size, where the segmentation fault occurs with sufficiently large batches. "Sufficiently large" appears to depend on the Neural network itself. However, all of the neural networks I have tried so far all experience this segfault when the batch size is larger than a certain amount. It's probably possible to solve or replicate this issue by increasing or decreasing your batch size.

I am still experiencing this on the february alpha build and I am using a Conda environment described on this page. (some of the pip commands need to be updated to match the new file names) I hope this helps you replicate the issue.

Also, using @tux-o-matic's workaround I was able to get my network to stop Segfaulting but it caused a memory leak instead (?!?). It appeared to run faster on GPU than it did on CPU (until I run out of memory, that is).

from tensorflow_macos.

atw1020 commented on May 27, 2024

I'm seeing nonlsegfault issues on 0.1-alpha3 but I'm still getting errors that are solved by using a smaller batch size. Going to keep investigating and hopefully get some new code to reproduce the issue I'm seeing

from tensorflow_macos.

atw1020 commented on May 27, 2024

I've been trying to replicate this issue on 0.1-alpha3 and I haven't been able to so I'm becoming pretty confidant that this issue was fixed in that patch. There seem to be other bugs related to batch size but this one has been addressed. Update if you are still experiencing this issue

from tensorflow_macos.

Model training on cpu (Intel) throws seg fault about tensorflow_macos HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs