GithubHelp home page GithubHelp logo

Comments (11)

tux-o-matic avatar tux-o-matic commented on May 27, 2024 1

Thanks @atw1020 , indeed reducing the batch size in my benchmark allows epochs to complete on cpu.
It's an interesting behaviour.
I don't expect to be able to use large batch sizes on a laptop with integrated GPU, but when so much is shared.
It's surprising that TF with CoreML is so limited on CPU, yet the GPU with the same memory can handle larger batch sizes.
For reference, the original benchmark used 32as batch size, that worked only on the GPU. Taking it down to 16 works on the CPU (20is too high, crashes again).

from tensorflow_macos.

anna-tikhonova avatar anna-tikhonova commented on May 27, 2024

@tux-o-matic Thank you for reporting this issue. Could you, please, point us to or attach an example you are running? This way, we can reproduce this issue locally and investigate.

from tensorflow_macos.

tux-o-matic avatar tux-o-matic commented on May 27, 2024

Hi @anna-tikhonova , I uses this Python code.
Just needs the TF fork and NumPy:

python cifar10_cnn.py

In my case, on a MacBook Air with Intel chips, the backend seems to choose the CPU by default and then throws the error.
However, if I specify

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name='gpu')
tensorflow.config.run_functions_eagerly(False)

Then the model gets trained, I can see with the Activity Monitor that Python threads are offloading work to the GPU. But on this integrated Intel GPU the perf is worse than the CPU and even PlaidML as a backend for TF could do better on the GPU.

from tensorflow_macos.

hughack avatar hughack commented on May 27, 2024

If you need another example, running this code (from #35 ) also defaults to CPU and Seg Faults.

import tensorflow as tf

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

Machine specs:
MacOS 11.0.1 on MacBook Pro, 15 inch, 2019.
2.3 GHz 8-Core Intel Core i9
16 GB 2400 MHz DDR4
Radeon Pro 560X 4 GB

from tensorflow_macos.

pooyadavoodi avatar pooyadavoodi commented on May 27, 2024

@tux-o-matic @hughack I apologize for the late reply. I just tried both of the scripts you provided. I'm not able to reproduce the issue. It's possible that it is resolved in a MacOS update. Could you please try again using an updated MacOS and let me know if you can still reproduce this?

from tensorflow_macos.

tux-o-matic avatar tux-o-matic commented on May 27, 2024

Hi @pooyadavoodi.
On an up to date BigSur, Python 3.8.7 and latest release of this project, still hit the same error.

from tensorflow_macos.

pooyadavoodi avatar pooyadavoodi commented on May 27, 2024

I managed to reproduce the segfault from @hughack's script using v0.1alpha0, and that issue is resolved in the latest release v0.1alpha2.

@tux-o-matic Could you share the BigSur version you are using? Also are you using the python that comes with the OS, otherwise how did you install it?

from tensorflow_macos.

tux-o-matic avatar tux-o-matic commented on May 27, 2024

I'm testing from BigSur 11.0.1. Python 3.8.7comes from MacPorts. Earlier tests were on older point release of Python 3.8, still from MacPorts.

from tensorflow_macos.

atw1020 avatar atw1020 commented on May 27, 2024

this appears to be the same as #127

I posted over there that I've found that this issue seems to be tied to batch size, where the segmentation fault occurs with sufficiently large batches. "Sufficiently large" appears to depend on the Neural network itself. However, all of the neural networks I have tried so far all experience this segfault when the batch size is larger than a certain amount. It's probably possible to solve or replicate this issue by increasing or decreasing your batch size.

I am still experiencing this on the february alpha build and I am using a Conda environment described on this page. (some of the pip commands need to be updated to match the new file names) I hope this helps you replicate the issue.

Also, using @tux-o-matic's workaround I was able to get my network to stop Segfaulting but it caused a memory leak instead (?!?). It appeared to run faster on GPU than it did on CPU (until I run out of memory, that is).

from tensorflow_macos.

atw1020 avatar atw1020 commented on May 27, 2024

I'm seeing nonlsegfault issues on 0.1-alpha3 but I'm still getting errors that are solved by using a smaller batch size. Going to keep investigating and hopefully get some new code to reproduce the issue I'm seeing

from tensorflow_macos.

atw1020 avatar atw1020 commented on May 27, 2024

I've been trying to replicate this issue on 0.1-alpha3 and I haven't been able to so I'm becoming pretty confidant that this issue was fixed in that patch. There seem to be other bugs related to batch size but this one has been addressed. Update if you are still experiencing this issue

from tensorflow_macos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.