Comments (14)
I found a version series that runs successfully on colab:
tensorflow 2.12.0
tensorflow_federated 0.61.0
Python 3.10.12
This version configuration can be successfully run on colab. If anyone encounters similar problems like me, you can try the above version.
from federated.
Some further updates:
I was able to figure out why the C++ subprocess was failing on Colab by locating the binary in the pip package and attempting to call the subprocess
module directly:
print(tff.__path__)
>>> ['/usr/local/lib/python3.10/dist-packages/tensorflow_federated']
# Using the path above, see if we can find the path to the binary
!ls /lib/python3.10/dist-packages/tensorflow_federated/data/
>>> worker_binary
Now try to start the binary directly and we'll capture the return code:
import subprocess
import portpicker
port = portpicker.pick_unused_port()
binary = '/usr/local/lib/python3.10/dist-packages/tensorflow_federated/data/worker_binary'
subprocess.check_output(args=[binary, f'--port={port}'], stderr=subprocess.STDOUT)
>>> CalledProcessError: Command '['/usr/local/lib/python3.10/dist-packages/tensorflow_federated/data/worker_binary', '--port=46383']' died with <Signals.SIGILL: 4>.
Bingo! The binary is failing from an Illegal Instruction (SIGILL: 4). So the binary is using some instruction set that isn't supported by the Colab CPU runtime machine. Unfortunately, this doesn't tell us which instruction is problematic…
Let's take a look at what the Colab machine's processor is:
!cat /proc/cpuinfo
>>> processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping : 0
…
Google Search tells me that Xeon chips from family 6, model 79 are Broadwell microarchitecture (https://en.wikipedia.org/wiki/Broadwell_(microarchitecture)).
Hypothesis: the binary is including newer AVX instructions (possibly AVX-512), which isn't supported until Skylake (the architecture after Broadwell). AVX-512 is sometimes used to speed up ML frameworks as it has wider instructions for increased SIMD parallelism. This is similar to the type of issue noticed in tensorflow/tensorflow#18275
To confirm we will download the 0.75.0
and 0.76.0
pip packages and inspect the worker_binary.
pip download tensorflow_federated==${VERSION} -d /tmp/tff_${VERSION}
unzip /tmp/tff_${VERSION} /tensorflow_federated-${VERSION} -py3-none-manylinux_2_31_x86_64.whl -d /tmp/tff_${VERSION}
objdump --no-show-raw-insn -M x86-64 -d /tmp/tff_${VERSION} /tensorflow_federated/data/worker_binary | awk '{if ($2 !~ ":" && $2 != "data32" && $2 != "file" && $2 != "of" && length($2) > 0) {print $2}}' | sort -u > /tmp/tff_${VERSION} /instructions.txt
Now we diff the two instructions.txt
and see whats different. The section that stood out to me:
$ diff 0.76.0_instructions.txt 0.75.0_instructions.txt
…
< kmovb
< kmovd
< kmovw
…
> vmaskmovpd
> vmaskmovps
> vpmaskmovd
> Vpmaskmovq
kmov*
is an AVX-512 instruction (specifically AVX-512F, see https://en.wikipedia.org/wiki/AVX-512#New_opmask_instructions) and only appears in the 0.76.0
pip package binary. Where as it appears that the 0.75.0
package is using AVX2 instructions (https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#New_instructions, and https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#New_instructions_2).
Okay we confirmed there is a difference in AVX2 vs AVX-512 instructions being used in the binaries, and that the colab runtime doesn't support AVX-512 instructions.
As a final test, lets see if one of the other Colab runtimes has newer CPUs and works. Going to Runtime > Change runtime type
and choosing TPUv2 then shows:
!cat /proc/cpuinfo
>>> processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) CPU @ 2.00GHz
stepping : 3
…
Where Family 6 Model 85 appears to be a Cascade Lake microarchitecture, which does include AVX-512 instruction support. Low and behold, executing on this colab runtime does not hang.
This seems like a smoking gun for the issue.
Why are we getting different instruction sets in the built binaries?
Likely this is from the build --copt=-march=native
configuration here https://github.com/tensorflow/federated/blob/d4865b22711385f6dbd357b6d8b0e1e978e8986d/.bazelrc#L37. This instructs the compiler to optimize for the architecture of the machine building the binary. Recently our pool of build machines has grown to include some with newer architectures which are incompatible with Colab CPU runtimes.
I'll look into configuring our build systems to ensure that the pip package is built for Haswell and newer CPUs, which should enable it to run on default Colab CPU runtimes.
from federated.
Hi @makabaka2. Can you provide the following information requested in the bug template:
Environment (please complete the following information):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Python package versions (e.g., TensorFlow Federated, TensorFlow):
- Python version:
- Bazel version (if building from source):
- CUDA/cuDNN version:
- What TensorFlow Federated execution stack are you using?
Note: You can collect the Python package information by running pip3 freeze
from the command line and most of the other information can be collected using
TensorFlows environment capture
script.
from federated.
The code is executed on Colab, and my execution environment information is as follows:
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.3 LTS \n \l
Python package versions (e.g., TensorFlow Federated, TensorFlow): tensorflow-2.14.1, tensorflow_federated-0.74.0
Python version: 3.10.12
Bazel version (if building from source):
CUDA/cuDNN version: CUDA-12.2, cuDNN-8.9.6
What TensorFlow Federated execution stack are you using? 0.74.0
from federated.
Ah, the colab bit helped me repro. You're right, things seem to be hanging indefinitely on colab right now. Looking into this now.
from federated.
OMG that was such a pain, looking forward to you guys sorting this out soon
from federated.
I am also stuck on initialize() while using colab. It seems to be waiting indefinitely at wait() from threading.py
python 3.10.12
tensorflow 2.14.1
tensorflow_federated 0.74.0
I will try changing the versions to what the author suggested.
from federated.
在使用 colab 时,我也陷入了initialize() 的困境。它似乎在 threading.py python 3.10.12 tensorflow 2.14.1 tensorflow_federated 0.74.0的 wait() 处无限期等待
我会尝试将版本更改为建议的版本。
Remember to restart colab after installing tensorflow and tensorflow_federated
from federated.
Thanks @makabaka2
from federated.
I've narrowed it down to TFF v0.69.0. Looking at the change list, there's nothing that jumps out at me. It could be related to organizational changes in executor stacks, but this seems unlikely.
Regardless, I'd recommend using TFF v0.68.0 (or earlier) for now.
from federated.
Also if anyone sees this in no-colab environments, please let me know. So far I have only been able to repro it on colab.
from federated.
In limited testing I've found that:
0.74.0
hangs0.75.0
appears to succeed0.76.0
hangs again
from federated.
Incredible sleuthing!
from federated.
Following up: the fix in #4637 should have landed in the most recent version 0.77.0
(https://pypi.org/project/tensorflow-federated/0.77.0/), please give it a try.
from federated.
Related Issues (20)
- LearningProcess Compatibility Error HOT 2
- An error went when I ran pip install --upgrade tensorflow-federated HOT 2
- AttributeError: module 'tensorflow._api.v2.nest' has no attribute 'map_struture' HOT 1
- Documentation Issue - build_fed_kmeans HOT 1
- Error installing Tensorflow Federated with Python 3.9 HOT 4
- OperatorNotAllowedInGraphError Traceback (most recent call last)
- TFF using model resnet,the val accuracy has always been a result of random guessing
- AttributeError: module 'numpy' has no attribute '_no_nep50_warning' when importing TensorFlow Federated HOT 6
- Installation failed HOT 10
- Facing error in "Learning Attribute" Please Help. HOT 2
- tff.federated_computation(lambda: 'Hello World!')() stuck/hanging HOT 6
- Stuck at learning_process.initialize() and tff.federated_computation(lambda: 'Hello, World!')() HOT 1
- Security policy for tensorflow federated?
- Exploding memory while training federated model on FLAIR Dataset HOT 3
- No GPU utilization when using the cpp execution context HOT 1
- perform cross-silo federated learning by TFF HOT 2
- Installtion failed HOT 5
- Keras 3 and TF 16.1 support HOT 13
- Issue with TensorFlow Federated Execution in Remote Environments HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from federated.