GithubHelp home page GithubHelp logo

ttconnect's Introduction

ttconnect

A script for combining the power of tmux with the TPU VMs. The script currently handles both TPU-v3, TPU-v4 and TPU-v5. The main idea is to use tmux for executing identical commands on multiple VMs.

Installation

# Download ttconnect
wget https://raw.githubusercontent.com/peregilk/ttconnect/main/ttconnect

# Make the program executable
chmod a+x ttconnect

# Optionally copy it to a place in your path (like /usr/local/bin/)

Use and Tips

The script is made for handling TPU pods. It will automatically open a tmux window with a tile for each of the TPU VMs, allowing them to be controlled both in parallel and individually.

# Open a connection to an already existing TPU VM or TPU-VMs.
# If one is not provided, it will default to us-central2-b
./ttconnect TPU-name [zone]

This command will open connections to all the workers in a tmux with split panes. A typical workspace for a v4-32 looks like this:

ttconnect screenshot

Layout

Depending upon how many windows that are open, it might be beneficial to change the layout mode. You can cycle through the five different layout modes with this command:

C-b <space>

Syncronize off

The default setting is syncronized panes. Whatever you type in one pane, will then happen in all the panes. However, if you like to make a change only to one of the TPUs, you can turn off this behaviour by setting:

C-b: setw synchronize-panes off

Target Specific Panes

It might happen that one of the tpus dies for some reason, and it might not be the one that is in focus. To target specific panes there are a few tricks that I like to use. Firstly you can always go to another pane using ctrl-b <arrow>. However, in many cases this pane is too small for working. If you have multiple VMs running, the first thing would then be to switch to the layout main-horisontal(see above). After you have done this, use the following command to see the id of each of the panes:

C-b q

When you know the id of the target pane, you can use the command below setting the N=id:

C-b:swap-pane -t N

Kill window

You can detach from the windows by doing

C-b d

However, if you really want to zap the entire window, you will have to do:

C-b: kill-window

You can then use ttconnectto connect to the same pod again with a fresh login.

Killing Stuck Scripts

In rare cases, some scripts crashes. If you dont want to recreate the TPUs/VMs, this is really useful commands.

gcloud alpha compute tpus tpu-vm ssh MyName --project=MyProject-11111 --zone=MyZone --worker=all --command="sudo pkill -9 python"

In some very rare cases, I have experienced that there still can be stuck programs that prevents the training scripts to restart. This is my last trick:

gcloud alpha compute tpus tpu-vm ssh MyName --project=MyProject-11111 --zone=MyZone --worker=all --command="ps ax | grep python | grep -v grep | awk '{print \$1}' | xargs -r sudo kill -9"

For more advanced use, please refer to the tmux documentation.

Switch Sessions

This is really just an tmux tips but it seems like a lot of tmux users simply is not aware of its most useful feature. List all sessions:

C-b s

Feedback

Feel free to modify the script, and to add features. If you come up with improvements, I will be glad to add them into the script. Please send any comments to [email protected].

ttconnect's People

Contributors

peregilk avatar

Stargazers

 avatar Goashnik avatar Christoph Minixhofer avatar 爱可可-爱生活 avatar Benjamin Minixhofer avatar Sanchit Gandhi avatar Ali Moezzi avatar Colin Raffel avatar Javier de la Rosa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.