Session 6

Without 2 nodes

Tensorboard link to logs with multi gpu training without 2 nodes https://tensorboard.dev/experiment/Y965wCV6SX6yswvqE8cJnw/

s3://test-bucket-emlo-1/s6/without_2_nodes_epoch_007.ckpt

max_batch_size = 20000

With 2 nodes

Tensorboard link to logs with gpu training with 2 nodes https://tensorboard.dev/experiment/YXzLURTzSGy7EjmpfGj8nA/

s3://test-bucket-emlo-1/s6/with_2_nodes_epoch_005.ckpt

max_batch_size = 20000

was able to pass same max batch-size

Session 4

Docker Image url

shivam13juna/tsai_emlo4

For running docker-image

docker run -p 8080:8080 shivam13juna/tsai_emlo4

For building Docker-image

cd dockerize/

docker image build -t torch_script .

Link to Github REPO

https://github.com/shivam13juna/emlo-assn2.git

Session 2

Building Image

make build

which triggers this command`docker build -t session2 .`

For Training

Set the timm model name in configs/model/cifar.yaml default is resnet18
following docker command will trigger training

docker run --mount type=bind,source=`pwd`,target=/src/ session2 python3 src/train.py experiment=cifar

For trying out different metrics, can specify in callbacks, configs/callbacks/model_checkpoint.yaml

For Prediction (Optional)

Copy the location of best checkpoint from training into predict.yaml
Paste the location of image(which one wants to run prediction on) in the predict.yaml and output will be the prediction class

docker run --mount type=bind,source=`pwd`,target=/src/ session2 python3 src/predict_v1.py

For Eval

Copy the location of best checkpoint from training into eval.yaml

docker run --shm-size 25G --mount type=bind,source=`pwd`,target=/src/ session2 python3 src/eval.py

--shm-size for increasing shared memory size for containers, was running OOM earlier.

For COG inference

in src/predict.py appropriate timm model needs to be specified (with which cifar model is trained), and path in which checkpoints are saved have to be specified. Corresponding state-dict are loaded into models, which are used in inferencing.
Output is the prediction class

cog predict -i image=@tmp/dog.jpg

shivam13juna / emlo-assn2 Goto Github PK

emlo-assn2's Introduction

Session 6

Without 2 nodes

With 2 nodes

Session 4

Docker Image url

For running docker-image

For building Docker-image

Link to Github REPO

Session 2

Building Image

For Training

For Prediction (Optional)

For Eval

For COG inference

emlo-assn2's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs