graphcore / examples-utils Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 2.0 403 KB

Utils and common code for Graphcore's example applications

License: Other

Makefile 0.07% Python 98.17% C++ 0.44% Shell 0.93% Jupyter Notebook 0.40%

examples-utils's People

Contributors

Stargazers

Watchers

Forkers

stjordanis chelseajohn

examples-utils's Issues

Add checkpoint upload capability (wandb + AWS S3)

Associated with https://graphcore.atlassian.net/browse/ACS-13

This is for defining exactly what is needed for this JIRA ticket to be completed and tracking distribution of sub-issues/tasks among assignees.

Link all relevant PRs to this issue to track.

Formalise convergence testing code (platform assessment) and dockerify it

Currently the code under platform assessment has three issues:

Its unofficial and poorly organised
Its not fully integrated into examples-utils benchmarking submodule
It uses a .sh file to setup environments etc

This task is for making the code more organised and integrated neatly into examples-utils benchmaring (or splitting it off into something else entirely) and converting it to use docker containers.

Examples utils benchmarking: Feedback round 2

Previous round of testing + feedback from SysOps and PSE was very useful and led to the improvment and promotion of examples utils benchmarking.

Round 2 will be performed with the AI-Engineering cloud SDK team.

Initial documentation draft for examples-utils benchmarking

Currently, the examples-utils benchmarking sub-module has no documentation aside from a brief README that is provided to users. This task will cover all efforts up to a V1.0 of documentation.

Benchmarks: checkpoint uploading

The current implementation of finding checkpoints for upload to wands/ s3 assumes:

--checkpoint-output-dir contains subdirectories. However it may be the case that the output directory itself holds the checkpoint files.
Each subdirectory contains only checkpoint related files. However some applications store other files related to checkpoints.
The file that was most recently updated in any subdirectory corresponds to a checkpoint. Tied with point 2, but applications may store files related to checkpoints, but the actual checkpoint file may not be the most recently updated file, a metadata file may be such a file instead

We need to identify all the expected scenarios across all apps for checkpoint outputs:

do they provide subdirectories for each ckpt
store metadata files
do all checkpoints correspond to one file? If so what are the possible allowed extensions

Could it be easier to control the format of the output directories in each application instead?

QOL improvements to `platform_assesment` command

I've extracted the functionality into: requirements_utils.py. I can move it further into its own folder if you think that's valuable but figured I'd get some feedback on this first. There are a few things that I want to improve before merging:

Logging and clarity of the log messages:
- Add header to the setup
- Remove the "Benchmark elapsed time" for requirement installation
- Log the name of the file environment_setup.log
- Capture the output of pip freeze after installing each requirements
Add some docs to the new function and module
Improve --help for platform assessment

Ideally I don't plan to merge with the platform_assessment script yet but I can open a follow up issue for that.

Originally posted by @payoto in #46 (comment)

Examine possibility of cppimport upgrade

@joshlk told cppimport introduced a fix for the issue arising in parallel compilation from version 22.07.17

Is it possible to upgrade the cppimport dependency to 22.07.17 and remove custom workarounds?

Some CI tests were still failing this week (they don't fail everytime though) (test attention in Bert https://jenkins.sourcevertex.net/job/public_examples/job/public_examples_ci_ubuntu_18_04_hw_pod_mk2/316/testReport/junit/(root)/(empty)/tests_integration_layer_test_attention/), and we could possibly solve the issue upgrading the popxl-addons dependency using cppimport 22.07.17. However, if examples utils depends on a previous version, there is a dependency conflict.

Examples utils self-identified upgrades

There are already some upgrades that we can make, even before feedback round 2:

Better spacing and clarity of results in terminal
Reduce the amount of logging in the terminal to only the minimum, the output in output.log can be complete
A more obvious progress/running inidicator in terminal
Ability to unify a distinct poprun common arguments option in the yaml files into all poprun commands from that yaml file.
The start of unit testing... (seek inspiration)
Remove the need of passing a benchmarks file entirely, make it optional in that case

graphcore / examples-utils Goto Github PK

examples-utils's People

Contributors

Stargazers

Watchers

Forkers

examples-utils's Issues

Add checkpoint upload capability (wandb + AWS S3)

Formalise convergence testing code (platform assessment) and dockerify it

Examples utils benchmarking: Feedback round 2

Initial documentation draft for examples-utils benchmarking

Benchmarks: checkpoint uploading

QOL improvements to `platform_assesment` command

Examine possibility of cppimport upgrade

Examples utils self-identified upgrades

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs