mlfoundations / open_lm Goto Github PK
View Code? Open in Web Editor NEWA repository for research on medium sized language models.
License: MIT License
A repository for research on medium sized language models.
License: MIT License
Would be nice to have a unit test for grad accum to make sure gradients are close to that without it.
Currently, we load args.resume potentially up to 3 times. This can be pretty slow for big models, and we should avoid re-loading it in these spots:
Lines 110 to 156 in 97d0a4a
currently there is an ipython notebook to check if llama2 weight conversion to the open_lm format is gucci. would be great to move this to a more formal pytest
unit test in the tests/
directory
changing precision from fp32
to amp_bf16
leads to pytest tests/test_grad_accum.py
failing
FAILED tests/test_grad_accum.py::test_grad_acc - AssertionError: Failed gradient checks at: ['tok_embeddings.weight', 'layers.0.attention.in_proj.weight', 'layers.0...
FAILED tests/test_grad_accum.py::test_grad_acc_fsdp - torch.multiprocessing.spawn.ProcessRaisedException:
ERROR:root:Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.
2024-01-03,15:18:27 | ERROR | Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.
Traceback (most recent call last):
File "/miniconda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 841, in
main(sys.argv[1:])
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 717, in main
) = get_string_for_epoch(
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 293, in get_string_for_epoch
return _single_epoch_string(num_samples, starting_points, paths, weights, num_workers_per_gpu, world_size)
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 424, in _single_epoch_string
raise e
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 405, in _single_epoch_string
shard_name = manifests[i][next_shard_per_source[i]]["shard"]
IndexError: list index out of range
Have tried decreasing num of workers
Tests currently fail locally because we download credentials in the workflow rather than in the tests. We should move this into the tests:
In get_wds_dataset
, it loops over all datasets and creates a shared_epoch
for each dataset, but the function get_wds_dataset
returns a DataInfo object for only one shared_epoch
.
Thus when we call data["train"].set_epoch(epoch)
in train_one_epoch
, it only updates the epoch number for one of the datasets. All other datasets are stuck in epoch=0 and will end up sampling the same data over and over.
By default, FSDP will reduce gradients on every backward() call, which is slow in multi node settings. We should use fsdp.no_sync() to only reduce gradients on the last backward call.
Right now, we set up arguments in tests where we don't need to. As a result, we end up needing to change tests every time we add a parameter:
Some places:
open_lm/open_lm/tests/test_accumulation.py
Lines 42 to 59 in b5f9beb
open_lm/tests/test_generate_kv_cache_time.py
Lines 21 to 37 in b5f9beb
Lines 13 to 79 in b5f9beb
open_lm/tests/test_generate_load_kv_cache_equal.py
Lines 29 to 46 in b5f9beb
We should instead just call parse_args
, or at the very least, only have these args in one part of the tests.
Hello!
Thank you for the great work.
I was wondering if you planned to release the intermediate checkpoints for all pretrained models as in Pythia (https://arxiv.org/pdf/2304.01373.pdf)?
Looking into the logs for runs with the --accurate-total-tokens option, the number of tokens seen reported at the end of training is smaller from the desired one, by about 250-300M.
Without the option, dataloading without replacement works properly. As such, runs that don't exhaust the available data are fine.
Tagging @sagadre @achalddave @Vaishaal. I'll look into what causes this.
For medium-to-large models, if the user doesn't have enough disk space (or, more commonly, has accidentally specified a path on a volume with not enough disk space), we train for a full "epoch," and crash while saving the checkpoint. It would be nice to either:
Option 1: Save a dummy checkpoint at the very start, before training. If this succeeds, assume that future checkpoints will work if --delete-previous-checkpoint is specified. As an addition, we could check if there is num_checkpoints * size(initial checkpoint) disk space remaining if --delete-previous-checkpoint is not specified, but this is not necessary.
Option 2: Estimate the size of the checkpoint (based on number of parameters) and check if we have enough disk space based on number of checkpoints requested.
In #125, we had to switch our gradient accumulation tests from SGD to AdamW to make gradient accumulation tests pass. It's unclear why this is the case; anecdotally, when training models with AdamW, training curves look similar with and without gradient accumulation. This could be a numerical issue, or some specific issue with AdamW that makes gradient accumulation behave differently.
Proposed by @achalddave, opening as a different issue to keep separate.
Currently parameter error checking and throwing is done on-the-fly in main.py
. This means we may do some heavy weight initialization (e.g., of model) only to throw if a user passed incompatible flags. Having an error checking function, called after argparse, will alleviate this and also clean up code.
It would be nice to simplify the checkpoint loading. Right now it is a bit confusing with checkpoint_path
, args.checkpoint_path
, join(.., "checkpoints")
, remote-sync
, etc. all referenced in various places inmain.py
Some items that need to be addressed:
data.py
.--dataset-resampled
and --dataset-manifest
the only possible options.--accurate-total-tokens
the default.interested in:
A) changing layer_id + 1 to args.num_layers.
B) removing the line std = std / math.sqrt(2 * (layer_id + 1))
I haven't yet investigated the cause of this, but I'd appreciate insights and help.
When running a 160m model (hidden_dim: 768, n_layers: 12, n_heads: 12, seq_len: 256
) I can fit batch_size: 48
in memory on a 3090, and it saturates the GPUs nicely.
Then running a 11m model (hidden_dim: 96, n_layers: 12, n_heads: 12, seq_len: 256
) I can still only fit batch_size: 48
in memory on a 3090.
Things to note:
Would you expect to be able to increase the batch_size
for smaller models? If so, there may be a problem somewhere...
I've used other LLM codebases, but not yet familiar with this one, so any insights where to look for problems would be appreciated!
We should use SHARDED_STATE_DICT when loading/saving checkpoints to avoid loading the entire model in CPU memory, similar to https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py.
Some ideas:
s3 cp
, perhaps using fsspec
or cloudpathlib
Benchmark and get tokenization on-the-fly to be as fast as training on pre-tokenized data
Prereq: Merge or close all open PRs to avoid major conflicts
Would be great to benchmark tokens/sec of OpenLM, comparing to other libraries like Mosaic, Metaseq, etc.
Stating why one should choose this framework instead of others (GPTneoX DS\megatron, accelerate+HF etc.) may ease the work when choosing to use this framework rather than another. (The timing vs ease of use you started mentioning on Twitter might be a good thing to write there first)
Currently, the CI only checks for formatting/linting in the openlm module dir. We should also check in tests/, and reformat all of tests/ with black.
would be nice to have something like this for open_lm
https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles
Currently when both options specified only the local storage is cleared of old checkpoints
Often we may have special control tokens that need to be handle when creating the inputs and targets. To allow max flexibility, users should be able to provide their own sample_chunk
functions or similar
Shouldn't be too hard (probably can just copy from open_clip) and enables somewhat easier fine-tuning. Nice to have
Right now, we disable distributed functionality if we are using a single gpu: https://github.com/mlfoundations/open_lm/blob/main/open_lm/distributed.py#L20-L25. But if WORLD_SIZE is provided, we should behave as if we are in a distributed environment, which in turn will allow us to run tests that verify the distributed code paths without requiring multiple gpus.
I'm trying OpenLM on Ubuntu 20.04 under WSL. I've hit an issue running the unit tests where the argument "moe_freq" is never set before using it in train.py which results in a Python error. As a work around I added hasattr() to line 176 in train.py:
if hasattr(args, "moe_freq") and args.moe_freq > 0:
open_lm/open_lm$ pytest tests/
FAILED tests/test_accumulation.py::TestGradientAccumulation::test_accumulation - AttributeError: 'Namespace' object has no attribute 'moe_freq'
I can't seem to import from attention.py
pip install git+https://github.com/mlfoundations/open_lm.git
Stuff like these work without any issues:
from open_lm.data import get_data
from open_lm.main import main
It fails when I try to import from open_lm.attention
from open_lm.attention import ATTN_ACTIVATIONS, ATTN_SEQ_SCALARS
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'ATTN_ACTIVATIONS' from 'open_lm.attention' (unknown location)
some potentially helpful links:
Note the following helpful link we were sent: https://github.com/NVIDIA/apex/blob/bc4be41c6fdb889db84b9f61f35440f82a057948/apex/normalization/fused_layer_norm.py#L192
Hi OpenLM team! Is there interest in making OpenLM models loadable using just HF?
I see some OpenLM models up on HF, but they are not readily loadable using HF. The proposed changes would involve adding an OpenLM class on HF, similar to how other models are hosted on HF (e.g. Mistral).
For comparison, both #54 and #20 allow saved OpenLM models to be loaded using HF functions, but under the hood it still calls OpenLM functions and requires the OpenLM library downloaded locally. What I'm thinking is basically porting OpenLM's model.py into the transformers library itself, so that OpenLM trained models can be shared and loaded more easily. I can work on this if you think it's a good idea.
Line 129 in 619a8b3
It seems to me that the rotary position embedding is being applied on the head dimension (dim -2) of the vectors q, k instead of the sequence dimension (dim 1).
I think the head and sequence dimensions should be swapped before calling position embedding .
(see https://github.com/facebookresearch/xformers/blob/748c159096d4f9fcfe3eaf22801e5aed4777210b/xformers/components/positional_embedding/rotary.py#L85)
What I'm proposing is simply to re-write RotaryWithCast as follow:
class RotaryWithCast(RotaryEmbedding):
def forward(self, q, k, v):
q, k = super().forward(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3))
q = q.permute(0, 2, 1, 3)
k = k.permute(0, 2, 1, 3)
return q.to(v.dtype), k.to(v.dtype), v
@sagadre has done a big grid search of HPs, lets update the names (ie potato_neox -> open_lm_410m) and add jsons with optimal HPs
Line 70 in 9ca7042
fsspec is somehow really slow at loading large files in my experience, and right now we have every process reading from s3. This is quite slow at large model sizes; it would be nice to speed this up, probably via subprocess.run("aws s3 cp ...")
in local_rank=0 and then loading locally from each worker.
We should add a test that:
The test should test single process, DDP, and FSDP.
Right now our tests take a while because they pip install repeatedly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.