GithubHelp home page GithubHelp logo

mlfoundations / open_lm Goto Github PK

View Code? Open in Web Editor NEW
349.0 349.0 47.0 9.35 MB

A repository for research on medium sized language models.

License: MIT License

Python 97.11% Shell 0.66% Jupyter Notebook 1.66% Makefile 0.24% Dockerfile 0.32%

open_lm's People

Contributors

achalddave avatar afang-story avatar georgiossmyrnis avatar iejmac avatar igorvasiljevic-tri avatar jeffreywpli avatar jfisher52 avatar jmercat avatar kernelmachine avatar mayeechen avatar mitchellnw avatar nielsrogge avatar pythonnut avatar reinhardh avatar revbucket avatar ruixin31 avatar rulinshao avatar sagadre avatar saurabhgarg1996 avatar sedrick-keh-tri avatar vaishaal avatar yuhui-zh15 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open_lm's Issues

Unit test for grad accum

Would be nice to have a unit test for grad accum to make sure gradients are close to that without it.

Minimize how often we load args.resume

Currently, we load args.resume potentially up to 3 times. This can be pretty slow for big models, and we should avoid re-loading it in these spots:

open_lm/open_lm/main.py

Lines 110 to 156 in 97d0a4a

def load_model(args, model):
checkpoint = pt_load(args.resume, map_location="cpu")
if "epoch" in checkpoint:
# resuming a train checkpoint w/ epoch and optimizer state
start_epoch = checkpoint["epoch"]
sd = checkpoint["state_dict"]
if next(iter(sd.items()))[0].startswith("module"):
sd = {k[len("module.") :]: v for k, v in sd.items()}
model.load_state_dict(sd)
logging.info(f"=> resuming checkpoint '{args.resume}' (epoch {start_epoch})")
else:
# loading a bare (model only) checkpoint for fine-tune or evaluation
model.load_state_dict(checkpoint)
logging.info(f"=> loaded checkpoint '{args.resume}' (epoch {start_epoch})")
return start_epoch
def load_optimizer(args, model, optimizer, scaler):
potential_checkpoint = args.resume.replace("epoch_", "optimizer_")
if check_exists(potential_checkpoint):
checkpoint = pt_load(potential_checkpoint, map_location="cpu")
else:
checkpoint = pt_load(args.resume, map_location="cpu")
if "optimizer" in checkpoint:
if optimizer is not None:
osd = checkpoint["optimizer"]
if args.fsdp:
osd = FSDP.optim_state_dict_to_load(
model=model, optim=optimizer, optim_state_dict=osd
)
optimizer.load_state_dict(osd)
logging.info(f"=> resuming optimizer")
if scaler is not None and "scaler" in checkpoint:
scaler.load_state_dict(checkpoint["scaler"])
else:
logging.info(f"=> WARNING: not resuming optimizer.")
def load_data_chunks(args):
checkpoint = pt_load(args.resume, map_location="cpu")
if "next_chunk" in checkpoint and "samples_seen" in checkpoint:
return checkpoint["next_chunk"], checkpoint["samples_seen"]
else:
logging.info(
f"=> WARNING: tried to resume a checkpoint without data chunk info. Assuming next_chunk = 0."
)
return 0, 0

llama2 unit tests

currently there is an ipython notebook to check if llama2 weight conversion to the open_lm format is gucci. would be great to move this to a more formal pytest unit test in the tests/ directory

grad accum tests failing on gpu w/ amp_bf16 precision

changing precision from fp32 to amp_bf16 leads to pytest tests/test_grad_accum.py failing

FAILED tests/test_grad_accum.py::test_grad_acc - AssertionError: Failed gradient checks at: ['tok_embeddings.weight', 'layers.0.attention.in_proj.weight', 'layers.0...
FAILED tests/test_grad_accum.py::test_grad_acc_fsdp - torch.multiprocessing.spawn.ProcessRaisedException: 

"Number of shards requested for a single epoch is more than the number of shards available" in the middle of a training run

ERROR:root:Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.

2024-01-03,15:18:27 | ERROR | Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.

Traceback (most recent call last):
File "/miniconda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 841, in
main(sys.argv[1:])
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 717, in main
) = get_string_for_epoch(
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 293, in get_string_for_epoch
return _single_epoch_string(num_samples, starting_points, paths, weights, num_workers_per_gpu, world_size)
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 424, in _single_epoch_string
raise e
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 405, in _single_epoch_string
shard_name = manifests[i][next_shard_per_source[i]]["shard"]
IndexError: list index out of range

Have tried decreasing num of workers

Dataloading Epoch Update Bug

In get_wds_dataset, it loops over all datasets and creates a shared_epoch for each dataset, but the function get_wds_dataset returns a DataInfo object for only one shared_epoch.

Thus when we call data["train"].set_epoch(epoch) in train_one_epoch, it only updates the epoch number for one of the datasets. All other datasets are stuck in epoch=0 and will end up sampling the same data over and over.

Deduplicate argparse namespace creation for tests

Right now, we set up arguments in tests where we don't need to. As a result, we end up needing to change tests every time we add a parameter:

Some places:

  1. args = {
    "device": "cpu",
    "precision": "fp16",
    "accum_freq": 1,
    "seq_len": 9,
    "vocab_size": 10,
    "batch_size": 16,
    "log_logit_mean": False,
    "grad_clip_norm": 1.0,
    "skip_scheduler": True,
    "rank": 0,
    "local_rank": 0,
    "world_size": 1,
    "wandb": False,
    "log_every_n_steps": 1,
    "target_mask_left": None,
    "target_mask_individual": None,
    }
  2. args = argparse.Namespace(
    **{
    # Generation params:
    "model": "open_lm_160m",
    "input_text": "random",
    "max_gen_len": max_gen_len,
    "context_len": context_len,
    "temperature": 0.0,
    "top_p": 1.0,
    "use_cache": False,
    # Model params that might not be in config:
    "model_norm": "gain_only_layer_norm",
    "qk_norm": False,
    "positional_embedding_type": "rotary",
    "ffn_type": "swiglu",
    }
    )
  3. open_lm/tests/shared.py

    Lines 13 to 79 in b5f9beb

    class MockTrainArgs:
    def __init__(self, model, **kwargs):
    data_path = download_val_data("shard_00000000.tar", "./tests/assets/")
    self.model = model # part of model config
    self.model_norm = "gain_only_layer_norm"
    self.qk_norm = False
    self.train_data = [
    data_path,
    ]
    self.log_logit_mean = False
    self.device = "cpu"
    self.precision = "float32"
    self.wd = 0.033
    self.lr = 3e-3
    self.beta1 = 0.9
    self.beta2 = 0.95
    self.eps = 1e-8
    self.warmup = 2
    self.skip_scheduler = False
    self.accum_freq = 1
    self.batch_size = 8
    self.grad_clip_norm = 1.0
    self.rank = 0
    self.local_rank = 0
    self.log_every_n_steps = 1e8
    self.save_logs = False
    self.logs = None
    self.name = "test_model_name"
    self.dataset_type = "webdataset"
    self.data_key = "json"
    self.ffn_type = "swiglu"
    self.train_num_samples = 250000
    self.train_data_mix_weights = None
    self.train_data_upsampling_factors = None
    self.disable_buffer = False
    self.seed = 1
    self.vocab_size = 50432
    self.seq_len = 300
    self.epochs = 1
    self.save_frequency = 1
    self.checkpoint_path = "./tests/assets/checkpoints/"
    self.resume = None
    self.distributed = False
    self.delete_previous_checkpoint = False
    self.workers = 1
    self.world_size = 1
    self.val_data = None
    self.lr_cooldown_end = 3e-5
    self.force_min_lr = 0.0
    self.scaler = None
    self.accum_freq = 1
    self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
    self.wandb = False
    self.fsdp = False
    self.fsdp_amp = False
    self.positional_embedding_type = "rotary"
    self.dist_backend = "nccl"
    self.dist_url = "env://"
    self.dataset_manifest = None
    self.target_mask_left = None
    self.target_mask_individual = None
    self.ignore_parse_errors = False
    for k, v in kwargs.items():
    if hasattr(self, k):
    setattr(self, k, v)
  4. args = argparse.Namespace(
    **{
    # Generation params:
    "model": "open_lm_1b_old",
    "input_text": "random",
    "max_gen_len": None,
    "context_len": None,
    "temperature": 0.0,
    "top_p": 1.0,
    "use_cache": False,
    "checkpoint": "checkpoints/open_lm_1b_old.pt",
    # Model params that might not be in config:
    "model_norm": "default_layer_norm",
    "qk_norm": False,
    "positional_embedding_type": "head_rotary",
    "ffn_type": "swiglu",
    }
    )

We should instead just call parse_args, or at the very least, only have these args in one part of the tests.

Wrong token count when using --accurate-total-tokens.

Looking into the logs for runs with the --accurate-total-tokens option, the number of tokens seen reported at the end of training is smaller from the desired one, by about 250-300M.

Without the option, dataloading without replacement works properly. As such, runs that don't exhaust the available data are fine.

Tagging @sagadre @achalddave @Vaishaal. I'll look into what causes this.

Error early if we don't have enough disk space

For medium-to-large models, if the user doesn't have enough disk space (or, more commonly, has accidentally specified a path on a volume with not enough disk space), we train for a full "epoch," and crash while saving the checkpoint. It would be nice to either:

Option 1: Save a dummy checkpoint at the very start, before training. If this succeeds, assume that future checkpoints will work if --delete-previous-checkpoint is specified. As an addition, we could check if there is num_checkpoints * size(initial checkpoint) disk space remaining if --delete-previous-checkpoint is not specified, but this is not necessary.

Option 2: Estimate the size of the checkpoint (based on number of parameters) and check if we have enough disk space based on number of checkpoints requested.

Figure out why AdamW + gradient accumulation leads to different results for test case

In #125, we had to switch our gradient accumulation tests from SGD to AdamW to make gradient accumulation tests pass. It's unclear why this is the case; anecdotally, when training models with AdamW, training curves look similar with and without gradient accumulation. This could be a numerical issue, or some specific issue with AdamW that makes gradient accumulation behave differently.

Factor out parameter error checking

Currently parameter error checking and throwing is done on-the-fly in main.py. This means we may do some heavy weight initialization (e.g., of model) only to throw if a user passed incompatible flags. Having an error checking function, called after argparse, will alleviate this and also clean up code.

Factorize helper function for all model loading

It would be nice to simplify the checkpoint loading. Right now it is a bit confusing with checkpoint_path, args.checkpoint_path, join(.., "checkpoints"), remote-sync, etc. all referenced in various places inmain.py

Improve dataloading.

Some items that need to be addressed:

  • Clean up the code in data.py.
  • Make --dataset-resampled and --dataset-manifest the only possible options.
  • Make --accurate-total-tokens the default.

Ablate on initialization

interested in:

A) changing layer_id + 1 to args.num_layers.
B) removing the line std = std / math.sqrt(2 * (layer_id + 1))

Weird memory usage for 11m vs 160m: similar batch size fits in memory...

I haven't yet investigated the cause of this, but I'd appreciate insights and help.

When running a 160m model (hidden_dim: 768, n_layers: 12, n_heads: 12, seq_len: 256) I can fit batch_size: 48 in memory on a 3090, and it saturates the GPUs nicely.

Then running a 11m model (hidden_dim: 96, n_layers: 12, n_heads: 12, seq_len: 256) I can still only fit batch_size: 48 in memory on a 3090.

Things to note:

  • The vocab size is the same, the code is exactly the same for both runs.
  • I've disabled gradient checkpointing and FSDP given the model sizes.
  • I'm just using two GPUs for testing, memory runs out on the first process.

Would you expect to be able to increase the batch_size for smaller models? If so, there may be a problem somewhere...

I've used other LLM codebases, but not yet familiar with this one, so any insights where to look for problems would be appreciated!

Revamp make_2048.py script

Some ideas:

  • add flags so that people can chunk to different chunk sizes. based on latest changes, people should chunk to the largest context length they expect to train on
  • look into alternatives to a system call to s3 cp, perhaps using fsspec or cloudpathlib
  • write directly to s3/cloud rather than caching locally and then pushing to cloud

Documentation: competing frameworks

Stating why one should choose this framework instead of others (GPTneoX DS\megatron, accelerate+HF etc.) may ease the work when choosing to use this framework rather than another. (The timing vs ease of use you started mentioning on Twitter might be a good thing to write there first)

Undefined argument "moe_freq" when running unit tests on WSL/Ubuntu 20.04

I'm trying OpenLM on Ubuntu 20.04 under WSL. I've hit an issue running the unit tests where the argument "moe_freq" is never set before using it in train.py which results in a Python error. As a work around I added hasattr() to line 176 in train.py:

if hasattr(args, "moe_freq") and args.moe_freq > 0:

Steps to Reproduce

open_lm/open_lm$ pytest tests/

FAILED tests/test_accumulation.py::TestGradientAccumulation::test_accumulation - AttributeError: 'Namespace' object has no attribute 'moe_freq'

Import from attention.py error

I can't seem to import from attention.py

pip install git+https://github.com/mlfoundations/open_lm.git

Stuff like these work without any issues:
from open_lm.data import get_data
from open_lm.main import main

It fails when I try to import from open_lm.attention
from open_lm.attention import ATTN_ACTIVATIONS, ATTN_SEQ_SCALARS

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'ATTN_ACTIVATIONS' from 'open_lm.attention' (unknown location)

HF Integration

Hi OpenLM team! Is there interest in making OpenLM models loadable using just HF?

I see some OpenLM models up on HF, but they are not readily loadable using HF. The proposed changes would involve adding an OpenLM class on HF, similar to how other models are hosted on HF (e.g. Mistral).

For comparison, both #54 and #20 allow saved OpenLM models to be loaded using HF functions, but under the hood it still calls OpenLM functions and requires the OpenLM library downloaded locally. What I'm thinking is basically porting OpenLM's model.py into the transformers library itself, so that OpenLM trained models can be shared and loaded more easily. I can work on this if you think it's a good idea.

@mitchellnw @sagadre @achalddave

Problem in position embedding

queries, keys, vals = self.pos_embed(queries, keys, vals)

It seems to me that the rotary position embedding is being applied on the head dimension (dim -2) of the vectors q, k instead of the sequence dimension (dim 1).
I think the head and sequence dimensions should be swapped before calling position embedding .
(see https://github.com/facebookresearch/xformers/blob/748c159096d4f9fcfe3eaf22801e5aed4777210b/xformers/components/positional_embedding/rotary.py#L85)

What I'm proposing is simply to re-write RotaryWithCast as follow:

class RotaryWithCast(RotaryEmbedding):
    def forward(self, q, k, v):
        q, k = super().forward(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3))
        q = q.permute(0, 2, 1, 3)
        k = k.permute(0, 2, 1, 3)
        return q.to(v.dtype), k.to(v.dtype), v

Speed up loading remote checkpoints

def pt_load(file_path, map_location=None):

fsspec is somehow really slow at loading large files in my experience, and right now we have every process reading from s3. This is quite slow at large model sizes; it would be nice to speed this up, probably via subprocess.run("aws s3 cp ...") in local_rank=0 and then loading locally from each worker.

Add test for checkpoint loading after save

We should add a test that:

  1. trains a (small) model for a couple steps
  2. saves it to disk
  3. calls main() again with a path to the checkpoint on disk
  4. trains a few steps

The test should test single process, DDP, and FSDP.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.