mlfoundations / open_lm Goto Github PK

View Code? Open in Web Editor NEW

349.0 349.0 47.0 9.35 MB

A repository for research on medium sized language models.

License: MIT License

Python 97.11% Shell 0.66% Jupyter Notebook 1.66% Makefile 0.24% Dockerfile 0.32%

open_lm's People

Contributors

Stargazers

Watchers

open_lm's Issues

Unit test for grad accum

Would be nice to have a unit test for grad accum to make sure gradients are close to that without it.

Minimize how often we load args.resume

Currently, we load args.resume potentially up to 3 times. This can be pretty slow for big models, and we should avoid re-loading it in these spots:

open_lm/open_lm/main.py

Lines 110 to 156 in 97d0a4a

 def load_model(args, model): 

 checkpoint = pt_load(args.resume, map_location="cpu") 

 if "epoch" in checkpoint: 

 # resuming a train checkpoint w/ epoch and optimizer state 

 start_epoch = checkpoint["epoch"] 

 sd = checkpoint["state_dict"] 

 if next(iter(sd.items()))[0].startswith("module"): 

 sd = {k[len("module.") :]: v for k, v in sd.items()} 

 model.load_state_dict(sd) 

 logging.info(f"=> resuming checkpoint '{args.resume}' (epoch {start_epoch})") 

 else: 

 # loading a bare (model only) checkpoint for fine-tune or evaluation 

 model.load_state_dict(checkpoint) 

 logging.info(f"=> loaded checkpoint '{args.resume}' (epoch {start_epoch})") 

 return start_epoch 

 def load_optimizer(args, model, optimizer, scaler): 

 potential_checkpoint = args.resume.replace("epoch_", "optimizer_") 

 if check_exists(potential_checkpoint): 

 checkpoint = pt_load(potential_checkpoint, map_location="cpu") 

 else: 

 checkpoint = pt_load(args.resume, map_location="cpu") 

 if "optimizer" in checkpoint: 

 if optimizer is not None: 

 osd = checkpoint["optimizer"] 

 if args.fsdp: 

 osd = FSDP.optim_state_dict_to_load( 

 model=model, optim=optimizer, optim_state_dict=osd 

 ) 

 optimizer.load_state_dict(osd) 

 logging.info(f"=> resuming optimizer") 

 if scaler is not None and "scaler" in checkpoint: 

 scaler.load_state_dict(checkpoint["scaler"]) 

 else: 

 logging.info(f"=> WARNING: not resuming optimizer.") 

 def load_data_chunks(args): 

 checkpoint = pt_load(args.resume, map_location="cpu") 

 if "next_chunk" in checkpoint and "samples_seen" in checkpoint: 

 return checkpoint["next_chunk"], checkpoint["samples_seen"] 

 else: 

 logging.info( 

 f"=> WARNING: tried to resume a checkpoint without data chunk info. Assuming next_chunk = 0." 

 ) 

 return 0, 0

llama2 unit tests

currently there is an ipython notebook to check if llama2 weight conversion to the open_lm format is gucci. would be great to move this to a more formal pytest unit test in the tests/ directory

grad accum tests failing on gpu w/ amp_bf16 precision

changing precision from fp32 to amp_bf16 leads to pytest tests/test_grad_accum.py failing

FAILED tests/test_grad_accum.py::test_grad_acc - AssertionError: Failed gradient checks at: ['tok_embeddings.weight', 'layers.0.attention.in_proj.weight', 'layers.0...
FAILED tests/test_grad_accum.py::test_grad_acc_fsdp - torch.multiprocessing.spawn.ProcessRaisedException:

"Number of shards requested for a single epoch is more than the number of shards available" in the middle of a training run

ERROR:root:Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.

2024-01-03,15:18:27 | ERROR | Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.

Traceback (most recent call last):
File "/miniconda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 841, in
main(sys.argv[1:])
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 717, in main
) = get_string_for_epoch(
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 293, in get_string_for_epoch
return _single_epoch_string(num_samples, starting_points, paths, weights, num_workers_per_gpu, world_size)
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 424, in _single_epoch_string
raise e
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 405, in _single_epoch_string
shard_name = manifests[i][next_shard_per_source[i]]["shard"]
IndexError: list index out of range

Have tried decreasing num of workers

Add bootstrap CIs to val perplexity calculation

Move dummy cred download into test

Tests currently fail locally because we download credentials in the workflow rather than in the tests. We should move this into the tests:

https://github.com/mlfoundations/open_lm/blob/b27ce1e26db571f3a6558e126a0d1cc98b58908d/.github/workflows/ci.yml#L46C1-L46C173

Dataloading Epoch Update Bug

In get_wds_dataset, it loops over all datasets and creates a shared_epoch for each dataset, but the function get_wds_dataset returns a DataInfo object for only one shared_epoch.

Thus when we call data["train"].set_epoch(epoch) in train_one_epoch, it only updates the epoch number for one of the datasets. All other datasets are stuck in epoch=0 and will end up sampling the same data over and over.

Use no_sync when doing gradient accumulation

pytorch/pytorch#72446

By default, FSDP will reduce gradients on every backward() call, which is slow in multi node settings. We should use fsdp.no_sync() to only reduce gradients on the last backward call.

Deduplicate argparse namespace creation for tests

Right now, we set up arguments in tests where we don't need to. As a result, we end up needing to change tests every time we add a parameter:

Some places:

open_lm/open_lm/tests/test_accumulation.py

Lines 42 to 59 in b5f9beb

 args = { 

 "device": "cpu", 

 "precision": "fp16", 

 "accum_freq": 1, 

 "seq_len": 9, 

 "vocab_size": 10, 

 "batch_size": 16, 

 "log_logit_mean": False, 

 "grad_clip_norm": 1.0, 

 "skip_scheduler": True, 

 "rank": 0, 

 "local_rank": 0, 

 "world_size": 1, 

 "wandb": False, 

 "log_every_n_steps": 1, 

 "target_mask_left": None, 

 "target_mask_individual": None, 

 }

open_lm/tests/test_generate_kv_cache_time.py

Lines 21 to 37 in b5f9beb

 args = argparse.Namespace( 

 **{ 

 # Generation params: 

 "model": "open_lm_160m", 

 "input_text": "random", 

 "max_gen_len": max_gen_len, 

 "context_len": context_len, 

 "temperature": 0.0, 

 "top_p": 1.0, 

 "use_cache": False, 

 # Model params that might not be in config: 

 "model_norm": "gain_only_layer_norm", 

 "qk_norm": False, 

 "positional_embedding_type": "rotary", 

 "ffn_type": "swiglu", 

 } 

 )

open_lm/tests/shared.py

Lines 13 to 79 in b5f9beb

 class MockTrainArgs: 

 def __init__(self, model, **kwargs): 

 data_path = download_val_data("shard_00000000.tar", "./tests/assets/") 

 self.model = model # part of model config 

 self.model_norm = "gain_only_layer_norm" 

 self.qk_norm = False 

 self.train_data = [ 

 data_path, 

 ] 

 self.log_logit_mean = False 

 self.device = "cpu" 

 self.precision = "float32" 

 self.wd = 0.033 

 self.lr = 3e-3 

 self.beta1 = 0.9 

 self.beta2 = 0.95 

 self.eps = 1e-8 

 self.warmup = 2 

 self.skip_scheduler = False 

 self.accum_freq = 1 

 self.batch_size = 8 

 self.grad_clip_norm = 1.0 

 self.rank = 0 

 self.local_rank = 0 

 self.log_every_n_steps = 1e8 

 self.save_logs = False 

 self.logs = None 

 self.name = "test_model_name" 

 self.dataset_type = "webdataset" 

 self.data_key = "json" 

 self.ffn_type = "swiglu" 

 self.train_num_samples = 250000 

 self.train_data_mix_weights = None 

 self.train_data_upsampling_factors = None 

 self.disable_buffer = False 

 self.seed = 1 

 self.vocab_size = 50432 

 self.seq_len = 300 

 self.epochs = 1 

 self.save_frequency = 1 

 self.checkpoint_path = "./tests/assets/checkpoints/" 

 self.resume = None 

 self.distributed = False 

 self.delete_previous_checkpoint = False 

 self.workers = 1 

 self.world_size = 1 

 self.val_data = None 

 self.lr_cooldown_end = 3e-5 

 self.force_min_lr = 0.0 

 self.scaler = None 

 self.accum_freq = 1 

 self.device = "cuda:0" if torch.cuda.is_available() else "cpu" 

 self.wandb = False 

 self.fsdp = False 

 self.fsdp_amp = False 

 self.positional_embedding_type = "rotary" 

 self.dist_backend = "nccl" 

 self.dist_url = "env://" 

 self.dataset_manifest = None 

 self.target_mask_left = None 

 self.target_mask_individual = None 

 self.ignore_parse_errors = False 

 for k, v in kwargs.items(): 

 if hasattr(self, k): 

 setattr(self, k, v)

open_lm/tests/test_generate_load_kv_cache_equal.py

Lines 29 to 46 in b5f9beb

 args = argparse.Namespace( 

 **{ 

 # Generation params: 

 "model": "open_lm_1b_old", 

 "input_text": "random", 

 "max_gen_len": None, 

 "context_len": None, 

 "temperature": 0.0, 

 "top_p": 1.0, 

 "use_cache": False, 

 "checkpoint": "checkpoints/open_lm_1b_old.pt", 

 # Model params that might not be in config: 

 "model_norm": "default_layer_norm", 

 "qk_norm": False, 

 "positional_embedding_type": "head_rotary", 

 "ffn_type": "swiglu", 

 } 

 )

We should instead just call parse_args, or at the very least, only have these args in one part of the tests.

add fused cross entropy

Release the intermediate checkpoints?

Hello!
Thank you for the great work.
I was wondering if you planned to release the intermediate checkpoints for all pretrained models as in Pythia (https://arxiv.org/pdf/2304.01373.pdf)?

Wrong token count when using --accurate-total-tokens.

Looking into the logs for runs with the --accurate-total-tokens option, the number of tokens seen reported at the end of training is smaller from the desired one, by about 250-300M.

Without the option, dataloading without replacement works properly. As such, runs that don't exhaust the available data are fine.

Tagging @sagadre @achalddave @Vaishaal. I'll look into what causes this.

Error early if we don't have enough disk space

For medium-to-large models, if the user doesn't have enough disk space (or, more commonly, has accidentally specified a path on a volume with not enough disk space), we train for a full "epoch," and crash while saving the checkpoint. It would be nice to either:

Option 1: Save a dummy checkpoint at the very start, before training. If this succeeds, assume that future checkpoints will work if --delete-previous-checkpoint is specified. As an addition, we could check if there is num_checkpoints * size(initial checkpoint) disk space remaining if --delete-previous-checkpoint is not specified, but this is not necessary.

Option 2: Estimate the size of the checkpoint (based on number of parameters) and check if we have enough disk space based on number of checkpoints requested.

Investigate effect of FSDP policies on mamba speed

Figure out why AdamW + gradient accumulation leads to different results for test case

In #125, we had to switch our gradient accumulation tests from SGD to AdamW to make gradient accumulation tests pass. It's unclear why this is the case; anecdotally, when training models with AdamW, training curves look similar with and without gradient accumulation. This could be a numerical issue, or some specific issue with AdamW that makes gradient accumulation behave differently.

Check if batch size changed when resuming.

Proposed by @achalddave, opening as a different issue to keep separate.

Factor out parameter error checking

Currently parameter error checking and throwing is done on-the-fly in main.py. This means we may do some heavy weight initialization (e.g., of model) only to throw if a user passed incompatible flags. Having an error checking function, called after argparse, will alleviate this and also clean up code.

Factorize helper function for all model loading

It would be nice to simplify the checkpoint loading. Right now it is a bit confusing with checkpoint_path, args.checkpoint_path, join(.., "checkpoints"), remote-sync, etc. all referenced in various places inmain.py

Improve dataloading.

Some items that need to be addressed:

Clean up the code in data.py.
Make --dataset-resampled and --dataset-manifest the only possible options.
Make --accurate-total-tokens the default.

Consider low precision normalization

As in https://github.com/mosaicml/composer

fp8 training on h100s

Add support for sparse mixture of experts (MoE)

Ablate on initialization

interested in:

A) changing layer_id + 1 to args.num_layers.
B) removing the line std = std / math.sqrt(2 * (layer_id + 1))

Weird memory usage for 11m vs 160m: similar batch size fits in memory...

I haven't yet investigated the cause of this, but I'd appreciate insights and help.

When running a 160m model (hidden_dim: 768, n_layers: 12, n_heads: 12, seq_len: 256) I can fit batch_size: 48 in memory on a 3090, and it saturates the GPUs nicely.

Then running a 11m model (hidden_dim: 96, n_layers: 12, n_heads: 12, seq_len: 256) I can still only fit batch_size: 48 in memory on a 3090.

Things to note:

The vocab size is the same, the code is exactly the same for both runs.
I've disabled gradient checkpointing and FSDP given the model sizes.
I'm just using two GPUs for testing, memory runs out on the first process.

Would you expect to be able to increase the batch_size for smaller models? If so, there may be a problem somewhere...

I've used other LLM codebases, but not yet familiar with this one, so any insights where to look for problems would be appreciated!

Standardize tokenization for json and txt datasets

Support saving/loading models larger than CPU memory

We should use SHARDED_STATE_DICT when loading/saving checkpoints to avoid loading the entire model in CPU memory, similar to https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py.

Revamp make_2048.py script

Some ideas:

add flags so that people can chunk to different chunk sizes. based on latest changes, people should chunk to the largest context length they expect to train on
look into alternatives to a system call to s3 cp, perhaps using fsspec or cloudpathlib
write directly to s3/cloud rather than caching locally and then pushing to cloud

Tokenization on-the-fly without slowdown

Benchmark and get tokenization on-the-fly to be as fast as training on pre-tokenized data

Black format openlm

Prereq: Merge or close all open PRs to avoid major conflicts

error checking params.py

add error checking so that one can never run code with flags that conflict with each other

Benchmark tok/sec with other libs

Would be great to benchmark tokens/sec of OpenLM, comparing to other libraries like Mosaic, Metaseq, etc.

LLaMA weight loading

Documentation: competing frameworks

Stating why one should choose this framework instead of others (GPTneoX DS\megatron, accelerate+HF etc.) may ease the work when choosing to use this framework rather than another. (The timing vs ease of use you started mentioning on Twitter might be a good thing to write there first)

Black format tests, change CI to check test formatting

Currently, the CI only checks for formatting/linting in the openlm module dir. We should also check in tests/, and reformat all of tests/ with black.

open_lm chronicles

would be nice to have something like this for open_lm
https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles

`--delete-previous-checkpoint` should delete prev checkpoints in `--remote-sync` bucket

Currently when both options specified only the local storage is cleared of old checkpoints

Support user specified token pre-processing functions

Often we may have special control tokens that need to be handle when creating the inputs and targets. To allow max flexibility, users should be able to provide their own sample_chunk functions or similar

Support `pretrained` arg for create_model (and train.py) like in open_clip

Shouldn't be too hard (probably can just copy from open_clip) and enables somewhat easier fine-tuning. Nice to have

Use distributed when world_size=1 if requested

Right now, we disable distributed functionality if we are using a single gpu: https://github.com/mlfoundations/open_lm/blob/main/open_lm/distributed.py#L20-L25. But if WORLD_SIZE is provided, we should behave as if we are in a distributed environment, which in turn will allow us to run tests that verify the distributed code paths without requiring multiple gpus.

Undefined argument "moe_freq" when running unit tests on WSL/Ubuntu 20.04

I'm trying OpenLM on Ubuntu 20.04 under WSL. I've hit an issue running the unit tests where the argument "moe_freq" is never set before using it in train.py which results in a Python error. As a work around I added hasattr() to line 176 in train.py:

if hasattr(args, "moe_freq") and args.moe_freq > 0:

Steps to Reproduce

open_lm/open_lm$ pytest tests/

FAILED tests/test_accumulation.py::TestGradientAccumulation::test_accumulation - AttributeError: 'Namespace' object has no attribute 'moe_freq'

Import from attention.py error

I can't seem to import from attention.py

pip install git+https://github.com/mlfoundations/open_lm.git

Stuff like these work without any issues:
from open_lm.data import get_data
from open_lm.main import main

It fails when I try to import from open_lm.attention
from open_lm.attention import ATTN_ACTIVATIONS, ATTN_SEQ_SCALARS

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'ATTN_ACTIVATIONS' from 'open_lm.attention' (unknown location)

Make torch.compile work with fsdp and xformers

some potentially helpful links:

Add a fused RMSNorm operation

Note the following helpful link we were sent: https://github.com/NVIDIA/apex/blob/bc4be41c6fdb889db84b9f61f35440f82a057948/apex/normalization/fused_layer_norm.py#L192

HF Integration

Hi OpenLM team! Is there interest in making OpenLM models loadable using just HF?

I see some OpenLM models up on HF, but they are not readily loadable using HF. The proposed changes would involve adding an OpenLM class on HF, similar to how other models are hosted on HF (e.g. Mistral).

For comparison, both #54 and #20 allow saved OpenLM models to be loaded using HF functions, but under the hood it still calls OpenLM functions and requires the OpenLM library downloaded locally. What I'm thinking is basically porting OpenLM's model.py into the transformers library itself, so that OpenLM trained models can be shared and loaded more easily. I can work on this if you think it's a good idea.

@mitchellnw @sagadre @achalddave

Problem in position embedding

open_lm/open_lm/model.py

Line 129 in 619a8b3

queries, keys, vals = self.pos_embed(queries, keys, vals)

It seems to me that the rotary position embedding is being applied on the head dimension (dim -2) of the vectors q, k instead of the sequence dimension (dim 1).
I think the head and sequence dimensions should be swapped before calling position embedding .
(see https://github.com/facebookresearch/xformers/blob/748c159096d4f9fcfe3eaf22801e5aed4777210b/xformers/components/positional_embedding/rotary.py#L85)

What I'm proposing is simply to re-write RotaryWithCast as follow:

class RotaryWithCast(RotaryEmbedding):
    def forward(self, q, k, v):
        q, k = super().forward(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3))
        q = q.permute(0, 2, 1, 3)
        k = k.permute(0, 2, 1, 3)
        return q.to(v.dtype), k.to(v.dtype), v

clean up model_configs directory

@sagadre has done a big grid search of HPs, lets update the names (ie potato_neox -> open_lm_410m) and add jsons with optimal HPs

Speed up loading remote checkpoints

open_lm/open_lm/file_utils.py

Line 70 in 9ca7042

def pt_load(file_path, map_location=None):

fsspec is somehow really slow at loading large files in my experience, and right now we have every process reading from s3. This is quite slow at large model sizes; it would be nice to speed this up, probably via subprocess.run("aws s3 cp ...") in local_rank=0 and then loading locally from each worker.

Add test for checkpoint loading after save

We should add a test that:

trains a (small) model for a couple steps
saves it to disk
calls main() again with a path to the checkpoint on disk
trains a few steps

The test should test single process, DDP, and FSDP.

Add venv cache for CI to avoid installing dependencies

Right now our tests take a while because they pip install repeatedly.

	def load_model(args, model):
	checkpoint = pt_load(args.resume, map_location="cpu")
	if "epoch" in checkpoint:
	# resuming a train checkpoint w/ epoch and optimizer state
	start_epoch = checkpoint["epoch"]
	sd = checkpoint["state_dict"]
	if next(iter(sd.items()))[0].startswith("module"):
	sd = {k[len("module.") :]: v for k, v in sd.items()}
	model.load_state_dict(sd)
	logging.info(f"=> resuming checkpoint '{args.resume}' (epoch {start_epoch})")
	else:
	# loading a bare (model only) checkpoint for fine-tune or evaluation
	model.load_state_dict(checkpoint)
	logging.info(f"=> loaded checkpoint '{args.resume}' (epoch {start_epoch})")
	return start_epoch


	def load_optimizer(args, model, optimizer, scaler):
	potential_checkpoint = args.resume.replace("epoch_", "optimizer_")
	if check_exists(potential_checkpoint):
	checkpoint = pt_load(potential_checkpoint, map_location="cpu")
	else:
	checkpoint = pt_load(args.resume, map_location="cpu")
	if "optimizer" in checkpoint:
	if optimizer is not None:
	osd = checkpoint["optimizer"]
	if args.fsdp:
	osd = FSDP.optim_state_dict_to_load(
	model=model, optim=optimizer, optim_state_dict=osd
	)
	optimizer.load_state_dict(osd)
	logging.info(f"=> resuming optimizer")
	if scaler is not None and "scaler" in checkpoint:
	scaler.load_state_dict(checkpoint["scaler"])
	else:
	logging.info(f"=> WARNING: not resuming optimizer.")


	def load_data_chunks(args):
	checkpoint = pt_load(args.resume, map_location="cpu")
	if "next_chunk" in checkpoint and "samples_seen" in checkpoint:
	return checkpoint["next_chunk"], checkpoint["samples_seen"]
	else:
	logging.info(
	f"=> WARNING: tried to resume a checkpoint without data chunk info. Assuming next_chunk = 0."
	)
	return 0, 0

	args = {
	"device": "cpu",
	"precision": "fp16",
	"accum_freq": 1,
	"seq_len": 9,
	"vocab_size": 10,
	"batch_size": 16,
	"log_logit_mean": False,
	"grad_clip_norm": 1.0,
	"skip_scheduler": True,
	"rank": 0,
	"local_rank": 0,
	"world_size": 1,
	"wandb": False,
	"log_every_n_steps": 1,
	"target_mask_left": None,
	"target_mask_individual": None,
	}

	args = argparse.Namespace(
	**{
	# Generation params:
	"model": "open_lm_160m",
	"input_text": "random",
	"max_gen_len": max_gen_len,
	"context_len": context_len,
	"temperature": 0.0,
	"top_p": 1.0,
	"use_cache": False,
	# Model params that might not be in config:
	"model_norm": "gain_only_layer_norm",
	"qk_norm": False,
	"positional_embedding_type": "rotary",
	"ffn_type": "swiglu",
	}
	)

	class MockTrainArgs:
	def __init__(self, model, **kwargs):
	data_path = download_val_data("shard_00000000.tar", "./tests/assets/")

	self.model = model # part of model config
	self.model_norm = "gain_only_layer_norm"
	self.qk_norm = False
	self.train_data = [
	data_path,
	]
	self.log_logit_mean = False
	self.device = "cpu"
	self.precision = "float32"
	self.wd = 0.033
	self.lr = 3e-3
	self.beta1 = 0.9
	self.beta2 = 0.95
	self.eps = 1e-8
	self.warmup = 2
	self.skip_scheduler = False
	self.accum_freq = 1
	self.batch_size = 8
	self.grad_clip_norm = 1.0
	self.rank = 0
	self.local_rank = 0
	self.log_every_n_steps = 1e8
	self.save_logs = False
	self.logs = None
	self.name = "test_model_name"
	self.dataset_type = "webdataset"
	self.data_key = "json"
	self.ffn_type = "swiglu"
	self.train_num_samples = 250000
	self.train_data_mix_weights = None
	self.train_data_upsampling_factors = None
	self.disable_buffer = False
	self.seed = 1
	self.vocab_size = 50432
	self.seq_len = 300
	self.epochs = 1
	self.save_frequency = 1
	self.checkpoint_path = "./tests/assets/checkpoints/"
	self.resume = None
	self.distributed = False
	self.delete_previous_checkpoint = False
	self.workers = 1
	self.world_size = 1
	self.val_data = None
	self.lr_cooldown_end = 3e-5
	self.force_min_lr = 0.0
	self.scaler = None
	self.accum_freq = 1
	self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
	self.wandb = False
	self.fsdp = False
	self.fsdp_amp = False
	self.positional_embedding_type = "rotary"
	self.dist_backend = "nccl"
	self.dist_url = "env://"
	self.dataset_manifest = None
	self.target_mask_left = None
	self.target_mask_individual = None
	self.ignore_parse_errors = False

	for k, v in kwargs.items():
	if hasattr(self, k):
	setattr(self, k, v)

	args = argparse.Namespace(
	**{
	# Generation params:
	"model": "open_lm_1b_old",
	"input_text": "random",
	"max_gen_len": None,
	"context_len": None,
	"temperature": 0.0,
	"top_p": 1.0,
	"use_cache": False,
	"checkpoint": "checkpoints/open_lm_1b_old.pt",
	# Model params that might not be in config:
	"model_norm": "default_layer_norm",
	"qk_norm": False,
	"positional_embedding_type": "head_rotary",
	"ffn_type": "swiglu",
	}
	)

mlfoundations / open_lm Goto Github PK

open_lm's People

Contributors

Stargazers

Watchers

Forkers

open_lm's Issues

Steps to Reproduce

Recommend Projects

Recommend Topics

Recommend Org

Jobs