What is the max token length that this model can support? Can it support more than 10k

About max token length about mamba HOT 28 OPEN

state-spaces commented on August 27, 2024

About max token length

from mamba.

Comments (28)

tridao commented on August 27, 2024 3

Yes we do exactly the same thing (which is now standard is several libraries): tokenize all documents, append "eos" token to the end of each document, concatenate all of them, the split into chunks of size 2048.

from mamba.

tridao commented on August 27, 2024 2

That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen.

from mamba.

tridao commented on August 27, 2024 2

You can also finetune Mamba on long documents.
Regarding "context extrapolation" without fine-tuning, the short answer is ... I don't know. The architecture is new different from Transformer, and there are still lots of interesting research questions.

from mamba.

EricLina commented on August 27, 2024 1

How to understand table.2 in mamba's paper, which shows great extrapolate ablility?🤔
As your paper shows, mamba could train at seqlen = 10^3 and test at seqlen=10^6 with good performance.🤔

from mamba.

tridao commented on August 27, 2024 1

There's no restriction, e.g. you can just pass in sequence of length 8k to finetune.

from mamba.

tridao commented on August 27, 2024 1

The paper describes the hyperparameters we used.
When increasing sequence length we decrease batch size (i.e. keeping the total number of tokens in the batch the same), and keep other hparams the same. I'm not sure that's optimal but it's what I've been using.

from mamba.

tridao commented on August 27, 2024 1

As the code says, it constructs nn.Conv1d with padding=3 (if conv has width 4), do the convolution, then remove the last 3 elements.

from mamba.

tridao commented on August 27, 2024

It was trained with seqlen=2k for apple to apple comparison with pythia, seems to extrapolate to around 3k context length but after that the quality is much worse.

from mamba.

RevolGMPHL commented on August 27, 2024

If I train on a longer sequence training set, will it improve max token length? Does it have anything to do with the size of the model?

from mamba.

tridao commented on August 27, 2024

Yes training on longer context (e.g. 4k or 8k) should help improve max token length. I think this is a general property of most sequence models (e.g. Transformers should be similar).

from mamba.

ftgreat commented on August 27, 2024

Language models based on the transformer architecture can extrapolate beyond the context by adjusting the position encoding, which may also require fine-tuning training on longer documents. There are also technical solutions that mitigate the degradation of performance during context extrapolation by filtering the kv cache.

I would like to understand the model structure and design of the Mamba S6, and whether there are similar technical solutions suitable for context extrapolation. Thank you.

from mamba.

ftgreat commented on August 27, 2024

Thanks very much.

I am currently not familiar with the inner details of the Mamba ssm module. May I ask if there are some parameters which shapes are related to preset context length?

from mamba.

sentialx commented on August 27, 2024

@tridao Does Mamba support passing state between multiple forward passes (or blocks of tokens) during training?

from mamba.

tridao commented on August 27, 2024

No that's not supported right now.

from mamba.

ftgreat commented on August 27, 2024

@tridao one more question about dataset processing during pretrain mamba-2.8 models.

As gpt3 paper said, "During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency.".

Did released mamba models use same packing tricks for datasets, thanks.

from mamba.

ftgreat commented on August 27, 2024

@tridao one more question please.

How to set Layers & Model dim for round 7B Mamba models, and are there design rules of model settings for model size scaling?

Thanks.

from mamba.

tridao commented on August 27, 2024

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

from mamba.

ftgreat commented on August 27, 2024

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

Thanks.

from mamba.

ftgreat commented on August 27, 2024

@tridao could you release mamba-1.4B intermediate checkpoint which is trained around 100B tokens?

I have trained mamba-1.4B from scratch using zh-en corpus. If checkpoint around 100B tokens is provided, I will check the metrics to validate the process.

Thanks

from mamba.

tridao commented on August 27, 2024

Unfortunately we only have the fully trained weights.

from mamba.

ftgreat commented on August 27, 2024

Unfortunately we only have the fully trained weights.

Thanks for your reply.

from mamba.

ftgreat commented on August 27, 2024

@tridao When scaling up max length for language modeling pretrain from sractch.

Could you please give us some advice about how to set hyperparameters like lr, warmup, global batch size, etc?

Thank you.

from mamba.

sentialx commented on August 27, 2024

@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?

from mamba.

tridao commented on August 27, 2024

@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?

inference_params supports moving the state forward by 1 step (i.e. recurrence). If you want to pass the states with length more than 1, you'd need to change the parallel scan (in selective_scan) to deal with that.

from mamba.

ftgreat commented on August 27, 2024

Mamba can be used as a module that can be drop-in replaced in some frameworks.

Megatron-LM is designed only for Transformer blocks. How can we integrate Mamba into it, could you give some advice, thanks.

Sorry to bother you both. @tridao @albertfgu

from mamba.

tridao commented on August 27, 2024

Instead of ParallelTransformerLayer in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.

from mamba.

ftgreat commented on August 27, 2024

Instead of ParallelTransformerLayer in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.

Thanks a lot.
Without TensorParallel / Pipeline Parallel, for model size scaling no need to use MegatronLM.

from mamba.

ftgreat commented on August 27, 2024

@tridao If there is no causal_conv1d_fn , how does the normal conv1d perform causally? Thanks

https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba_simple.py#L168

from mamba.

About max token length about mamba HOT 28 OPEN

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs