GithubHelp home page GithubHelp logo

About max token length about mamba HOT 28 OPEN

state-spaces avatar state-spaces commented on August 27, 2024
About max token length

from mamba.

Comments (28)

tridao avatar tridao commented on August 27, 2024 3

Yes we do exactly the same thing (which is now standard is several libraries): tokenize all documents, append "eos" token to the end of each document, concatenate all of them, the split into chunks of size 2048.

from mamba.

tridao avatar tridao commented on August 27, 2024 2

That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen.

from mamba.

tridao avatar tridao commented on August 27, 2024 2

You can also finetune Mamba on long documents.
Regarding "context extrapolation" without fine-tuning, the short answer is ... I don't know. The architecture is new different from Transformer, and there are still lots of interesting research questions.

from mamba.

EricLina avatar EricLina commented on August 27, 2024 1

How to understand table.2 in mamba's paper, which shows great extrapolate ablility?🤔
As your paper shows, mamba could train at seqlen = 10^3 and test at seqlen=10^6 with good performance.🤔
image

from mamba.

tridao avatar tridao commented on August 27, 2024 1

There's no restriction, e.g. you can just pass in sequence of length 8k to finetune.

from mamba.

tridao avatar tridao commented on August 27, 2024 1

The paper describes the hyperparameters we used.
When increasing sequence length we decrease batch size (i.e. keeping the total number of tokens in the batch the same), and keep other hparams the same. I'm not sure that's optimal but it's what I've been using.

from mamba.

tridao avatar tridao commented on August 27, 2024 1

As the code says, it constructs nn.Conv1d with padding=3 (if conv has width 4), do the convolution, then remove the last 3 elements.

from mamba.

tridao avatar tridao commented on August 27, 2024

It was trained with seqlen=2k for apple to apple comparison with pythia, seems to extrapolate to around 3k context length but after that the quality is much worse.

from mamba.

RevolGMPHL avatar RevolGMPHL commented on August 27, 2024

If I train on a longer sequence training set, will it improve max token length? Does it have anything to do with the size of the model?

from mamba.

tridao avatar tridao commented on August 27, 2024

Yes training on longer context (e.g. 4k or 8k) should help improve max token length. I think this is a general property of most sequence models (e.g. Transformers should be similar).

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

Language models based on the transformer architecture can extrapolate beyond the context by adjusting the position encoding, which may also require fine-tuning training on longer documents. There are also technical solutions that mitigate the degradation of performance during context extrapolation by filtering the kv cache.

I would like to understand the model structure and design of the Mamba S6, and whether there are similar technical solutions suitable for context extrapolation. Thank you.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

Thanks very much.

I am currently not familiar with the inner details of the Mamba ssm module. May I ask if there are some parameters which shapes are related to preset context length?

from mamba.

sentialx avatar sentialx commented on August 27, 2024

@tridao Does Mamba support passing state between multiple forward passes (or blocks of tokens) during training?

from mamba.

tridao avatar tridao commented on August 27, 2024

No that's not supported right now.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

@tridao one more question about dataset processing during pretrain mamba-2.8 models.

As gpt3 paper said, "During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency.".

Did released mamba models use same packing tricks for datasets, thanks.

image

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

@tridao one more question please.

How to set Layers & Model dim for round 7B Mamba models, and are there design rules of model settings for model size scaling?

Thanks.

from mamba.

tridao avatar tridao commented on August 27, 2024

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

Thanks.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

@tridao could you release mamba-1.4B intermediate checkpoint which is trained around 100B tokens?

I have trained mamba-1.4B from scratch using zh-en corpus. If checkpoint around 100B tokens is provided, I will check the metrics to validate the process.

Thanks

from mamba.

tridao avatar tridao commented on August 27, 2024

Unfortunately we only have the fully trained weights.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

Unfortunately we only have the fully trained weights.

Thanks for your reply.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

@tridao When scaling up max length for language modeling pretrain from sractch.

Could you please give us some advice about how to set hyperparameters like lr, warmup, global batch size, etc?

Thank you.

from mamba.

sentialx avatar sentialx commented on August 27, 2024

@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?

from mamba.

tridao avatar tridao commented on August 27, 2024

@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?

inference_params supports moving the state forward by 1 step (i.e. recurrence). If you want to pass the states with length more than 1, you'd need to change the parallel scan (in selective_scan) to deal with that.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

Mamba can be used as a module that can be drop-in replaced in some frameworks.

Megatron-LM is designed only for Transformer blocks. How can we integrate Mamba into it, could you give some advice, thanks.

Sorry to bother you both. @tridao @albertfgu

from mamba.

tridao avatar tridao commented on August 27, 2024

Instead of ParallelTransformerLayer in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

Instead of ParallelTransformerLayer in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.

Thanks a lot.
Without TensorParallel / Pipeline Parallel, for model size scaling no need to use MegatronLM.

from mamba.

ftgreat avatar ftgreat commented on August 27, 2024

@tridao If there is no causal_conv1d_fn , how does the normal conv1d perform causally? Thanks

https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba_simple.py#L168

from mamba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.