GithubHelp home page GithubHelp logo

Comments (4)

hrbigelow avatar hrbigelow commented on July 24, 2024 1

Yes, this is a common confusion. But, it has nothing to do with causal_conv1d itself.

The Mamba model, like transformers, is an auto-regressive architecture. This means that if you take $y^l_t$ to be the output of layer $l$ at timestep $t$, then the dependencies are that $y^l_t$ depends on $y^{l-1}_{&lt;=t}$. Note that it is <=t, and NOT <t. The shift-by-one happens only at the very end of the entire network, where $y^L_t$ produces logits for the $t+1$'th predicted token.

Within this setting, causal_conv1d is just a component within each of the layers. The notion that it is "causal" means that it simply does not look forward. For example: with say d_conv=5, the input would be padded with 4, and we'd have:

pppp123456789
^^^^^
 ^^^^^
  ^^^^^
   ^^^^^
    ^^^^^
     ^^^^^

So, the first output of the causal_conv1d layer would consist of input pppp1, which would include the token 1. But, all of this ultimately will be used to generate logits for predicting the token at position 2.

from mamba.

sudhakarsingh27 avatar sudhakarsingh27 commented on July 24, 2024 1

This helps me connect the dots, thanks!
(I was trying to connect this causal_conv1d action with Attention mechanism in "Attention Expressivity" section D.2 in H3 paper)

from mamba.

sudhakarsingh27 avatar sudhakarsingh27 commented on July 24, 2024

I think it also has to do with causal_conv1d because attention mechanism helps in mixing across the sequence dimension which helps attention accomplish tasks like "associative_recall" and "induction_heads".
Authors of H3 try to emulate that with "shifting" tokens (with a "shift" SSM) while Mamba authors seem to have approximated even that to just a causal convolution.
Without this, wouldn't SSMs be just "fancy Gated RNNs or LSTMs"?

from mamba.

PheelaV avatar PheelaV commented on July 24, 2024

Hi @albertfgu

Could you please confirm the above or elaborate? Mamba paper makes exactly one reference to "shift-ssm" from H3, I might have missed it but besides the Figure 3 and the mention of local convolution of H3 in subsection SSM Architectures there are not any comments. Pattern matching implementation similarities and researching the papers has led me here. I am looking for the intuition behind the conv1d operation in each of the Mamba blocks. In H3 shift-ssm is argued to help approximate the attention multiplicative interactions and together with the diag-ssm that brings in the attentioin memorization akin to soft lookup in order to facilitate the associative search ability. Thanks.

from mamba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.