Comments (4)
Yes, this is a common confusion. But, it has nothing to do with causal_conv1d
itself.
The Mamba model, like transformers, is an auto-regressive architecture. This means that if you take <=t
, and NOT <t
. The shift-by-one happens only at the very end of the entire network, where
Within this setting, causal_conv1d
is just a component within each of the layers. The notion that it is "causal" means that it simply does not look forward. For example: with say d_conv=5
, the input would be padded with 4, and we'd have:
pppp123456789
^^^^^
^^^^^
^^^^^
^^^^^
^^^^^
^^^^^
So, the first output of the causal_conv1d
layer would consist of input pppp1
, which would include the token 1. But, all of this ultimately will be used to generate logits for predicting the token at position 2.
from mamba.
This helps me connect the dots, thanks!
(I was trying to connect this causal_conv1d
action with Attention mechanism in "Attention Expressivity" section D.2 in H3 paper)
from mamba.
I think it also has to do with causal_conv1d
because attention mechanism helps in mixing across the sequence dimension which helps attention accomplish tasks like "associative_recall" and "induction_heads".
Authors of H3 try to emulate that with "shifting" tokens (with a "shift" SSM) while Mamba authors seem to have approximated even that to just a causal convolution.
Without this, wouldn't SSMs be just "fancy Gated RNNs or LSTMs"?
from mamba.
Hi @albertfgu
Could you please confirm the above or elaborate? Mamba paper makes exactly one reference to "shift-ssm" from H3, I might have missed it but besides the Figure 3 and the mention of local convolution of H3 in subsection SSM Architectures there are not any comments. Pattern matching implementation similarities and researching the papers has led me here. I am looking for the intuition behind the conv1d operation in each of the Mamba blocks. In H3 shift-ssm is argued to help approximate the attention multiplicative interactions and together with the diag-ssm that brings in the attentioin memorization akin to soft lookup in order to facilitate the associative search ability. Thanks.
from mamba.
Related Issues (20)
- Question for 'self.use_mem_eff_path and inference_params' HOT 4
- triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 254208, Hardware limit: 101376. HOT 5
- I want to ask does anyone know how to solve this problem
- /anaconda3/lib/python3.11/site-packages/causal_conv1d_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c107WarningC1ENS_7variantIJNS0_11UserWarningENS0_18DeprecationWarningEEEERKNS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb HOT 1
- Mamba-2 Error: `'NoneType' object has no attribute 'causal_conv1d_fwd'` HOT 8
- Used selective_scan_cuda and causal_conv1d_cuda, but still very slow to train HOT 1
- mamba / self-attention hybrid generation
- Inference multiple tokens HOT 2
- Error when using FP16 or Mixed precision HOT 3
- How to use Mamba2?
- How to extract whole sentence embeddings HOT 1
- Does mamba support data packing?
- Slow Mamba 2 training speeds with higher d_state values HOT 1
- Where is ‘Block’ class in the new version mamba? HOT 1
- mamba_ssm Install Failure HOT 9
- Sequence parallelism in the mixer (Context Parallelism)
- Support Mamba-codestral
- Why does it take so long to build HOT 1
- Is mamba suitable for time-series classification task? HOT 1
- Question on Comparison between Mamba and S4 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mamba.