This is a general question when training transformers, but is related to a specific qu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[Question] very small attention scores about x-transformers HOT 7 CLOSED

pfeatherstone commented on July 17, 2024

[Question] very small attention scores

from x-transformers.

Comments (7)

pfeatherstone commented on July 17, 2024

I'll give it a go thanks

from x-transformers.

pfeatherstone commented on July 17, 2024

Funny how some publications can just be a case of: add conv there and see what happens

from x-transformers.

pfeatherstone commented on July 17, 2024

@lucidrains Good news, using attn_qk_norm seems to have solved my problem. Now, all my attention "scores"/"dots" are order O(1), except for masked elements which are -3.4028*10^+38 as expected. So the softmaxed attention map is now looking more sensible.

It might be worth mentioning in the README that attn_qk_norm can have this nice property. You mention already it can help with overflowing but it seems it can help with underflowing, or whatever this is.

from x-transformers.

pfeatherstone commented on July 17, 2024

Unfortunately, talking_heads isn't compatible with flash attention. I can't afford not to use flash attention. I also had a look at sparse_topk thinking that would also help, but again, not compatible with flash attention. Makes sense.

from x-transformers.

lucidrains commented on July 17, 2024

@pfeatherstone nice! yea i'm bullish on cosine sim attention. Tero Karras recently used it in his new u-net with great results

from x-transformers.

pfeatherstone commented on July 17, 2024

Makes you wonder, what percentage of a model is just some kind of normalization. Probably quite high. That seems like a flaw. Someone needs to invent a new neural network architecture where normalization is like < 1% of your layers.

from x-transformers.

pfeatherstone commented on July 17, 2024

@pfeatherstone nice! yea i'm bullish on cosine sim attention. Tero Karras recently used it in his new u-net with great results

What's the state of https://github.com/lucidrains/flash-cosine-sim-attention ? I like the idea of fusing flash attention with l2-normalized kv.

Also, did you consider using https://github.com/NVIDIA/cutlass for the CUDA backend? I think Tri Dao used that library for Flash Attention 2 and allowed him to write much more concise and ultimately better code. (According to a podcast interview)

from x-transformers.

Recommend Projects

[Question] very small attention scores about x-transformers HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs