I am a bit of a noob when it comes to transformers. If I want to encode a batch of <co

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Memory Efficiency w.r.t Sequence Length about x-transformers HOT 5 OPEN

lucidrains commented on June 29, 2024 1

Memory Efficiency w.r.t Sequence Length

from x-transformers.

Comments (5)

lucidrains commented on June 29, 2024 1

@adamoyoung nope, no difference! you could strategically construct your batches to minimize padding tokens to maximize efficiency, but most practitioners never do so

from x-transformers.

adamoyoung commented on June 29, 2024 1

Thanks! Do you know if other implementations tend to do this as well? In pytorch_geometric they allow for graph batching where the memory usage scales with the number of nodes/edges actually in the batch, not the maximum number of nodes/edges that are allowed in a single graph (which is analogous to the sequence length). They do this by implementing the attention with scatter/gather operations instead of masked matrix multiplications. I'm wondering if this would be a good idea for transformers, and if you know of anyone who has tried this.

from x-transformers.

lucidrains commented on June 29, 2024 1

@adamoyoung yea, the transformers community went a very different direction than that of graph neural nets and how it is approached with PyG. we typically don't do it the scatter/gather way, though I have met researchers who were interested in writing CUDA kernels to remove attention on the padding. i think batching by similar lengths is a good middle ground that i've seen used by others (one such implementation i came across https://github.com/jonathanking/sidechainnet/blob/4d4f57204c162ab938b8762dfacffb1d992774d0/sidechainnet/dataloaders/SimilarLengthBatchSampler.py#L9 )

from x-transformers.

adamoyoung commented on June 29, 2024 1

Thanks, that's a good solution! Will check it out.

from x-transformers.

adamoyoung commented on June 29, 2024

My guess is there is no difference, based on how the masks are used in the Attention class

from x-transformers.

Recommend Projects

Memory Efficiency w.r.t Sequence Length about x-transformers HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs