gupta-abhay / pytorch-vit Goto Github PK
View Code? Open in Web Editor NEWAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Home Page: https://arxiv.org/abs/2010.11929
License: MIT License
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Home Page: https://arxiv.org/abs/2010.11929
License: MIT License
How do you patch the image? any clues for the preprocessing and training step?
In the train.py there the code include "from vit.utils import ( adjust_learning_rate)" but no adjust_learning_rate in the util.py
https://github.com/gupta-abhay/ViT/blob/fcc17638d0f4d661af19128871345b01a800631c/vit/models/ViT.py#L99
I guess the self.flatten_dim in this line should be replaced with embedding_dim.
This work is very interested and fascinating. I have a question : the size of embedding size and how you decide it?
Look forward to the release of pretrained model.
Hello Gupta!
Being new to vision tasks, can you share just a small snippet that can show how we can use pytorch-vit in downwards vision tasks like image retrieval etc. Thanx
I get the part where Image is split into P
say 16x16
smaller image patches and then you have to Flatten
the 3D patch to pass it into a Linear
layer to get what they call Liner Projection
. Can you please explain how the two types of Embeddings are working. Looked at your code too and looked like a maze to me. If you could just explain in Laymen's terms, I'll look at the code again and understand.
Thanks
Hi, will you release any pre-trained models?
Thank you
hello,
Thanks a lot for this very interesting work!
when you unroll the tensor you use unfold and flatten like this:
x = (x.unfold(2, self.patch_dim, self.patch_dim).
unfold(3, self.patch_dim, self.patch_dim).contiguous())
x = x.view(x.size(0), -1, self.flatten_dim)
but if x is in shape N,C,H,W, unrolling ends up with N,C,H//P,W//P,P,P and therefore flattening ends up mixing data from different channels. It means your "words" come from different blocks in space. It does not really matter for training your model with one specific size, but i think it will have hard time to transfer to a different size...
instead you could do like this:
self.flatten_dim_in = (patch_dim**2) * in_channels
...
x = (x.unfold(2, self.patch_dim, self.patch_dim)
.unfold(3, self.patch_dim, self.patch_dim) .contiguous())
x = x.view(b,c,-1,self.patch_dim**2)
x = x.permute(0,2,3,1).contiguous()
x = x.view(x.size(0), -1, self.flatten_dim_in)
Just to make sure the data at the end is really what you expect: all the rgb pixels of one patch together, and not a mix of patches together.
Now i haven't tried your code yet so perhaps you have a different layout than N,C,H,W for images?
I don't understand the line 74 of ViT.py:
x = self.to_cls_token(x[:, 0])
If the first dimension of x is batch, then the 2nd dimension 0 should be patch, as the dimension of x should be [batch, patch, feature]. Does it mean only the first patch is used? Could anybody help me on this? Thanks a lot.
''model = models.dict [args.model] (num_classes=args.num_classes)'', 'module' object is not callable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.