Comments (5)
All vocabularies are defaultdict
objects, so index zero is reserved for unknown or unseen inputs. Fields with sequential=True
, which is the default but should be turned off for your label_field
because the label data isn't a sequence, also have a reserved index for padding.
But since your inputs are already numeric, you can get away with setting use_vocab=False
, in which case I believe you can get what you're looking for by setting postprocessing=Pipeline(int)
. In that case there won't be any extra indices, just zero and one.
(And yes, this is all badly underdocumented and there's pretty much no way you could have figured this out from the docstrings...)
from text.
So if I understand correctly, I am now using :
text_field = data.Field()
label_field = data.Field(sequential=False,use_vocab=False,postprocessing=data.Pipeline(int))
To make things simple, I am loading each dataset one at a time :
train = data.TabularDataset(path='train.tsv',format='tsv',fields=[('text', text_field), ('lbl', label_field)])
Several points I note :
- If I do
len(train)
I get N+1 where N is the number of rows - If I do
len(label_field.vocab)
I get 16 while my data only has two classes 0 and 1. This number was 31 when I was using both train and val datasets within oneTabularDataset.split
For reproducing this issue, I am attaching a sample of 100 rows from my dataset. Notice when we load this set using a TabularDataset, we get len(label_field.vocab)
= 4.
from text.
I think the problem here is that torchtext isn't ignoring the header on the TSV file; there are a couple options to fix this. We could add the ability to ignore a header line, or you could exclude the header from the data like the following snippet (also, if you set use_vocab
to False on the label_field
, you shouldn't call label_field.build_vocab
):
text_field = data.Field()
label_field = data.Field(sequential=False,use_vocab=False,postprocessing=data.Pipeline(int))
train = data.TabularDataset(
path='train.tsv',format='tsv',fields=[('text', text_field), ('lbl', label_field)],
filter_pred=lambda ex: ex.lbl in ['0', '1'])
>>>vars(train[0])
{'text': ['"An', 'American', 'in', 'Paris', ...], 'lbl': '1'}
train_iter = data.BucketIterator(train, batch_size=32, device=0, sort_key=lambda ex: len(ex.text))
>>>vars(next(iter(train_iter)))
{'batch_size': 32, 'dataset': <torchtext.data.TabularDataset object at 0x7fcf71bbfac8>,
'train': True, 'text': Variable containing:
2545 438 1946 ... 328 200 1089
266 2 2 ... 23 96 3174
3 5406 1399 ... 40 10 4
... ⋱ ...
1 1 1 ... 1 1 5
1 1 1 ... 1 1 2
1 1 1 ... 1 1 3979
[torch.cuda.LongTensor of size 500x32 (GPU 0)]
, 'lbl': Variable containing:
1
1
1
0
...
1
[torch.cuda.LongTensor of size 32 (GPU 0)]
}
If you want to use pytorch's variable length RNN capability, you should add return_lengths=True
to the text_field
.
from text.
['unk'] in labels will resulting in more parameters and memory consuming?
from text.
We could add the ability to ignore a header line
We did this in #146, and we added the ability to avoid creating an '<unk>'
token in #107.
from text.
Related Issues (20)
- One of the three datasets returned by Multi30k seems to be bugged.
- Confusing docs for build_vocab_from_iterator
- how to run this code
- UTF-8 error with testing set of `torchtext.datasets.Multi30k(language_pair=("de", "en"))`. HOT 4
- Torch Text Transform Documentation Mismatch
- The Future of torchtext HOT 1
- BLEU_SCORE weird behaviour
- Fail to import torchtext KeyError: 'SP_DIR' HOT 1
- how to install libtorchtext for cpp project use? please give some operation .thanks
- Unable to download wikitext datasets HOT 4
- AttributeError: module 'torchtext' has no attribute 'legacy'
- # Liste von Namen und Alter personen = [ {"name": "Max", "alter": 30}, {"name": "Anna", "alter": 25}, {"name": "Lisa", "alter": 35} ] # Ausgabe der Liste for person in personen: print("Name:", person["name"]) print("Alter:", person["alter"]) print()
- [Release Blocking] TorchData is too old for PyTorch 2.3 HOT 1
- Remove SpaCy/NLTK as an optional dependency by creating our own tokenizer for a number of languages
- wikitext-2 is not available anymore HOT 2
- Why torchtext needs to reinstall torch
- [RFC] Deprecate/Stop TorchText releases starting with Pytorch release 2.4 HOT 9
- PyTorch 2.4 is not supported by TorchText
- Wikitext-103 URL is down HOT 3
- t5_demo can't retrieve CNNDM from drive.google; how to use local copy?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text.