I have a dataset containing a text field text and cor

So if I understand correctly, I am now using : <div class="snippet-clipboard-conte

We could add the ability to ignore a header line <p dir

Understanding vocabulary of text and labels about text HOT 5 CLOSED

pytorch commented on July 23, 2024 1

Understanding vocabulary of text and labels

from text.

Comments (5)

jekbradbury commented on July 23, 2024

All vocabularies are defaultdict objects, so index zero is reserved for unknown or unseen inputs. Fields with sequential=True, which is the default but should be turned off for your label_field because the label data isn't a sequence, also have a reserved index for padding.
But since your inputs are already numeric, you can get away with setting use_vocab=False, in which case I believe you can get what you're looking for by setting postprocessing=Pipeline(int). In that case there won't be any extra indices, just zero and one.

(And yes, this is all badly underdocumented and there's pretty much no way you could have figured this out from the docstrings...)

from text.

koustuvsinha commented on July 23, 2024

So if I understand correctly, I am now using :

text_field = data.Field()
label_field = data.Field(sequential=False,use_vocab=False,postprocessing=data.Pipeline(int))

To make things simple, I am loading each dataset one at a time :

train = data.TabularDataset(path='train.tsv',format='tsv',fields=[('text', text_field), ('lbl', label_field)])

Several points I note :

If I do len(train) I get N+1 where N is the number of rows
If I do len(label_field.vocab) I get 16 while my data only has two classes 0 and 1. This number was 31 when I was using both train and val datasets within one TabularDataset.split

For reproducing this issue, I am attaching a sample of 100 rows from my dataset. Notice when we load this set using a TabularDataset, we get len(label_field.vocab) = 4.

from text.

jekbradbury commented on July 23, 2024

I think the problem here is that torchtext isn't ignoring the header on the TSV file; there are a couple options to fix this. We could add the ability to ignore a header line, or you could exclude the header from the data like the following snippet (also, if you set use_vocab to False on the label_field, you shouldn't call label_field.build_vocab):

text_field = data.Field()
label_field = data.Field(sequential=False,use_vocab=False,postprocessing=data.Pipeline(int))
train = data.TabularDataset(
    path='train.tsv',format='tsv',fields=[('text', text_field), ('lbl', label_field)],
    filter_pred=lambda ex: ex.lbl in ['0', '1'])
>>>vars(train[0])
{'text': ['"An', 'American', 'in', 'Paris', ...], 'lbl': '1'}
train_iter = data.BucketIterator(train, batch_size=32, device=0, sort_key=lambda ex: len(ex.text))
>>>vars(next(iter(train_iter)))
{'batch_size': 32, 'dataset': <torchtext.data.TabularDataset object at 0x7fcf71bbfac8>,
'train': True, 'text': Variable containing:
 2545   438  1946  ...    328   200  1089
  266     2     2  ...     23    96  3174
    3  5406  1399  ...     40    10     4
       ...          ⋱          ...       
    1     1     1  ...      1     1     5
    1     1     1  ...      1     1     2
    1     1     1  ...      1     1  3979
[torch.cuda.LongTensor of size 500x32 (GPU 0)]
, 'lbl': Variable containing:
 1
 1
 1
 0
...
 1
[torch.cuda.LongTensor of size 32 (GPU 0)]
}

If you want to use pytorch's variable length RNN capability, you should add return_lengths=True to the text_field.

from text.

SmartAI commented on July 23, 2024

['unk'] in labels will resulting in more parameters and memory consuming?

from text.

jekbradbury commented on July 23, 2024

We could add the ability to ignore a header line

We did this in #146, and we added the ability to avoid creating an '<unk>' token in #107.

from text.

Understanding vocabulary of text and labels about text HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs