This program can read and load large Congressional Tweet dataset. It will read and load their text and their corresponding party.
- DatasetReader: Used to read data from files and create Instances
- DataLoader: Used to piece together several Instances of data into several Batches for training
- The
_read
method, where the input is the path to the file you want to read, keeps yielding Instance data text_to_instance
, called in the_read
method, the goal is to create Instance data based on the text data obtained from_read
TweetReader
can setmax_instances
to specify the maximum number of instances to be retrieved from the data
After constructed Datasetreader, we can use AllenNLP build-in function to construct dataloader. However, before we transfer data into batches, we need to convert a series of tokens recorded in the TextField of the Instance to the corresponding idx.
- We need to iterate through all Instances to build the vocabulary.
- Using the dictionary, encode the contents of each Field of each Instance.
- For
TextField
: we need to convert each Token into its corresponding idx using indexer - For
LabelField
: we need to number the labels with the vocabulary - For SequenceLabelField, we need to number each content of labels
- For
python dataloader.py
Inspiration, code snippets, etc.