nielsrogge / transformers-tutorials Goto Github PK

View Code? Open in Web Editor NEW

7.5K 119.0 1.2K 218.29 MB

This repository contains demos I made with the Transformers library by HuggingFace.

License: MIT License

Jupyter Notebook 100.00%

transformers pytorch bert vision-transformer layoutlm gpt-2

transformers-tutorials's Introduction

Transformers-Tutorials

Hi there!

This repository contains demos I made with the Transformers library by 🤗 HuggingFace. Currently, all of them are implemented in PyTorch.

NOTE: if you are not familiar with HuggingFace and/or Transformers, I highly recommend to check out our free course, which introduces you to several Transformer architectures (such as BERT, GPT-2, T5, BART, etc.), as well as an overview of the HuggingFace libraries, including Transformers, Tokenizers, Datasets, Accelerate and the hub.

For an overview of the ecosystem of HuggingFace for computer vision (June 2022), refer to this notebook with corresponding video.

Currently, it contains the following demos:

Audio Spectrogram Transformer (paper):
- performing inference with ASTForAudioClassification to classify audio.
BERT (paper):
- fine-tuning BertForTokenClassification on a named entity recognition (NER) dataset.
- fine-tuning BertForSequenceClassification for multi-label text classification.
BEiT (paper):
- understanding BeitForMaskedImageModeling
CANINE (paper):
- fine-tuning CanineForSequenceClassification on IMDb
CLIPSeg (paper):
- performing zero-shot image segmentation with CLIPSeg
Conditional DETR (paper):
- performing inference with ConditionalDetrForObjectDetection
- fine-tuning ConditionalDetrForObjectDetection on a custom dataset (balloon)
ConvNeXT (paper):
- fine-tuning (and performing inference with) ConvNextForImageClassification
DINO (paper):
- visualize self-attention of Vision Transformers trained using the DINO method
DETR (paper):
- performing inference with DetrForObjectDetection
- fine-tuning DetrForObjectDetection on a custom object detection dataset
- evaluating DetrForObjectDetection on the COCO detection 2017 validation set
- performing inference with DetrForSegmentation
- fine-tuning DetrForSegmentation on COCO panoptic 2017
DPT (paper):
- performing inference with DPT for monocular depth estimation
- performing inference with DPT for semantic segmentation
Deformable DETR (paper):
- performing inference with DeformableDetrForObjectDetection
DiT (paper):
- performing inference with DiT for document image classification
Donut (paper):
- performing inference with Donut for document image classification
- fine-tuning Donut for document image classification
- performing inference with Donut for document visual question answering (DocVQA)
- performing inference with Donut for document parsing
- fine-tuning Donut for document parsing with PyTorch Lightning
GIT (paper):
- performing inference with GIT for image/video captioning and image/video question-answering
- fine-tuning GIT on a custom image captioning dataset
GLPN (paper):
- performing inference with GLPNForDepthEstimation to illustrate monocular depth estimation
GPT-J-6B (repository):
- performing inference with GPTJForCausalLM to illustrate few-shot learning and code generation
GroupViT (repository):
- performing inference with GroupViTModel to illustrate zero-shot semantic segmentation
ImageGPT (blog post):
- (un)conditional image generation with ImageGPTForCausalLM
- linear probing with ImageGPT
LUKE (paper):
- fine-tuning LukeForEntityPairClassification on a custom relation extraction dataset using PyTorch Lightning
LayoutLM (paper):
- fine-tuning LayoutLMForTokenClassification on the FUNSD dataset
- fine-tuning LayoutLMForSequenceClassification on the RVL-CDIP dataset
- adding image embeddings to LayoutLM during fine-tuning on the FUNSD dataset
LayoutLMv2 (paper):
- fine-tuning LayoutLMv2ForSequenceClassification on RVL-CDIP
- fine-tuning LayoutLMv2ForTokenClassification on FUNSD
- fine-tuning LayoutLMv2ForTokenClassification on FUNSD using the 🤗 Trainer
- performing inference with LayoutLMv2ForTokenClassification on FUNSD
- true inference with LayoutLMv2ForTokenClassification (when no labels are available) + Gradio demo
- fine-tuning LayoutLMv2ForTokenClassification on CORD
- fine-tuning LayoutLMv2ForQuestionAnswering on DOCVQA
LayoutLMv3 (paper):
- fine-tuning LayoutLMv3ForTokenClassification on the FUNSD dataset
LayoutXLM (paper):
- fine-tuning LayoutXLM on the XFUND benchmark for token classification
- fine-tuning LayoutXLM on the XFUND benchmark for relation extraction
MarkupLM (paper):
- inference with MarkupLM to perform question answering on web pages
- fine-tuning MarkupLMForTokenClassification on a toy dataset for NER on web pages
Mask2Former (paper):
- performing inference with Mask2Former for universal image segmentation:
MaskFormer (paper):
- performing inference with MaskFormer (both semantic and panoptic segmentation):
- fine-tuning MaskFormer on a custom dataset for semantic segmentation
OneFormer (paper):
- performing inference with OneFormer for universal image segmentation:
Perceiver IO (paper):
- showcasing masked language modeling and image classification with the Perceiver
- fine-tuning the Perceiver for image classification
- fine-tuning the Perceiver for text classification
- predicting optical flow between a pair of images with PerceiverForOpticalFlow
- auto-encoding a video (images, audio, labels) with PerceiverForMultimodalAutoencoding
SAM (paper):
- performing inference with MedSAM
- fine-tuning SamModel on a custom dataset
SegFormer (paper):
- performing inference with SegformerForSemanticSegmentation
- fine-tuning SegformerForSemanticSegmentation on custom data using native PyTorch
T5 (paper):
- fine-tuning T5ForConditionalGeneration on a Dutch summarization dataset on TPU using HuggingFace Accelerate
- fine-tuning T5ForConditionalGeneration (CodeT5) for Ruby code summarization using PyTorch Lightning
TAPAS (paper):
- fine-tuning TapasForQuestionAnswering on the Microsoft Sequential Question Answering (SQA) dataset
- evaluating TapasForSequenceClassification on the Table Fact Checking (TabFact) dataset
Table Transformer (paper):
- using the Table Transformer for table detection and table structure recognition
TrOCR (paper):
- performing inference with TrOCR to illustrate optical character recognition with Transformers, as well as making a Gradio demo
- fine-tuning TrOCR on the IAM dataset using the Seq2SeqTrainer
- fine-tuning TrOCR on the IAM dataset using native PyTorch
- evaluating TrOCR on the IAM test set
UPerNet (paper):
- performing inference with UperNetForSemanticSegmentation
VideoMAE (paper):
- performing inference with VideoMAEForVideoClassification
ViLT (paper):
- fine-tuning ViLT for visual question answering (VQA)
- performing inference with ViLT to illustrate visual question answering (VQA)
- masked language modeling (MLM) with a pre-trained ViLT model
- performing inference with ViLT for image-text retrieval
- performing inference with ViLT to illustrate natural language for visual reasoning (NLVR)
ViTMAE (paper):
- reconstructing pixel values with ViTMAEForPreTraining
Vision Transformer (paper):
- performing inference with ViTForImageClassification
- fine-tuning ViTForImageClassification on CIFAR-10 using PyTorch Lightning
- fine-tuning ViTForImageClassification on CIFAR-10 using the 🤗 Trainer
X-CLIP (paper):
- performing zero-shot video classification with X-CLIP
- zero-shot classifying a YouTube video with X-CLIP
YOLOS (paper):
- fine-tuning YolosForObjectDetection on a custom dataset
- inference with YolosForObjectDetection

... more to come! 🤗

If you have any questions regarding these demos, feel free to open an issue on this repository.

Btw, I was also the main contributor to add the following algorithms to the library:

TAbular PArSing (TAPAS) by Google AI
Vision Transformer (ViT) by Google AI
DINO by Facebook AI
Data-efficient Image Transformers (DeiT) by Facebook AI
LUKE by Studio Ousia
DEtection TRansformers (DETR) by Facebook AI
CANINE by Google AI
BEiT by Microsoft Research
LayoutLMv2 (and LayoutXLM) by Microsoft Research
TrOCR by Microsoft Research
SegFormer by NVIDIA
ImageGPT by OpenAI
Perceiver by Deepmind
MAE by Facebook AI
ViLT by NAVER AI Lab
ConvNeXT by Facebook AI
DiT By Microsoft Research
GLPN by KAIST
DPT by Intel Labs
YOLOS by School of EIC, Huazhong University of Science & Technology
TAPEX by Microsoft Research
LayoutLMv3 by Microsoft Research
VideoMAE by Multimedia Computing Group, Nanjing University
X-CLIP by Microsoft Research
MarkupLM by Microsoft Research

All of them were an incredible learning experience. I can recommend anyone to contribute an AI algorithm to the library!

Data preprocessing

Regarding preparing your data for a PyTorch model, there are a few options:

a native PyTorch dataset + dataloader. This is the standard way to prepare data for a PyTorch model, namely by subclassing torch.utils.data.Dataset, and then creating a corresponding DataLoader (which is a Python generator that allows to loop over the items of a dataset). When subclassing the Dataset class, one needs to implement 3 methods: __init__, __len__ (which returns the number of examples of the dataset) and __getitem__ (which returns an example of the dataset, given an integer index). Here's an example of creating a basic text classification dataset (assuming one has a CSV that contains 2 columns, namely "text" and "label"):

from torch.utils.data import Dataset

class CustomTrainDataset(Dataset):
    def __init__(self, df, tokenizer):
        self.df = df
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # get item
        item = df.iloc[idx]
        text = item['text']
        label = item['label']
        # encode text
        encoding = self.tokenizer(text, padding="max_length", max_length=128, truncation=True, return_tensors="pt")
        # remove batch dimension which the tokenizer automatically adds
        encoding = {k:v.squeeze() for k,v in encoding.items()}
        # add label
        encoding["label"] = torch.tensor(label)
        
        return encoding

Instantiating the dataset then happens as follows:

from transformers import BertTokenizer
import pandas as pd

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
df = pd.read_csv("path_to_your_csv")

train_dataset = CustomTrainDataset(df=df, tokenizer=tokenizer)

Accessing the first example of the dataset can then be done as follows:

encoding = train_dataset[0]

In practice, one creates a corresponding DataLoader, that allows to get batches from the dataset:

from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)

I often check whether the data is created correctly by fetching the first batch from the data loader, and then printing out the shapes of the tensors, decoding the input_ids back to text, etc.

batch = next(iter(train_dataloader))
for k,v in batch.items():
    print(k, v.shape)
# decode the input_ids of the first example of the batch
print(tokenizer.decode(batch['input_ids'][0].tolist())

HuggingFace Datasets. Datasets is a library by HuggingFace that allows to easily load and process data in a very fast and memory-efficient way. It is backed by Apache Arrow, and has cool features such as memory-mapping, which allow you to only load data into RAM when it is required. It only has deep interoperability with the HuggingFace hub, allowing to easily load well-known datasets as well as share your own with the community.

Loading a custom dataset as a Dataset object can be done as follows (you can install datasets using pip install datasets):

from datasets import load_dataset

dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})

Here I'm loading local CSV files, but there are other formats supported (including JSON, Parquet, txt) as well as loading data from a local Pandas dataframe or dictionary for instance. You can check out the docs for all details.

Training frameworks

Regarding fine-tuning Transformer models (or more generally, PyTorch models), there are a few options:

using native PyTorch. This is the most basic way to train a model, and requires the user to manually write the training loop. The advantage is that this is very easy to debug. The disadvantage is that one needs to implement training him/herself, such as setting the model in the appropriate mode (model.train()/model.eval()), handle device placement (model.to(device)), etc. A typical training loop in PyTorch looks as follows (inspired by this great PyTorch intro tutorial):

import torch
from transformers import BertForSequenceClassification

# Instantiate pre-trained BERT model with randomly initialized classification head
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# I almost always use a learning rate of 5e-5 when fine-tuning Transformer based models
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

# put model on GPU, if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    for batch in train_dataloader:
        # put batch on device
        batch = {k:v.to(device) for k,v in batch.items()}
        
        # forward pass
        outputs = model(**batch)
        loss = outputs.loss
        
        train_loss += loss.item()
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print("Loss after epoch {epoch}:", train_loss/len(train_dataloader))
    
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for batch in eval_dataloader:
            # put batch on device
            batch = {k:v.to(device) for k,v in batch.items()}
            
            # forward pass
            outputs = model(**batch)
            loss = outputs.logits
            
            val_loss += loss.item()
                  
    print("Validation loss after epoch {epoch}:", val_loss/len(eval_dataloader))

PyTorch Lightning (PL). PyTorch Lightning is a framework that automates the training loop written above, by abstracting it away in a Trainer object. Users don't need to write the training loop themselves anymore, instead they can just do trainer = Trainer() and then trainer.fit(model). The advantage is that you can start training models very quickly (hence the name lightning), as all training-related code is handled by the Trainer object. The disadvantage is that it may be more difficult to debug your model, as the training and evaluation is now abstracted away.
HuggingFace Trainer. The HuggingFace Trainer API can be seen as a framework similar to PyTorch Lightning in the sense that it also abstracts the training away using a Trainer object. However, contrary to PyTorch Lightning, it is not meant not be a general framework. Rather, it is made especially for fine-tuning Transformer-based models available in the HuggingFace Transformers library. The Trainer also has an extension called Seq2SeqTrainer for encoder-decoder models, such as BART, T5 and the EncoderDecoderModel classes. Note that all PyTorch example scripts of the Transformers library make use of the Trainer.
HuggingFace Accelerate: Accelerate is a new project, that is made for people who still want to write their own training loop (as shown above), but would like to make it work automatically irregardless of the hardware (i.e. multiple GPUs, TPU pods, mixed precision, etc.).

transformers-tutorials's People

Contributors

Stargazers

Watchers

Forkers

chloe2 abhibisht89 hatranvin aaronjoseph pinkaban dinhngoclan ljferrer askintution vanessatelles sibtainrazajamali s4sarath imperialite fratenuta stjordanis matias-mw sonalilala rahulkhairnarr linhduongtuan anminhhung ajaypayattuparambil tristanoprofetto jesperkers custom-org vinayasathyanarayana cmeninwa andreaparker muralits98 vision-lang trannhatquy daominhkhanh20 cxqntnt salah856 china-challengehub chuanjiangcui qianghaozhang cogdof trisitc kforcodeai melcutz lkafle lisaterumi plaban1981 bigdatamatta psssnikhil hadryan phunc20 kd303 techthiyanes cenchaojun jnishi antilibrary5 amirstudy hsaurabh0919 brahimmade dliofindia pkurainbow kapitsa2811 pavan-b scape1989 naveenjr caltunay yazidoudou18 raniem-ar awerepo abdullahmohammadkhan goswamig ashishpatel26 mansurul11 geogubd rafallewanczyk admariner garymihalik1 tsharma89 ritog zhesun821 riedlma siatwangmin gabriellavoura suresh-nakkeran aydinmyilmaz nehc zjh-819 atinangrish hzy312 barana91 lw3259111 felixdittrich92 jame76 chuonglqspkt jhxcugbcs jwijffels ustchope piegu enriquecatala shubham-ai celsollopes bluekiji77 danielschulz maxpark jinlmsft

transformers-tutorials's Issues

UnicodeDecodeError

Hi,
when I run the code I have below issue, can you please guide?
Tnx a lot!

Epoch: 0
Loss after 0 steps: 2.1485347747802734
File "...\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 53976: character maps to

Can you create a demo for LayoutLMv2 as well?

Recreating DocVQA results for LayoutLMv2

Related issue on the unilm repo.

I'm trying to recreate the results reported in the LayoutLMv2 paper, Table 6, row 7. Following this example, I've fine-tuned the base model with DocVQA training set for 20 epochs. The resulting model is under-performing compared to what's reported in the paper (roughly 40% of answers default to [CLS]). I'm wondering whether:

anyone has been able to reproduce the results
the number of epochs (20) was based on original work by authors or was for demo purposes only

can we pass a minimum threshold score, below which the trocr will return an empty string?

word level predictions are more than the actual number of words

Hi @NielsRogge,

Thank you for the amazing contribution that you made.
I noticed in the Fine tuning notebook that the number of words detected with pytesseract (164 in the notebook) are less than the number of word level predictions (229 in the notebook) even after eliminating the special tokens. It is supposed that they are equal. How to make it ?

Thanks in advance

Word Grouping for Entities

The LayoutLM model is able to capture the entity class at word level. How do we group words based on entity?

Custom Dataset for LayoutLMv2

Hi Niels. First of all, congratulations on your incredible tutorials! They are really impressive.

I'm very interested in making Fine-tuning to LayoutLMv2, similar to what you did on Fine-tuning LayoutLMv2ForTokenClassification on CORD.ipynb with the receipt's information. But I can't wrap my head around the data pre-processing.
I want to use my own dataset for fine-tuning. I have hundreds of images ready to be annotated, but I don't know how to do it. I've seen in the other tutorials that all the datasets used in fine-tuning have the annotations in JSON format. I only know how to do them in XML PASCAL VOC format. Is there a specific way to create a custom dataset to be used by LayoutLMv2?

I would really appreciate some guidance. Again, congratulations on your tutorials and progress.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 53977: character maps to <undefined>

Great notebook!

I am getting this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 53977: character maps to

when running the running on this section:

for epoch in range(num_train_epochs):
  print("Epoch:", epoch)
  for batch in tqdm(train_dataloader):

Any idea?

How to get LayoutLMv2 output as key-value pairs?

Model I am using is LayoutLMv2:

(Link of the demo for reference: https://huggingface.co/spaces/nielsr/LayoutLMv2-FUNSD )

I do get 'questions' & 'answers' as separate colored boxes in output image. But is there a way to get it as a python dictionary (key-value pairs), as in questions become keys & answers become its corresponding values?

Why does TAPAS perform worse than reported?

Hi, nice tutorials!

Thank you for adding TAPAS to huggingface/transformers. It is really helpful.

However, according to your Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb, the performance of tapas-base-finetuned-tabfact on test set is 77.1 while it is reported as 78.5 in the paper. What attributes to the performance drop?

Thank you!

How are the image embeddings being set to LayoutLM model?

Thankyou for the notebook on using LayoutLM model. I am not able to figure out the part where we are sending image embeddings through Faster-RCNN. This code only relies on text is it?

LayoutXLM for Token Classification on FUNSD

Hello Niels, first thanks a lot for all of your awesome tutorials,
I'm trying to apply LayoutLM v2 Token classification tutorial on LayoutXLM, and I'm facing few issues.
I'm trying to have a processer for LayoutXLM, so converting this line

from transformers import LayoutLMv2Processor
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

to those, but none worked.

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutxlm-base", revision="no_ocr")

feature_extractor = LayoutLMv2FeatureExtractor(apply_ocr=False)
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutxlm-base")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

So, Can you please help me figure out what to change to have it work?
Many thanks in advance!.

LayoutLMV2: Splitting the document (>512 tokens) into multiples

Hello,

What is the best way to divide the document into multiple parts for LayoutLMv2 without losing any information from the document?

What is the best way to devide larger document (tokens > 512) in the custom Dataset class in which __getitem__ function supposed to return a single entry from the dataset, and each entry/example could be of only upto 512 token? I am using the custom dataset class for getting and a DataLoader to feed the train and test datasets to the pre-trained model for fine-tuning and testing. In the response of past issue, I found that we can create multiple training examples for a given document. In that scenario is it possible to still use the LayoutLMv2Processor along with the custom dataset to get the document split with multiple subparts? The link of the tutorial mentioned in that response does not answer this specific question. Below is the portion of the custom dataset class that I am using that truncates and pads if needed and does not split the document into multiple parts.

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

class CustomImageDataset(Dataset):
    def __init__(self, instance_list_path, image_dir, ocr_dir, processor=None, max_length=512):
        self.image_ids = []
        with open(instance_list_path) as ip_file:
            for line in ip_file:
                self.image_ids.append(line.strip())
        self.image_dir = image_dir
        self.ocr_dir = ocr_dir
        self.processor = processor
        self.max_length = max_length

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):
        # first, take an image
        curr_id = self.image_ids[idx]
        if os.path.exists(f'{self.image_dir}/{curr_id}.png'):
            image = Image.open(f'{self.image_dir}/{curr_id}.png').convert("RGB")
        else:
            image = Image.open(f'{self.image_dir}/{curr_id}.jpg').convert("RGB")
        image = image.resize((224, 224))

        # get word-level annotations (sorted top to bottom and left to right appearance, basic flow of a document)
        words, boxes, word_labels = self.get_word_box_labels(idx)

        assert len(words) == len(boxes) == len(word_labels)
        
        # use processor to prepare everything
        encoded_inputs = self.processor(image, words, boxes=boxes, word_labels=word_labels, 
                                        padding="max_length", truncation=True, 
                                        return_tensors="pt")
        
        # remove batch dimension
        for k,v in encoded_inputs.items():
          encoded_inputs[k] = v.squeeze()

        assert encoded_inputs.input_ids.shape == torch.Size([512])
        assert encoded_inputs.attention_mask.shape == torch.Size([512])
        assert encoded_inputs.token_type_ids.shape == torch.Size([512])
        assert encoded_inputs.bbox.shape == torch.Size([512, 4])
        assert encoded_inputs.image.shape == torch.Size([3, 224, 224])
        assert encoded_inputs.labels.shape == torch.Size([512]) 
      
        return encoded_inputs

For the processor input, if I manually split words, boxes, and word_labels into small slices and use the exact same entire image as the image input for smaller slices of words, boxes, and word_labels, will that be fine? I will keep some stride/window to give context information in the continuous blocks. As an example below, if I want to split a document/image containing 200 OCR words input into 2 portions with the first portion containing the tokens corresponding to 150 words with 25 words of overlap between two portions, does that look correct?

image = Image.open(f'{self.image_dir}/{curr_id}.png').convert("RGB")

# sorted top to bottom and left to right appearance, basic flow of a document
words, boxes, word_labels = self.get_word_box_labels(idx)
words_1 = words[:150]
words_2 = words[125:]
boxes_1 = boxes[:150]
boxes_2 = boxes[125:]
word_labels_1 = word_labels[:150]
word_labels_2 = word_labels[125:]

# portion-1 with the entire image as the 1st parameter
encoded_inputs_1 = self.processor(image, words_1, boxes=boxes_1, word_labels=word_labels_1, 
                                        padding="max_length", truncation=True, 
                                        return_tensors="pt")

# portion-2 with the entire image as the 1st parameter
encoded_inputs_2 = self.processor(image, words_2, boxes=boxes_2, word_labels=word_labels_2, 
                                        padding="max_length", truncation=True, 
                                        return_tensors="pt")

Moreover, is there any way to use LayoutLMv2Processor for automatically splitting a document into multiple examples without losing any information? The reference example link that you mentioned in your response uses AutoTokenizer and has return_overflowing_tokens and stride parameters. But, I could not find such parameters for LayoutLMv2Processor. Sometimes, even a small number of words end up with >=512 tokens, and splitting the input based on words can sometimes cause deleting the end of the document using LayoutLMv2Processor with the padding="max_length" and truncation=True parameters set to it. So, I would like to see if I can still use the LayoutLMv2Processor or the output of it to split the document into multiple parts without losing any information.

Unable to find data - val_v1.0.json file

Hello, can you please help me locate the "val_v1.0.json" file? I am unable to find the val folder in the repo as well.
Any help will be appreciated!

with open('/content/drive/MyDrive/LayoutLMv2/Tutorial notebooks/DocVQA/val/val_v1.0.json') as f:

trainer.test() returns empty dict on Vision Transformer Notebook.

I tried running this Notebook on Google Colab with a T4 GPU.

The training stopped at Epoch 3: 55%, I guess due to the Early Stopping mechanism.

The execution is successful but the Validation sanity check is at 0%.

When I execute-

trainer.test()

I get a successful execution with no error, but I get an empty dictionary as a result. Why is this happening?

I previously got an error saying that the CPU was bottlenecking dataloading, and I updated the dataloading process to use 4 workers instead of the default 1. The problem did go away. But I don't see any problems as such, but it does not work either.

What should I do to get it to work?

Colab: https://colab.research.google.com/drive/18UJ3dVG27xaRTI1BdYFK_vgawpDm8EAv?usp=sharing

Fine-tuning LayoutLMv2 on multi-gpus

Based on providing notebook at here, I reproduce LayoutLMv2 at my repo. Everything is okay when I run on a single GPU with a batch size equals to 2, but I have trouble when fine-tuning. The fine-tuning on a large dataset like DocVQA is too long when using a single GPU. So I have followed the Pytorch instruction at here to train it on multi-GPUS but it generates errors.

Running command:

CUDA_VISIBLE_DEVICES=1,2 python train.py --train_config default_config --work_dir runs/train/experiment1/

Output:
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution).

I have found this issue in some forums, but actually, it doesn't work for me.

Can you help me how to fine-tuning on multi-GPUs?

Link prediction[LayoutLM]

Hi,
Thank you for such amazing notebooks and sorry if I am missing something basic. Are you doing link prediction in the layoutlm notebooks? If not, any ideas on how it can be done and does the original implementation do that, or is it only token classification?

colab link order may be wrong

link
Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb
to
fine-tuning LayoutLMForTokenClassification on the FUNSD dataset

whereas link
Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb
to
fine-tuning LayoutLMForSequenceClassification on the RVL-CDIP dataset

LayoutXLM for groups of tokens classification

Hello again Niels!

I'm trying to use LayoutXLM (I'm working with german documents) to identify a special kind of query in invoice. So my objective is to classify sequences of words instead of just tokens.

For example, the queries can come in the form of tables. In that case, instead of classifying all the tokens on the table, I want to classify each row, like this:

I'm familiar with table extraction and detection algorithms, but the queries often come in extremely different formats, not just tables. After extensive research, I believe that LayoutLM family of algorithms can tackle my issue.

Is there any way in which you can guide me? I already found a way to make proper annotations with the tools you mentioned in my previous issue.

about the memeory

when I fine-tune the layoutlm v2，I created the feature using all train dataset，but the memory is not suffient，how can I solve the problem?

there is insufficient memory for the java runtime envirionment to continue.
...

Inference on custom data

How do I modify the FUNSDDataset class for inference on custom data, where I do not have annotations?

Tokenizer

I am training layoutlmv2 token VQA model consist words more than 512. but somehow only 512 words are only feeding to the model. so, other words are not training. any way by which I can train the whole document?

Where does image embeddings come from in Layoutlm funsd task?

Hello @NielsRogge . Thanks a lot for the wonderful tutorial of Fine_tuning_LayoutlmfortokenClassification_on_Funsd.
I have been able to successfully run it on my system. However, I have a basic practical doubt. In the paper of Layoutlm, the authors point out that Layoutlm uses small image clips for each bounding box for training purposes too.
Page 2 of the paper

meanwhile the image embedding can capture some appearance features such as font directions, types, and colors.

But these small crops of each bounding box was not made in any of the code you did. Is that task is done while doing forward pass in the below code you uploaded?

outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids,
                      labels=labels)

Basically, I cannot find any part of the code that calculated image embedding from bounding boxes while the paper claims it does. Can you help me find the part of code that does that as I think open-source code which they have uploaded might not have that code. Thanks a lot, man. Looking forward to your response.

Output size is 561x768 instead of 512x768

Hello,

I am trying to build my own classifier head over LayoutLMv2Model using HuggingFace Transformers. But the size of the output of LayoutLMv2Model is 561x768 instead of 512x768. Is this an error or am I missing something? Here is the link to my notebook - https://colab.research.google.com/drive/1FwO7E-CrXTYTERajuIgmAV8VP4J5UfvR?usp=sharing

Thank you in advance, your responses and notebooks have really helped me!

Can the data split further divided into simple_test, complex_test, small_test

Sure. Here is the raw train, validation, and test data.

Download the data and run the following script, it is expected to get 79.1% accuracy.

import os
from typing import List
import torch
import pandas as pd
from transformers import TapasTokenizer, TapasForSequenceClassification
from datasets import load_dataset, load_metric, Features, Sequence, ClassLabel, Value, Array2D

def prepare_official_data_loader():
    tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-finetuned-tabfact')
    features = Features({
        'attention_mask': Sequence(Value(dtype='int64')),
        'input_ids': Sequence(feature=Value(dtype='int64')),
        'label': ClassLabel(names=['refuted', 'entailed']),
        'statement': Value(dtype='string'),
        'table_caption': Value(dtype='string'),
        'table_id': Value(dtype='string'),
        'token_type_ids': Array2D(dtype="int64", shape=(512, 7))
    })
    test_set = load_dataset('json', data_files={'test': 'test.jsonl'}, split='test')

    def _format_pd_table(table_text: List) -> pd.DataFrame:
        df = pd.DataFrame(columns=table_text[0], data=table_text[1:])
        df = df.astype(str)
        return df

    test = test_set.map(
        lambda e: tokenizer(table=_format_pd_table(e['table_text']), queries=e['statement'],
                            truncation=True,
                            padding='max_length'),
        features=features,
        remove_columns=['table_text'],
    )
    # map to PyTorch tensors and only keep columns we need
    test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
    # create PyTorch dataloader
    test_dataloader = torch.utils.data.DataLoader(test, batch_size=4)

    return test_dataloader

def evaluate():
    accuracy = load_metric("accuracy")
    test_dataloader = prepare_official_data_loader()
    batch = next(iter(test_dataloader))
    assert batch["input_ids"].shape == (4, 512)
    assert batch["attention_mask"].shape == (4, 512)
    assert batch["token_type_ids"].shape == (4, 512, 7)

    # Evaluate
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = TapasForSequenceClassification.from_pretrained('google/tapas-base-finetuned-tabfact')
    model.to(device)

    number_processed = 0
    total = len(test_dataloader) * batch["input_ids"].shape[0]  # number of batches * batch_size
    for batch in test_dataloader:
        # get the inputs
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch["token_type_ids"].to(device)
        labels = batch["label"].to(device)

        # forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
                        labels=labels)
        model_predictions = outputs.logits.argmax(-1)

        # add metric
        accuracy.add_batch(predictions=model_predictions, references=labels)

        number_processed += batch["input_ids"].shape[0]
        print(f"Processed {number_processed} / {total} examples")

    final_score = accuracy.compute()
    print(final_score)

if __name__ == '__main__':
    evaluate()

Originally posted by @JasperGuo in #2 (comment)

Scores for LayoutLMv2 not matching with paper result

Hello,

Thank you for your notebooks, they are really helpful to get started.
I am trying to reproduce the results of the LayoutLMv2 research paper using LayoutLMv2-base and they are 2% less than the reported ones using your notebook. In the paper, they have reported precision: 0.8029, recall: 0.8539, f1-score: 0.8276 but maximum scores that I am able to get are precision: 0.7907, recall: 0.8248, f1': 0.8074. What changes can I make to your notebook to get accuracy closer to the ones mentioned in the paper?

Layoutlmv2 - Document Classification predicting same class always

Hi Niels Rogge,

Thanks for all the awesome tutorials.

I have been working on fine-tuning LayoutlmV2 model for document classification on my own data. I am facing an issue that the model is predicting same class for all examples, even for training examples. The model's training accuracy was above 90% after certain epochs , but it is always predicting the same class. I have changed the learning rate also, but no improvement. I have trained 400 examples for 3 classes, equally balanced.

Another strange thing is , if I trained with less examples (5 to 15 examples for each class), it is predicting different classes, but if i train on more examples it is predicting same class.

Can you please help me regarding this ? Do I need to change any configuration before fine-tuning the model ?

Thanks in advance

Jerome

Image tokenization consumes lot of memory (ViT)

Hey, thanks so much for adding ViT support to transformers.

I was trying to finetune ViT on cifar10(full dataset) with your notebook but it consumes lot of disc memory, so instead i tried tokenization during training. It worked but again takes time to train as tokenization is slow.

Can we somehow speedup the tokenization process ? need your suggestions :) thanks !

pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values

Hello Niels and thanks for your great tutorials...

I'm trying to run LayoutLMv2ForTokenClassification on FUNSD with no success... in this line

train_dataset = datasets['train'].map(preprocess_data, batched=True, remove_columns=datasets['train'].column_names, features=features)

I got this error:

File "/Workspace/python/hf2/lib64/python3.8/site-packages/datasets/arrow_writer.py", line 108, in arrow_array
storage = pa.array(self.data, type.storage_dtype)
File "pyarrow/array.pxi", line 306, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values

I tried in colab and my local machine... same result... in colab with pip installed versions.. in my machine with git versions... Can you offer some guidance ? thanks in advance.

Find the correct start and end position when extract feature

Follow your work, I reproduce the code to train layoutlmv2 on the docvqa dataset. But I have a problem with encoding datasets. Especially, the implementation can't find the extract start and end position of the answer in tokens extracted from the image.

def subfinder(words_list, answer_list):
    matches = []
    start_indices = []
    end_indices = []
    for idx, i in enumerate(range(len(words_list))):
        if words_list[i] == answer_list[0] and words_list[i:i + len(answer_list)] == answer_list:
            matches.append(answer_list)
            start_indices.append(idx)
            end_indices.append(idx + len(answer_list) - 1)
    if len(matches) != 0:
        return matches[0], start_indices[0], end_indices[0]
    else:
        return None, 0, 0


def read_ocr_annotation(file_path, shape):
    words_img = []
    boxes_img = []
    width, height = shape
    with open(file_path, 'r') as f:
        data = json.load(f)   # data = {"status": [], "recognitionResults": []}
        try:
            recognitionResults = data['recognitionResults']
            # Loop through each recognition line
            for reg_result in recognitionResults:
                lines = reg_result['lines']
                for line in lines:
                    for word_info in line['words']:
                        word_info['boundingBox'] = (word_info['boundingBox'])
                        x_min = np.min(word_info['boundingBox'][0:-1:2])
                        y_min = np.min(word_info['boundingBox'][1:-1:2])
                        x_max = np.max(word_info['boundingBox'][0:-1:2])
                        y_max = np.max(word_info['boundingBox'][1:-1:2])
                        words_img.append(word_info['text'])
                        boxes_img.append(normalize_bbox(bbox=[x_min, y_min, x_max, y_max], 
                            width=reg_result['width'], height=reg_result['height']))
        except:
            if not 'WORD' in data.keys():
                print("! Ignore ", file_path)
                return [], []
                
            for word in data['WORD']:
                text = word['Text']
                bbox = word['Geometry']['BoundingBox']
                bbox = [bbox['Left']*width, bbox['Top']*height, 
                        (bbox['Left'] + bbox['Width'])*width, 
                        (bbox['Top'] + bbox['Height'])*height]
                nl_bbox = normalize_bbox(bbox=bbox, width=width, height=height)
                words_img.append(text)
                boxes_img.append(nl_bbox)
    
    return (words_img, boxes_img)


def encode_dataset(examples, max_length=512):

    images         = [Image.open(image_file).convert("RGB") for image_file in examples['image']]
    org_shapes     = [img.size[0:2] for img in images]

    words          = []
    bbox           = []
    for i in range(len(images)):
        words_img, boxes_img = read_ocr_annotation(file_path=examples['ocr_output_file'][i], shape=org_shapes[i])
        words.append(words_img)
        bbox.append(boxes_img)

    questions  = examples['question']
    encoding   = processor(images, questions, words, bbox, max_length=max_length, padding="max_length", truncation=True)

    # next, add start_positions and end_positions
    start_positions = [0]*BATCH_SIZE
    end_positions   = [0]*BATCH_SIZE

    answers = examples['answers']
    
    # for every example in the batch:
    for idx in range(len(answers)):
        cls_index = encoding.input_ids[idx].index(processor.tokenizer.cls_token_id)

        words_example = [word.lower() for word in words[idx]]

        for answer in answers[idx]:
            match, word_idx_start, word_idx_end = subfinder(words_example, answer.lower().split())
            if match != None:
                break
    
        if match != None:
            sequence_ids = encoding.sequence_ids(idx)
            
            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(encoding.input_ids[idx]) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            word_ids = encoding.word_ids(idx)[token_start_index:token_end_index+1]
            for id in word_ids:
                if id == word_idx_start:
                    start_positions[idx] = token_start_index
                    break
                else:
                    token_start_index += 1

            for id in word_ids[::-1]:
                if id == word_idx_end:
                    end_positions[idx] = token_end_index
                    break
                else:
                    token_end_index -= 1
        else:
            start_positions[idx] = cls_index
            end_positions[idx] = cls_index


    encoding['start_positions'] = start_positions
    encoding['end_positions']   = end_positions
    encoding['question_id']     = examples['questionId']

    return encoding

Could you help with any idea for getting correct annotation as much as possible?

ArrowInvalid error in Vision Transformer Notebook

Hi,

I am trying to execute the notebook and I kept encountering the error of the following

Any pointers on solving this issue?

https://colab.research.google.com/drive/1fKZWN1Q6GcvRFBbrcdnCMLALj-OUEfg2?usp=sharing

Add more features to LayoutLMv2 Token classification

Hello @NielsRogge,
How can we add an extra feature to LayoutLMv2 in token classification
Currently, we have ('words', 'boxes', 'ner_tags'), if we want to add another label beside the NER Tag and pass it to the model during the training and testing phases.
Thanks in advance!

How to extract the words

is there a possibility extract more than 512 words

License?

What is the license of this repository?

[LayoutLMv2] From start_logits and end_logits to answer

In working with your source code. I have implemented an inference on the entire subset (train or val or test) at here.

for idx, batch in enumerate(dataloader):
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    token_type_ids = batch["token_type_ids"].to(device)
    bbox = batch["bbox"].to(device)
    image = batch["image"].to(device)
    start_positions = batch["start_positions"].to(device)
    end_positions = batch["end_positions"].to(device)

    # forward + backward + optimize
    outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
                    bbox=bbox, image=image, start_positions=start_positions, end_positions=end_positions)
    start_position = torch.argmax(outputs.start_logits).cpu().numpy()
    end_position   = torch.argmax(outputs.end_logits).cpu().numpy()
    encoding = tokenizer(dataset_with_ocr['question'], dataset_with_ocr['words'], dataset_with_ocr['boxes'], 
                     max_length=512, padding="max_length", truncation=True)
    answer_pred = tokenizer.decode(encoded_dataset['input_ids'][0][start_position: end_position+1])
    print(answer_pred)

Can you confirm for me that whether my coding is correct or not ? What happened if end_position < start_position ? Do you have any idea to handle this case?
Thanks for the help!

LayoutLMv2 and LayoutXLM

Thank you so much for these great tutorials, they are really helpful
I was wondering if you have any plans to make a new tutorial for LayoutLMv2 or LayoutXLM in the near future?
Thank you again!

Embeddings for specific words

Hello,

Thank you for all the help on earlier issues.
I am trying to get layout embeddings for specific words in the document instead of the whole document. I was able to do it using LayoutLMv1 Tokenizer and LayoutLMv1 Model. But now I need to do the same using LayoutLMv2 Tokenizer and LayoutLMv2 Model. Code wise I am doing the following,

model =  LayoutLMv2Model.from_pretrained('microsoft/layoutlmv2-base-uncased', output_hidden_states = True,)
tokenizer = LayoutLMv2Tokenizer.from_pretrained('microsoft/layoutlmv2-base-uncased')

def text_preparation(text, tokenizer):
    marked_text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segments_ids = [1]*len(indexed_tokens)

    # Convert inputs to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    return tokenized_text, tokens_tensor, segments_tensors

def get_layout_embeddings(tokens_tensor, segments_tensors, model,image):
    with torch.no_grad():
        outputs = model(input_ids=tokens_tensor,image=image) # ERROR ON THIS LINE
        hidden_states = outputs[2][1:]

    token_embeddings = hidden_states[-1]
    token_embeddings = torch.squeeze(token_embeddings, dim=0)
    list_token_embeddings = [token_embed.tolist() for token_embed in token_embeddings]
    return list_token_embeddings

def get_embedding(texts,image):
    target_word_embeddings = []
    for text in texts:
        tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(text, tokenizer)
        list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model,image)
        temp = []
        for word in tokenized_text:
            word_index = tokenized_text.index(word)
            word_embedding = list_token_embeddings[word_index]
            if(temp!=[]):
                temp = list(map(float.__add__, temp, word_embedding))
            else:
                temp = word_embedding
        target_word_embeddings.append(temp)
    return target_word_embeddings

image = Image.open(path_to_image)
words = ["Hello","World"]
get_embedding(words,image)

But I am getting the following error - 'NoneType' object has no attribute 'dtype' on line number 14. I am sending words and image to the model because all other parameters are marked as optional in the documentation.

I would really appreciate any help on this.

[Example] TrOCR / VisionEncoderDecoder

Would be nice if you can add some examples for fine-tuning for example with any pretrained bert as decoder !? :)
Do we have also a chance to export these after training to ONNX ?
However, I think even if then only with a own greedy search implementation

I have started to play a bit with it but struggle currently with the greedy search i shared my questions and notebook also in huggingface forum:
Question

Thanks a lot :)

`GLIBC_2.29' not found

And after I installed the latest version of the transformers (v4.10.0), I couldn't import LayoutLMv2Processor.
the error is shown as follow:

ImportError: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by xxxxx/python3.7/site-packages/tokenizers/tokenizers.cpython-37m-x86_64-linux-gnu.so)

how to fixed this problem? Thank you.

save and load fine tuned Vision Transformer

trainer.save_model(path)

model = ViTForImageClassification.from_pretrained(path)

The path is correct but an error appears because is not the config.json file

Could you help me?
Thanks

How to train own dataset with LayoutLMv2?

I have some trouble training with own dataset for LayoutLMv2. Previously, I am possible to train my own dataset using LayoutLMv1 but I would like to try on the version 2 but realize there is some slight difference with the code. My dataset format now is currently the same as FUNSD dataset

amazing work!

Hi, just a quick note to say thanks for these amazing notebooks! They are super useful! Just a question though: where can I find a tensorflow version of them (or similar)?

Thanks!

getting error in get_ocr_words_and_boxes() method

I am using the below notebook to replicate finetuning results on DocVQA. I am getting errors while converting data into features.
https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb#scrollTo=DIRvzDlA9QXp
Getting error in this line:
dataset_with_ocr = dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)

Error:
ValueError Traceback (most recent call last)
/tmp/ipykernel_37458/2876220197.py in
----> 1 dataset_with_ocr = dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)

~/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
2034
2035 if num_proc is None or num_proc == 1:
-> 2036 return self._map_single(
2037 function=function,
2038 with_indices=with_indices,

~/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
501 self: "Dataset" = kwargs.pop("self")
502 # apply actual function
--> 503 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
504 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
505 for dataset in datasets:

~/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
468 }
469 # apply actual function
--> 470 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
471 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
472 # re-apply format to the output

~/py38/lib/python3.8/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
404 # Call actual function
405
--> 406 out = func(self, *args, **kwargs)
407
408 # Update fingerprint of in-place transforms + update in-place history of transforms

~/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc, cache_only)
2419 writer.write_table(batch)
2420 else:
-> 2421 writer.write_batch(batch)
2422 if update_data and writer is not None:
2423 writer.finalize() # close_stream=bool(buf_writer is None)) # We only close if we are writing in a file

~/py38/lib/python3.8/site-packages/datasets/arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
411 typed_sequence = OptimizedTypedSequence(batch_examples[col], type=col_type, try_type=col_try_type, col=col)
412 typed_sequence_examples[col] = typed_sequence
--> 413 pa_table = pa.Table.from_pydict(typed_sequence_examples)
414 self.write_table(pa_table, writer_batch_size)
415

~/py38/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

~/py38/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

~/py38/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/py38/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

~/py38/lib/python3.8/site-packages/datasets/arrow_writer.py in arrow_array(self, type)
114 else:
115 out = pa.array(cast_to_python_objects(self.data, only_1d_for_numpy=True), type=type)
--> 116 if trying_type and out[0].as_py() != self.data[0]:
117 raise TypeError(
118 "Specified try_type alters data. Please check that the type/feature that you provided match the type/features of the data."

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Any help is really appreciated @NielsRogge

Set tesseract language

Hi Niels!

First of all, thank you for this amazing report with all these jupyter notebooks.

I would like to know if it is possible to select the tesseract language via LayoutLMv2FeatureExtractor() and if not , how could I do it.

Arrow Invalid error while running this training script for FUNSD training script using huggingface trainer

Really great work buddy.

I am stuck at this particular error and I am hoping you could help me out with the resolving of this issue.

The notebook I am facing error is "Fine-tuning LayoutLMv2ForTokenClassification on FUNSD using HuggingFace Trainer.ipynb"

The error I am facing is:
ArrowInvalid: Can only convert 1-dimensional array values
`---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
in ()
28
29 train_dataset = datasets['train'].map(preprocess_data, batched=True, remove_columns=datasets['train'].column_names,
---> 30 features=features)
31 test_dataset = datasets['test'].map(preprocess_data, batched=True, remove_columns=datasets['test'].column_names,
32 features=features)

13 frames
/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
1701 new_fingerprint=new_fingerprint,
1702 disable_tqdm=disable_tqdm,
-> 1703 desc=desc,
1704 )
1705 else:

/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
183 }
184 # apply actual function
--> 185 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
186 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
187 # re-apply format to the output

/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
396 # Call actual function
397
--> 398 out = func(self, *args, **kwargs)
399
400 # Update fingerprint of in-place transforms + update in-place history of transforms

/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc, cache_only)
2063 writer.write_table(batch)
2064 else:
-> 2065 writer.write_batch(batch)
2066 if update_data and writer is not None:
2067 writer.finalize() # close_stream=bool(buf_writer is None)) # We only close if we are writing in a file

/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
409 typed_sequence = OptimizedTypedSequence(batch_examples[col], type=col_type, try_type=col_try_type, col=col)
410 typed_sequence_examples[col] = typed_sequence
--> 411 pa_table = pa.Table.from_pydict(typed_sequence_examples)
412 self.write_table(pa_table, writer_batch_size)
413

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py in arrow_array(self, type)
106 storage = numpy_to_pyarrow_listarray(self.data, type=type.value_type)
107 else:
--> 108 storage = pa.array(self.data, type.storage_dtype)
109 out = pa.ExtensionArray.from_storage(type, storage)
110 elif isinstance(self.data, np.ndarray):

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Can only convert 1-dimensional array values`

Please help me out. Thanks

Vocab size for microsoft/layoutxlm-base

Hello there,

First of all thank you so much for the work you are doing, it's being really helpful for me to get my hands dirty with state-of-the-art models.

Some weeks ago I fine-tunned a layoutxlm-base using this notebook as reference and it worked, even got nice results with it.

Today I tried to run another training but unfortunetaly something went wrong, after a couple of hours I noticed that the tokenizer's size and model's vocab_size is 250002 but vocab's length is 250007.

So as a work around I came with this:
model.layoutlmv2.embeddings.word_embeddings = torch.nn.Embedding(250007, 768, padding_idx=1)

It seems to be working..
Furthermore I will save tokenizer and model files to ensure that will be always the same.

But my question is if I change this layer in the previous model I will get the same results? Or it is needed to re-train?

Once again, thank you so much!

ArrowInvalid: Can only convert 1-dimensional array values

I run the notebook Fine-tuning LayoutLMv2ForTokenClassification on FUNSD.ipynb and got the error as follow:

ArrowInvalid Traceback (most recent call last)

in ()
27
28 train_dataset = datasets['train'].map(preprocess_data, batched=True, remove_columns=datasets['train'].column_names,
---> 29 features=features)
30 test_dataset = datasets['test'].map(preprocess_data, batched=True, remove_columns=datasets['test'].column_names,
31 features=features)

13 frames

/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Can only convert 1-dimensional array values

could anyone help me to solve the problem?

Image.open(image) chokes memory when preparing custom dataset

I am training the model on my custom dataset. The processor expects Image.open('image').convert('RGB') for all images. This is memory-consuming. I am creating a custom dataset on 800 train and 200 test samples.

Below is from the documentation .

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
words = ["hello", "world"]
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
encoding = processor(image, words, boxes=boxes, return_tensors="pt")

I am not using inbuilt OCR. Why is an actual image needed here (Layoutlmv1 needed width and height only)? Is there an alternative to do this without Image.open()? Or else how to avoid huge memory consumption for the current process?

more than 512 tokens

how to handle the input image have more than 512 tokens or words in an image in Layout LM2 tokenizer example ?

the visual representation is stopping certain level while showing output ?

Fine-tuning model trained on DocVQA dataset

Hi @NielsRogge.
Could you provide the pre-trained model that is fined-tuning on the DocVQA dataset?
I want to use that model to compare with my trained model.

nielsrogge / transformers-tutorials Goto Github PK

transformers-tutorials's Introduction

Transformers-Tutorials

Data preprocessing

Training frameworks

transformers-tutorials's People

Contributors

Stargazers

Watchers

Forkers

transformers-tutorials's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs