pbelcak / ultrafastbert Goto Github PK

View Code? Open in Web Editor NEW

493.0 15.0 34.0 219 KB

The repository for the code of the UltraFastBERT paper

License: MIT License

Python 73.13% Shell 22.02% C++ 3.50% Cuda 1.13% C 0.22%

ultrafastbert's Introduction

UltraFastBERT

The repository for the paper "Exponentially Faster Language Modelling"

https://arxiv.org/abs/2311.10770

Organisation

The training folder contains a clone of the crammedBERT repository from the beginning of October 2023. A few new configurations and small modifications have been made to enable the use of FFFs. A masking implementation (i.e. an implementation of FFFs that offers no speed advantage over FFs but simulates its selective engagement of neurons by masking) is provided for training and downstream finetuning.
The benchmark_cpu folder contains C++ code using Intel MKL 2023.2.0 to implement accelerated CPU versions of FFF inference as well as baseline DMM implementations of the traditional FF layers.
benchmark_pytorch folder contains the C++ code for the "Native fused" and "PyTorch BMM" implementations of both FF and FFF inference.
benchmark_cuda folder contains the C++/CUDA kernel code for the "Naive CUDA" implementations of FF and FFF.

Reproducing the results from weights

The configuration and weights for UltraFastBERT-1x11-long can be found on HuggingFace:

https://huggingface.co/pbelcak/UltraFastBERT-1x11-long

These files have been produced and uploaded using training/load_local_model.py with impl.push_to_huggingface_hub=True.

UltraFastBERT-1x11-long, as a model, is an instance of our small extension of the crammedBERT setup. You can simply enter the training directory and follow the steps given in the crammingBERT README to use HuggingFace AutoTokenizer and AutoModelForMaskedLM, with the difference that you want UltraFastBERT-1x11-long, and not crammedBERT.

Quickstart

Create a new Python/conda environment, or simply use one that does not have any previous version of the original cramming project installed. If, by accident, you use the original cramming repository code instead of the one provided in the /training folder of this project, you will be warned by transformers that there are some extra weights (FFF weight) and that some weights are missing (the FF weights expected by the original crammedBERT).
cd ./training
pip install .
Create minimal_example.py
Paste the code below

import cramming
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pbelcak/UltraFastBERT-1x11-long")
model = AutoModelForMaskedLM.from_pretrained("pbelcak/UltraFastBERT-1x11-long")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Run python minimal_example.py.

Reproducing the results from scratch

To reproduce our training and finetuning results, simply head straight down to the training folder and follow the instructions of the README there.
To reproduce our CPU speed benchmarking results, head to benchmark_cpu. If you're on Windows, the easiest way to compile&run the code might be to use Visual Studio 2022 Community with the Intel oneAPI extension. The other option is to use the Intel compilers directly (more information on the Intel oneAPI "Getting started" websites).
benchmark_pytorch results can be reproduced by running python main.py in the folder. The outcomes of these runs are automatically put into a SQLite results.db file for the ease of inspection.
benchmark_cuda requires the CUDA Toolkit. Once installed, using python setup.py install in the extension folder will do the CUDA code compilation for you and prepare a module that can be imported.

ultrafastbert's People

Contributors

Stargazers

Watchers

ultrafastbert's Issues

How can I use your code?

Dear author,
Hi, I just read your paper and it's awesome. I'm wandering that among all these versions which one is best suitable for using in our transformer model. Of course I want to use GPU for training. Can I just use the fff_cuda to complete the whole training process like the normal feed forward layer？ and I also want to know did you try using fff_cuda on windows system? Thank you!

Failure during evaluation after training

Following the training README I am using:

python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4  data=pile-readymade

Followed by:

python eval.py eval=GLUE_sane name=amp_b8192_cb_o4_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True impl.compile_torch=False

This fails:

FFF-BERT seems to run slower than a vanilla BERT model

Boyan and I performance-tested the FFF-BERT (on HuggingFace) against a vanilla BERT of similar size, and found that it performs maybe 15% more slowly on my M2 mac.

https://gist.github.com/p-i-/355668983aaeee3f282977cdfb93017c

This seems surprising, as the benchmarks do indeed demonstrate a ~50x speedup for a single feed-forward layer:

#!/bin/bash

echo "🔸 Batch size 100"
echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 100  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 100  --n-iters 10  --device cpu


echo "🔸 Batch size 10"
echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 10  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 10  --n-iters 10  --device cpu


echo "🔸 Batch size 1"

echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 1  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 1  --n-iters 10  --device cpu

> . run.sh 
🔸 Batch size 100
naive FF (batch matmult)
eager: 1.3852830000000003
compile: 1.366022000000001
(eval) compiled: 1.3960490000000003 ± 0.03737091447636828
~~~~~~~~~~
FFF (batch matmult)
eager: 0.05451000000000006
compile: 0.018572000000000255
(eval) compiled: 0.01893820000000006 ± 0.0015569136006856079
~~~~~~~~~~
🔸 Batch size 10
naive FF (batch matmult)
eager: 0.141181
compile: 0.1446900000000002
(eval) compiled: 0.1389437 ± 0.0026585709714055
~~~~~~~~~~
FFF (batch matmult)
eager: 0.005520000000000191
compile: 0.001954999999999707
(eval) compiled: 0.002634200000000009 ± 0.0015433838667031883
~~~~~~~~~~
🔸 Batch size 1
naive FF (batch matmult)
eager: 0.01369599999999993
compile: 0.01478299999999999
(eval) compiled: 0.014860099999999932 ± 0.0014923330358871411
~~~~~~~~~~
FFF (batch matmult)
eager: 0.0005589999999999762
compile: 0.0005690000000000417
(eval) compiled: 0.0003634999999999167 ± 7.71248987033366e-05
~~~~~~~~~~

Speedups for batchsize 100 10 1:

In [1]: 1.3471607 / 0.019425799999999917, 0.14026139999999993 / 0
   ...: .0023557000000000716, 0.014105299999999899 / 0.0003394000
   ...: 0000001194
Out[1]: (69.34904611393127, 59.54128284586138, 41.5595167943412)

finetuning

Here simple instructiong to finetune model
https://huggingface.co/docs/transformers/training

Can be finetuned ultra-fast-bert in similar way?

UltraFastBERT FFF Layer performs worse than MLP

I've been trying to reproduce your positive results for the FFF layer structure. To simplify the comparison I've been using CIFAR-10 as a proxy problem.

Over the past week I put together a training framework for CIFAR-10 with a baseline transformer model (vit_tiny with mlp_dim=256). I've then introduced a number of variants of the transformer model using your UltraFastBERT implementation of FFF, some tweaks to it, and a community version of FFF written from scratch. The results are here:

Code for this experiment is here: https://github.com/catid/cifar10deepspeed

So far we have yet to see the FFF layer improve upon a small mlp_dim=16 FFN network. The conditional computation does not seem to improve the network's ability to generalize to the validation set.

I'm currently suspecting that the UltraFastBERT result can be improved by replacing the FFF layers with a MLP layer that is with mlp_dim=16, which is obviously much smaller and easier to train/evaluate than a FFF layer.

What is the optimal FFF implementation in the codebase?

Am I correct that benchmark_cuda/fff_cuda on the main is currently the best performing implementation of FFF?

Edit: After looking at the code I'm guessing maybe that isn't the best full implementation, since it seems that fff_backward is unimplemented?

std::vector<torch::Tensor> fff_backward(
		torch::Tensor inputs
) {
	CHECK_INPUT(inputs);

	return { };
}

How to decode output

Hello,
This project seems very interesting. I have a question / suggestion for potential improvement:
The tokenizer has no decode() method (would it make sense to add one?). Could you explain how to get the output back to natural speech?

output = model(**encoded_input)
text_output = ???

Thank you in advance for your reply !

Best wishes,
Eric

Left branches getting ignored in PyTorch implementation

Bojan and I dug through this work (convo in Yannic's Discord -> DailyPapers channel).

UltraFastBERT/benchmark_pytorch/fff/fff_bmm.py:

    # y = torch.einsum('b i j , b i -> b j', selected_w2s, F.gelu(all_logits))
    y = torch.einsum('b i j , b i -> b j', selected_w2s, all_scores)
    return y

I've removed the .gelu from this line. (Bojan switched to einsum also to improve clarity, but that's not relevant here).

If you're using .gelu then you're discarding information from all negative-scoring nodes, as you're multiplying their y-vector contribution by 0.

Here's a gist with an MNIST example: https://gist.github.com/p-i-/784ea313d21856c286b823f27bf79d90

If you put the .gelu back the accuracy will deteriorate.

Notice:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = FFF(nIn=28*28, nOut=500)
        self.fc2 = FFF(nIn=500, nOut=10)
        # self.fc1 = FFF(nIn=28*28, nOut=10)
    def forward(self, x):
        x = x.view(-1, 28*28)
        y_hat = self.fc2(torch.relu(self.fc1(x)))
        # y_hat = self.fc1(x)
        return y_hat

So I'm introducing a nonlinearity in between two FFF layers.

I think there's maybe a cleaner way to re-conceptualize the result at which you have arrived.

Each node has a node.x, which points in some direction in INPUT space and a node.y which points in some direction in OUTPUT space.

The node[][].x each represents the normal vector to a region-splitting hyperplane. And for input x, node[p][q].score = dot(x, node[p][q].x) projects the input onto this normal-vector.

If it's positive, it's "sunny-side" of the hyper-plane and we branch up and right, else it's "darkside" and we branch up and left.

Either way we'll reach a new node with a fresh region-splitting hyperplane.

And so, once we've traversed the (depth D) tree (and I'm going to follow the authors in considering a solitary root-node as a tree of depth-0 (D=0)) we have split the input space into 2**D regions, of which we are inside one.

And this to me is the beautiful part.

If we consider the "winning" node sequence node_{1..D}, then node_k.x form a basis for a D-dimensional subspace within our INPUT space. e_1 ... e_D.

And node_k.y form a basis for a D-dimensional subspace within our OUTPUT space. f_1 ... f_D.

And our input x can be written as lambda_1 e_1 + ... + lambda_D e_D + remainderTerm, where lambda_i is just node_i.score

And we're projecting this to lambda_1 f_1 + ... + lambda_D f_D

So the FFF layer is figuring out a "most-useful--D-dimensional-transform" and applying it. It's lerping from a basis over INPUT space to a basis over OUTPUT space.

And this basis-pair depends on where our input x is located in INPUT space. There's 2**D possible basis-pairs.

And the backprop will move the bases around to optimize performance. So it will learn a "most-useful" mapping. It reminds me of LoRA in this way.

There's room for exploring quite a few ideas from here. I've emailed one of the authors.

Not able to replicate the results

Hi,
we are trying to replicate the results presented in Table 1 of the paper. We leave everything as is and pretrain the model with the following command: python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert-fff train=bert-o4 data=pile-readymade It trains for 1 day on a single Nvidia A100
Then we evaluate it with: python eval.py eval=GLUE_sane name=amp_b8192_cb_o4_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True impl.compile_torch=False

We get average GLUE score of 74%.

We also tried to train a trasnformer with the FFF layer on TinyStories dataset. We take the original model and just replace the FF layers with FFF layers. We achieve the same perplexity as a model with the MLPs completely removed. The perplexity is worse than the one achieved by the original model with FF layers.

Two FFF classes,

In the fastfeedforward project, the Class FFF is defined in fastfeedforward/fastfeedforward/fff.py
In the UltraFastBert project the class FFF is defined in UltraFastBERT/training/cramming/architectures /fff.py

While both inherit from torch.nn.Module, they both have two different implementation of the foward() function.
I assume the UltraBert forward function is an implementation of algorithm 1 in the paper, "Exponentially Faster Language Modeling", but I don't understand the notation in algorithm 1, so I cannot verify that the python code corresponds to algorithm 1.

From what I can understand, the forward method in UltraFastBert would be eval_forward(), if UltraFastBert were to be used in the same context.

Please unify the architecture, or change one to FFFAlpha, and the other to FFFBeta. It sounds dumb, but I'd appreciate if you could concisely explain when each forward() should be used.

LICENSE.MD renaming

When you try to install cramming by copying the training folder and then hit pip install . it results in:

Defaulting to user installation because normal site-packages is not writeable
Processing /home/s371513/ernie/berts/training
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [7 lines of output]
      running dist_info
      creating /tmp/pip-modern-metadata-butidlxr/cramming.egg-info
      writing manifest file '/tmp/pip-modern-metadata-butidlxr/cramming.egg-info/SOURCES.txt'
      warning: no previously-included files matching '*.pyc' found anywhere in distribution
      warning: no previously-included files matching '__pycache__' found anywhere in distribution
      writing manifest file '/tmp/pip-modern-metadata-butidlxr/cramming.egg-info/SOURCES.txt'
      error: [Errno 2] No such file or directory: 'LICENSE.md'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

As a propose, just rename LICENSE.MD to LICENSE.md since including and excluding processes done in the MANIFEST.in are case sensitive.

Info about how to use and play with the model missing from README.

I know it's primarily a demonstration of a research project, rather than a ready-to-use product, but it'd be awesome to have some instructions or ready scripts for using this for text generation or other tasks.

can we have a slightly larger model?

the currently published model is making errors, I wonder if a larger bert would probably fix the issue? 128 is too small.