GithubHelp home page GithubHelp logo

kansformers's Introduction

Kansformers: Transformers using KANs

Kolmogorov-Arnold Networks (KANs) are a proposed alternative architecture to MLPs. Despite having similar capabilities (and being representable with MLPs), KANs allow for faster training, less forgetting, and more interpretability.

With some modifications, KANs can be switched out for pytorch nn.Linear layers. We first use efficient KANs and change shape processing to allow for better compatability.

Because the two are interchangable, we can take a Transformer architecture and replace nn.Linear layers with KANs. We use minGPT as a basis and swap the layers. We then train on a sample corpus and evaluate.

Running the model

The train.ipynb demonstrates a sample run of the model. Any further explanation should be found on the minGPT repository.

Checkpoints

Weights for several checkpoints can be found here: link

The model-5-5-2024.pt checkpoint uses: n_layer=2, n_head=2, n_embd=128, C.model.block_size = 128

The model-5-7-2024-1.pt checkpoint uses: model_type = 'gpt-micro', C.model.block_size = 128

The model-5-7-2024-2.pt checkpoint uses: model_type = 'gpt-micro', C.model.block_size = 256

  • This model is trained on openwebtext
  • Used ~37GB vRAM
  • Notebook at this commit is wrong :( the actual one didn't save on my drive
    • Weights are okay tho

The model-5-7-2024-3.pt checkpoint uses: model_type = 'gpt-nano', C.model.block_size = 128

The model-5-7-2024-4.pt checkpoint uses: model_type = 'gpt-nano', C.model.block_size = 256

The model-5-7-2024-5.pt checkpoint uses: model_type = 'gpt-mini', C.model.block_size = 128

Observations

Training RAM

  • Doubling the block size (context window) doubled the amount of RAM required to train

Block/Model Size vs. Loss/Generations

  • Both the 128 and 256 block sizes for gpt-micro sit at the similar low 6.xxxx loss values when training stops, although 256 does slightly better by reaching high 5.xxxx
    • Could indicate that model size makes more of a difference than block size
    • Outputs were significantly better for 256 block size
  • Training a smaller model, gpt-nano with block size 128 gives a mid 6.xxxx loss and terrible outputs
    • Training with block size 256 gives the same loss and output quality
    • Could indicate that block size and model size must scale up together to see noticeable difference
  • When setting temperature to below 1, text becomes full of special characters; When setting to above 1, text becomes gibberish
    • Could be an issue with model size or temperature is not having desired effects
  • For gpt-mini with block size 128, minimal testing indicates that a temperature of about 0.75 produces the best outputs

Notes

  • I trained the model on a single L4 GPU with high ram in Google Colab
    • Due to this computing constraint, I had to train a tiny version of the model
  • Checkpoints model-5-7-2024-2.pt, model-5-7-2024-4.pt, and model-5-7-2024-5.pt were trained on a single high-ram A100 in Google Colab
  • Efficient KAN is used as it is currently the strongest implementation of KAN: benchmarks
    • I had initially planned to use some c/c++ implementation of KAN to improve times but benchmarks show that current implementation is acceptable
    • I am not sure if there is any benchmark of the model memory footprint (not forward/backward pass memory) across the implementations, but I assume efficient KAN will still be the best
  • Early stopping was used due to its proven effectiveness in LLMs: paper
  • Randomization is important, the dataset uses sampling and no seed is set

Future Work

  • Increase compute
    • Train model on larger scales to find better performance
      • Preferrable GPT-2 scale: model_type = 'gpt2', C.model.block_size = 1024
    • Evaluate larger models on benchmarks
    • Observe scaling laws, training times, loss patterns, emergent capabilities
  • Look at other transformer models (BERT, BART, DistilBERT, RoBERTa)
    • With modified efficient kan, should be easy to swap out nn.Linear layers with KAN layers
      • Might be able to find a systematic way to do this
  • Maybe control sampling/randomness with a set seed for replication

kansformers's People

Contributors

akaashdash avatar

Stargazers

Zichen Wang avatar WBC-ML avatar  avatar  avatar  avatar  avatar 清吾 avatar boqing avatar Ulan Sametov avatar  avatar  avatar Jane avatar Taha Aksu avatar bahayonghang avatar  avatar  avatar Zappone Federico avatar  avatar  avatar 不为谁而写的代码 avatar NCU WASN Lab avatar wth avatar wxmlearner avatar Vadim Shubin avatar Yin Hanlong avatar  avatar 王新龙 avatar  avatar Juwon Kim avatar Gihyeon Lee avatar  avatar Vincent Hong avatar Stoney Kang avatar Xiao Hu avatar Jonas Landsgesell avatar  avatar CoderPanda avatar  avatar  avatar  avatar  avatar  avatar Amund Tveit avatar Oxi avatar  avatar kuka avatar Xinli Xiong avatar king avatar Qcy avatar XiAoJH-ac avatar Chieh Chang avatar  avatar  avatar 爱可可-爱生活 avatar  avatar huangpu avatar Maoyu Wang avatar gyunggyung avatar Pratham Batra avatar minghao avatar  avatar  avatar  avatar Fenghe Tang avatar  avatar  avatar Abner  avatar Jinhui.Lin avatar Nelson Yalta avatar 李昊洋 avatar Derck avatar Liwei Deng avatar  avatar  avatar  avatar  avatar YunYuLai avatar Legoshi avatar Misha Brukman avatar  avatar  avatar 'dre avatar Aris_zyh avatar  avatar  avatar  avatar Hao Xu avatar  avatar Ko Dae Won avatar Yuxuan Gao avatar zhaoying9105 avatar  avatar B.P.S avatar J.-C. Jiang avatar  avatar Eisneim Terry avatar John Pope avatar Hoang Thang Ta avatar  avatar qwemnb avatar

Watchers

 avatar  avatar Vedu Mallela avatar

kansformers's Issues

Is there a way to reduce the number of parameters?

After replacing the linear layer in Transformer with a single-layer KAN and replacing the feedforward neural network with a two-layer KANs, the number of parameters has increased by ten times. Are there any methods to reduce the number of parameters?

Why is the inference so slow?

Why is the inference so slow?

I used the code below, but I don't know if I'm using my GPU well.

device = torch.device('cpu')

model = GPT(C.model)
model.load_state_dict(torch.load(model_path))
model.to(device)

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
model.eval()
input_phrase = "Yesterday"
input_ids = tokenizer.encode(input_phrase, return_tensors='pt')

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=10, temperature=0.75, do_sample=True)

generated_text = tokenizer.decode(output[0])
print(f"Generated text: {generated_text}")

https://colab.research.google.com/drive/1I5n0SDrggPA8AnpucHuiI3jAEAQ4_KBh#scrollTo=XXwwVMgdW9-2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.