GithubHelp home page GithubHelp logo

amber-train's Introduction

Amber: the first model of LLM360

amber logo


license

🤗 [Amber Download] • 🤗 [AmberChat Download] • 📈 [Analysis and Results] • 📗 Pretraining Dataset

About LLM360

LLM360 is an initiative for comprehensive and fully open-sourced LLMs, where all training details, model checkpoints, intermediate results, and additional analyses are made available to the community. Our goal is to advance the field by inviting the community to deepen the understanding of LLMs together. As the first step of the project LLM360, we release all intermediate model checkpoints, our fully-prepared pre-training dataset, all source code and configurations, and training details. We are committed to continually pushing the boundaries of LLMs through this open-source effort.

Get access now at LLM360 site

Model Description

Amber is the first model in the LLM360 family. Amber is an 7B English language model with the LLaMA architecture.

Loading Amber

from transformers import LlamaTokenizer, LlamaForCausalLM

tokenizer = LlamaTokenizer.from_pretrained("LLM360/Amber", revision="ckpt_356")
model = LlamaForCausalLM.from_pretrained("LLM360/Amber", revision="ckpt_356")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Amber Training Details

DataMix

Subset Tokens (Billion)
Arxiv 30.00
Book 28.86
C4 197.67
Refined-Web 665.01
StarCoder 291.92
StackExchange 21.75
Wikipedia 23.90
Total 1259.13

Hyperparameters

Hyperparameter Value
Total Parameters 6.7B
Hidden Size 4096
Intermediate Size (MLPs) 11008
Number of Attention Heads 32
Number of Hidden Layers 32
RMSNorm ɛ 1e^-6
Max Seq Length 2048
Vocab Size 32000
Training Loss
loss curve

Evaluation

ARC HellaSwag
arc hellaswag
MMLU TruthfulQA
mmlu truthfulqa

Citation

BibTeX:

@misc{liu2023llm360,
      title={LLM360: Towards Fully Transparent Open-Source LLMs}, 
      author={Zhengzhong Liu and Aurick Qiao and Willie Neiswanger and Hongyi Wang and Bowen Tan and Tianhua Tao and Junbo Li and Yuqi Wang and Suqi Sun and Omkar Pangarkar and Richard Fan and Yi Gu and Victor Miller and Yonghao Zhuang and Guowei He and Haonan Li and Fajri Koto and Liping Tang and Nikhil Ranjan and Zhiqiang Shen and Xuguang Ren and Roberto Iriondo and Cun Mu and Zhiting Hu and Mark Schulze and Preslav Nakov and Tim Baldwin and Eric P. Xing},
      year={2023},
      eprint={2312.06550},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

amber-train's People

Contributors

aurickq avatar tanyuqian avatar victormillerpetuum avatar hunterhector avatar willieneis avatar linux4life798 avatar odp avatar mylibrar avatar

Stargazers

 avatar Fan avatar Liang Ding avatar  avatar Tiancheng Zhao (Tony)  avatar Xuemin Zhao avatar  avatar Shahrukh Khan avatar  avatar Weikai Xie avatar NikolasR avatar  avatar  avatar  avatar Peutlefaire avatar Roy Hvaara avatar Mr. Jack Tung avatar richard wang avatar rshalan avatar Likhit avatar Nguyễn Lê Phúc Vinh avatar Ifty Mohammad Rezwan avatar Mirko Raca avatar JarbasAI avatar  avatar Nikolay Karelin avatar guozhang chan avatar Jeff Carpenter avatar tiger avatar 徐静 avatar Rahul D Shetty avatar Monteiro Steed avatar AISHA HALDER avatar Andrew Carr avatar Niall Taylor avatar ./c² avatar Vaibhav Singh avatar Yilun Kuang avatar Michael Schock avatar Xi Lain (デビッド)  avatar moses_wong avatar Joseph Cheng avatar  avatar  avatar Dhruv Kalathia avatar Ali Emre Narin avatar Arunkumar Venkataramanan avatar  avatar bravery_cry avatar Murhaf avatar  avatar Max Ilyin avatar Shyam Sudhakaran avatar Rohan Paul avatar Ashwin Venkat avatar  avatar  avatar 电线杆 avatar Satpal Singh Rathore avatar  avatar Daxiong avatar Mohammadreza Ghofrani avatar  avatar Giyaseddin Bayrak avatar Angus He avatar  avatar zahra avatar AntiMoron avatar Ritchie avatar Yves Quemener avatar  avatar Geunsik Lim avatar  avatar  avatar AI/ML Engineer avatar Krtolica Vujadin avatar Khasim Dudekula avatar M Akash Kumar avatar  avatar  avatar J avatar Borjan Peovski avatar Rafael Navarro avatar Chuanming avatar Fil Sviatoslav - Stand with Ukraine avatar  avatar  avatar Salvador Guzman avatar  avatar Alexander Nikulin avatar chris_zhp avatar David Ponce avatar 苹果的味道 avatar  avatar Leroy van Logchem avatar Penut Chen avatar Zhou Han avatar Damir Cavar avatar adzhua avatar Hyunwoo Ko avatar

Watchers

 avatar  avatar  avatar Damir Cavar avatar  avatar  avatar Tianhua avatar Yuqi Wang avatar

amber-train's Issues

Aligning logits with labels through two shifts?

在 main.py中数据准备时:

def collate_fn(examples, device):
    token_ids = torch.tensor(
        [example['token_ids'] for example in examples], device=device)
    return **{'input_ids': token_ids[:, :-1], 'labels': token_ids[:, 1:]}**

def train_chunk(.......):
..........
batch = collate_fn(
            examples=examples[i:i+per_device_batch_size], device=fabric.device)
input_ids, labels = batch['input_ids'], batch['labels']

在 modeling_llama.py 中loss计算时:

class LlamaForCausalLM(LlamaPreTrainedModel):
....................
        if labels is not None:
            # Shift so that tokens < n predict n
            **shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()**

为什么在模型数据sample输入时进行了预测和真实值之间的位移对齐,在模型中loss计算时还进行了一次位移对齐?

Please share requirements of the project

Thanks for this great work and open sourcing it.

I'm trying to execute the training code but due to the lack of requirements, I failed. Could you share the requirements needed for the project? By requirement I mean the version of Python library packages, cuda version, etc.

Thanks,

Question about pretraining.

I attached my training loss below, the data we are using refers to LLM360's paper, we use less data starcode.
For each training epoch our data contains 30B arxiv , Book 57B, C4 197.67B, Refined-Web 665.01, StarCoder 150B, StackExchange 21.75B, Wikipedia 23.90B.
And the hyperparameter we are using the same as LLM360 demonstrated. And the max_seq_len is 4096 instead of 2048, tokenizer is gpt tokenzier.
We are using an opensource repo to run the experiment on H100 Node with 2048 global bsize.
Currently our model can only achieve around 10.5 PPL on the falcon dataset. which is much worse than LLM360 amber model (Around 8 PPL) and llama-2 (Around 8 PPL).
Just wondering what would be the possible reason that our model perform much worse?
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.