GithubHelp home page GithubHelp logo

salesforce / codegen Goto Github PK

View Code? Open in Web Editor NEW
4.8K 79.0 368.0 1.38 MB

CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.

License: Apache License 2.0

Python 100.00%
programsynthesis generativemodel codex languagemodel llm tpu-acceleration

codegen's Introduction

CodeGen

Official release for the CodeGen1 and CodeGen2 models (350M, 1B, 3B, 7B 16B) for Program Synthesis by Salesforce AI Research.

News

July 2023

CodeGen2.5 released outperforming 16B parameter models with only 7B.

May 2023

CodeGen2.0 released with strong infill sampling capability.

March 2022

CodeGen1.0 released on par with OpenAI Codex at the time.

Publications

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
Erik Nijkamp*, Bo Pang*, Hiroaki Hayashi*, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong
ICLR, 2023

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
Erik Nijkamp*, Hiroaki Hayashi*, Caiming Xiong, Silvio Savarese, and Yingbo Zhou
ICLR, 2023

Usage

The models are available on the Hugging Face Hub.

CodeGen1.0

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
inputs = tokenizer("# this function prints hello world", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

CodeGen2.0

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen2-7B")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen2-7B", trust_remote_code=True, revision="main")
inputs = tokenizer("# this function prints hello world", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

CodeGen2.5

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-mono", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-mono")
inputs = tokenizer("# this function prints hello world", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0]))

Training

The Jaxformer library for data pre-processing, training and fine-tuning the CodeGen models can be found here:

https://github.com/salesforce/jaxformer

Citation

If you find our code or paper useful, please cite the paper:

@article{nijkamp2022codegen,
  title={CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis},
  author={Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming},
  journal={ICLR},
  year={2023}
}

@article{nijkamp2023codegen2,
  title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages},
  author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo},
  journal={ICLR},
  year={2023}
}

codegen's People

Contributors

bpucla avatar dependabot[bot] avatar eltociear avatar enijkamp avatar jimjag avatar jsoref avatar lowinli avatar rooa avatar vlomshakov avatar wilcoxst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codegen's Issues

Inference on TPU's

Is it possible to setup CodeGen for inference on TPU's?

According to Huggingface, Pytorch-XLA generation is not yet supported: huggingface/transformers#12322

Is there another way to use this model on TPU's for inference?

Thanks,
Arya

Will MTPB dataset be open sourced?

Hi there, thank you so much for providing the model and weight of such a nice project!

I wonder if you are planning to open source the MTPB dataset mentioned in the paper and the corresponding generation method? Thank you :)

add web demo/model to Huggingface

Hi, would you be interested in adding CodeGen to Hugging Face? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. There is already a Salesforce organization on Huggingface: https://huggingface.co/Salesforce to add models/datasets/spaces(web demos) to.

Example from other organizations:
Keras: https://huggingface.co/keras-io
Microsoft: https://huggingface.co/microsoft
Facebook: https://huggingface.co/facebook

Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/salesforce/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

and here are guides for adding spaces/models/datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces
how to add models: https://huggingface.co/docs/hub/adding-a-model
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

Model-level Parallelism

Hi, thanks for releasing these models! It's great to se more open source LLMs, especially for source code. I wanted to sample from the 16B parameter model, but unfortunately the weights do bit fit in the memory of a single 48GB GPU. Could you comment on whether the weights can be distributed across several GPUs at inference time? I imagine that would be valuable for facilitating the use of the largest models, given that few GPUs offer more than 48GB of memory. I believe the below code does something like this on the training side; it would be great if you could offer some implementation pointers for making this available for sampling.

def parallelize(self, device_map=None):
# Check validity of device_map
self.device_map = (
get_device_map(len(self.h), range(torch.cuda.device_count())) if device_map is None else device_map
)
assert_device_map(self.device_map, len(self.h))
self.model_parallel = True
self.first_device = "cpu" if "cpu" in self.device_map.keys() else "cuda:" + str(min(self.device_map.keys()))
self.last_device = "cuda:" + str(max(self.device_map.keys()))
self.wte = self.wte.to(self.first_device)
# Load onto devices
for k, v in self.device_map.items():
for block in v:
cuda_device = "cuda:" + str(k)
self.h[block] = self.h[block].to(cuda_device)
# ln_f to last
self.ln_f = self.ln_f.to(self.last_device)

How to evaluate on Human-eval

I would like to evaluate the Codegen model on human-eval dataset. But I don't know how to generate 200 samples for each problem to calculate pass@k. Can you provide any documentation in this regard?

A attempt to write finetuning script for codegen

Hii,
I am interested in writing finetuning script for this model.

Can anyone tell while training the model in what format can i provide output to the model.

As the data is of form
{code:"def hello(name): return f"Hello {name}", nl : "This function takes name as input and a massage saying hello to the person in format hello name."}

We can tokenize input (either code or text ) by the making tokenizer and passing text into it.

But what format should I use for training and how to compare loss of original output and predicted output.

Thanks

Are pre-trained models also licensed under BSD?

Hello,

I'd like to know if the BSD-3 license also applies to the pre-trained models. In particular can the pre-trained models be re-used commercially?

In any case thank you for a great paper and sharing the code and models!

Generated outputs are often ended with '#'

Hi,

When we use the sampling code jaxformer/hf/sample.py, we notice that a lot of generated outputs ended with '#'. Is this the expected behavior? Could you help us figure out why?

For example, in the offical colab. The output of last cell is before truncation is

sampling
====================================================================================================

    print("Hello World")

hello_world()

#
====================================================================================================

which is also the case.

How to get embedding for javascript and python code snippet?

I have a couple of questions:

a) How can I use CodeGen to extract embedding for JavaScript and Python code?
b) Can I feed incomplete code JavaScript and Python snippet to extract embedding? Or the code snippet needs to be complete?
c) Have anyone used CodeGen to perform code to code search?

How to run Codegen locally?

Can I try out Codegen locally for code generation as an alternative to CoPilot?

Is there a way to try the model locally and query the model with prompts?

How to predict using the model ?

I am running the code in kaggle but once the blocks completed running nothing's happening am I not supposed to predict PL by giving NL or how do I go about it ?

inconsistent `eos_token_id` and `pad_token_id` in model & tokenizer config

Hi,

Based on the paper, codegen is based on gpt2 tokenizer and training scheme, i.e. bos_token, eos_token, and pad_token are "<eodoftext>". However, it seems the HF model config includes the incorrect bos_token_id and pad_token_id (eos_token_id is fixed by #32).

way to reproduce the issue & expected behavior

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")

assert model.config.eos_token_id == tokenizer.eos_token_id # pass (50256 == 50256)
assert model.config.bos_token_id == tokenizer.bos_token_id # failed (1 != 50256)
assert model.config.pad_token_id == tokenizer.pad_token_id # failed (None != 50256)

Problems With HumanEval

Thanks for the work. CogeGen has achieved relatively good results on HumanEval, but I wanna know how the model is finetuned. Because it only has 164 problems, can you provide parameters for finetuning?

ValueError: Tokenizer class CodeGenTokenizer does not exist or is not currently imported.

https://huggingface.co/Salesforce/codegen-2B-mono
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")

text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

ValueError: Tokenizer class CodeGenTokenizer does not exist or is not currently imported.

Reproduce HumanEval Results

Hi, I was trying to reproduce the HumanEval results of CodeGen-16B-Mono. My pass@1 results were significantly worse than that in the paper.

Here are my current results.
temperature 0.2: {'pass@1': 0.15926829268292686, 'pass@10': 0.45424462172394614, 'pass@100': 0.7596262109932597}
temperature 0.6: {'pass@1': 0.1631707317073171, 'pass@10': 0.45846712024390285, 'pass@100': 0.7349539270978501}
temperature 0.8: {'pass@1': 0.16115853658536589, 'pass@10': 0.4522397258878905, 'pass@100': 0.7548574227789276}

I used the checkpoint from https://huggingface.co/Salesforce/codegen-16B-mono and generated 200 completions for each HumanEval problem. The evaluation was run on https://github.com/openai/human-eval.

Here is a code snippet of how I prompted the model.

inputs = tokenizer(problem["prompt"], return_tensors="pt")
canonical_solution = tokenizer(problem["canonical_solution"]).input_ids
input_ids_len = inputs.input_ids.shape[1]
output = model.generate(
    **inputs,
    do_sample=True,
    temp=problem["temp"],
    top_p=0.95,
    max_length=input_ids_len + max(128, len(canonical_solution) + 64),
    pad_token_id=tokenizer.eos_token_id,
)

Could you please give me some guidance to reproduce the paper results?

Thank you!

Why tokenizer.pad_token == args.pad (i.e., 50256)??

Hi,

For my project, I'm trying to fine-tune CodeGen models on my dataset and evaluate the resulting fine-tuned model on the HumanEval benchmark dataset. I have a few questions that I would appreciate if you could address.

  1. First, why in the sampling code, at line 234, we have tokenizer.pad_token == args.pad, which is 50256. Shouldn't we set the pad_token to eos_token, not 50256 (which is the eos_token_id)? I'm confused by this. At line 240, you set the parameter pad_token_id=args.pad. So in your sampling code, both pad_token and pad_token_id are set to 50256. Can you please elaborate on this? That would be super helpful.

  2. As a baseline, I need to replicate your single-turn HumanEval benchmark results, but unfortunately, I'm getting surprisingly lower results compared to what is reported in the paper. And, I'm 99% positive that I'm probably missing a point. To produce Table 1 results in the paper, did you use the exact same sampling procedure as sample.py?

Thanks a lot for your time.

setting `max-length` to larger value doesn't affect the output length

Hi, I'm trying out the sampling code. I set max-length to larger value as I'd like it to output longer sequence but I still got very short one as follows.

~/CodeGen$ python3 -m jaxformer.hf.sample --model codegen-350M-mono --context "def hello_world():" --max-length 1024
loading parameters
loading parameters took 14.13s
loading tokenizer
loading tokenizer took 7.02s
sampling
====================================================================================================

    print("Hello World")

hello_world()

#
====================================================================================================
def hello_world():
    print("Hello World")

hello_world()


====================================================================================================
sampling took 0.47s
done.

Is this expected behavior? Thanks!

Mismatch in attention weights for causal masked tokens vs attention masked tokens

attention scores corresponding to the tokens that are masked out using attention_mask get a value of -1e4 as per https://github.com/salesforce/CodeGen/blob/main/jaxformer/hf/codegen/modeling_codegen.py#L439, whereas the attention scores masked out using causal_mask get a value of -1*1e9. This leads to a discrepancy between the pre-softmax attention scores for causally masked tokens and padded tokens to be different. This causes inference outputs from individual sequences to inference outputs to be different.

Is there a way to prevent reloading parameters?

Hello! I have been using CodeGen to generate lately. But I found that I spent most of the time on loading parameters.
I tried to separate out the create_model part to prevent it from reloading but cuda out of memory occurred.
So I want to know is there a way to prevent reloading parameters?
Thanks!

How did you train the large-sized models without out-of-memory?

I would like to fine-tune the 2B model, but I got the out-of-memory issue even with the batch size setting to 1 (on a single GPU with 24G memory).

I wonder what devices you used to pre-train the 2B and 16B models? How did you address the memory issue? Did you parallel the model by layers on different GPUs? Thank you.

Nan

Proper way to prompt for code generation

Hello, thanks for your work.

Is there a proper way to prompt for code generation?

For example, to generate code to answer: "Create a function called num_in_str() to check whether a string contains a number."

Currently, when I pass that into the context (for the 350M and 2B models), the output is only # and it stops there.

Thank you!

Problems with torch

When trying to install CodeGen locally I run "pip3 install -r requirements.txt" and get the following error:
Looking in links: https://download.pytorch.org/whl/torch_stable.html
ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu111 (from versions: 1.11.0, 1.11.0+cpu, 1.11.0+cu113, 1.11.0+cu115, 1.12.0, 1.12.0+cpu, 1.12.0+cu113, 1.12.0+cu116, 1.12.1, 1.12.1+cpu, 1.12.1+cu113, 1.12.1+cu116, 1.13.0, 1.13.0+cpu, 1.13.0+cu116, 1.13.0+cu117)
ERROR: No matching distribution found for torch==1.9.0+cu111

I have Windows 10, Intel(R) UHD Graphics GPU and I run "pip3 install --upgrade pip setuptools successfully". Any ideas what could I do?

BigQuery dataset

Hi, first of all, great work!

Is there any chance you could provide more details on the BigQuery dataset / subset? Perhaps a list of the repositories used?
It would be great to have in order to avoid data leakage in experiments.

Cheers

Confusion about inconsistent `eos_token_id`

Hi,

Thanks for releasing this model! It seems that the eos_token_id is consistent between the model config and the pretrained tokenizer. The model config will give 2 as eos_token_id, yet eos_token_id for tokenizer is 50256. I'm wondering which one is correct.

The provided sample.py example also uses 50256 as pad_token_id. pad_token_id should be the same as eos_token_id for GPT, right? So is the correct eos_token_id be 50256?

If we didn't specify eos_token_id as input argument to .generate function in sample.py (but we specify pad_token_id to 50256), it will automatically get eos_token_id from self.config.eos_token_id which is 2 -- is this an issue?

Code to reproduce:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")

print(model.config.eos_token_id) # 2
print(tokenizer.eos_token_id) # 50256

Documentation output

Thank you so much for sharing your work including weights and making it easy to run!

What does the text output of the model mean? I'm getting the completion of my prompt in the first box and then another completion in the second box. Is the second box just the truncated completion of the first box including the prompt?

Example:

(venv) ➜  CodeGen git:(main) ✗ python3 -m jaxformer.hf.sample --model codegen-2B-mono --context "def print_in_quotes(text):" --device cpuloading parameters
loading parameters took 16.33s
loading tokenizer
loading tokenizer took 3.43s
sampling
====================================================================================================

    print('"{}"'.format(text))


def print_in_quotes_with_quotes(text):
    print("\"{}\"".format(text))


def print_in_quotes_with_quotes_and_spaces(text):
    print("\"{}\"".format(text))


def print_in_quotes_with_quotes_and_spaces_and_new_lines(text):
    print("\"{}\"".format(text))


def print_in_quotes_
====================================================================================================
def print_in_quotes(text):
    print('"{}"'.format(text))
====================================================================================================
sampling took 78.96s
done.

How do I train this on custom data?

I am completely new to AI, I would like to know how I can train it to recognize a new language.
I have no idea what to do. I can find no docs online about this. I have attempted to train it using the Trainer from transformers, but I keep coming up with errors. Can I have a code example for this?

MTPB benchmark

Hello,
I saw that you released the MTPB benchmark recently, do you plan to release also the script to evaluate CodeGen in multi-turn or single-turn using the MTPB dataset?

Thank you in advance.

BigPython Availability

Hello! I hope everything is going well with you.

First, I would like to appreciate in publishing open-sourced pre-trained models and the quality of the paper, they are amazing!

Are there any plans in releasing or publishing a script to create the BigPython dataset? I have looked around and could not find any reference on such a dataset.

Thank you and best regards,
Gustavo.

warning information

CodeGen is a powerful model.

When I use the model as the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")

text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

However, here has some warning information:

The attention mask and the pad token id were not set.  As a consequence, you may observe unexpected behavior.  Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Do you know how I can fix it? Plus, what happens if I don't fix it?

Thank you very much!

codegen-350M-multi and codegen-350M-mono model files mistakenly shared the same hash

# codegen-350M-nl,multi,mono
# wget -P checkpoints https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-multi.tar.gz && tar -xvf checkpoints/codegen-350M-multi.tar.gz -C checkpoints/
# wget -P checkpoints https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-mono.tar.gz && tar -xvf checkpoints/codegen-350M-mono.tar.gz -C checkpoints/

Hello!
I found that the model files downloaded from two different links above are the same:

md5sum codegen-350M-mono/*
d81cbe1111f246ca7f48850d0ed627fb  codegen-350M-mono/pytorch_model.bin

md5sum codegen-350M-multi/*
d81cbe1111f246ca7f48850d0ed627fb  codegen-350M-multi/pytorch_model.bin

How to train on custom data — continued

I am completely new to AI, I would like to know how I can train it to recognize a new language.
I have no idea what to do. I can find no docs online about this. I have attempted to train it using the Trainer from transformers, but I keep coming up with errors. Can I have a code example for this?
I have a dict of expected inputs to expected outputs. Should the dict be input:input+output or input:output? I would expect it to be the former.
I have 0 GPUs and I have no idea how to use Jaxformer.

I would like some code example of training or something. I am trying to teach it a new language and teach it other new things. Any examples? Anything I need to know?

Please help me.

MTPB benchmark question 1 is duplicate of 26

It seems question 1 in the benchmark is actually a duplicate of 26.

In addition, the given description and name for question 1: "name": "Sandwich string", "description": "Append a string in the middle of another string." do not describe the task of question 1.

If this version of question 1 was used for paper evaluations, maybe it should not be altered now, but should the description be updated to reflect the actual task of question 1?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.