salesforce / codegen Goto Github PK
View Code? Open in Web Editor NEWCodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.
License: Apache License 2.0
CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.
License: Apache License 2.0
Does it possible to use it in Visual Studio Code?
Hi,
For my project, I'm trying to fine-tune CodeGen models on my dataset and evaluate the resulting fine-tuned model on the HumanEval benchmark dataset. I have a few questions that I would appreciate if you could address.
First, why in the sampling code, at line 234, we have tokenizer.pad_token == args.pad
, which is 50256. Shouldn't we set the pad_token to eos_token, not 50256 (which is the eos_token_id)? I'm confused by this. At line 240, you set the parameter pad_token_id=args.pad. So in your sampling code, both pad_token and pad_token_id are set to 50256. Can you please elaborate on this? That would be super helpful.
As a baseline, I need to replicate your single-turn HumanEval benchmark results, but unfortunately, I'm getting surprisingly lower results compared to what is reported in the paper. And, I'm 99% positive that I'm probably missing a point. To produce Table 1 results in the paper, did you use the exact same sampling procedure as sample.py?
Thanks a lot for your time.
I am trying to test this model on my Mac M1 but it looks like NVIDIA hardware is required. Is it correct do I miss something
Thanks for the work. CogeGen has achieved relatively good results on HumanEval, but I wanna know how the model is finetuned. Because it only has 164 problems, can you provide parameters for finetuning?
I am running the code in kaggle but once the blocks completed running nothing's happening am I not supposed to predict PL by giving NL or how do I go about it ?
When trying to install CodeGen locally I run "pip3 install -r requirements.txt" and get the following error:
Looking in links: https://download.pytorch.org/whl/torch_stable.html
ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu111 (from versions: 1.11.0, 1.11.0+cpu, 1.11.0+cu113, 1.11.0+cu115, 1.12.0, 1.12.0+cpu, 1.12.0+cu113, 1.12.0+cu116, 1.12.1, 1.12.1+cpu, 1.12.1+cu113, 1.12.1+cu116, 1.13.0, 1.13.0+cpu, 1.13.0+cu116, 1.13.0+cu117)
ERROR: No matching distribution found for torch==1.9.0+cu111
I have Windows 10, Intel(R) UHD Graphics GPU and I run "pip3 install --upgrade pip setuptools successfully". Any ideas what could I do?
I am completely new to AI, I would like to know how I can train it to recognize a new language.
I have no idea what to do. I can find no docs online about this. I have attempted to train it using the Trainer from transformers, but I keep coming up with errors. Can I have a code example for this?
Hi,
When we use the sampling code jaxformer/hf/sample.py
, we notice that a lot of generated outputs ended with '#'. Is this the expected behavior? Could you help us figure out why?
For example, in the offical colab. The output of last cell is before truncation is
sampling
====================================================================================================
print("Hello World")
hello_world()
#
====================================================================================================
which is also the case.
I would like to evaluate the Codegen model on human-eval dataset. But I don't know how to generate 200 samples for each problem to calculate pass@k. Can you provide any documentation in this regard?
Can I try out Codegen locally for code generation as an alternative to CoPilot?
Is there a way to try the model locally and query the model with prompts?
Hello! I hope everything is going well with you.
First, I would like to appreciate in publishing open-sourced pre-trained models and the quality of the paper, they are amazing!
Are there any plans in releasing or publishing a script to create the BigPython dataset? I have looked around and could not find any reference on such a dataset.
Thank you and best regards,
Gustavo.
Hi,
Based on the paper, codegen is based on gpt2 tokenizer and training scheme, i.e. bos_token
, eos_token
, and pad_token
are "<eodoftext>"
. However, it seems the HF model config includes the incorrect bos_token_id
and pad_token_id
(eos_token_id
is fixed by #32).
way to reproduce the issue & expected behavior
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
assert model.config.eos_token_id == tokenizer.eos_token_id # pass (50256 == 50256)
assert model.config.bos_token_id == tokenizer.bos_token_id # failed (1 != 50256)
assert model.config.pad_token_id == tokenizer.pad_token_id # failed (None != 50256)
Hi,
Thanks for releasing this model! It seems that the eos_token_id
is consistent between the model config and the pretrained tokenizer. The model config will give 2
as eos_token_id
, yet eos_token_id
for tokenizer is 50256
. I'm wondering which one is correct.
The provided sample.py
example also uses 50256
as pad_token_id
. pad_token_id
should be the same as eos_token_id
for GPT, right? So is the correct eos_token_id
be 50256
?
If we didn't specify eos_token_id
as input argument to .generate
function in sample.py
(but we specify pad_token_id
to 50256
), it will automatically get eos_token_id
from self.config.eos_token_id
which is 2
-- is this an issue?
Code to reproduce:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
print(model.config.eos_token_id) # 2
print(tokenizer.eos_token_id) # 50256
Hello,
I saw that you released the MTPB benchmark recently, do you plan to release also the script to evaluate CodeGen in multi-turn or single-turn using the MTPB dataset?
Thank you in advance.
I would like to fine tune the Codegen model.
What H/W would you need to fine tune a Codegen model?
What are the GPU reuirements?
# codegen-350M-nl,multi,mono
# wget -P checkpoints https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-multi.tar.gz && tar -xvf checkpoints/codegen-350M-multi.tar.gz -C checkpoints/
# wget -P checkpoints https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-mono.tar.gz && tar -xvf checkpoints/codegen-350M-mono.tar.gz -C checkpoints/
Hello!
I found that the model files downloaded from two different links above are the same:
md5sum codegen-350M-mono/*
d81cbe1111f246ca7f48850d0ed627fb codegen-350M-mono/pytorch_model.bin
md5sum codegen-350M-multi/*
d81cbe1111f246ca7f48850d0ed627fb codegen-350M-multi/pytorch_model.bin
Hi, I'm trying out the sampling code. I set max-length to larger value as I'd like it to output longer sequence but I still got very short one as follows.
~/CodeGen$ python3 -m jaxformer.hf.sample --model codegen-350M-mono --context "def hello_world():" --max-length 1024
loading parameters
loading parameters took 14.13s
loading tokenizer
loading tokenizer took 7.02s
sampling
====================================================================================================
print("Hello World")
hello_world()
#
====================================================================================================
def hello_world():
print("Hello World")
hello_world()
====================================================================================================
sampling took 0.47s
done.
Is this expected behavior? Thanks!
I have a couple of questions:
a) How can I use CodeGen to extract embedding for JavaScript and Python code?
b) Can I feed incomplete code JavaScript and Python snippet to extract embedding? Or the code snippet needs to be complete?
c) Have anyone used CodeGen to perform code to code search?
I would like to fine tune the Codegen model. Can you provide any documentation in this regard?
https://huggingface.co/Salesforce/codegen-2B-mono
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
ValueError: Tokenizer class CodeGenTokenizer does not exist or is not currently imported.
how to fine-tune codegen with LoRA?
How Codegen is comparable with the following models?
As titled. I feel that I cannot load it with int8 on a single GPU (24G)
When the detailed training code will be released??
if Eos token id is changed from 2 to 50256, accuracy on eval dataset will also get impacted, If true then what about paper mentioned accuracy on human eval dataset?
Does it support infilling-style?
Hi, first of all, great work!
Is there any chance you could provide more details on the BigQuery dataset / subset? Perhaps a list of the repositories used?
It would be great to have in order to avoid data leakage in experiments.
Cheers
Hi there, thank you so much for providing the model and weight of such a nice project!
I wonder if you are planning to open source the MTPB dataset mentioned in the paper and the corresponding generation method? Thank you :)
How to run in no-GPU system?
It gives me error No CUDA GPUs are available because I dont have any NVIDIA graphics card.
https://hjlabs.in/
Hi, thanks for releasing these models! It's great to se more open source LLMs, especially for source code. I wanted to sample from the 16B parameter model, but unfortunately the weights do bit fit in the memory of a single 48GB GPU. Could you comment on whether the weights can be distributed across several GPUs at inference time? I imagine that would be valuable for facilitating the use of the largest models, given that few GPUs offer more than 48GB of memory. I believe the below code does something like this on the training side; it would be great if you could offer some implementation pointers for making this available for sampling.
CodeGen/jaxformer/hf/codegen/modeling_codegen.py
Lines 332 to 348 in ffd5a9b
Hello, thanks for your work.
Is there a proper way to prompt for code generation?
For example, to generate code to answer: "Create a function called num_in_str() to check whether a string contains a number."
Currently, when I pass that into the context (for the 350M and 2B models), the output is only #
and it stops there.
Thank you!
Hello! I modified a colab notebook. It can be used to interact with CodeGen!
Here is the link: https://colab.research.google.com/drive/1fQI8OgzMAR0bquCrvhlAtXSw6iMFbVgI
Original notebook: https://colab.research.google.com/drive/1Bi2TnSUp2vNiSUhamsNuC4HqkZ2J4WwZ?usp=sharing
Hii,
I am interested in writing finetuning script for this model.
Can anyone tell while training the model in what format can i provide output to the model.
As the data is of form
{code:"def hello(name): return f"Hello {name}", nl : "This function takes name as input and a massage saying hello to the person in format hello name."}
We can tokenize input (either code or text ) by the making tokenizer and passing text into it.
But what format should I use for training and how to compare loss of original output and predicted output.
Thanks
huggingface: https://github.com/huggingface/transformers/blob/main/src/transformers/models/codegen/modeling_codegen.py#L101
Your request: c1c49ab
You already update your model. However, you don't update your hugginfgace code. Please update it and also update your checkpoint if you can.
Hi,
I noticed in the config file (https://huggingface.co/Salesforce/codegen-350M-mono/blob/main/config.json) that:
"attn_pdrop": 0.0
"embd_pdrop": 0.0
"resid_pdrop": 0.0
Is codegen pretrained with dropout 0? @enijkamp
Hi, I was trying to reproduce the HumanEval results of CodeGen-16B-Mono. My pass@1 results were significantly worse than that in the paper.
Here are my current results.
temperature 0.2: {'pass@1': 0.15926829268292686, 'pass@10': 0.45424462172394614, 'pass@100': 0.7596262109932597}
temperature 0.6: {'pass@1': 0.1631707317073171, 'pass@10': 0.45846712024390285, 'pass@100': 0.7349539270978501}
temperature 0.8: {'pass@1': 0.16115853658536589, 'pass@10': 0.4522397258878905, 'pass@100': 0.7548574227789276}
I used the checkpoint from https://huggingface.co/Salesforce/codegen-16B-mono and generated 200 completions for each HumanEval problem. The evaluation was run on https://github.com/openai/human-eval.
Here is a code snippet of how I prompted the model.
inputs = tokenizer(problem["prompt"], return_tensors="pt")
canonical_solution = tokenizer(problem["canonical_solution"]).input_ids
input_ids_len = inputs.input_ids.shape[1]
output = model.generate(
**inputs,
do_sample=True,
temp=problem["temp"],
top_p=0.95,
max_length=input_ids_len + max(128, len(canonical_solution) + 64),
pad_token_id=tokenizer.eos_token_id,
)
Could you please give me some guidance to reproduce the paper results?
Thank you!
I would like to fine-tune the 2B model, but I got the out-of-memory issue even with the batch size setting to 1 (on a single GPU with 24G memory).
I wonder what devices you used to pre-train the 2B and 16B models? How did you address the memory issue? Did you parallel the model by layers on different GPUs? Thank you.
Nan
CodeGen is a powerful model.
When I use the model as the following code:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
However, here has some warning information:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Do you know how I can fix it? Plus, what happens if I don't fix it?
Thank you very much!
attention scores corresponding to the tokens that are masked out using attention_mask get a value of -1e4 as per https://github.com/salesforce/CodeGen/blob/main/jaxformer/hf/codegen/modeling_codegen.py#L439, whereas the attention scores masked out using causal_mask get a value of -1*1e9. This leads to a discrepancy between the pre-softmax attention scores for causally masked tokens and padded tokens to be different. This causes inference outputs from individual sequences to inference outputs to be different.
Thank you so much for sharing your work including weights and making it easy to run!
What does the text output of the model mean? I'm getting the completion of my prompt in the first box and then another completion in the second box. Is the second box just the truncated completion of the first box including the prompt?
Example:
(venv) ➜ CodeGen git:(main) ✗ python3 -m jaxformer.hf.sample --model codegen-2B-mono --context "def print_in_quotes(text):" --device cpuloading parameters
loading parameters took 16.33s
loading tokenizer
loading tokenizer took 3.43s
sampling
====================================================================================================
print('"{}"'.format(text))
def print_in_quotes_with_quotes(text):
print("\"{}\"".format(text))
def print_in_quotes_with_quotes_and_spaces(text):
print("\"{}\"".format(text))
def print_in_quotes_with_quotes_and_spaces_and_new_lines(text):
print("\"{}\"".format(text))
def print_in_quotes_
====================================================================================================
def print_in_quotes(text):
print('"{}"'.format(text))
====================================================================================================
sampling took 78.96s
done.
Hi, would you be interested in adding CodeGen to Hugging Face? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. There is already a Salesforce organization on Huggingface: https://huggingface.co/Salesforce to add models/datasets/spaces(web demos) to.
Example from other organizations:
Keras: https://huggingface.co/keras-io
Microsoft: https://huggingface.co/microsoft
Facebook: https://huggingface.co/facebook
Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/salesforce/BLIP
github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore
and here are guides for adding spaces/models/datasets to your org
How to add a Space: https://huggingface.co/blog/gradio-spaces
how to add models: https://huggingface.co/docs/hub/adding-a-model
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html
Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.
How to apply prompt tuning to Codegen?
Hello,
I'd like to know if the BSD-3 license also applies to the pre-trained models. In particular can the pre-trained models be re-used commercially?
In any case thank you for a great paper and sharing the code and models!
Is it possible to setup CodeGen for inference on TPU's?
According to Huggingface, Pytorch-XLA generation is not yet supported: huggingface/transformers#12322
Is there another way to use this model on TPU's for inference?
Thanks,
Arya
It seems question 1 in the benchmark is actually a duplicate of 26.
In addition, the given description and name for question 1: "name": "Sandwich string", "description": "Append a string in the middle of another string."
do not describe the task of question 1.
If this version of question 1 was used for paper evaluations, maybe it should not be altered now, but should the description be updated to reflect the actual task of question 1?
I am completely new to AI, I would like to know how I can train it to recognize a new language.
I have no idea what to do. I can find no docs online about this. I have attempted to train it using the Trainer from transformers, but I keep coming up with errors. Can I have a code example for this?
I have a dict of expected inputs to expected outputs. Should the dict be input:input+output or input:output? I would expect it to be the former.
I have 0 GPUs and I have no idea how to use Jaxformer.
I would like some code example of training or something. I am trying to teach it a new language and teach it other new things. Any examples? Anything I need to know?
Please help me.
Hello! I have been using CodeGen to generate lately. But I found that I spent most of the time on loading parameters.
I tried to separate out the create_model part to prevent it from reloading but cuda out of memory occurred.
So I want to know is there a way to prevent reloading parameters?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.