Comments (9)
Yes, this is expected.
I believe, in your case, you would want to set min-length, not max-length for model.generate(min_length=1024, ...) here:
https://github.com/salesforce/CodeGen/blob/main/jaxformer/hf/sample.py#L120
This will pull in a LogitProcessor which is manipulating the value for the eos token, see:
Hope it helps.
from codegen.
Thanks for the reply! When I use the checkpoint from HuggingFace like in the example code, I would get much longer output like the following result in Colab.
In[1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
inputs = tokenizer("def hello_world():", return_tensors="pt").to(0)
sample = model.generate(**inputs, do_sample=True, max_length=512)
print(tokenizer.decode(sample[0]))
Out[1]:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
def hello_world():
global name
name = "Samira"
print("Hello")
hello_world()
print(name)
# In python, a global statement is at end of file
def show_info_me_name():
print("my name is, " + name)
def set_name():
global name
name = "Samira"
show_info_me_name()
set_name()
I try various decoding settings (temperature sampling, greedy decoding) but still can't seem to match this. Do the two methods have different implementations of the generate
function?
from codegen.
@xu3kev This is probably because of the optional truncate_before_pattern
implemented for CodeGenTokenizer, while our sampling code by default truncates based on a pattern. Please try setting with patterns as in here.
from codegen.
Thanks for the suggestion! I removed all of them and the output is still the same. I can't find the place where it is given to the tokenizer.
from codegen.
This repo's default temperature and top_p are 0.2 ad 0.95, respectively. But it seems that transformers' generate()
has set them to 1.0 and 1.0 by default. When you sample from our repo, please add the argument like --p 1.0 --t 1.0
.
Where to add patterns in the tokenizer (transformers v4.21.3):
In [5]: import re
In [6]: patterns = [
...: '^#',
...: re.escape('<|endoftext|>'),
...: "^'''",
...: '^"""',
...: '\n\n\n'
...: ]
In [7]: import torch
...: from transformers import AutoTokenizer, AutoModelForCausalLM
...: tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
...: model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
...: inputs = tokenizer("def hello_world():", return_tensors="pt")
...: sample = model.generate(**inputs, do_sample=True, max_length=512)
...: print(tokenizer.decode(sample[0], truncate_before_pattern=patterns))
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
def hello_world():
print ("Hello!")
if __name__=="__main__":
hello_world()
from codegen.
Thanks! I tried again to compare the two methods by setting do_sample
to False
to try to do greedy decoding.
The output from the following Huggingface's approach is pretty long
In[1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
inputs = tokenizer("# this function prints hello world", return_tensors="pt").to(0)
sample = model.generate(**inputs, do_sample=False, max_length=512)
print(tokenizer.decode(sample[0]))
Out[1]:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
# this function prints hello world
# this function takes a string as an argument
# it prints the string in the following format:
# Hello, World!
# this function takes a string as an argument
# it prints the string in the following format:
# Hello, World!
# this function takes a string as an argument
# it prints the string in the following format:
# Hello, World!
# this function takes a string as an argument
# it prints the string in the following format:
# Hello, World!
# this function takes a string as an argument
# it prints
On the other hand, the sampling code in this repo produced a nearly empty result.
Modifying sampling code:
with torch.no_grad():
input_ids = input_ids.to(device)
tokens = model.generate(
input_ids,
do_sample=False,
num_return_sequences=num_return_sequences,
max_length=input_ids_len + max_length_sample,
pad_token_id=pad_token_id,
use_cache=True,
)
text = tokenizer.batch_decode(tokens[:, input_ids_len:, ...])
Output:
~/CodeGen$ python3 -m jaxformer.hf.sample --model codegen-350M-mono --context "# this function prints hello world" --max-length 1024
loading parameters
loading parameters took 13.88s
loading tokenizer
loading tokenizer took 6.99s
sampling
====================================================================================================
#
====================================================================================================
It seems very strange to me. Could you help me understand what might be the issue?
from codegen.
Thanks for the observation. Let us investigate and get back to you.
from codegen.
Thanks for the reply! If possible, could you reopen the issue, so it would be easier to track this? Or would it be better if I open another issue about the problem that I can't match the two greedy decoding methods?
from codegen.
@xu3kev check eos token id, hugging face has different token id and model from this repo has different token id
from codegen.
Related Issues (20)
- How i build a UI demo like mentioned in readme (input to output ) HOT 1
- What is the hardware requirement for fine tuning codegen 2B and higher models?
- memory out of error. Hardware requirements HOT 1
- A question about the detail of data preprocessing
- Limit of code generation HOT 1
- instruct dataset
- Using LoRA with CodeGen 2B mono HOT 2
- How to use infills sampling?
- What is min loss in CodeGen1B while finetuning.
- Clarity on training data for each of the codegen versions
- How to use gpu to accelerate inference? HOT 1
- How much VRAM do I need if I want to enable GPU acceleration? codegen25-7B-instruct
- Set different temperature
- fine tunning : data format
- AttributeError: 'CodeGen25Tokenizer' object has no attribute 'encoder' HOT 2
- What is the context window for Codegen2? HOT 1
- Defect detection
- Error calling tokenizer.get_vocab() (Codegen2.5) HOT 1
- Atrribute Error: 'AlignConfig' object has no attribute 'encoder', 'PoolFormerConfig' object has no attribute 'encoder'. HOT 1
- Which dataset is used for fine-tuning CodeGen25-7B-multi resulting in CodeGen25-7B-mono?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codegen.