pkoukk / tiktoken-go Goto Github PK
View Code? Open in Web Editor NEWgo version of tiktoken
License: MIT License
go version of tiktoken
License: MIT License
你好,我这边采用如下方法计算的token误差和Openai 官方计算工具上计算的结果相差很大,不知你您那边遇见过没有。gpt3.5。使用如下demo结果方法进行的计算
func NumTokensFromMessages(messages []openai.ChatCompletionMessage, model string) (num_tokens int) {
tkm, err := tiktoken.EncodingForModel(model)
if err != nil {
err = fmt.Errorf("EncodingForModel: %v", err)
fmt.Println(err)
return
}
var tokens_per_message int
var tokens_per_name int
if model == "gpt-3.5-turbo-0301" || model == "gpt-3.5-turbo" {
tokens_per_message = 4
tokens_per_name = -1
} else if model == "gpt-4-0314" || model == "gpt-4" {
tokens_per_message = 3
tokens_per_name = 1
} else {
fmt.Println("Warning: model not found. Using cl100k_base encoding.")
tokens_per_message = 3
tokens_per_name = 1
}
for _, message := range messages {
num_tokens += tokens_per_message
num_tokens += len(tkm.Encode(message.Content, nil, nil))
num_tokens += len(tkm.Encode(message.Role, nil, nil))
num_tokens += len(tkm.Encode(message.Name,nil,nil))
if message.Name != "" {
num_tokens += tokens_per_name
}
}
num_tokens += 3
return num_tokens
}
I'm not seeing where the gpt2
model is handled similar to how it is done here: https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py#L10
A simple test like:
+func TestGpt2Encoding(t *testing.T) {
+ if _, err := EncodingForModel("gpt2"); err != nil {
+ t.Error(err)
+ }
+}
will fail like so:
--- FAIL: TestGpt2Encoding (0.00s)
tiktoken_test.go:36: Unknown encoding: gpt2
FAIL
exit status 1
FAIL github.com/pkoukk/tiktoken-go 0.194s
tkm, err := tiktoken.EncodingForModel("gpt-3.5-turbo-16k-0613")
get err even loaded file from https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
tkm, err := tiktoken.EncodingForModel(Model)
if err != nil {
fmt.Println(fmt.Errorf("EncodingForModel: %v", err))
return
}
我在执行这个model gpt-3.5-turbo-0301 报错找不到encoding ,我看了源码MODEL_TO_ENCODING没有这个模型
// func (t *Tiktoken) Encode(text string, allowedSpecial []string, disallowedSpecial []string) []int {
var allowedSpecialSet map[string]any
if len(allowedSpecial) == 0 {
allowedSpecialSet = map[string]any{}
} else if len(disallowedSpecial) == 1 && disallowedSpecial[0] == "all" {
allowedSpecialSet = t.specialTokensSet
} else {
allowedSpecialSet = map[string]any{}
for _, v := range allowedSpecial {
allowedSpecialSet[v] = nil
}
}
@pkoukk 暂时没有看全部代码的逻辑,但感觉 disallowedSpecial
怪怪,不知道是否应该是 allowedSpecial
?
The number of tokens deviates a lot comparing to https://platform.openai.com/tokenizer.
package main
import (
"fmt"
"github.com/pkoukk/tiktoken-go"
)
func main() {
text := "这是一个测试"
tke, _ := tiktoken.GetEncoding("cl100k_base")
token := tke.Encode(text, nil, nil)
fmt.Println(len(token)) // Result: 4
}
The result is 10 as generated by OpenAI Tokenizer .
At the moment(6/27), Counting token is slightly changed.
I changed the Example of ChatMessage
// below link may not work on Chrome(error: Unable to render code block)
// then, use FireFox
// OpenAI Cookbook: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
func NumTokensFromMessages(messages []openai.ChatCompletionMessage, model string) (numTokens int) {
tkm, err := tiktoken.EncodingForModel(model)
if err != nil {
err = fmt.Errorf("encoding for model: %v", err)
log.Println(err)
return
}
var tokensPerMessage, tokensPerName int
if model == "gpt-3.5-turbo-0613" ||
model == "gpt-3.5-turbo-16k-0613" ||
model == "gpt-4-0314" ||
model == "gpt-4-32k-0314" ||
model == "gpt-4-0613" ||
model == "gpt-4-32k-0613" {
tokensPerMessage = 3
tokensPerName = -1
} else if model == "gpt-3.5-turbo-0301" {
tokensPerMessage = 4 // every message follows <|start|>{role/name}\n{content}<|end|>\n
tokensPerName = -1 // if there's a name, the role is omitted
} else if model == "gpt-3.5-turbo" {
log.Println("warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
return NumTokensFromMessages(messages, "gpt-3.5-turbo-0613")
} else if model == "gpt-4" {
log.Println("warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
return NumTokensFromMessages(messages, "gpt-4-0613")
} else {
err := errors.New("warning: model not found. Using cl100k_base encoding")
log.Println(err)
return
}
for _, message := range messages {
numTokens += tokensPerMessage
numTokens += len(tkm.Encode(message.Content, nil, nil))
numTokens += len(tkm.Encode(message.Role, nil, nil))
numTokens += len(tkm.Encode(message.Name, nil, nil))
if message.Name != "" {
numTokens += tokensPerName
}
}
numTokens += 3 // every reply is primed with <|start|>assistant<|message|>
return numTokens
}
Howdy,
There's potential concurrent access to the ENCODING_MAP in the getEncoding function here;
func getEncoding(encodingName string) (*Encoding, error) {
encoding, ok := ENCODING_MAP[encodingName]
if !ok {
initEncoding, err := initEncoding(encodingName)
if err != nil {
return nil, err
}
encoding = initEncoding
ENCODING_MAP[encodingName] = encoding
}
return encoding, nil
}
There may be some other issues in the package that make it unsafe to run in multiple go-routines - which isn't expected since we're picking up unique instances via tiktoken.EncodingForModel(model). Might want to move this (and other) globals into a struct.
Hi, pkoukk!
Nice port! However, I noticed that there is no explicit license note in the repository, which means the code cannot be used by others and is considered proprietary.
If you would like to publish the code as a Free and Open Source Software, I recommend choosing a license from https://www.gnu.org/licenses/license-list.html.
Thank you for publishing this code and I hope you find my suggestion helpful.
我的prompt
golang, Please use Markdown syntax to reply
model: gpt-3.5-turbo
encoding: cl100k_base
计算的结果是:9个
但是实际是:17个 接口报错:
This model's maximum context length is 4097 tokens. However, you requested 4104 tokens (17 in the messages, 4087 in the completion). Please reduce the length of the messages or completion.
代码如下:
func (g *GPT) getTikTokenByEncoding(prompt string) (int, error) {
encoding := g.getAvailableEncodingModel(Model)
g.App.LogInfo("encoding: ", encoding)
tkm, err := tiktoken.GetEncoding(encoding)
if err != nil {
return 0, err
}
token := tkm.Encode(prompt, nil, nil)
return len(token), nil
}
请问如何解决?
我的代码如下
package main
import (
"fmt"
"log"
"strings"
"github.com/pkoukk/tiktoken-go"
"github.com/sashabaranov/go-openai"
)
func main() {
ins := []openai.ChatCompletionMessage{
{
Role: "user",
Content: "Hello!",
},
{
Role: "assistant",
Content: "Hello! How can I assist you today?",
},
}
fmt.Println(NumTokensFromMessages(ins, "gpt-3.5-turbo-0613"))
}
func NumTokensFromMessages(messages []openai.ChatCompletionMessage, model string) (numTokens int) {
tkm, err := tiktoken.EncodingForModel(model)
if err != nil {
err = fmt.Errorf("encoding for model: %v", err)
log.Println(err)
return
}
var tokensPerMessage, tokensPerName int
switch model {
case "gpt-3.5-turbo-0613",
"gpt-3.5-turbo-16k-0613",
"gpt-4-0314",
"gpt-4-32k-0314",
"gpt-4-0613",
"gpt-4-32k-0613":
tokensPerMessage = 3
tokensPerName = 1
case "gpt-3.5-turbo-0301":
tokensPerMessage = 4 // every message follows <|start|>{role/name}\n{content}<|end|>\n
tokensPerName = -1 // if there's a name, the role is omitted
default:
if strings.Contains(model, "gpt-3.5-turbo") {
log.Println("warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
return NumTokensFromMessages(messages, "gpt-3.5-turbo-0613")
} else if strings.Contains(model, "gpt-4") {
log.Println("warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
return NumTokensFromMessages(messages, "gpt-4-0613")
} else {
err = fmt.Errorf("num_tokens_from_messages() is not implemented for model %s. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.", model)
log.Println(err)
return
}
}
for _, message := range messages {
numTokens += tokensPerMessage
numTokens += len(tkm.Encode(message.Content, nil, nil))
numTokens += len(tkm.Encode(message.Role, nil, nil))
numTokens += len(tkm.Encode(message.Name, nil, nil))
if message.Name != "" {
numTokens += tokensPerName
}
}
numTokens += 3
return numTokens
}
我的打印结果如下
go run main.go 22
请问我应该如何修改才能获得正确的token数量
Curious how this compares.
I have a basic usecase counter, however because GetEncoding is so expensive, my server with 50M of memory gets an OOM error immediately with only about 7 go routines calling it at the same time. It would be great if this was optimized or if the tkm could be shared and re-used
func countTokens(messages []openai.ChatCompletionMessage) int {
tkm, err := tiktoken.GetEncoding(tiktoken.MODEL_CL100K_BASE)
if err != nil {
panic(err)
}
tokensPerMessage := 3
var tokenCount int
for _, message := range messages {
tokenCount += tokensPerMessage
tokenCount += len(tkm.Encode(message.Content, nil, nil))
tokenCount += len(tkm.Encode(message.Role, nil, nil))
}
tokenCount += tokensPerMessage // every reply is primed with <|start|>assistant<|message|>
return tokenCount
}
It seems the GetEncoding
is not cheap to call since it have to compile regex. Creating a Tiktoken
instance before every token calculation is not efficient. Therefore, if Tiktoken
instance can be shared by multiple go routines, then we only need to create it once, which is more efficient.
hi sir
this is very good tools! useful
can you support an token counter?
or any suggestion how can i do this?
ref: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
chapter: 6. Counting tokens for chat API calls
Thanks for the project. Great so far.
Is there support for counting tokens when using function calls?
https://platform.openai.com/docs/guides/gpt/function-calling
Thank you for your efforts. I found that in the NewCoreBPE function, the result of the following code does not seem to be used anywhere,
sortedTokenBytes := make([][]byte, 0, len(encoder))
for k := range encoder {
sortedTokenBytes = append(sortedTokenBytes, []byte(k))
}
sort.Slice(sortedTokenBytes, func(i, j int) bool {
return bytes.Compare(sortedTokenBytes[i], sortedTokenBytes[j]) < 0
})
return &CoreBPE{
......
sortedTokenBytes: sortedTokenBytes,
}, nil
but this sorting operation seems to be very expensive. Is there any consideration here?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.