Tokenizers used for encoding/embedding and LLMs.
This project is just getting started. Currently only the OpenAI Tiktoken BPE encoding is supported. It would be great to get some more implemented!
cl-tokenizers is not in quicklisp. It will need to be installed in the local-projects quicklisp directory:
> cd /$USER_HOME/quicklisp/local-projects/
> git clone https://github.com/jolby/cl-tokenizers.git
CL-USER>(ql:quickload :tokenizers)
> (:TOKENIZERS)
CL-USER>(defparameter *cl100k-encoder* (tokenizers:get-encoder :tiktoken "cl100k_base"))
> *CL100K-ENCODER*
CL-USER>(tokenizers:encode *cl100k-encoder* "hello world")
> #(15339 1917)
CL-USER>(tokenizers:decode *cl100k-encoder* #(15339 1917))
> "hello world"