Comments (5)
#4688 might solve your issue, can you try that out?
Thank you for your prompt attention to the issue. However, upon careful consideration, it seems that the problem we are facing is distinct from the one addressed in issue #4688. The concern raised in #4688 is related to an additional repetition of the Beginning of Sequence (BOS) token, which is particularly relevant in the context of API interactions. In contrast, our scenario involves offline inference and does not exhibit the same pattern of BOS token repetition as described in the aforementioned issue.
Therefore, the solution or workaround suggested in issue #4688 may not be directly applicable to the problem we are encountering with the prompt_logprobs output and the proliferation of unexpected special tokens.
I appreciate your effort to provide a potential fix, and I will continue to explore alternative solutions to resolve the unexpected token issue we are experiencing. Should there be any further insights or suggestions you can offer, they would be most welcome.
Thank you once again for your assistance with this matter.
from vllm.
@DarkLight1337 This sounds related to #4577 -- something between 0.4.0.post1
and 0.4.1
changed the way tokenization works. I am for whatever reason getting back a sequence of tokens like <
, <|
, <|im_
etc. instead of the whole <|im_start|>
at once.
from vllm.
#4688 might solve your issue, can you try that out?
from vllm.
I'm currently investigating a similar issue in #4200. It seems that there is something wrong with the detokenizing logic where new_decoded_token_text
gets pre-padded with extra whitespace characters. @Yard1 @njhill do you have any idea about this?
Edit It seems that my particular issue is related to the chat template. @DreamGenX 's issue would be more relevant to this case.
vllm/vllm/transformers_utils/detokenizer.py
Line 115 in 52f8107
from vllm.
Hello @DarkLight1337, @DreamGenX, and everyone involved in this discussion,
Thank you for your ongoing investigation into the tokenization and detokenization logic within the vLLM project. I understand that there may be related issues, such as #4200 and #4577, which are being looked into.
However, for the issue at hand, which is #4772, I would like to clarify that our primary concern is not with the detokenization process or the formatting of the output. We are not using the detokenized text and, therefore, any irregularities in the detokenization are not relevant to our use case.
Our focus is on the correctness of the prompt_logprobs output. We are encountering a significant number of unexpected special tokens in the log probability dictionary, which is causing errors in our downstream processing. The presence of these tokens is not expected and is interfering with the intended functionality of the LLM model.
To reiterate, we need assistance in ensuring that the prompt_logprobs output is accurate and free from these unexpected special tokens for both the Llama3 and Llama2-13b-chat-hf models.
If there are any updates, insights, or suggestions on how to address this specific issue with the log probabilities, we would greatly appreciate the guidance.
Thank you for your attention to this matter, and we look forward to a resolution.
Best regards,
@leejamesss
from vllm.
Related Issues (20)
- [Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- [Speculative decoding]: The content generated by speculative decoding is inconsistent with the content generated by the target model HOT 3
- [Performance]: Is kv cache implemented globally in vllm, It can be shared by multiple concurrent inferences? HOT 1
- [Feature]: Make a unstable latest docker image
- [Performance]: gptq and awq quantization do not improve the performance HOT 2
- [Installation]: Compiling VLLM for cpu only. HOT 2
- [New Model]: mistralai/Codestral-22B-v0.1 HOT 3
- [Usage]: Streaming Response from vLLM 0.4.2 -> 0.4.3 HOT 2
- [Usage]: Function calling for mistral v0.3
- [Bug]: vLLM does not support virtual GPU
- [Bug]: Unexpected prompt token logprob behaviors of llama 2 when setting echo=True for openai-api server HOT 1
- [Bug]: Getting an empty string ('') for every call on fine-tuned Code-Llama-7b-hf model
- [Bug]: non-deterministic Python gc order leads to flaky tests HOT 13
- [Usage]: Howto quiet the terminal 'Info' outputs in vllm HOT 1
- [Performance]: [Automatic Prefix Caching] When hitting the KV cached blocks, the first execute is slow, and then is fast. HOT 2
- [Speculative decoding]: The content generated by speculative decoding is inconsistent with the content generated by : When I use the speculative mode and prompt_length+output_length > 2048, the error occurs HOT 1
- [Bug]: Qwen2 MoE: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'? HOT 2
- [Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length HOT 2
- [RFC]: Refactor MoE
- [Bug]: TorchSDPAMetadata is out of date HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.