I've noticed that the generation diverges after some tokens in comparison to the HF im

This is actually an interesting topic. Thanks for sharing it <a class="user-mention no

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The output diverges in comparison to the Python implementation. about candle HOT 5 OPEN

hugoabonizio commented on July 21, 2024 1

The output diverges in comparison to the Python implementation.

from candle.

Comments (5)

LaurentMazare commented on July 21, 2024

I feel that this is not an unexpected behavior, even with the temperature set to 0. The tricky bit here is numerical stability, some of the cuda algorithm may be non deterministic but even besides this candle and pytorch don't apply the exact same ops, e.g. we accumulate with f32 in the softmax whereas pytorch may well do something slightly different.
Overall as the generated text seems legit I would think that it's fine but I would not consider that the generation or the generated logits should line up perfectly.

from candle.

hugoabonizio commented on July 21, 2024

It makes sense! I was suspecting that. My concern arises from consistently lower performance in my internal benchmarks (average across ~17 datasets), where it scores 1% to 2% lower than the reference implementation (Python) on all tested models. However, I suppose there's no easy solution for that.

from candle.

LaurentMazare commented on July 21, 2024

That's interesting, what is the benchmark, MMLU or something else? For MMLU 1 or 2% seems with noise but it's a bit annoying if it's consistently worse, might be good to measure perplexity if that's not already what you're doing.
Overall numerical differences can lead to lower performance as pytorch will be consistent between training and inference and we wouldn't be but it's hard to say by how much, so any number you can put on this would be greatly appreciated (it may certainly be a bug on the candle side).

from candle.

jorgeantonio21 commented on July 21, 2024

This is actually an interesting topic. Thanks for sharing it @hugoabonizio. Even though the numerical imprecision being naturally present in different implementations, I would expect these differences to be minimal and therefore have no impact in the actual token generation (the probabilities precision might differ slightly, but the sampled token should be the same assuming one fixes the random seed for sampling. Any thoughts on this @LaurentMazare @hugoabonizio ?

from candle.

hugoabonizio commented on July 21, 2024

@LaurentMazare Unfortunately, this result is based on an internal benchmark suite and not all datasets are public. However, I'll make an effort to attempt the same kind of evaluation using public datasets to make it reproducible.

@jorgeantonio21 I wouldn't expect sampling to be equal because there are a lot of factors affecting the sampling process that differ. However, in greedy sampling, I was expecting the results to be the same since the output probabilities should be (hopefully) the same.

from candle.

The output diverges in comparison to the Python implementation. about candle HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs