Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU an

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Modify the current PyTorch model to C++ about vllm HOT 4 OPEN

zhuohan123 commented on May 16, 2024 7

Modify the current PyTorch model to C++

from vllm.

Comments (4)

zhuohan123 commented on May 16, 2024

Specific optimizations for smaller kernels (~100m parameters).

Improve sampling efficiency.
We may need to merge more models.

This should not be prioritized. Because the core technique of CacheFlow (memory saving) is not helpful for small models at all, but still, they may benefit from iteration-level scheduling.

from vllm.

zhuohan123 commented on May 16, 2024

After the C++ version, we might need to rerun all the experiments with the new implementation.

from vllm.

hmellor commented on May 16, 2024

@zhuohan123 can this work be considered complete?

from vllm.

amrobbins commented on May 16, 2024

If interested, I've been building a c++ native deep learning framework for the past few years that I want to get open sourced soon. This framework aims for optimal performance, here is my framework training AlexNet, mostly the kernels here are cublasLt and cuDNN:

I'd certainly like for it to be part of VLLM. Is this something that there'd be interest in? If so I can go ensure that I support (or add support) for all of the pieces that you need and get it connected. I can provide access to the private repo if requested.

from vllm.

Modify the current PyTorch model to C++ about vllm HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs