Introduction This proposal requires lots of works and will introdu

(Just to note I haven't looked at <a class="issue-link js-issue-link" data-error-text=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp about llamasharp HOT 5 OPEN

AsakusaRinne commented on September 26, 2024 4

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp

from llamasharp.

Comments (5)

martindevans commented on September 26, 2024 2

Batched inference is not user-friendly

That's mostly because it's not designed to be 😆

The BatchedExecutor is the "minimum viable product" to expose low level primitives in a safe way to C# - the main idea is that there should never be a point to using the lower level APIs, because BatchedExecutor exposes everything in a safer way without any speed cost. I think that's mostly done with the current API does not contain any pointers, doesn't expose any operations that can lead to memory leaks and lifts the fairly primitive llama.cpp API into a higher level object oriented API.

My intention with the BatchedExecutor has always been that most end users don't use it directly, instead it acts as the foundation that all of the higher level APIs can be built on. For example something like the current executors could be written so that they wrap a single Conversation object and multiple different executors can all be using the same batch which transparently speeds everything up.

I haven't been pushing for anyone to use it until recently because I've only just reached feature parity with the addition of loading/saving individual conversations in #681!

Mid level APIs

Going from this diagram I would say BatchedExecutor can currently provide:

LLM Engine: it runs the LLM, so I guess it does this 😁
Sequence: A Conversation is a sequence.
KV Cache Manager: Individual conversations can be forked (sharing cache), rewound (dropping cache items) shifted (freeing up some cache space) and there is an API for arbitrary KV amnipulations for people who know what they're doing.

Thoughts on the other parts of that diagram:

Sampling

There is the entire sampling pipeline API I developed (see here) which I think serves this. A sampling pipeline can be put together by implementing ISamplingPipeline and calling the various sampling methods. This gives direct access to the logits (so you could implement an entirely custom sampler if you wanted) but is also easy to use by just chaining some methods together if you want to (e.g. here's the default pipeline, which does a lot of things but is still fairly understandable).

Scheduler

This is a tricky one that I haven't done any work on, I assume you're meaning something to schedule when inference is run to maximise the work done in a single batch but minimise the latency? That's probably the hardest part of the batched inference, you need to bring together all the work into a batch before calling infer and definitely needs some kind of higher level system to help schedule it.

Stopping Criteria

Not something I've worked on much at all, since it comes after inference and sampling which have been my main focus. Definitely something we need though!

Other Things

I think some other things I would add to the "mid level" API list would be:

Templating. We need the low level implementation of templating - taking some text and transforming it into alernative text according to the template.

We probably also need the higher level implementation (something like ChatSession/ChatHistory) which represents the history in an object oriented way and can be manipulated in ways that make sense at lower level (e.g. rewind, fork, shift can all be done at the high level and map down into low level KV manipulations.

Embeddings. There seem to be a lot of changes coming with how llama.cpp handles embeddings - generative models, embedding models, pooling techniques etc. Our current LLamaEmbedder is very primitive, at the very least it could be made into something that uses a batch to generate lots of embeddings at once much much faster than currently.

High Level APIs

I think these would probably be better of split into separate packages? Our current high level APIs have become a bit of a mess over time as the low level has shifted underneath them, splitting into separate packages somewhat prevents that becoming an issue in the future.

That would leave LLamaSharp providing the core things that everyone needs (low and mid level) and then separate special purpose packages providing other specific usecases. e.g. individual nuget packages for:

Chat
OpenAI Style API
Semantic Kernel
Kernel Memory
RAG
Web backend

from llamasharp.

AsakusaRinne commented on September 26, 2024 1

Going from this diagram I would say BatchedExecutor can currently provide:
LLM Engine: it runs the LLM, so I guess it does this 😁
Sequence: A Conversation is a sequence.
KV Cache Manager: Individual conversations can be forked (sharing cache), rewound (dropping cache items) shifted (freeing up > some cache space) and there is an API for arbitrary KV amnipulations for people who know what they're doing.

Yes, in my prototype, I referred to the implementations of LLamaBatch. It's so lucky that there are some codes that I could take from!

Scheduler: I assume you're meaning something to schedule when inference is run to maximise the work done in a single batch but minimise the latency?

Yes, and it's also responsible for continuous batching. I think it's important for making LLM servers because the requests may come at any time.

I think some other things I would add to the "mid level" API list would be...

I could try to figure out how to make the embedding APIs better when moving on in this proposal. However currently I have no idea about the template. To reduce the duplicated works and refactoring, I think we'd better keep the prototype in Experimental util we have taken into account all the possibly major features (if this proposal is approved). 😄

I think these would probably be better of split into separate packages?

In my opinion, I would like to keep the text-completion and chat-completion classes in the main package and put others on separate packages, such as the sever Engine, OpenAI-style APIs and RAG. As you can see in #683, LLM (text-completion) is only a very simple wrapper of LLMEngine. :)

from llamasharp.

martindevans commented on September 26, 2024 1

(Just to note I haven't looked at #683 yet. I wasn't suggesting things that should be added to that specific PR, just the general direction of the project overall for the next 12 months!)

from llamasharp.

SignalRT commented on September 26, 2024

@AsakusaRinne The overall idea seems good to me. But I have the following observations:

Designing the API based on the Highest APIs seems to be the right idea to me. I think that if we not propose the higher level APIs before to prototype the solution we will begin to force the Midlevel APIs to be able to implement the high level APIs.
Any use of the library different to local / desktop usage will require to escalate the solution via Web API and multiple instances of the LLM to serve the request. That means that the client to use the WEB API should be first class citizen of the library.
I think that we need to provide a template system. One of the most complex things to everybody starting to use LLMs is the time needed to begin the use a model before understand the right use of a model.

I will begin to provide feedback on the prototype.

from llamasharp.

AsakusaRinne commented on September 26, 2024

Any use of the library different to local / desktop usage will require to escalate the solution via Web API and multiple instances of the LLM to serve the request. That means that the client to use the WEB API should be first class citizen of the library.

Agreed. Currently we ask user to build the web API from mid-level APIs themselves and it's difficult for them to apply batched inference. The Server Engine in this proposal is to provide a class to deal with parallel inference and, as you said, multiple LLM instances. Thus it will make it easy for users to build a high-performance web API.

I think that we need to provide a template system. One of the most complex things to everybody starting to use LLMs is the time needed to begin the use a model before understand the right use of a model.

That's a good idea. However it seems that #670 and this proposal will consume all my free time, so I'm afraid I'm not available for it in the future 3 months. If you found it helpful for it to modify some parts of this proposal, I'll be more than happy to help and discuss with you. :)

from llamasharp.

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp about llamasharp HOT 5 OPEN

Comments (5)

Sampling

Scheduler

Stopping Criteria

Other Things

High Level APIs

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs