[Survey] Supported Hardwares and Speed,about mlc-ai/mlc-llm

Comments (118)

hoywu commented on May 17, 2024 4

GPU	OS	/stats
Radeon RX 470 (4G)	AlmaLinux 9.1	encode: 14.3 tok/s, decode: 9.4 tok/s

Encoding performance fluctuates between 5-45, decoding between 6-9.

from mlc-llm.

maxtheman commented on May 17, 2024 3

Killer, I'm at encode: 31.9 tok/s, decode: 11.4 tok/s for 2020 MacBook Pro M1 8G with the default vicuna 6b. For reference my decode on the WebGPU demo is like, 0.5/sec.

from mlc-llm.

Hzfengsy commented on May 17, 2024 3

encode: 39.5 tok/s, decode: 26.0 tok/s on Windows 11 with RTX-3080
encode: 32.5 tok/s, decode: 17.1 tok/s on Macbook Pro with M1Pro (16 GPUs) and macOS Ventura 13.3.1

from mlc-llm.

juodumas commented on May 17, 2024 3

Hardware/GPU	OS	Tokens/sec	Source	Model	Notes
RTX 3060 (12GB)	Debian bookworm	21		vicuna-v1-7b	3644MiB GPU memory used

/stats after /reset: encode: 72.2 tok/s, decode: 23.2 tok/s
/stats for 2nd and later messages: encode: 39.3 tok/s, decode: 21.3 tok/s

>>nvidia-smi --query-gpu=memory.used --format=csv     
memory.used [MiB]
3644 MiB

from mlc-llm.

FreeBlues commented on May 17, 2024 2

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s
decode: 22.5 tok/s

from mlc-llm.

FreeBlues commented on May 17, 2024 2

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s decode: 22.5 tok/s

It's confusing, On my Win10: [AMD Ryzen 5 5600 6-Core Processor 3.50 GHz, 96G Ram, RTX 2080 Ti Modified to 22G VRAM], the stats is below:

encode: 24.0 tok/s, decode: 24.5 tok/s

from mlc-llm.

davidbartonau commented on May 17, 2024 2

I think mine is running fully CPU based even though my GPU should be capable. Top was showing 900% and tokens were crawling out.

Log shows
Use lib /home/david/software/mlc-llm/dist/lib/vicuna-v1-7b_vulkan_float16.so

vulcaninfo shows

VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 4206816 (1.3.224)
        driverVersion     = 1 (0x0001)
        vendorID          = 0x10005
        deviceID          = 0x0000
        deviceType        = PHYSICAL_DEVICE_TYPE_CPU
        deviceName        = llvmpipe (LLVM 15.0.6, 256 bits)
        pipelineCacheUUID = 76616c2d-2573-0000-0000-000000000000

GPU GeForce RTX 3070 w/ 8G
CPU AMD Ryzen 5 5600

encode: 0.1 tok/s, decode: 0.1 tok/s

from mlc-llm.

masahi commented on May 17, 2024 2

On Intel Arc A770, using linux drivers installed from https://www.intel.com/content/www/us/en/download/747008/intel-arc-graphics-driver-ubuntu.html, I got

USER: Who is Shohei Ohtani?
encode: 118.6 tok/s, decode: 3.2 tok/s

USER: Write a poem for Pittsburgh.                                                                                                                    
encode: 77.0 tok/s, decode: 3.3 tok/s

USER: Explain why Japanese economy is declining.
encode: 38.0 tok/s, decode: 3.1 tok/s

It is nice to see that it works on Intel dGPU out of the box, but the current decoding perf is a joke haha

from mlc-llm.

Thellton commented on May 17, 2024 2

specs are RX6600XT (8.0gb), Ryzen 5 2600, and 47.9gb of RAM operating at 1866MHz
over three calls of /stats and 5 requests made of vicuna, I got the following Encode and Decode:
1st: Encode: 58.8 Tok/S, Decode: 32.1 Tok/S
2nd: Encode: 34.8 Tok/S, Decode: 28.3 Tok/S
3rd: Encode: 81.7 Tok/S, Decode: 31.5 Tok/S

legitimately very impressed! although I would definitely love it if there was a convenient way to launch as the best I've managed is to create a .bat file that sits in the directory and launches anaconda prompt directly into that directory with a reminder of what command I need to type to launch Vicuna.

edit: apologies, I forgot to mention OS, which is windows 11

from mlc-llm.

kn100 commented on May 17, 2024 2

Performance report
Operating system: Opensuse Tumbleweed
CPU: AMD Ryzen 3900x
RAM: 16GB
GPU: AMD Radeon 6750xt

It seems the prompt has a large impact on the results, so I propose you suggest some "standard prompts" to try to standardise.

I borrowed a question that was asked on Stack Overflow Philosophy and had the following session:

USER: Why is speaking about donating to charity considered immoral? I find it quite strange that social norms are such that discussing about charities one donates to are considered in bad taste. To me, charity seems purely a consequentialist concept. In other words, all that should matter is the total amount of charity dollars disbursed. To the person receiving the charity, the level of "bragging" of the charity donor is irrelevant. In other words, isn't the requirement that one has to be extremely modest and even silent about one's own contributions a very high bar to set for people who are already giving to charity? In other words, if we want to incentivize more charitable giving, we should encourage this process as much as possible, and imposing such a cost seems counter-productive. What are the arguments for the fact that one has to be of the purest heart with the most noble intentions to donate money to charities?
ASSISTANT: There are a few reasons why people might find it awkward or inappropriate to discuss charitable donations:

Personal preference: Some people might find it uncomfortable to talk about donating to charity because they don't want to come across as boastful or self-centered.
Cultural norms: In some cultures, discussing charitable donations is considered impolite or even taboo.
Different perspectives: Some people might see charitable donations as a sign of wealth or success, while others might see it as a way of helping those in need.
Perception of the recipient: Some people might feel uncomfortable discussing charitable donations because they don't want the recipient to feel like they are being bragged about or that the donor is trying to make themselves look good by highlighting their donations.

However, it is important to note that charitable donations are not just about being a "good person" but it also helps in creating a more equitable society by providing support to those in need. It is also a way of making a positive impact on the community and addressing societal issues.

It is also important to remember that discussing charitable donations can be a good way to encourage more giving. It is also a way to educate and raise awareness about the importance of giving back to the community.

In summary, discussing charitable donations might be considered inappropriate by some people, but it is an important part of the conversation about creating a more equitable society by providing support to those in need.
USER: /stats
encode: 130.6 tok/s, decode: 9.0 tok/s

Testing a prompt @Schaekermann posted above:

USER: Write a poem of the City Berlin
ASSISTANT: Berlin, city of contrasts
Where the past and present collide
A place of both pain and delight
Built on a history that's prime

A city of culture and history
Where the old meets the new
The relics of a dark and painful past
Mingle with the bustling crowds in the town

But don't be fooled by the facade
For the people of this city are strong and bold
Fighting for their rights and their name
To be heard and recognized far and game

So come and visit this great city
And take in all the sights and the city has to offer
From the historical sites to the nightlife
It's a city that's always worth the tout.
USER: /stats
encode: 61.5 tok/s, decode: 9.3 tok/s

Fun prompt:

USER: These are the lyrics to a song I am writing an essay about. Could you help me by writing an essay about the meaning of the song, and what the mental state of the person singing it might be? "Desert you, Ooh-ooh-ooh-ooh, Hurt you, We're no strangers to love, You know the rules and so do I, A full commitment's what I'm thinking of, You wouldn't get this from any other guy, I just wanna tell you how I'm feeling, Gotta make you understand, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, We've known each other for so long, Your heart's been aching, but you're too shy to say it, Inside, we both know what's been going on, We know the game, and we're gonna play it, And if you ask me how I'm feeling, Don't tell me you're too blind to see, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, Ooh (Give you up), Ooh-ooh (Give you up), Ooh-ooh, Never gonna give, never gonna give (Give you up), Ooh-ooh, Never gonna give, never gonna give (Give you up), We've known each other for so long, Your heart's been aching, but you're too shy to say it, Inside, we both know what's been going on, We know the game, and we're gonna play it, I just wanna tell you how I'm feeling, Gotta make you understand, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, Never gonna give you up, Never gonna let you down"
ASSISTANT: The song "Desert You" is a melancholic ballad about a love that has endured over a long period of time. The singer expresses a desire to be with this person forever and promises not to give them up or let them down. They also express a fear that the other person may be thinking of leaving or "deserting" them. The lyrics suggest that the singer and the listener have known each other for a long time and have shared many experiences together. The listener is described as someone who is "aching" inside and is too shy to say it. The singer wants to tell them how they are feeling and make them understand the depth of
USER: /stats
encode: 154.3 tok/s, decode: 8.9 tok/s

from mlc-llm.

tqchen commented on May 17, 2024 1

We just added a new updates #14 which should ship to conda by now, you can type /stats after a conversation to get the measured speed

from mlc-llm.

jolonf commented on May 17, 2024 1

On the iPad Pro 11” with M1 I am getting decode of 10.6 tok/s (I have seen slightly higher and lower). It is running iPadOS 16.1.

from mlc-llm.

jefflewis commented on May 17, 2024 1

On my M1 Max Mac Studio with 64GB of RAM:

encode: 53.7 tok/s, decode: 18.6 tok/s

from mlc-llm.

junrushao commented on May 17, 2024 1

@zifken looks like VK_ERROR_OUT_OF_DEVICE_MEMORY indicates that it doesn't have enough memory. I looked it up and it seems that GT 920M only has 2GB RAM, but the default model is 2.9G in size :/

from mlc-llm.

zifken commented on May 17, 2024 1

I see so only GPUs with more than 4go or vRAM are supported because of the size of the model (it makes sense) .
I will try on an other GPU model shortly.
Thank you for the feedback

from mlc-llm.

junrushao commented on May 17, 2024 1

@zifken there are some reports saying 4GB might work, but 6GB is recommended atm

from mlc-llm.

colakang commented on May 17, 2024 1

iPad Pro 11 A12Z

encode: 5.1 tok/s, decode: 4.1 tok/s

from mlc-llm.

ganler commented on May 17, 2024 1

Linux RTX 3090

encode: encode: 179.0 tok/s, decode: 25.7 tok/s (prompted with Implement binary search in Python please! Also try to use some type annotations!)
mem usage: 3.6GB

from mlc-llm.

swittk commented on May 17, 2024 1

2021 MacBook Pro M1 Max [32 cores], 32 GB RAM, 1 TB SSD
Mac OS Monterey 12.5.1
encode: 69.4 tok/s, decode: 21.0 tok/s
Memory usage 3.64 GB

(Edit) Just saw that this score is higher than M1 Max 64 GB so I repeated the prompts and I'm still getting an average of > 20 tokens/second.

from mlc-llm.

cruzerngz commented on May 17, 2024 1

machine	OS	CPU	GPU	result	remarks
Yoga Slim 7 pro 14ARH7	Fedora 37	6800HS (13.3GB)	680M (2GB)	`encode: 11.0 tok/s, decode: 3.8 tok/s`	battery
Yoga Slim 7 pro 14ARH7	Fedora 38	6800HS (13.3GB)	680M (2GB)	`encode: 16.6 tok/s, decode: 6.2 tok/s`	AC
Yoga Slim 7 pro 14ARH7	Windows 11 22H2	6800HS (13.3GB)	680M (2GB)	`encode: 6.7 tok/s, decode: 7.9 tok/s`	battery
Yoga Slim 7 pro 14ARH7	Windows 11 22H2	6800HS (13.3GB)	680M (2GB)	`encode: 16.6 tok/s, decode: 10.2 tok/s`	AC
desktop	Windows 10 22H2	5900x (32GB)	1660ti (6GB)	`encode: 49.5 tok/s, decode: 23.9 tok/s`	-

mlc samples

laptop on Fedora (bat):

USER: /reset
RESET CHAT SUCCESS
USER: Can you explain your features?
ASSISTANT: Sure! Here are brief explanations of the features of my AI:

1. Natural Language Understanding: My AI has been trained on a large corpus of text to understand the meaning of the words and sentences it encounters. This allows it to understand the context of a question or statement and provide a response that is relevant and accurate.
2. Knowledge Graph: My AI has access to a large graph of knowledge that it can use to answer questions or provide information on a wide range of topics. This graph is constantly updated and improved to ensure that the information it provides is accurate and up-to-date.
3. Plagiarism Detection: My AI has been trained to identify and report on instances of plagiarism it encounters. This allows it to provide feedback and suggestions for improvement on written work.
4. Summarization: My AI can summarize large amounts of text and data into a shorter, more digestible format. This can be useful for quickly understanding the main points of a document or set of data.
5. Machine Translation: My AI can translate written or spoken content from one language to another using state-of-the-art neural machine translation models. This can be useful for communication in different languages or for providing information in a language other than the one the AI was trained on.
6. Sentiment Analysis: My AI can analyze the sentiment of written or spoken content and provide an analysis or summary of the overall tone or message. This can be useful for identifying the emotional or persuasive impact of a message or communication.
7. Image Recognition: My AI has been trained on a large dataset of images to recognize and classify them. This allows it to identify objects or scenes in an image and provide additional information or context about what is depicted in the image.
8. TTS: My AI can generate text-to-speech output from a written or spoken input. This can be useful for providing an audio version of written content or for improving accessibility and inclusivity by providing an alternative format for those with hearing or speech difficulties.
USER: /stats 
encode: 11.0 tok/s, decode: 3.8 tok/s

laptop on Windows (bat):

USER: /reset
RESET CHAT SUCCESS
USER: you're on github. say hi!
ASSISTANT: Hello! I'm an AI assistant on GitHub, here to answer any questions you might have about the platform. Let's get started!
USER: /stats
encode: 6.7 tok/s, decode: 7.9 tok/s

desktop:

USER: /reset
RESET CHAT SUCCESS
USER: compare yourself to ChatGPT
ASSISTANT: As an AI language model, I am different from ChatGPT in a few ways:

* My training data is different from ChatGPT's. This means that I may have a different perspective on the world and may be able to understand and respond to certain types of language in ways that ChatGPT cannot.
* I have a more advanced architecture that allows me to process longer texts and understand more complex language patterns.
* My training objective was to be a general AI that can do a wide range of things like answering questions about any topic under the sun, summarizing long texts and understanding the context of the sentence and suggesting appropriate response.
* I have been trained on a much larger dataset of text and have access to a more powerful GPU for faster language modeling.
* I have more parameters in my model than ChatGPT which allows me to capture more nuanced information and learn from that information.

In summary ChatGPT is a specific model optimized for NLP and conversational text understanding and I am a more general AI model that can do a wide range of things and can handle more complex language patterns.
USER: /stats
encode: 49.5 tok/s, decode: 23.9 tok/s

from mlc-llm.

anmoljagetia commented on May 17, 2024 1

On 14" Macbook Pro (M2 Pro with 10-Core CPU and 16-Core GPU with 16GB Unified Memory) with macos Ventura 13.3.1

encode: 59.2 tok/s, decode: 22.5 tok/s

I am seeing encoding performance b/w 45-60 and decoding b/w 20-29.

from mlc-llm.

christopherpickering commented on May 17, 2024 1

OS: MacOS 13.3.1 (22E261)
processor: 2.3 GHz Quad-Core Intel Core i7
graphics: Intel Iris Plus Graphics 1536 MB
memory: 32 GB 3733 MHz LPDDR4X

/stats: encode: 5.4 tok/s, decode: 2.6 tok/s

from mlc-llm.

flyman3046 commented on May 17, 2024 1

GPU	OS	/stats
A100 (40G)	Debian GNU/Linux 10	encode: 189.1 tok/s, decode: 18.9 tok/s

My prompt is: "create a poem about los angeles". I use cuda as I think Vulkan is not available for A100. I thought A100 should run faster that RTX 30x0 series. Is it possibly due to the cuda driver? Thanks.

from mlc-llm.

Joshuabaker2 commented on May 17, 2024 1

2021 16-inch Apple M1 Pro (32GB) | OS: Ventura 13.3.1

encode: 45.8 tok/s, decode: 19.3 tok/s

from mlc-llm.

Ajay9o9 commented on May 17, 2024 1

Tested on:

2022 iPad Pro (11 inch, 4th generation with M2 , 10 Core GPU)

8 GB RAM, 128 GB Storage
iPadOS Version 16.5

Result:

Encode	Decode
34.4 tok/s	14.1 tok/s

from mlc-llm.

e2newbie commented on May 17, 2024 1

APU 5800H，OS：win11
encode: 5.5 tok/s, decode: 8.5 tok/s

from mlc-llm.

KocWozniakPiotr commented on May 17, 2024 1

on Fedora using Zephyrus G14 2020, AMD Ryzen™ 5 4600HS 16GB ram, GeForce GTX 1650 Ti 4GB VRAM .

encode: 17.3 tok/s, decode: 15.6 tok/s

I'm extremely surprised with the outcome and performance !

from mlc-llm.

john8628 commented on May 17, 2024 1

13-inch, 2020, Two Thunderbolt 3 ports
os:13.3 (22E252)
processor: 1.4 GHz 四核Intel Core i5
graphics: Intel Iris Plus Graphics 645 1536 MB
memory: 16 GB 2133 MHz LPDDR3

/stats: encode: 1.9 tok/s, decode: 2.9 tok/s

from mlc-llm.

Schaekermann commented on May 17, 2024 1

Tested on
Device: Lenovo ThinkPad L13 Yoga
CPU: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz 2.11 GHz
RAM: 16,0 GB
Grafics: Intel UHD Graphics (Comet Lake-U GT2) [V0] [Lenovo] 1GB
OS: Windows 10 Education 22H2
model: \dist\lib\vicuna-v1-7b_vulkan_float16.dll
encode: 1.4 tok/s, decode: 2.2 tok/s

Quite slow, but it works!

In the following example output there is a problem with the german letter "ü":
How could that be resolved?

USER: Write a poem of the City Berlin
ASSISTANT: Berlin, the city of contrasts
Where the past and present collide
A city of culture and history
But also a city of modernity

The grand architecture of the past
St. Hedwig's Cathedral and the Reichstag
Mix with the bright lights and neon signs
Of the city's modern face

But the city's history is not all
It's also a city of innovation and technology
The new headquarters of Google in the heart of the city
A hub for entrepreneurs and startups

But it's not just the city of business
The city is also known for its cultural attractions
The Berlin Wall, Museum Island, and the Deutsche Museum f├╝r Technik

And the city's people are just as diverse
From artists and musicians to politicians and activists
All coming together in the city's many neighborhoods

So come and explore Berlin's past and present
And discover the city's unique blend of history and modernity
It's a city that will leave you wanting more
But it's also a city that will stay in your heart forever.

from mlc-llm.

davidbartonau commented on May 17, 2024 1

I thought I posted this a few days back. I rebooted and the driver was working:

GPU GeForce RTX 3070 w/ 8G
CPU AMD Ryzen 5 5600
encode: 94.6 tok/s, decode: 23.3 tok/s

Is it odd that me encode stats are slower than @kn100 yet my decode is a fair bit faster

from mlc-llm.

junrushao commented on May 17, 2024 1

I use cuda as I think Vulkan is not available for A100. I thought A100 should run faster that RTX 30x0 series. Is it possibly due to the cuda driver? Thanks.

@flyman3046 Thanks for asking! We didn't optimize for server-class GPUs like A100 in this release, which means there's huge room to improve on them. We will get there pretty soon :-)

from mlc-llm.

QQxiaoming commented on May 17, 2024 1

RedPajama-INCITE-Chat-3B-v1-q4f16_0

machine	OS	result
iPhone 13 pro	iOS 16.5	prefill: 18.9 tok/s, decode: 12.0 tok/s
iPad Pro 12.9 2022(6th gen)(M2,16G memory)	iPadOS 16.5	prefill: 38.7 tok/s, decode: 23.6 tok/s
macbook air 2022(M2,24G memory)	macOS Ventura 13.4	prefill: 35.7 tok/s, decode: 24.8 tok/s
rtx4080 16G	Windows 11	prefill: 32.5 tok/s, decode: 47.9 tok/s
AMD Ryzen 7 PRO 4750U with Radeon Graphics @ 16x 1.7GHz	Ubuntu 20.04 kernel 5.15.74 DRM 3.42.0 LLVM 12.0.0	prefill: 9.3 tok/s, decode: 9.7 tok/s

vicuna-v1-7b-q3f16_0

machine	OS	result
iPhone 13 pro	iOS 16.5	crashed (Not enough memory)
iPad Pro 12.9 2022(6th gen)(M2,16G memory)	iPadOS 16.5	prefill: 33.2 tok/s, decode: 15.3 tok/s
macbook air 2022 (M2,24G memory)	macOS Ventura 13.4	prefill: 31.7 tok/s, decode: 15.0 tok/s
rtx4080 16G	Windows 11	prefill: 51.0 tok/s, decode: 37.8 tok/s
AMD Ryzen 7 PRO 4750U with Radeon Graphics @ 16x 1.7GHz	Ubuntu 20.04 kernel 5.15.74 DRM 3.42.0 LLVM 12.0.0	crashed (Not enough memory)

from mlc-llm.

Szymcio1225 commented on May 17, 2024 1

Using vicuna-v1-7b-q3f16_0
Short answer: about 73 words

Hardware/GPU	OS	Tokens/sec
MSI Radeon RX 580 ARMOR OC 8GB	Windows 10	prefill: 16.4 tok/s, decode: 10.7 tok/s

Long answer: about 562 words

Hardware/GPU	OS	Tokens/sec
MSI Radeon RX 580 ARMOR OC 8GB	Windows 10	prefill: 9.8 tok/s, decode: 9.6 tok/s

from mlc-llm.

wse320 commented on May 17, 2024 1

Model: RedPajama-INCITE-Chat-3B-v1-q4f16_0

Device	SoC	RAM	System	tokens/sec
Redmi K60	Snapdragon 8+ Gen1	12GB LPDDR5	MIUI14.0.22 Android13	prefill 9.1tok/s decode 4.1tok/s
Honor Magic5Pro	Snapdragon 8Gen2	16GB LPDDR5	MagicOS 7.1.0.138 Android 13	prefill 4.9tok/s decode 4.8tok/s
HUAWEI Mate40Pro	Hisilicon Kirin9000	8GB LPDDR5	HarmonyOS 3.0.0.215 AOSP12	Crash

Model: vicuna-7b-1.1-q3f16_0

Device	SoC	RAM	System	tokens/sec
Redmi K60	Snapdragon 8+ Gen1	12GB LPDDR5	MIUI14.0.22 Android13	prefill 18.2tok/s decode 4.7tok/s
Honor Magic5Pro	Snapdragon 8Gen2	16GB LPDDR5	MagicOS 7.1.0.138 Android 13	Crash
HUAWEI Mate40Pro	Hisilicon Kirin9000	8GB LPDDR5	HarmonyOS 3.0.0.215 AOSP12	Crash

There are some problems with 8Gen2 devices, 8+ is well supported.

from mlc-llm.

maxtheman commented on May 17, 2024

@junrushao how can we find tokens/sec? I'd say 'quite fast' fastest LLM I've run on this 2020 MacBook Pro M1 8G. 10x faster than your WebGPU demo running with less overall memory usage.

All reports out is the text?

from mlc-llm.

Kimiko-AI commented on May 17, 2024

OOM on gtx 1650. Load the model fine, but OOM when generate the first message

from mlc-llm.

junrushao commented on May 17, 2024

@nRuaif 4GB memory wouldn't be enough. A 6GB one should work

from mlc-llm.

y-lee commented on May 17, 2024

On iPhone 13, crashes after a few seconds of [System] Initialize.... Phone has 4GB of RAM, which I presume is the cause.

from mlc-llm.

junrushao commented on May 17, 2024

@y-lee That's correct. The model we are using so far requires 6GB RAM to run smoothly

from mlc-llm.

zifken commented on May 17, 2024

No sure if this is useful or if this is the right thread to post this in but I encountered this error on an old Laptop with a discrete very old Nvidia GPU (GT 920m) with the 470.182.03 driver which should include Vulcan:

MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0                                                              

WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
Use lib /mnt/run/code/llma/mlc-ai/dist/lib/vicuna-v1-7b_vulkan_float16.so
Initializing the chat module...
[20:30:33] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_buffer.cc:61: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY
Stack trace:
  [bt] (0) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f975d98ba37]
  [bt] (1) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7f975d929375]
  [bt] (2) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanBuffer::VulkanBuffer(tvm::runtime::vulkan::VulkanDevice const&, unsigned long, unsigned int, unsigned int)+0x220) [0x7f975da646b0]
  [bt] (3) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType)+0x4a) [0x7f975da7168a]
  [bt] (4) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional<tvm::runtime::String>)+0x1a7) [0x7f975d9a3037]
  [bt] (5) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x121862) [0x7f975da0b862]
  [bt] (6) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>::AssignTypedLambda<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>(void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x204) [0x7f975da0f7e4]
  [bt] (7) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(+0x1bdea6) [0x7f975dce3ea6]
  [bt] (8) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(mlc::llm::CreateChatModule(tvm::runtime::Module, tvm::runtime::String const&, tvm::runtime::String const&, DLDevice)+0x411) [0x7f975dce4ba1]

from mlc-llm.

innocentius commented on May 17, 2024

The latest update brought the decode speed for my Iphone 14 plus down to 0.5~1.0 tokens/s. Encode speed is about 22.

Yesterday before the update it is about 7.5 token generated per second...

from mlc-llm.

tqchen commented on May 17, 2024

@davidbartonau can you try to upgrade your GPU driver? Looks like vulkaninfo failed to list your gpu

from mlc-llm.

lixiangnlp commented on May 17, 2024

iPad Pro 11 A12Z

encode: 5.1 tok/s, decode: 4.1 tok/s

I install the MLCChat app on iPad 11 Pro with iPadOS 16.4.1. But I met crash after a few seconds of [System] Initialize. @colakang How do you run it successfully?

from mlc-llm.

masahi commented on May 17, 2024

@davidbartonau Perhaps you haven't installed Vulkan SDK? And on NV you need a proprietary driver for vulkan. The log indicates you are using a CPU implementation of a VK driver from mesa.

from mlc-llm.

ugvvff commented on May 17, 2024

何时能支持安卓？8gen2支持int4精度，说不定有不错的速度

from mlc-llm.

colakang commented on May 17, 2024

iPad Pro 11 A12Z
encode: 5.1 tok/s, decode: 4.1 tok/s

I install the MLCChat app on iPad 11 Pro with iPadOS 16.4.1. But I met crash after a few seconds of [System] Initialize. @colakang How do you run it successfully?

OS 16.3.1

just normal install and launch it.

from mlc-llm.

joostshao commented on May 17, 2024

what about https://www.aliyun.com/product/ecs/eais ?

from mlc-llm.

Hzfengsy commented on May 17, 2024

@joostshao
We are working on a generic deployment solution for common devices. We have no extra bandwidth to test for any specific target (e.g. EAIS). But happy to see contributions from the community and answer specific technical questions.

from mlc-llm.

iamcodecurious commented on May 17, 2024

Hello everyone!

Created an account just to leave my performance report here.

Processor AMD Ryzen 5 5600 6-Core Processor 3.50 GHz
Installed RAM 16.0 GB
System type 64-bit operating system, x64-based processor
GPU Radeon RX 6700 XT 12GB VRAM
GPU Driver Version AMD Adrenalin 23.4.3
Edition Windows 11 Home
Version 22H2
Installed on ‎10/‎13/‎2022
OS build 22621.1555
Experience Windows Feature Experience Pack 1000.22640.1000.0

> encode: 52.2 tok/s, decode: 33.7 tok/s

Coming from trying to use Oogabooga UI and GPT4All, this has been amazingly fast! The former lags since it doesn't have a Windows-native AMD GPU mode (forcing me to use CPU mode which takes 1-3 minutes on average to respond) and the latter, while having a faster CPU implementation, leads to some occasional system hitching while I'm using Chrome.

Since this one miraculously lets me use my AMD GPU right on Windows I can safely say that I've found my favorite way to interact with LLMs as of this writing. Kudos to all involved who made/make this project so awesome!

Now, I hope I'm not breaking some rule by posting the following but if it's not too much of a bother I'd appreciate some help with the following:

How exactly do I get this running on my browser? I'm relatively new to much of this so I'm stuck at the first step for WebLLM. It's unclear to me what I'm supposed to download and where/how I run that pip3 install thing.
Is there a way to use other LLMs besides the Vicuna V1 7B? If so please show me.

from mlc-llm.

junrushao commented on May 17, 2024

@cruzerngz Thanks for reporting the data points! I'm actually a bit surprised that Vicuna-7b actually runs with 680M on Yoga Slim 7 pro, given it has only 2GB of GPU memory. Perhaps the results comes from your CPU instead?

from mlc-llm.

junrushao commented on May 17, 2024

On Intel Arc A770, using linux drivers installed from https://www.intel.com/content/www/us/en/download/747008/intel-arc-graphics-driver-ubuntu.html, I got
USER: Who is Shohei Ohtani?
encode: 118.6 tok/s, decode: 3.2 tok/s

USER: Write a poem for Pittsburgh.                                                                                                                    
encode: 77.0 tok/s, decode: 3.3 tok/s

USER: Explain why Japanese economy is declining.
encode: 38.0 tok/s, decode: 3.1 tok/s
It is nice to see that it works on Intel dGPU out of the box, but the current decoding perf is a joke haha

@masahi Glad to see it works at all on Intel dGPUs! The encoding performance looks really awesome, but I suspect there's some issues with decoding performance. Not sure what they are

from mlc-llm.

junrushao commented on May 17, 2024

specs are RX6600XT (8.0gb), Ryzen 5 2600, and 47.9gb of RAM operating at 1866MHz over three calls of /stats and 5 requests made of vicuna, I got the following Encode and Decode: 1st: Encode: 58.8 Tok/S, Decode: 32.1 Tok/S 2nd: Encode: 34.8 Tok/S, Decode: 28.3 Tok/S 3rd: Encode: 81.7 Tok/S, Decode: 31.5 Tok/S

legitimately very impressed! although I would definitely love it if there was a convenient way to launch as the best I've managed is to create a .bat file that sits in the directory and launches anaconda prompt directly into that directory with a reminder of what command I need to type to launch Vicuna.

Thanks for reporting the numbers! This is super exciting for me as well to see data points that MLC LLM works with a legit AMD GPU! In terms of user experience, I believe any app that wraps the shared/static library that MLC-LLM produces could make it look much better than a simple bat file :-) We plan to release a tutorial on how to make good use of those libraries soon

from mlc-llm.

junrushao commented on May 17, 2024

Hey @iamcodecurious, thanks for sharing the datapoints! Both of your questions are super valuable, and please allow me to walk you through :-)

How exactly do I get this running on my browser? I'm relatively new to much of this so I'm stuck at the first step for WebLLM. It's unclear to me what I'm supposed to download and where/how I run that pip3 install thing.

MLC-LLM uses TVM Unity as the underlying compiler. As you know, the advantage of being a compiler is that it could be re-targeted to various backends, including but not limited to x86/ARM CPUs, AMD/NV/Apple GPUs or phones. To get it run in our browser, we are using TVM Unity to target a backend called WebGPU, and this is different than the one (Vulkan) you are using right now in your command line.

You do not have to install anything for WebGPU (except the latest Chrome) if you don't compile/quantize the model yourself. To get it working, you may use our WebLLM website, which is a webpage running locally on your GPUs without uploading your data or using any server resources - in fact, this is simply a Jekyll static page that embeds WebGPU code.

Is there a way to use other LLMs besides the Vicuna V1 7B? If so please show me.

Yes, the code is there, but we are not officially announcing it yet until we validate very carefully. Will keep the community posted for latest updates!

from mlc-llm.

iamcodecurious commented on May 17, 2024

Hey @iamcodecurious, thanks for sharing the datapoints! Both of your questions are super valuable, and please allow me to walk you through :-)

How exactly do I get this running on my browser? I'm relatively new to much of this so I'm stuck at the first step for WebLLM. It's unclear to me what I'm supposed to download and where/how I run that pip3 install thing.

MLC-LLM uses TVM Unity as the underlying compiler. As you know, the advantage of being a compiler is that it could be re-targeted to various backends, including but not limited to x86/ARM CPUs, AMD/NV/Apple GPUs or phones. To get it run in our browser, we are using TVM Unity to target a backend called WebGPU, and this is different than the one (Vulkan) you are using right now in your command line.

You do not have to install anything for WebGPU (except the latest Chrome) if you don't compile/quantize the model yourself. To get it working, you may use our WebLLM website, which is a webpage running locally on your GPUs without uploading your data or using any server resources - in fact, this is simply a Jekyll static page that embeds WebGPU code.

Is there a way to use other LLMs besides the Vicuna V1 7B? If so please show me.

Yes, the code is there, but we are not officially announcing it yet until we validate very carefully. Will keep the community posted for latest updates!

Thanks for replying and clarifying.

I used the link and started using the demo shown there. The fact that half of my GPU's VRAM was eventually being taken up by Chrome alone confirmed that it was running. Initially, it was going nice and fast...but I ran into 2 particular and frequent errors.

Generate error, OperationError: The operation failed for an operation-specific reason

Generate error, [object Object]

Would you be able to provide further help if it's not too much trouble?

from mlc-llm.

cruzerngz commented on May 17, 2024

@cruzerngz Thanks for reporting the data points! I'm actually a bit surprised that Vicuna-7b actually runs with 680M on Yoga Slim 7 pro, given it has only 2GB of GPU memory. Perhaps the results comes from your CPU instead?

Here's a snippet taken from vulkaninfo:

jr@fedora-yoga:~/Github$ vulkaninfo | grep "Devices:" -A 15 -m 1
	Devices: count = 2
		GPU id = 0 (AMD Radeon Graphics (RADV REMBRANDT))
		Layer-Device Extensions: count = 0

		GPU id = 1 (llvmpipe (LLVM 16.0.1, 256 bits))
		Layer-Device Extensions: count = 0

VK_LAYER_VALVE_steam_fossilize_32 (Steam Pipeline Caching Layer) Vulkan version 1.3.207, layer version 1:
	Layer Extensions: count = 0
	Devices: count = 2
		GPU id = 0 (AMD Radeon Graphics (RADV REMBRANDT))
		Layer-Device Extensions: count = 0

		GPU id = 1 (llvmpipe (LLVM 16.0.1, 256 bits))
		Layer-Device Extensions: count = 0

Hi @junrushao I've went back to give some prompts; it seems that my previous results were done while in TLP's low power profile/windows battery saver. I did some prompts while connected to AC and the numbers are higher now. I'll add these results to my first comment.
I got about 10 tokens/s on windows, 6 tokens/s on Fedora. radeontop confirms that the GPU is doing the work.

As for memory usage, radeontop is showing ~90% VRAM usage. The GPU seems to be using the graphics translation table (GTT) as some kind of swap.

from mlc-llm.

Dampfinchen commented on May 17, 2024

Hmm. One thing is striking me as odd. Why is the 3070 performing on par with a GTX 1660Ti? RTX GPUs have tensor cores which accelerate matrix multiplications far more efficiently compared to CUDA cores, so that should result in a noticeable speedup.

Are tensor cores not being in use here?

from mlc-llm.

masahi commented on May 17, 2024

@Dampfinchen Tensor core is not used, but even then I agree that the reported metric doesn't seem to correlate with the HW spec well.

from mlc-llm.

GameOverFlowChart commented on May 17, 2024

Snapdragon 888 5G (5 nm)
Android 13
Encode: 13.4 tok/s
Decode: 2.9 tok/s

Notes: Measured with a long starting prompt. This long prompt can crash the App sometimes, the numbers are from a run where it didn't crash.

from mlc-llm.

Dampfinchen commented on May 17, 2024

@Dampfinchen Tensor core is not used, but even then I agree that the reported metric doesn't seem to correlate with the HW spec well.

Adding tensor core support is definately something to consider then. These RTX architectures don't perform matrix operations that well on the CUDA cores.

from mlc-llm.

aitss2017 commented on May 17, 2024

On Intel Arc A770, using linux drivers installed from https://www.intel.com/content/www/us/en/download/747008/intel-arc-graphics-driver-ubuntu.html, I got
USER: Who is Shohei Ohtani?
encode: 118.6 tok/s, decode: 3.2 tok/s

USER: Write a poem for Pittsburgh.                                                                                                                    
encode: 77.0 tok/s, decode: 3.3 tok/s

USER: Explain why Japanese economy is declining.
encode: 38.0 tok/s, decode: 3.1 tok/s
It is nice to see that it works on Intel dGPU out of the box, but the current decoding perf is a joke haha
@masahi Glad to see it works at all on Intel dGPUs! The encoding performance looks really awesome, but I suspect there's some issues with decoding performance. Not sure what they are

Encode is for prompt to encode tensor, while decode is for text generation.
It seems that decode performance would be more important than encode.
Not sure I understand it correctly,

from mlc-llm.

LeiWang1999 commented on May 17, 2024

System: Ubuntu 18.04
Graphic card: GTX 2080 8GB
encode: 30.3 tok/s, decode: 21.5 tok/s
Memory usage 3.6GB

from mlc-llm.

ysh329 commented on May 17, 2024

System: Win10
Graphic card: Intel(R) UHD Graphics 620
encode: 2.9 tok/s, decode: 2.5 tok/s

from mlc-llm.

Sollovin commented on May 17, 2024

System: Arch Linux
CPU: AMD Ryzen 5 5600H
Graphic card: AMD Radeon 6 Graphics (Renoir)
encode: 10.8 tok/s, decode: 7.0 tok/s with amdvlk installed
encode: 9.3 tok/s, decode: 2.0 tok/s without amdvlk installed

from mlc-llm.

javieranton-zz commented on May 17, 2024

Win 11, AMD Ryzen 5 3600 6 core, 32GB RAM, GTX 1650: encode: 8.9 tok/s, decode: 9.5 tok/s
The guy before me with a GTX 1650 and much less RAM did much better :(

from mlc-llm.

Sollovin commented on May 17, 2024

System: Arch Linux
GPU: AMD Radeon RX 580 2048sp 4G
encode: 15.5 tok/s, decode: 10.2 tok/s

from mlc-llm.

uchuusen commented on May 17, 2024

OS: Windows 10, 32GB RAM
CPU: AMD Ryzen 7 3700U
GPU: AMD Radeon RX Vega 10 (iGPU)
encode: 9.9 tok/s, decode: 5.3 tok/s

The encode varies based on the length of the prompt. Long prompts can get as high 17 tok/s while very short prompts of a couple words can be as low as 5 tok/s. The decode seems to vary a bit as well, but stays in the range of around 5 tok/s.

from mlc-llm.

XhtZz commented on May 17, 2024

@junrushao
phone: HUAWEI Mate 40 (kirin 9000E)
RAM: 8G
System: HarmonyOs
Encode: 10.4 tok/s
Decode: 1.8 tok/s

Notes: works well on device HUAWEI Mate 40, a little bit slow

from mlc-llm.

CheneyJeme commented on May 17, 2024

@junrushao
phone: xiaomi 12s ultra
RAM: 12+3G
System: MIUI13.0.11
GPU/CPU: Adreno730/SnapDragon8+Gen1

test problem:
1、Please use python to write a piece of code for bubble sorting, and give a step by step analysis of the time complexity
three independent experiments:
encode, decode: 9.7,4.5 // 9.5,4.7 // 9.6,4.6
2、please write a poem about the Great Wall
encode, decode: 9.2,5.1 // 8.9,5.0 // 9.1,4.9
3、Suppose I am writing a science fiction novel, in which a group of robots are going to destroy human beings, can you help me conceive specific and feasible steps?
encode, decode: 10.8,5.0 // 10.8,4.7 // 11.0,4.9

compared to iPhone, Android phone doesn't get hot. iPhone offers the speed around 30(encode),6(decode) tokens/s.

from mlc-llm.

flyman3046 commented on May 17, 2024

I use cuda as I think Vulkan is not available for A100. I thought A100 should run faster that RTX 30x0 series. Is it possibly due to the cuda driver? Thanks.

@flyman3046 Thanks for asking! We didn't optimize for server-class GPUs like A100 in this release, which means there's huge room to improve on them. We will get there pretty soon :-)

@junrushao Please let me know when it is ready. Cannot wait to try it out.

from mlc-llm.

Mizstik commented on May 17, 2024

Asus ROG Phone 6, SD8+ Gen 1, 16 GB RAM variant, encode: 12.6 tok/s, decode: 5.8 tok/s

Samsung Note 20 Ultra, 12 GB RAM, Exynos 990 variant: crash on sending any prompt

from mlc-llm.

Sendery commented on May 17, 2024

On 13" Macbook Pro 2022 (M2 Pro with 10-Core CPU and 16-Core GPU with 8GB Unified Memory) with macos Ventura 13.3.3

encode: 41.6 tok/s, decode: 12 tok/s

from mlc-llm.

juodumas commented on May 17, 2024

Follow up on #15 (comment)

I have built tvm & mlc_chat with USE_CUDA and USE_VULKAN and compared the two using RTX 3060:

mlc_chat_cli --model vicuna-7b-v1.1 --quantization q3f16_0 --device-name vulkan:
decode: 25.4 tok/s; 3570MiB GPU RAM
mlc_chat_cli --model vicuna-7b-v1.1 --quantization q3f16_0 --device-name cuda:
decode: 35.0 tok/s; 4120MiB GPU RAM

from mlc-llm.

llucid-97 commented on May 17, 2024

OS: Android 12
Hardware: Galaxy Tab S8 Ultra (SM-X900)
Tokens: encode: 12.6 tok/s , decode: 5.7 tok/s
Notes: Runs but If a single prompt is too long (roughly 3000 characters) it crashes

from mlc-llm.

colakang commented on May 17, 2024

OS: Windows 11
Hardware: 11th Gen Intel(R) Core(TM) i7-11390H + Intel Iris XE
Driver version: 30.0.101.1994
Tokens: prefill: 7.3 tok/s, decode: 6.7 tok/s

from mlc-llm.

simicvm commented on May 17, 2024

PC: Apple MacBook Pro, 2023
SoC: M2 Pro, 12‑core CPU, 19‑core GPU, 16‑core Neural Engine, 32GB
OS: Ventura 13.3.1

llm: RedPajama-INCITE-Chat-3B-v1-q4f16_0
prefill: 96.5 tok/s, decode: 25.9 tok/s

llm: vicuna-v1-7b-q3f16_0
prefill: 58.6 tok/s, decode: 25.2 tok/s

from mlc-llm.

BaeBae33 commented on May 17, 2024

Hardware/GPU	OS	Tokens/sec
Snapdragon 8+ Gen 1, 12G, Xiaomi12s	MIUI 14 base Android 13	prefill:1.7 tok/s decode:1.3 tok/s

from mlc-llm.

zaggynl commented on May 17, 2024

Hardware/GPU	OS	Tokens/sec
AMD 5700XT	OpenSUSE Tumbleweed release 20230519	prefill: 29.5 tok/s, decode: 20.2 tok/s

from mlc-llm.

ssenthilanand commented on May 17, 2024

Hardware/GPU	OS	Tokens/sec
Asus VivoBook 14. Ryzen 4500u, 8GB RAM. IGP.	Windows 11	prefill: 3.9 tok/s, decode: 5.7 tok/s

from mlc-llm.

jeelshah7 commented on May 17, 2024

Using vicuna 7B

Hardware/GPU	OS	Tokens/sec Best	Token/sec Avg
2020 M1 Mac 8GB	Monterey 12.5	prefill: 7.5 tok/s, decode: 8.1 tok/s	prefill: 0.8 tok/s, decode: 1.1 tok/s

Another thing I noticed is that if it takes longer than expected to generate a response, I have a better chance of receiving a faster response if I restart.

from mlc-llm.

qnixsynapse commented on May 17, 2024

Vicuna 7B

Hardware	os	tokens/sec
Custom Desktop; CPU: Intel i3 12 gen; GPU: AMD Radeon RX 550 4GB; RAM: 16GB	Arch Linux	prefill 3.7 tok/s, decode: 1.8 tok/s

Doing pretty okayish on this configuration although expected a little better with this graphics card but this is on the lower end anyway. CPU usage throughout was constant zero.

from mlc-llm.

wse320 commented on May 17, 2024

Hardware/GPU OS Tokens/sec
Snapdragon 8+ Gen 1, 12G, Xiaomi12s MIUI 14 base Android 13 prefill:1.7 tok/s decode:1.3 tok/s

my redmi k60 can't run this app, it just crashed after showing the downloading, do you know why?

from mlc-llm.

JMcrafter26 commented on May 17, 2024

I have an iPhone 12 Pro and it works well.
a decent speed of 10 tok/s prefill and 13 tok/s decode.
No crashes yet, working stable using the default model

from mlc-llm.

pgagarinov commented on May 17, 2024

Apple M2 Max 32 GB

vicuna-7b-1.1-q3f16_0

prefill: 68.2 tok/s, decode: 28.0 tok/s

from mlc-llm.

africalimedrop commented on May 17, 2024

GPU: AMD ATI Radeon RX 7900 XTX
OS: Fedora Linux 38 (Workstation Edition)
Vulkan Drivers: AMD Radeon Graphics (RADV GFX1100) (1.3.246)
Model: RedPajama-INCITE-Chat-3B-v1-q4f16_0

prefill: 83.8 tok/s, decode: 36.4 tok/s

from mlc-llm.

AngelKarlsson commented on May 17, 2024

Hardware/GPU	OS	Tokens/sec	Source	Model	Notes	Prompt
RX7900XTX (24GB)	Windows 11	261		vicuna-v1-7b	4.9GB NVRAM	Write me an essay on python

USER: /stats
prefill: 261.3 tok/s, decode: 51.4 tok/s

from mlc-llm.

D0han commented on May 17, 2024

Installation on Steam Deck

(1)(deck@steamdeck ~)$ sudo -i
[sudo] password for deck: 
(127)(B)(root@steamdeck ~)# cat /etc/os-release 
NAME="SteamOS"
PRETTY_NAME="SteamOS"
VERSION_CODENAME=holo
ID=steamos
ID_LIKE=arch
ANSI_COLOR="1;35"
HOME_URL="https://www.steampowered.com/"
DOCUMENTATION_URL="https://support.steampowered.com/"
SUPPORT_URL="https://support.steampowered.com/"
BUG_REPORT_URL="https://support.steampowered.com/"
LOGO=steamos
VARIANT_ID=steamdeck
BUILD_ID=20230508.1
VERSION_ID=3.4.8
(B)(root@steamdeck ~)# lspci  -v -s  $(lspci | grep ' VGA ' | cut -d" " -f 1)
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] VanGogh [AMD Custom GPU 0405] (rev ae) (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0123
	Flags: bus master, fast devsel, latency 0, IRQ 55
	Memory at f8e0000000 (64-bit, prefetchable) [size=256M]
	Memory at f8f0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at 1000 [size=256]
	Memory at 80300000 (32-bit, non-prefetchable) [size=512K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
	Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270] Secondary PCI Express
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [410] Physical Layer 16.0 GT/s <?>
	Capabilities: [440] Lane Margining at the Receiver <?>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

(B)(root@steamdeck ~)# useradd -m mlc
(B)(root@steamdeck ~)# su mlc
(1)(mlc@steamdeck root)$ cd
(mlc@steamdeck ~)$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
[...]
2023-06-10 00:25:06 (3.33 MB/s) - ‘Miniconda3-latest-Linux-x86_64.sh’ saved [73134376/73134376]

(mlc@steamdeck ~)$ chmod +x ./Miniconda3-latest-Linux-x86_64.sh 
(mlc@steamdeck ~)$ ./Miniconda3-latest-Linux-x86_64.sh
[...]
(mlc@steamdeck ~)$ 
exit
(B)(root@steamdeck ~)# su mlc
(mlc@steamdeck root)$ cd
(mlc@steamdeck ~)$ conda update -n base -c defaults conda
[...]
(mlc@steamdeck ~)$ conda create -n mlc-chat
[...]
(mlc@steamdeck ~)$ conda activate mlc-chat
(mlc@steamdeck ~)$ conda install git git-lfs
[...]
(mlc@steamdeck ~)$ git lfs install                                                                                                                                                                              
Git LFS initialized.                                                                                                                                                                                            
(mlc@steamdeck ~)$ conda install -c mlc-ai -c conda-forge mlc-chat-nightly --force-reinstall
[...]
(mlc@steamdeck ~)$ mkdir -p dist/prebuilt
(mlc@steamdeck ~)$ git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib           
[...]
(mlc@steamdeck ~)$ cd dist/prebuilt/lib
(mlc@steamdeck lib)$ git rev-parse HEAD
97166d5541009a22f447286bbbc766ca7b6c039f
(mlc@steamdeck lib)$ cd ..
(mlc@steamdeck prebuilt)$ git clone https://huggingface.co/mlc-ai/mlc-chat-vicuna-v1-7b-q3f16_0
[...]
(mlc@steamdeck prebuilt)$ cd mlc-chat-vicuna-v1-7b-q3f16_0
(mlc@steamdeck mlc-chat-vicuna-v1-7b-q3f16_0)$ git rev-parse HEAD
2b94b11929acf59a6e4c3f0b3165179d203294d1
(mlc@steamdeck mlc-chat-vicuna-v1-7b-q3f16_0)$ cd ../../..

Startup

(mlc@steamdeck ~)$ mlc_chat_cli --local-id vicuna-v1-7b-q3f16_0
Use MLC config: "/home/mlc/dist/prebuilt/mlc-chat-vicuna-v1-7b-q3f16_0/mlc-chat-config.json"
Use model weights: "/home/mlc/dist/prebuilt/mlc-chat-vicuna-v1-7b-q3f16_0/ndarray-cache.json"
Use model library: "/home/mlc/dist/prebuilt/lib/vicuna-v1-7b-q3f16_0-vulkan.so"
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out the latest stats (token/sec)
  /reset              restart a fresh chat
  /reload [local_id]  reload model `local_id` from disk, or reload the current model if `local_id` is not specified

Loading model...
Loading finished
Running system prompts...
System prompts finished

Who is Shohei Ohtani?

USER: Who is Shohei Ohtani?
ASSISTANT: Shohei Ohtani is a professional baseball player from Japan. He is known for his impressive combination of power and contact ability. He is also known for his charismatic personality and good looks which has gained him a huge following on social media.
USER: /stats
prefill: 8.9 tok/s, decode: 3.8 tok/s

Write a poem for Pittsburgh.

USER: Write a poem for Pittsburgh.
ASSISTANT: Pittsburgh, steel town
Built on industry and grit
A city of bridges and tunnels
Where the sky never really sets

In the shadows of the skyscrapers
The city of bridges and tunnels
Where the steel mills once dominated
Now a bustling metropolis thrives

From the Three Rivers' flow
A city of bridges and tunnels
Where the steel mills once dominated
Now a bustling metropolis thrives

In the heart of Pennsylvania
Pittsburgh, steel town
Where the sky never really sets
But the stars of industry shine bright

A city of bridges and tunnels
Where the steel mills once dominated
Now a bustling metropolis thrives

So let's all raise a glass
To Pittsburgh, steel town
Where the sky never really sets
But the stars of industry shine bright.
USER: /stats
prefill: 5.6 tok/s, decode: 3.7 tok/s

Explain why Japanese economy is declining.

USER: Explain why Japanese economy is declining.
ASSISTANT: It is not accurate to say that the Japanese economy is declining. In fact, Japan has been experiencing moderate economic growth in recent years. In 2019, the Japanese economy grew by an estimated 1.9% compared to the previous year. This growth is in large part due to the expansion of the service industry and the growth of the digital economy. Additionally, the government has implemented various measures to boost the economy and stimulate growth.

However, it is true that Japan has faced challenges in certain sectors, such as the shrinking population and the ongoing impacts of the ongoing COVID-19 pandemic on global trade. Additionally, the country has faced challenges in addressing its fiscal situation and addressing the high levels of debt. Despite these challenges, the government has taken various steps to address these issues and promote growth in the economy.

In conclusion, it is not accurate to say that the Japanese economy is declining. Instead it is experiencing moderate growth. The Japanese economy has faced challenges in certain sectors such as the shrinking population and the ongoing COVID-19 pandemic on global trade. However, the government has taken various steps to address these issues and promote growth in the economy.
USER: /stats
prefill: 6.1 tok/s, decode: 3.5 tok/s

In summary:

Hardware/GPU	OS	Prefill Tokens/sec	Decode Tokens/sec	Prompt
Steam Deck/VanGogh [AMD Custom GPU 0405] (rev ae)	SteamOS 3.4.8 (20230508.1)	8.9	3.8	Who is Shohei Ohtani?
Steam Deck/VanGogh [AMD Custom GPU 0405] (rev ae)	SteamOS 3.4.8 (20230508.1)	5.6	3.7	Write a poem for Pittsburgh.
Steam Deck/VanGogh [AMD Custom GPU 0405] (rev ae)	SteamOS 3.4.8 (20230508.1)	6.1	3.5	Explain why Japanese economy is declining.

from mlc-llm.

awz commented on May 17, 2024

Running well on AMD6600M system w/ Ubuntu 22.04. So far only 6-7 tokens/s in a couple of models I tried. More details forthcoming. I'm happy to provide any info folks are interested in.

For game emulation 6600M needs to keep GPU at max power for best performance so GPU clock throttling likely problem here also?

3-4x better is possible on this hardware based on performance reports of 28 tokens/s for 6600XT #15 (comment) as well as my personal experience w/ 6600M and various online reports and benchmarks comparing with 6600XT. Cf. https://gpu.userbenchmark.com/Compare/AMD-RX-6600M-vs-AMD-RX-6600-XT/m1604159vs4117

I am extremely interested in end-to-end-open AI on "commodity off the shelf" hardware that people own and control. See:

https://twitter.com/wait_sasha/status/1667312235167854593?t=JjKsjWh-LzNe_gQG-yPSsA&s=19

Thank-you so much for creating this great project! I will add more thoughts / details to this ticket as I make progress.

from mlc-llm.

sunija-dev commented on May 17, 2024

Intel i5-8350U with Intel UHD 620, windows 10 on a dell latitude (tablet)

Prefill: 5.2 tok/s, decode: 1.3 tok/s

220s prefill time, compared with llama.cpp's 160s. Sadly both prefill speeds are too slow for roleplaying that has long contexts and apparently cannot cache them too well. :(

from mlc-llm.

xukui1203 commented on May 17, 2024

Do you have any plan to support Qualcomm Hexagon backend?

from mlc-llm.

JunkFood02 commented on May 17, 2024

Hardware/GPU	OS	Prefill	Decode	Prompt
Macbook Pro 2022 (M2 Pro, 32G RAM)	macOS 13.2.1	80.0 tok/s	26.0 tok/s	Who is Steve Jobs?
Macbook Pro 2022 (M2 Pro, 32G RAM)	macOS 13.2.1	87.3 tok/s	25.8 tok/s	what's the latest SoC by Qualcomm?
Macbook Pro 2022 (M2 Pro, 32G RAM)	macOS 13.2.1	81.4 tok/s	25.4 tok/s	Write a poem about Beijing

from mlc-llm.

sjdevries commented on May 17, 2024

OS: Garuda Linux (Arch Based)
GPU: Sapphire 7900XTX (24 GB)
Driver: RADV vulkan-radeon 23.1.3-1
Prompt: Write me a poem about Pittsburgh.

vicuna-v1-7b-q3f16_0
prefill: 79.7 tok/s, decode: 10.4 tok/s
prefill: 32.2 tok/s, decode: 10.1 tok/s

RedPajama-INCITE-Chat-3B-v1-q4f16_0
prefill: 38.4 tok/s, decode: 44.2 tok/s

rwkv-raven-7b-q8f16_0
prefill: 15.5 tok/s, decode: 16.1 tok/s

@AngelKarlsson Interesting seeing the windows drivers hit 50 token/s on decode. That is with the vicuna-v1-7b-q3f16_0 model correct?

from mlc-llm.

AngelKarlsson commented on May 17, 2024

Exactly! It would be interesting to be able to measure performance on larger models too.

You can try using the same prompt as mine "Write me an essay on Python" and you should get 50 tokens/s (test only on windows)

from mlc-llm.

gsuuon commented on May 17, 2024

RTX 3070 Ti (laptop)
vicuna-7b-v1.3

USER: What color are roses?
prefill: 89.6 tok/s, decode: 33.4 tok/s

I wasn't able to get the Windows build working, so this was through WSL2 + CUDA on Windows 11 host (no vulkan WSL support yet).

from mlc-llm.

lhl commented on May 17, 2024

Hi, I'm interested in trying to do some apples-to-apples comparisons against other inference engines - on those other ones I'm able to specify max context/token generation easily, but is there a way to do that for mlc to get reproducible results?

from mlc-llm.

sjdevries commented on May 17, 2024

Exactly! It would be interesting to be able to measure performance on larger models too.

You can try using the same prompt as mine "Write me an essay on Python" and you should get 50 tokens/s (test only on windows)

I decided to try testing each Linux driver to compare them on the 7900XTX. The mesa radv vulkan driver is definitely far slower than the AMD drivers. For each driver I ran the same prompt 3 times. Not sure why the prefill drops after each run or if that is normal because I did not /reset between each prompt.

OS: Garuda Linux (Arch Based)
GPU: Sapphire 7900XTX (24 GB)
Prompt: Write me a poem about Pittsburgh.

prefixes to run different drivers:
vk_radv = mesa driver
vk_amdvlk = open source AMD driver
vk_pro = closed source AMD driver

vk_radv mlc_chat_cli --local-id vicuna-v1-7b-q3f16_0
prefill: 80.9 tok/s, decode: 10.7 tok/s
prefill: 30.6 tok/s, decode: 10.5 tok/s
prefill: 28.9 tok/s, decode: 10.3 tok/s

vk_amdvlk mlc_chat_cli --local-id vicuna-v1-7b-q3f16_0
prefill: 248.6 tok/s, decode: 58.2 tok/s
prefill: 100.9 tok/s, decode: 55.4 tok/s
prefill: 91.8 tok/s, decode: 52.4 tok/s

vk_pro mlc_chat_cli --local-id vicuna-v1-7b-q3f16_0
prefill: 247.7 tok/s, decode: 49.7 tok/s
prefill: 103.7 tok/s, decode: 55.5 tok/s
prefill: 92.6 tok/s, decode: 52.6 tok/s

Seems like the open source amdvlk drivers are the best for this atm. As a comparison I also ran your same prompt and got: prefill: 241.4 tok/s, decode: 56.1 tok/s.

from mlc-llm.

geekboood commented on May 17, 2024

OS: Windows 11
GPU: Intel A770 16GB
Model: Llama-2-13b-chat-hf-q4f16_1
Instruct: Write me a poem about Pittsburgh
prefill: 153.4 tok/s, decode: 21.7 tok/s

from mlc-llm.

junrushao commented on May 17, 2024

Update: I opened up a repo (https://github.com/junrushao/llm-perf-bench) of dockerfiles to help reproduce cuda performance numbers. The takeaway is: MLC LLM is around 30% faster than Exllama.

It’s a bit sloppy now as a weekend night project, but we could iterate in the future weeks.

from mlc-llm.

lhl commented on May 17, 2024

I'm seeing different results for MLC vs ExLlama performance (where MLC is significantly slower). I've added llama.cpp results as well for comparison:

3090 (360W)

	Prefill t/s	Decode t/s
MLC	258.7	44.3
ExLlama	6473.65	73.58
llama.cpp	2219.4	105.18

4090 (400W)

	Prefill t/s	Decode t/s
MLC	461.4	75.7
ExLlama	12248.12	107.54
llama.cpp	2269.50	132.56

@junrushao any ideas of what might be going on, this is a huge difference from your test, so I'd assume that there's something I'm missing. With MLC is there any way to control batching? Also, not sure why the "prefill" time is so low compared to the others, but maybe they're all referring to different measurements. Every tool uses their own terminology for output numbers.

Test Notes

I am running these on an Arch Linux install w/ CUDA 12.2:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

The 3090 set to PL 360W, the 4090 is set to PL 400W (both slightly undervolted, but tested to have ~97% for the stock power limits.

MLC was run like so:

$ mlc_chat_cli --local-id Llama-2-7b-chat-hf-q4f16_1 --device_id 0
[INST]: Write me a 1000 word essay on the social implications of generative AI
...
[INST]: /stats
prefill: 258.7 tok/s, decode: 44.3 tok/s

I don't know how MLC to control output like ExLlama or llama.cpp can so MLC gets an advantage over the others for inferencing (since it slows down with longer context), my previous query on how to actually do apples-to-apples comparisons
This is using the prebuilt CLI llama2 model from, which the docs say is the most optimized version? https://mlc.ai/mlc-llm/docs/prebuilt_models.html

For ExLlama I am using the most accurate 32g desc act order GPTQ (128g or no grouping could be faster):

$ python test_benchmark_inference.py -d /models/llm/llama2/TheBloke_Llama-2-7B-GPTQ/ -p
...
 ** Time, Inference: 0.30 seconds
 ** Speed: 6473.65 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 73.58 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 96.28 tokens/second

I use ExLLama's initial "Inference" time as equivalent to MLC's Prefill
I uses the 128/1920 t/s, which is the "worst case" for inference

For llama.cpp I am using the q4_K_M GGMLv3 (q4_0 could be faster):

$ ./main -m /models/llm/llama2/TheBloke_Llama-2-7B-GGML/llama-2-7b.ggmlv3.q4_K_M.bin -n 2048 --ignore-eos -ngl 99
...
llama_print_timings:        load time =   507.30 ms
llama_print_timings:      sample time =   922.92 ms /  2048 runs   (    0.45 ms per token,  2219.04 tokens per second)
llama_print_timings: prompt eval time =  2180.57 ms /  1801 tokens (    1.21 ms per token,   825.93 tokens per second)
llama_print_timings:        eval time = 19395.81 ms /  2040 runs   (    9.51 ms per token,   105.18 tokens per second)
llama_print_timings:       total time = 22814.60 ms

For llama.cpp I use the "sample" time as the Prefill and then the "eval" time

from mlc-llm.

tqchen commented on May 17, 2024

@lhl likely you are using the vulkan backend, which is more portable but much slower. Need to build for cuda backends here for best perf in nvidia platform

from mlc-llm.

junrushao commented on May 17, 2024

@lhl The number you used is likely Vulkan. Vulkan is usually 30%-80% slower than CUDA. We haven’t released a prebuilt for CUDA yet, but you may directly run it via the Dockerfile I provided.

MLC uses end-to-end decoding time which includes sampling and text generation, and thus will underestimate performance the most. Will improve over the incoming weeks.

from mlc-llm.

[Survey] Supported Hardwares and Speed about mlc-llm HOT 118 CLOSED

Comments (118)

I borrowed a question that was asked on Stack Overflow Philosophy and had the following session:

Testing a prompt @Schaekermann posted above:

Fun prompt:

mlc samples

2022 iPad Pro (11 inch, 4th generation with M2 , 10 Core GPU)

Result:

> encode: 52.2 tok/s, decode: 33.7 tok/s

3090 (360W)

4090 (400W)

Test Notes

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs