Comments (11)
I am curious if someone managed to run this on a laptop outside of the Ultras.
from mlx.
Yes, I am using the provided mistral example. It's not a typo it takes around 80 seconds to generate 1 token.
from mlx.
Yes that's an oversight, the Mistral example does fp16, but the llama does fp32 by defualt since that's what the weights are saved in.
You can see an example of casting the weights in the Mistral file. We should add the same for Llama (and probably just save them as fp16 in the first place as it doesn't seem to make a difference).
from mlx.
@rovo79 Correct. As per #18, ANE API is closed source and not publicly accessible. I believe the only way to touch ANE today is via CoreML.
from mlx.
So, has someone managed to run a 7B inference using MLX on 16GB of RAM? Or do you need an Ultra to make any use of MLX?
from mlx.
FP16/BF16 are both supported dtypes here
The ops are lazy and will only execute the compute as needed, but if the default_device indicates the gpu it should be using the metal kernels
from mlx.
I am running Llama/Mistral inference examples on my M1Pro with 16GB of memory and getting around 80sec/token.
Are you using the 7b llama and 7b mistral model? Is it a typo? Do you mean(80ms/token or 80sec/token)?
from mlx.
GPU usage seems low right ?
from mlx.
@tcapelle Can you please check your memory pressure when running the model? At 16GB of memory, you may be running out of wired memory since the example uses FP16 (weights total nearly 14.6GB) and inference takes a bit more than that.
from mlx.
@tcapelle I tried the LLaMA example on my M1 Pro 32GB. It's indeed slow, and I think that's mostly due to the weights being FP32. I haven't checked Mistral example yet, but this performance is expected if that is also FP32. Transformer inference is typically memory-bound and using FP32 is a bottleneck.
Did you do additional modifications to run the example in FP16 or did I miss something?
from mlx.
Running Inference with MLX won't touch ANE in anyway, right?
from mlx.
Related Issues (20)
- [BUG] mlx_lm issue with Phi-3 fine tuned model: adding and repeating weird tokens
- [FEATURE] in keras LayerNorm by default is apply to last dimension only HOT 9
- [BUG] in-place updating of array slice unexpectedly fails due to broadcasting problem HOT 2
- [BUG] Matmul gives wrong output for large sizes HOT 4
- [BUG] broadcast of scalar array in last dimension fails after #1035
- [BUG] Unable to install mlx on MacbookPro M3Pro with MacOS 14.4.1 HOT 1
- [FEATURE] how to return mlx intermediate layer output similarly to Keras HOT 2
- [BUG] cannot replicate a keras model into mlx when I reuse keras pretrained weights
- [BUG] EOS terminator for mlx_lm generate function HOT 1
- [BUG] libc++abi crash when using recurrent layer and transformer HOT 2
- [Feature] arctan2 HOT 3
- [BUG] arithmetic operations with numpy arrays are not commutative HOT 3
- [Feature] KANs HOT 1
- No module named 'mlx.core'; 'mlx' is not a package HOT 2
- 0.12.2 release was not completed HOT 8
- [FEATURE REQUEST] mx.grad doesn't alias argnums and argnames HOT 5
- [BUG] `np.ndarray` of bfloat16 using ml_dtypes is being interpreted as complex64
- [BUG] mlx crashes with msg - uncaught exception of type std::invalid_argument: [Scatter::eval_gpu] Does not support int64 HOT 4
- Is dlpack supported? HOT 9
- [BUG] matmul yields different results when using concat HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlx.