Comments (4)
Hi @jpaye,
This can be configured today, but we recommend that you test your specific checkpoint to ensure that quantized weight storage does not produce poor results.
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from transformers_neuronx import (
NeuronAutoModelForCausalLM,
NeuronConfig,
QuantizationConfig,
HuggingFaceGenerationModelAdapter,
)
name = 'mistralai/Mistral-7B-Instruct-v0.2'
def load_neuron_int8():
config = AutoConfig.from_pretrained(name)
model = NeuronAutoModelForCausalLM.from_pretrained(
name,
tp_degree=2,
amp='bf16',
n_positions=[256], # Limited seqlen for faster compilation
neuron_config=NeuronConfig(
quant=QuantizationConfig(
quant_dtype='s8',
dequant_dtype='bf16'
)
)
)
model.to_neuron()
return HuggingFaceGenerationModelAdapter(config, model)
def load_cpu_bf16():
return AutoModelForCausalLM.from_pretrained(name)
def infer(model):
tokenizer = AutoTokenizer.from_pretrained(name)
prompt = "[INST] What is your favourite condiment? [/INST]"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, top_k=1, max_new_tokens=256 - input_ids.shape[1])
print('Output:')
print(tokenizer.decode(output[0]))
if __name__ == '__main__':
infer(load_neuron_int8())
infer(load_cpu_bf16())
You’ll notice that the Neuron quantized int8 version and the CPU bf16 version produce slightly different greedy results due to precision loss:
Neuron Output:
[INST] What is your favourite condiment? [/INST] I don’t have a personal preference or the ability to taste or enjoy condiments, as I’m an artificial intelligence and don’t have a physical body or senses. However, I can tell you that some common favourite condiments include ketchup, mustard, mayonnaise, hot sauce, soy sauce, and relish. People’s preferences can vary greatly depending on their cultural background, dietary restrictions, and personal taste preferences.
CPU Output:
[INST] What is your favourite condiment? [/INST] I don’t have a personal preference or the ability to taste or enjoy condiments, as I’m an artificial intelligence and don’t have a physical body or senses. However, I can tell you that some common favourite condiments include ketchup, mustard, mayonnaise, hot sauce, soy sauce, and BBQ sauce. People’s preferences can vary greatly depending on their cultural background, dietary restrictions, and personal taste preferences.
We will look at updating the documentation for clarity.
from aws-neuron-sdk.
@jluntamazon thank you very much for the help! Attempting to run the code on inf2.8xlarge
, currently working on debugging the below error
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Will update!
from aws-neuron-sdk.
@jluntamazon thanks again for the help! I got through the above issue but now debugging the below
I hit this when attempting to save the quantized model with save_pretrained
ValueError: Attempted to use an uninitialized parameter in <method 'detach' of 'torch._C._TensorBase' objects>. This error happens when you are using a `LazyModule` or explicitly manipulating `torch.nn.parameter.UninitializedParameter` objects. When using LazyModules Call `forward` with a dummy batch to initialize the parameters before calling torch functions
Will keep working on it, just posting in case it's an issue that's familiar to you
from aws-neuron-sdk.
@jluntamazon just an update -- I was able to work out the issues and did get this to work!
However, I didn't really see the performance bump I would have expected -- in my testing the inference wasn't faster than the non-quantized model (on inf2.xlarge). Wondering if that's expected? I had been hoping that I'd see lower inference latency
from aws-neuron-sdk.
Related Issues (20)
- Input tensor is not an XLA tensor: CPUFloatType while using crf.decode function HOT 4
- RuntimeError: Bad StatusOr access: INVALID_ARGUMENT: PJRT_Client_Create: error condition nullptr != (args)->client->Error(): Init: error condition !(num_devices > 0): HOT 3
- BERT model implemented usiing TransformerEncoder returns all NaNs when running it torch==1.13.1 HOT 3
- PDF print on the home page is empty when the left side is collapsed HOT 1
- Quite largely increased latency with weights/neff separated HOT 1
- Input tensors not being read torch neuronx 2.1.2 HOT 4
- Is there something wrong in torch_neuronx.trace ? HOT 3
- support for aten::upsample_nearest3d HOT 1
- Is it possible to compile a model when no NeuronCores are available? HOT 2
- ECS inf1 neuron hook script fails HOT 2
- Issue on page /frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html
- Model doesn't support task text-classification for the neuron backend
- DataParallel Support on CRF inference HOT 1
- neuron-distributed for inference HOT 1
- AWS NeuronX sdk installation HOT 2
- Issue on page /general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html HOT 1
- Missing example in the doc for speculative decoding beta support HOT 1
- Links broken on page /libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.html
- [Runtime API] Missing `nrt_get_dmabuf_fd` Function HOT 4
- Inf1 BERT deployment using 1.13.1-neuron-py310-sdk2.19.0-ubuntu20.04
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-neuron-sdk.