pratyushasharma / laser Goto Github PK
View Code? Open in Web Editor NEWThe Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
Home Page: https://pratyushasharma.github.io/laser/
License: MIT License
The Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
Home Page: https://pratyushasharma.github.io/laser/
License: MIT License
Title.
Hi,
Great work on this! Is Mistral supported? Right now I only see GPT-J and Llama 2.
Thank you!
Very cool work. I have curious how to reproduce the figure 5 assessment provided in this paper.
Thank you for sharing your great work. While reproducing your project, I encountered an issue that I hope you can provide some help with. I executed the following commands:
python3 intervention_gptj_fever.py --lname fc_in --rate 9.9 --lnum 26 python3 intervention_gptj_fever.py --lname fc_in --rate 9.0 --lnum 26 python3 intervention_gptj_fever.py --lname dont
However, I noticed that the LASER did not yield any performance improvement as expected. I've attached the log files for your reference. Could you kindly look into them and assist me in identifying any potential problems in my implementation?
GPTJ-log-26-fc_in-9.0.txt
GPTJ-log-24-dont-1.txt
GPTJ-log-26-fc_in-9.9.txt
Hello~ @pratyushasharma. Thanks for your effort and the code, I have been reproducing the result of Llama2-7B + TruthfulQA based on your code so that I can use your work as my baseline for further research, but I found that the results (i.e. accuracy) were almost the same, which is around 56.52 especially for the base model. I do not know what is wrong and I am still confused why that causes so much accuracy increase in Llama2-7B + TruthfulQA (around 5.7% in your result). I will appreciate it if you can help me check this result
Thanks for providing the code for this promising research. I'm looking forward to see how far this idea can be pushed. It would be especially cool if a heuristic could be found that applies this technique to multiple layers chosen in a way that works across different models.
When I investigated the results a bit closer and ran some of the benchmarks locally, I came across some potential issues. Specifically, I took a look at the BigBench-Epistemic Reasoning benchmark, but I suspect that others could also be affected. First of all, I noticed that the accuracy of the models without intervention was below 50% (Tab. 1). For a binary classification task, this is strange. When debugging the results, I found that for Roberta and GPT-J (haven't tested Llama), the models always predicted the same label, and since that label was used in 37% of samples in the dataset, that's also their accuracy. As Llama has 63% accuracy with intervention, I suspect that it simply always predicts other label.
Digging a bit deeper, I found the logits for the label tokens to be extremely small. This typically happens when the model is somehow "derailed" and wants to predict neither of the tokens. Sometimes, this simply comes down to tokenization: Often, the models try to predict " True" and " False" (leading whitespace) because this is how they tokenize the text. Other times, they want to go in a completely different direction. I would recommend to log the absolute probabilities of the label tokens and double-check when they are too low. Often, this can be fixed by slight adjustments to prompts or labels.
Also, there is a typo in this prompt: "entails" => "entail".
I hope this is helpful.
Thanks for the awesome work. I would like to try it out. What is the ETA on the code?
Hi,
Thank you so much for making this project! I see that there's a CLI argument for dataset_file
, do you know what I should point it to for the counterfact method?
Thank you!
Hi,
Thanks for releasing this code. Does this codebase decrease the size of the model (ie file size, required VRAM)?
Thank you!
Hello! Thanks for your idea and codes, and I am applying the code to my model. There are two questions for me now:
Thank you!
The github readme says 'rate' measures how much rank to retain, but in the code implements "results = torch.svd_lowrank(weight, q=desired_rank, niter=niter)" confuses me a bit. desired_rank means to retain, desired_rank = max_rank * k so k means how much to retain, then k = (10 - rate) * 0.1, so rate should mean how much to reduction? Is there something wrong with my understanding?
Thanks for publishing this excellent work. If I understand correctly, you run LASER intervention separately for each evaluation task.
Would it be possible to make one LASER model that is generic to all tasks? My goal is to compress LLAMA-v2-7B to be smaller, for executing faster on mobile devices.
Also, is it correct that you just apply LASER to one layer of the model? I was wondering, did you try applying it to most of the layers?
This is the issue that contains list of all features for the upcoming refactoring:
A unified abstract class that does all common stuff like create command line arguments, make an LLM, and run the experiment. We may have only 1 file per LLM (or per LLM type) and this abstract class. We may not be able to get it down to a single file since certain LLMs like Roberta which are really Masked Language Models have a different procedure to computing accuracy and log-loss using the tokens.
Replace the use of rate with ρ which is used in the paper.
Add a feature to support memory reduction by storing separate U, S, V matrices rather than multiplying them back and loosing the memory advantage.
Add more LLMs, specifically, Mistral and other Llama2 versions and Phi models.
Release LLMs with optimally chosen reductions from Table3 of the paper https://arxiv.org/pdf/2312.13558.pdf.
If you have more requests, please paste them below. Do note that the first version of refactoring may not be able to do all of the above, but we'll do our best. We welcome PRs.
Hello,
I'm trying to reproduce metrics in Table 1 for LLAMA-2. I did so for GPT-J, and the results are consistent; however, for LLAMA-2 for some reason, the results are not matching. Any idea of why this is happening?
For LLAMA-2, Fever, I get:
Logs:
python intervention_llama2_fever.py --lname dont --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf
Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.98242396454226 percentage, Mean F1 score is None, Mean Log Prob is -1.1887680674259296, top-1 accuracy is 54.82958887360538, top-10 accuracy is 99.99235824545316, top-5 accuracy is 99.92358245453156.
python intervention_llama2_fever.py --lname fc_in --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf
Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.13418920984258 percentage, Mean F1 score is None, Mean Log Prob is -1.2900288283587429, top-1 accuracy is 54.09598043710836, top-10 accuracy is 100.0, top-5 accuracy is 99.91594069998472.
Specs:
Python==3.8
Torch: Version: 1.12.1+cu116
Hi! I really enjoyed the paper.
I've implemented a version that uses marchenko pastur in order to speed up the search, instead looking within a grid search.
If it's of your interest, we can join efforts.
I would be glad if you could take a look at https://github.com/cognitivecomputations/laserRMT
Congratulations for your work.
Best,
Fernando
I should run which command to obtain the accuracy of the base model?
In sections 5.1 and 5.2, I have a few questions
Reference:
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in
GPT. Advances in Neural Information Processing Systems, 36, 2022.
Hi,
Great work on this codebase! Would you mind adding a license, ie MIT/Apache/ISC?
Thank you!
Do you publish the rank-reduced models anywhere?
Can you consider implementing LASER on three-dimensional tensors? For example, use this method for the conv3d architecture.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.