pratyushasharma / laser Goto Github PK

View Code? Open in Web Editor NEW

356.0 22.0 25.0 2.31 MB

The Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

Home Page: https://pratyushasharma.github.io/laser/

License: MIT License

Python 100.00%

gpt-j interpretability laser llama2 llm llms model-compression transformers

laser's Issues

Do you think it could work for MoE models like Mixtral?

Title.

Mistral Support

Hi,
Great work on this! Is Mistral supported? Right now I only see GPT-J and Llama 2.
Thank you!

How to reproduce Figure 5 analysis in this paper?

Very cool work. I have curious how to reproduce the figure 5 assessment provided in this paper.

Problem Encountered During Reproduction

Thank you for sharing your great work. While reproducing your project, I encountered an issue that I hope you can provide some help with. I executed the following commands:
python3 intervention_gptj_fever.py --lname fc_in --rate 9.9 --lnum 26 python3 intervention_gptj_fever.py --lname fc_in --rate 9.0 --lnum 26 python3 intervention_gptj_fever.py --lname dont
However, I noticed that the LASER did not yield any performance improvement as expected. I've attached the log files for your reference. Could you kindly look into them and assist me in identifying any potential problems in my implementation?
GPTJ-log-26-fc_in-9.0.txt
GPTJ-log-24-dont-1.txt
GPTJ-log-26-fc_in-9.9.txt

Llama2-7B + TruthfulQA reproduce issue

Hello~ @pratyushasharma. Thanks for your effort and the code, I have been reproducing the result of Llama2-7B + TruthfulQA based on your code so that I can use your work as my baseline for further research, but I found that the results (i.e. accuracy) were almost the same, which is around 56.52 especially for the base model. I do not know what is wrong and I am still confused why that causes so much accuracy increase in Llama2-7B + TruthfulQA (around 5.7% in your result). I will appreciate it if you can help me check this result

Potential improvements for evaluation

Thanks for providing the code for this promising research. I'm looking forward to see how far this idea can be pushed. It would be especially cool if a heuristic could be found that applies this technique to multiple layers chosen in a way that works across different models.

When I investigated the results a bit closer and ran some of the benchmarks locally, I came across some potential issues. Specifically, I took a look at the BigBench-Epistemic Reasoning benchmark, but I suspect that others could also be affected. First of all, I noticed that the accuracy of the models without intervention was below 50% (Tab. 1). For a binary classification task, this is strange. When debugging the results, I found that for Roberta and GPT-J (haven't tested Llama), the models always predicted the same label, and since that label was used in 37% of samples in the dataset, that's also their accuracy. As Llama has 63% accuracy with intervention, I suspect that it simply always predicts other label.

Digging a bit deeper, I found the logits for the label tokens to be extremely small. This typically happens when the model is somehow "derailed" and wants to predict neither of the tokens. Sometimes, this simply comes down to tokenization: Often, the models try to predict " True" and " False" (leading whitespace) because this is how they tokenize the text. Other times, they want to go in a completely different direction. I would recommend to log the absolute probabilities of the label tokens and double-check when they are too low. Often, this can be fixed by slight adjustments to prompts or labels.

Also, there is a typo in this prompt: "entails" => "entail".

I hope this is helpful.

What is the ETA on the code

Thanks for the awesome work. I would like to try it out. What is the ETA on the code?

Where to Get the Dataset

Hi,
Thank you so much for making this project! I see that there's a CLI argument for dataset_file, do you know what I should point it to for the counterfact method?
Thank you!

Question

Hi,
Thanks for releasing this code. Does this codebase decrease the size of the model (ie file size, required VRAM)?
Thank you!

method of composing reductions across layers

Hello! Thanks for your idea and codes, and I am applying the code to my model. There are two questions for me now：

The paper says "greedily search" over the parameters and have a "simple compose strategy" when composing reductions across layers。Does this mean that search the best rate in different later MLP layers and then simply compose them?
Can I use a single command line to realize composing reductions across layers? Or i need to repeat doing intervention on a single layer for a few times to compose reductions?

Thank you!

what does the 'rate' parameters actually mean in code?

The github readme says 'rate' measures how much rank to retain, but in the code implements "results = torch.svd_lowrank(weight, q=desired_rank, niter=niter)" confuses me a bit. desired_rank means to retain, desired_rank = max_rank * k so k means how much to retain, then k = (10 - rate) * 0.1, so rate should mean how much to reduction? Is there something wrong with my understanding?

Generic model?

Thanks for publishing this excellent work. If I understand correctly, you run LASER intervention separately for each evaluation task.

Would it be possible to make one LASER model that is generic to all tasks? My goal is to compress LLAMA-v2-7B to be smaller, for executing faster on mobile devices.

Also, is it correct that you just apply LASER to one layer of the model? I was wondering, did you try applying it to most of the layers?

Feature Request for Upcoming Refactoring

This is the issue that contains list of all features for the upcoming refactoring:

A unified abstract class that does all common stuff like create command line arguments, make an LLM, and run the experiment. We may have only 1 file per LLM (or per LLM type) and this abstract class. We may not be able to get it down to a single file since certain LLMs like Roberta which are really Masked Language Models have a different procedure to computing accuracy and log-loss using the tokens.
Replace the use of rate with ρ which is used in the paper.
Add a feature to support memory reduction by storing separate U, S, V matrices rather than multiplying them back and loosing the memory advantage.
Add more LLMs, specifically, Mistral and other Llama2 versions and Phi models.
Release LLMs with optimally chosen reductions from Table3 of the paper https://arxiv.org/pdf/2312.13558.pdf.

If you have more requests, please paste them below. Do note that the first version of refactoring may not be able to do all of the above, but we'll do our best. We welcome PRs.

Reproducing LLAMA-2 metrics

Hello,

I'm trying to reproduce metrics in Table 1 for LLAMA-2. I did so for GPT-J, and the results are consistent; however, for LLAMA-2 for some reason, the results are not matching. Any idea of why this is happening?

For LLAMA-2, Fever, I get:

Baseline (no laser): 54.98% accuracy. The paper shows 59.3%
With LASER: 54.13% accuracy

Logs:

Baseline (no laser): 54.98% accuracy. The paper shows 59.3%
python intervention_llama2_fever.py --lname dont --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf

Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.98242396454226 percentage, Mean F1 score is None, Mean Log Prob is -1.1887680674259296, top-1 accuracy is 54.82958887360538, top-10 accuracy is 99.99235824545316, top-5 accuracy is 99.92358245453156.

With LASER: 54.13% accuracy
python intervention_llama2_fever.py --lname fc_in --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf

Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.13418920984258 percentage, Mean F1 score is None, Mean Log Prob is -1.2900288283587429, top-1 accuracy is 54.09598043710836, top-10 accuracy is 100.0, top-5 accuracy is 99.91594069998472.

Specs:
Python==3.8
Torch: Version: 1.12.1+cu116

Rank reduction using random matrix theory

Hi! I really enjoyed the paper.
I've implemented a version that uses marchenko pastur in order to speed up the search, instead looking within a grid search.

If it's of your interest, we can join efforts.

I would be glad if you could take a look at https://github.com/cognitivecomputations/laserRMT

Congratulations for your work.

Best,

Fernando

how to get base model accuracy

I should run which command to obtain the accuracy of the base model？

Excellent work, looking forward to following up with further research!

In sections 5.1 and 5.2, I have a few questions

In the counterfact dataset, we should not only compare the accuracy of top k, but also consider the ES,EM metrics mentioned by Meng, which may increase the probability of wrong words while increasing the probability of correct words (The native language of Danielle Darrieux is English. Danielle Darrieux's native language is French).

I think if a specific LASER is performed on each dataset individually, although it will significantly improve the prediction performance, there is a risk of overfitting? I think a uniform set of hyperparameters should be found to reduce the RANK to demonstrate the effectiveness of this approach.
Nevertheless, I think this is a very worthwhile endeavor and it gives us a very valuable insight into the inner workings of the transformer.

Reference：
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in
GPT. Advances in Neural Information Processing Systems, 36, 2022.

License

Hi,
Great work on this codebase! Would you mind adding a license, ie MIT/Apache/ISC?
Thank you!

After your code is saved, the size of the weights is the same as the pre-trained ones, and no memory is saved. What is the reason for this?

Rank-reduced models?

Do you publish the rank-reduced models anywhere?

Application to three-dimensional tensors

Can you consider implementing LASER on three-dimensional tensors? For example, use this method for the conv3d architecture.

pratyushasharma / laser Goto Github PK

laser's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs