GithubHelp home page GithubHelp logo

brandon-lockaby / algebraic_value_editing Goto Github PK

View Code? Open in Web Editor NEW

This project forked from montemac/activation_additions

0.0 0.0 0.0 8.25 MB

Experiments testing the algebraic value-editing conjecture (AVEC) on GPT-2 models

License: MIT License

Python 3.18% Jupyter Notebook 96.80% Dockerfile 0.01%

algebraic_value_editing's Introduction

Algebraic value editing in pretrained language models

Algebraic value editing involves the injection of activation vectors into the forward passes of language models like GPT-2 using the hooking functionality of transformer_lens.

Installation

After cloning the repository, run pip install -e . to install the algebraic_value_editing package.

There are currently a few example scripts in the scripts/ directory.For example, basic_functionality.py generates modified prompts (as described below).

Methodology

How the vectors are generated

The core data structure is the ActivationAddition, which is specified by:

  • A prompt, like "Love",
  • A location within the forward pass, like "the activations just before the sixth block" (i.e. blocks.6.hook_resid_pre), and
  • A coefficient, like 2.5.
love_rp = ActivationAddition(prompt="Love", coeff=2.5, act_name="blocks.6.hook_resid_pre")

The ActivationAddition specifies:

Run a forward pass on the prompt, record the activations at the given location in the forward pass, and then rescale those activations by the given coefficient.

Then, when future forward passes reach blocks.6.hook_resid_pre, a hook function adds e.g. 2.5 times the "Love" activations to the usual activations at that location.

For example, if we run gpt2-small on the prompt "I went to the store because", the residual streams line up as follows:

prompt_tokens =  ['<|endoftext|>', 'I', ' went', ' to', ' the', ' store', ' because']
love_rp_tokens = ['<|endoftext|>', 'Love']

To add the love ActivationAddition to the forward pass, we run the usual forward pass on the prompt until transformer block 6. At this point, consider the first two residual streams. Namely, the '<|endoftext|>' residual stream and the 'I'/'Love' residual stream. We add the activations in these two residual streams.

X-vectors are a special kind of ActivationAddition

A special case of this is the "X-vector." A "Love minus hate" vector is generated by

love_rp, hate_rp = get_x_vector(prompt1="Love", prompt2="Hate", 
                                coeff=5, act_name=6)

This returns a tuple of two ActivationAdditions:

love_rp = ActivationAddition(prompt="Love", coeff=5, act_name="blocks.6.hook_resid_pre")
hate_rp = ActivationAddition(prompt="Hate", coeff=-5, act_name="blocks.6.hook_resid_pre")

(This is mechanistically similar to our cheese- and top-right-vectors, originally computed for deep convolutional maze-solving policy networks.)

Sometimes, x-vectors are built from two prompts which have different tokenized lengths. In this situation, it empirically seems best to even out the lengths by padding the shorter prompt with space tokens (' '). This is done by calling:

get_x_vector(prompt1="I talk about weddings constantly", 
             prompt2="I do not talk about weddings constantly", 
             coeff=4, act_name=20, 
             pad_method="tokens_right", model=gpt2_small,
             custom_pad_id=gpt2_small.to_single_token(' '))

Using ActivationAdditions to generate modified completions

Given an actual prompt which is fed into the model normally (model.generate(prompt="Hi!")) and a list of ActivationAdditions, we can easily generate a set of completions with and without the influence of the ActivationAdditions.

print_n_comparisons(
    prompt="I hate you because",
    model=gpt2_xl,
    tokens_to_generate=100,
    activation_additions=[love_rp, hate_rp],
    num_comparisons=15,
    seed=42,
    temperature=1, freq_penalty=1, top_p=.3
)

This produces an output like the following (where the prompt is bolded, and the completions are not):

An even starker example is produced by

praise_rp, hurt_rp = *get_x_vector(prompt1="Intent to praise", 
                                   prompt2="Intent to hurt", 
                                   coeff=15, act_name=6,
                                   pad_method="tokens_right", model=gpt2_xl,
                                   custom_pad_id=gpt2_xl.to_single_token(' '))
print_n_comparisons(
    prompt="I want to kill you because",
    model=gpt2_xl,
    tokens_to_generate=50,
    activation_additions=[praise_rp, hurt_rp],
    num_comparisons=15,
    seed=0,
    temperature=1, freq_penalty=1, top_p=.3
)

For more examples, consult our Google Colab.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.