The transformer-visualization from poppingtonic

Plan (New to Mech Interp? Try this after the other notebooks)

Possible updates to this project that I'm considering as I come up with plans for coming back to this project.

Validate Induction Heads for more LMs

Eigenvalues for detecting Induction Heads
NMF
More experiments with Activation Patching
[RLHF Reward Models]
- Linear Probes
- Episodic Linear Probes
Review Paper
- Does Mech-Interp Scale? https://arxiv.org/abs/2307.09458
- ROME (rank-one model editing) https://rome.baulab.info

CIDs for LLMs

When chaining parallel and sequential calls to large language models (like LangChain), you implicitly create a causal graph that can be analyzed visually if you have the right tracing tools (https://github.com/oughtinc/ice). This notebook describes different agents using an explicit formalism based on causal influence diagrams, which we can treat as a notation for describing the data flow, components and steps involved when a user makes a request. We use the example diagrams to explain and fix risk scenarios, showing how easy it is to debug agent architectures if you can visually reason about the data flow, and ask questions about intent alignment for AGI in the context of such agents.

Examples and Theory in Colab to Get Started:
YouTube (starts at 31:40): https://youtu.be/XauqlTQm-o4?t=1901
Paper: https://docs.google.com/document/d/160Yw_iuvztB6CTT9Osj5wC0sOrEKjfaGkkeeYuwQf4Y/edit#heading=h.kwtox8r6b7n6

https://colab.research.google.com/drive/1roLQgXhEtI83Q5vX1q24Q9iDgu5LFFWA#scrollTo=5b8fbbeb-e90b-4990-ac3e-d484205b78aa

TODO (Intermediate to Advanced):

Answer Set Programming for Automated Verification of Intent Consistency https://github.com/poppingtonic/transformer-visualization/tree/main/formal-constraints
A note on Security Mindset - discuss fuzzing, imagine potential adversarial attacks and prompt-injection attacks as well and explore designs that mitigate them.
Mechanistic Interpretability: Info-Weighted Attention mechanisms, Info-weighted Averaging (https://youtu.be/etFCaFvt2Ks)
[viz] Animating the temporal dependence of events in the CID if we have timestamps of each sub-agent process starting - should add this to tracing code
[Alignment Theory] Study links to Garrabrant's Temporal Inference with Finite Factored Sets: https://arxiv.org/abs/2109.11513
CIDs for LLMs In The Wild: The Bounty Hunt. Which kinds of CIDs (in the paper, or any ones you might come up with) do you think exist in the wild i.e. implemented and in production? I have not done any kind of study yet but it would be cool to find ChatGPT's CID, ChatGPT Plugins, ReAct's CID or AutoGPT's CID and reason about it from first principles.
Quality Diversity from Human (Or AI) Feedback on CID Variants (or human-provided edits and feedback on whether it would be good) with safety properties we can analyze. Similar to LMX but when we consider the multi-agent decomposition of the program trace.

poppingtonic / transformer-visualization Goto Github PK

transformer-visualization's People

Contributors

Stargazers

Watchers

Forkers

transformer-visualization's Issues

Plan (New to Mech Interp? Try this after the other notebooks)

[RLHF Reward Models]

CIDs for LLMs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs