GithubHelp home page GithubHelp logo

poppingtonic / transformer-visualization Goto Github PK

View Code? Open in Web Editor NEW
7.0 4.0 2.0 5.27 MB

Mechanistic Interpretability Tutorials, Results and research log as I learn from publicly available research, and experimentation.

Jupyter Notebook 99.84% Python 0.16%
interpretable-ai transformers gradio-interface visualization interpretability-jam

transformer-visualization's People

Contributors

poppingtonic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

transformer-visualization's Issues

CIDs for LLMs

When chaining parallel and sequential calls to large language models (like LangChain), you implicitly create a causal graph that can be analyzed visually if you have the right tracing tools (https://github.com/oughtinc/ice). This notebook describes different agents using an explicit formalism based on causal influence diagrams, which we can treat as a notation for describing the data flow, components and steps involved when a user makes a request. We use the example diagrams to explain and fix risk scenarios, showing how easy it is to debug agent architectures if you can visually reason about the data flow, and ask questions about intent alignment for AGI in the context of such agents.

Examples and Theory in Colab to Get Started:
YouTube (starts at 31:40): https://youtu.be/XauqlTQm-o4?t=1901
Paper: https://docs.google.com/document/d/160Yw_iuvztB6CTT9Osj5wC0sOrEKjfaGkkeeYuwQf4Y/edit#heading=h.kwtox8r6b7n6

https://colab.research.google.com/drive/1roLQgXhEtI83Q5vX1q24Q9iDgu5LFFWA#scrollTo=5b8fbbeb-e90b-4990-ac3e-d484205b78aa

TODO (Intermediate to Advanced):

  • Answer Set Programming for Automated Verification of Intent Consistency https://github.com/poppingtonic/transformer-visualization/tree/main/formal-constraints
  • A note on Security Mindset - discuss fuzzing, imagine potential adversarial attacks and prompt-injection attacks as well and explore designs that mitigate them.
  • Mechanistic Interpretability: Info-Weighted Attention mechanisms, Info-weighted Averaging (https://youtu.be/etFCaFvt2Ks)
  • [viz] Animating the temporal dependence of events in the CID if we have timestamps of each sub-agent process starting - should add this to tracing code
  • [Alignment Theory] Study links to Garrabrant's Temporal Inference with Finite Factored Sets: https://arxiv.org/abs/2109.11513
  • CIDs for LLMs In The Wild: The Bounty Hunt. Which kinds of CIDs (in the paper, or any ones you might come up with) do you think exist in the wild i.e. implemented and in production? I have not done any kind of study yet but it would be cool to find ChatGPT's CID, ChatGPT Plugins, ReAct's CID or AutoGPT's CID and reason about it from first principles.
  • Quality Diversity from Human (Or AI) Feedback on CID Variants (or human-provided edits and feedback on whether it would be good) with safety properties we can analyze. Similar to LMX but when we consider the multi-agent decomposition of the program trace.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.