Topic: mechanistic-interpretability Goto Github

Some thing interesting about mechanistic-interpretability

👇 Here are 35 public repositories matching this topic...

alejoacelas / arena_2.0_exhibit

mechanistic-interpretability,Solution to ML assignments from the Alignment Research Engineering Accelerator (ARENA) in-person program

User: alejoacelas

cuda mechanistic-interpretability nlp pytorch rl torch-lightning transformers

alejoacelas / bayesian-transformers

mechanistic-interpretability,Interpretability on 1-layer Transformer models that converge on the Bayesian-optimal solution for statistical tasks

User: alejoacelas

bayesian-inference mechanistic-interpretability transformers

alejoacelas / interp-benchmarks

mechanistic-interpretability,Reversed-engineered Transformer models as a benchmark for interpretability methods

User: alejoacelas

benchmark causal-analysis mechanistic-interpretability pytorch

alejoacelas / mech-interp-challenges

mechanistic-interpretability,Starting Kit for the CodaBench competition on Transformer Interpretability

User: alejoacelas

competitive-programming mechanistic-interpretability transformer

alejoacelas / organizer-mech-interp-challenges

mechanistic-interpretability,Organizer's repository for the Transformer Interpretability CodaBench competition

User: alejoacelas

competitive-programming mechanistic-interpretability transformer

apartresearch / deepdecipher

mechanistic-interpretability,🦠 DeepDecipher: An open source API to MLP neurons

Organization: apartresearch

Home Page: https://apartresearch.com

academic api interpretability interpretability-jam interpretability-methods machine-learning mechanistic-interpretability research website

apartresearch / interpretability-starter

mechanistic-interpretability,🧠 Starter templates for doing interpretability research

Organization: apartresearch

Home Page: https://alignmentjam.com/jam/interpretability

alignment-jam interpretability interpretability-jam mechanistic-interpretability

aryamanarora / causalgym

mechanistic-interpretability,CausalGym: Benchmarking causal interpretability methods on linguistic tasks

User: aryamanarora

Home Page: https://arxiv.org/abs/2402.12560

benchmark interpretability mechanistic-interpretability syntaxgym causality

batsresearch / cross-lingual-detox

mechanistic-interpretability,Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages"

Organization: batsresearch

Home Page: https://arxiv.org/abs/2406.16235

ai-safety mechanistic-interpretability multilingual-nlp nlp cross-lingual-transfer generalization

cx0 / mech-interpretability

mechanistic-interpretability,Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.

User: cx0

ioi mechanistic-interpretability indirect-object-identification

daspartho / pronoun-prediction

mechanistic-interpretability,Identifying Circuit behind Pronoun Prediction in GPT-2 Small

User: daspartho

gpt-2 interpretability mechanistic-interpretability

deanhazineh / emergent-world-representations-othello

mechanistic-interpretability,A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello

User: deanhazineh

gpt-2 intervention mechanistic-interpretability othello-ai

epfl-dlab / llm-latent-language

mechanistic-interpretability,Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

Organization: epfl-dlab

llama2 llm mechanistic-interpretability multilingual-nlp

evan-lloyd / graphpatch

mechanistic-interpretability,graphpatch is a library for activation patching on PyTorch neural network models.

User: evan-lloyd

interpretability large-language-models mechanistic-interpretability pytorch

francescortu / comp-mech

mechanistic-interpretability,Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

User: francescortu

Home Page: https://arxiv.org/abs/2402.11655

interpretability llm mechanistic-interpretability

jbloomaus / decisiontransformerinterpretability

mechanistic-interpretability,Interpreting how transformers simulate agents performing RL tasks

User: jbloomaus

Home Page: https://jbloomaus-decisiontransformerinterpretability-app-4edcnc.streamlit.app/

mechanistic-interpretability reinforcement-learning

koayon / atp_star

mechanistic-interpretability,PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

User: koayon

large-language-models machine-learning mechanistic-interpretability

mechanistic-interpretability,A project that simulates a game of shuffling cups with a hidden ball underneath one of them. It also trains a Transformer based deep learning model to predict the final position of the ball after a series of swaps.

User: lejoon

deep-learning mechanistic-interpretability transformers

lkopf / cosy

mechanistic-interpretability,CoSy: Evaluating Textual Explanations

User: lkopf

global-explainability machine-learning mechanistic-interpretability xai-evaluation

matthiasdellago / visualising-attention

mechanistic-interpretability,Visualising (self)-attention as a vector field: exploring and building intuition. Based on anvaka.github.io/fieldplay.

User: matthiasdellago

attention attention-mechanism machine-learning mechanistic-interpretability transformer vector-field visualization

microsoft / automated-explanations

mechanistic-interpretability,Explain a black-box module in natural language.

Organization: microsoft

Home Page: https://arxiv.org/abs/2305.09863

artificial-intelligence explanation gpt gpt4 interpretability language-model large-language-models machine-learning mechanistic-interpretability neuroscience

nix07 / binding-circuit-discovery

mechanistic-interpretability,This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".

User: nix07

Home Page: https://dcm.baulab.info/

mechanistic-interpretability science-of-deep-learning

nix07 / finetuning

mechanistic-interpretability,This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

User: nix07

Home Page: https://finetuning.baulab.info

entity-tracking finetuning science-of-deep-learning mechanistic-interpretability

openmoss / language-model-saes

mechanistic-interpretability,For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

Organization: openmoss

interpretability mechanistic-interpretability sparse-autoencoders sparse-dictionary

pauljblazek / deepdistilling

mechanistic-interpretability,Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

User: pauljblazek

Home Page: https://rdcu.be/dy2Go

explainable-ai program-synthesis mechanistic-interpretability inductive-logic-programming model-distillation distilling neurosymbolic domain-adaptation out-of-distribution-generalization interpretable knowledge-distillation

ruizheliuoa / awesome-interpretability-in-large-language-models

mechanistic-interpretability,This repository collects all relevant resources about interpretability in LLMs

User: ruizheliuoa

dictionary-learning interpretability-and-explainability mechanistic-interpretability sparse-autoencoder

sagarss24 / mtb_manuscript_data

mechanistic-interpretability,Physiological modeling into the metaverse of Mycobacterium tuberculosis beta CA inhibition mechanism

User: sagarss24

drug-design machine-learning mechanism-of-action mechanistic-interpretability systems-biology tuberculosis

stanfordnlp / pyvene

mechanistic-interpretability,Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

Organization: stanfordnlp

Home Page: http://pyvene.ai

interpretability mechanistic-interpretability intervention activation-intervention activation-patching

steering-vectors / steering-vectors

mechanistic-interpretability,Steering vectors for transformer language models in Pytorch / Huggingface

Organization: steering-vectors

Home Page: https://steering-vectors.github.io/steering-vectors/

ai gpt huggingface mechanistic-interpretability nlp pytorch representation-engineering

taufeeque9 / codebook-features

mechanistic-interpretability,Sparse and discrete interpretability tool for neural networks

User: taufeeque9

Home Page: https://huggingface.co/spaces/taufeeque/codebook-features

codebook features interpretability language-model mechanistic-interpretability transformers

wesg52 / sparse-probing-paper

mechanistic-interpretability,Sparse probing paper full code.

User: wesg52

Home Page: https://arxiv.org/abs/2305.01610

ai-alignment ai-safety interpretability mechanistic-interpretability

wesg52 / universal-neurons

mechanistic-interpretability,Universal Neurons in GPT2 Language Models

User: wesg52

ai-safety interpretability llm mechanistic-interpretability

yash-srivastava19 / arrakis

mechanistic-interpretability,Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

User: yash-srivastava19

Home Page: https://arrakis-mi.readthedocs.io/en/latest/README.html

anthropic explainable-ai garcon mechanistic-interpretability transformer transformerlens

zhaoyi-li21 / creme

mechanistic-interpretability,Implementation for the paper "Understanding and Patching Compositional Reasoning in LLMs" @ ACL2024-Findings, Bangkok, Thailand.

User: zhaoyi-li21

Home Page: https://arxiv.org/abs/2402.14328

large-language-models mechanistic-interpretability multi-hop-reasoning compositional-reasoning factual-reasoning

zroe1 / toy-models-of-superposition

mechanistic-interpretability,A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.

User: zroe1

machine-learning mechanistic-interpretability python3 pytorch toy-models

Topic: mechanistic-interpretability Goto Github

👇 Here are 35 public repositories matching this topic...

alejoacelas / arena_2.0_exhibit

alejoacelas / bayesian-transformers

alejoacelas / interp-benchmarks

alejoacelas / mech-interp-challenges

alejoacelas / organizer-mech-interp-challenges

apartresearch / deepdecipher

apartresearch / interpretability-starter

aryamanarora / causalgym

batsresearch / cross-lingual-detox

cx0 / mech-interpretability

daspartho / pronoun-prediction

deanhazineh / emergent-world-representations-othello

epfl-dlab / llm-latent-language

evan-lloyd / graphpatch

francescortu / comp-mech

jbloomaus / decisiontransformerinterpretability

koayon / atp_star

lejoon / cup-transformer

lkopf / cosy

matthiasdellago / visualising-attention

microsoft / automated-explanations

nix07 / binding-circuit-discovery

nix07 / finetuning

openmoss / language-model-saes

pauljblazek / deepdistilling

ruizheliuoa / awesome-interpretability-in-large-language-models

sagarss24 / mtb_manuscript_data

stanfordnlp / pyvene

steering-vectors / steering-vectors

taufeeque9 / codebook-features

wesg52 / sparse-probing-paper

wesg52 / universal-neurons

yash-srivastava19 / arrakis

zhaoyi-li21 / creme

zroe1 / toy-models-of-superposition

Recommend Projects

Recommend Topics

Recommend Org

Jobs