GithubHelp home page GithubHelp logo

telephonegame's Introduction

Probing BERT's priors with serial reproduction chains (Findings of ACL)

Repository organization

  • model contains code to run chains sampling sentences from BERT.

  • corpus contains code for extracting comparable sentences from wikipedia and bookcorpus.

  • analysis contains code for analyzing behavioral experiment data, performing corpus comparison, and producing figures.

telephonegame's People

Contributors

hawkrobe avatar taka-yamakoshi avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

michahu

telephonegame's Issues

CoNLL paper

General writing

  • Abstract
  • Move details about tokenization and handling wikipedia down to corpus comparison section
  • Write corpus methods
  • Write behavioral methods
  • Discussion
  • Appendix A: MH notes

Analyses/results

  • Section 3.1: Checking convergence of chains (e.g. re-making the squeeze plot and show that they begin to overlap in probability)
  • Section 3.2: Auto-correlation analyses (e.g. Figure 1)
  • Section 3.3: Some check on whether samples are 'memorized' duplicates of corpus sentences
  • Section 4.1: Lexical analyses. Re-make the Zipf plot, and also take the spearman rank correlation between the lexical frequency distribution on BERT vs. corpus, make a table of "biggest deviations" (which words are most over and under-represented in the BERT samples, via PMI; we can also do this for bi-grams/tri-grams).
  • Section 4.2: Syntactic analyses. Remake our POS & DEP comparison figures, identify sources of largest deviation (e.g. which POS are most over- and under- represented).
  • Section 5.1: Report statistics for human behavioral ratings
  • Section 5.2: Predict human ratings from features (e.g. report linear regression)

Data collection

  • Run additional behavioral sample with n-gram generations and more different sentences during burn-in...
    • take Google 5-gram and sample first word from unigram probability, second word from bigram probability, etc...

Figures

  • Taka: "Figure 1": auto-correlation curves at different lags
    • Rerun very long chains (e.g. 100k steps).
    • With and without 'fixed point detection'
    • With and without 'multiple sentence detection' (where we run sentencizer on every step to prevent getting into a multi-sentence state)
  • Robert: "Front figure": graphical depiction of our technique and why it's interesting (frivolous tsne plot to see chain bouncing around space of sentences)
  • Robert: "Alignment": show relationship between sentence lengths measured in sub-word and word...

Stretch goals:

  • Section 4.3: Some syntactic complexity measures (e.g. dependency distances) or grammatical error measures.

BERT TODOs

Convergence analysis

  • switch stephen's 'high probability' sentences out for even higher probability sentences (e.g. from the end of the low temperature chains). So we can see the 'squeeze'.
  • plotting the probability of sampled word under BERT (vs. whole sentence probability) since this is closer to the 'objective' that Gibbs sampling is using
  • make same plots for temperature = 1.
  • make same plots for Metropolis-Hastings algorithm
  • one final plot: check MH w/ temp=1 for more time steps (t~500)

Comparing distributional statistics to corpus

  • lexical frequency in chains vs. corpus? (zipf's law?)
  • split wikipedia in half (A & B) and plot each to see how closer we ought to expert BERT to be
  • examine set difference of vocabulary (e.g. words in wikipedia but not generated by BERT)
  • examine frequency ranking of these words (e.g. are they mostly coming from the tails?)
  • controls: generate sentences from another model, e.g. LSTM/GPT-2 -- is BERT distribution closer/further from the data?
  • examine words by change in rank (e.g. which words moved 'up' the most in rank, and 'down' the most in rank)
  • part of speech distributions
  • statistics of syntactic structure in chains vs. corpus?
    • something based on dependency parses? (e.g. dependency lengths?)
    • run a constituency parse (e.g. phase-level tag frequencies?)
    • extract the part of speech for each word, e.g. (NOUN VERB PREP DET NOUN...) to coarse-grain all sentences to just the POS sequence, and compare n-gram stuff on the POS...?
  • how those statistics change in the BERT chains (e.g. do sentences become syntactically simpler as they go through the chains?)

Longer-term evaluations

  • control: compare to another non-Wikipedia corpus, e.g. web corpus
  • examine probabilities under a more traditional language model (e.g. GPT-2)
    • note: we could define an iterated learning dynamics with a sequential model by picking a site in the sentence, chopping off the rest of the sentence, and re-generating whole sentence from there.
    • or, alternatively, picking a site, re-sampling that word given previous words and then scoring probability of the whole sentence with that work swapped in (which is a little weirder)
  • show relationship between Gibbs sampling distribution and MH distribution
    • can we prove a theorem about converging to the same sentence probability (i.e. MH using the BERT way of factoring sentence probability as P(s) = P(w_1 | w_-1)*...*P(w_n | w_-n)) goes to the same thing as Gibbs using P(w_i | w_-i))

Additional thoughts

  • What should we do with punctuation?
    • option (1) we exclude punctuation entirely? (i.e. conditional prior given assumption of no punctuation?)
    • option (2) we ignore the fact that sometimes sentences get weird when lots of slots are taken up by punctuation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.