Light

taka-yamakoshi / telephonegame Goto Github PK

Python 1.46% JavaScript 94.31% HTML 4.13% CSS 0.10%

telephonegame's Introduction

model contains code to run chains sampling sentences from BERT.
corpus contains code for extracting comparable sentences from wikipedia and bookcorpus.
analysis contains code for analyzing behavioral experiment data, performing corpus comparison, and producing figures.

Abstract
Move details about tokenization and handling wikipedia down to corpus comparison section
Write corpus methods
Write behavioral methods
Discussion
Appendix A: MH notes

Section 3.1: Checking convergence of chains (e.g. re-making the squeeze plot and show that they begin to overlap in probability)
Section 3.2: Auto-correlation analyses (e.g. Figure 1)
~~Section 3.3: Some check on whether samples are 'memorized' duplicates of corpus sentences~~
Section 4.1: Lexical analyses. Re-make the Zipf plot, and also take the spearman rank correlation between the lexical frequency distribution on BERT vs. corpus, make a table of "biggest deviations" (which words are most over and under-represented in the BERT samples, via PMI; we can also do this for bi-grams/tri-grams).
Section 4.2: Syntactic analyses. Remake our POS & DEP comparison figures, identify sources of largest deviation (e.g. which POS are most over- and under- represented).
Section 5.1: Report statistics for human behavioral ratings
Section 5.2: Predict human ratings from features (e.g. report linear regression)

Run additional behavioral sample with n-gram generations and more different sentences during burn-in...
- take Google 5-gram and sample first word from unigram probability, second word from bigram probability, etc...

Taka: "Figure 1": auto-correlation curves at different lags
- Rerun very long chains (e.g. 100k steps).
- With and without 'fixed point detection'
- With and without 'multiple sentence detection' (where we run sentencizer on every step to prevent getting into a multi-sentence state)
Robert: "Front figure": graphical depiction of our technique and why it's interesting (frivolous tsne plot to see chain bouncing around space of sentences)
Robert: "Alignment": show relationship between sentence lengths measured in sub-word and word...

Section 4.3: Some syntactic complexity measures (e.g. dependency distances) or grammatical error measures.

Idea: is BERT making errors more than human editors on wikipedia? If so, what kinds of errors?

Papers:

switch stephen's 'high probability' sentences out for even higher probability sentences (e.g. from the end of the low temperature chains). So we can see the 'squeeze'.
plotting the probability of sampled word under BERT (vs. whole sentence probability) since this is closer to the 'objective' that Gibbs sampling is using
make same plots for temperature = 1.
make same plots for Metropolis-Hastings algorithm
one final plot: check MH w/ temp=1 for more time steps (t~500)

lexical frequency in chains vs. corpus? (zipf's law?)
split wikipedia in half (A & B) and plot each to see how closer we ought to expert BERT to be
examine set difference of vocabulary (e.g. words in wikipedia but not generated by BERT)
examine frequency ranking of these words (e.g. are they mostly coming from the tails?)
controls: generate sentences from another model, e.g. LSTM/GPT-2 -- is BERT distribution closer/further from the data?
examine words by change in rank (e.g. which words moved 'up' the most in rank, and 'down' the most in rank)
part of speech distributions
statistics of syntactic structure in chains vs. corpus?
- something based on dependency parses? (e.g. dependency lengths?)
- run a constituency parse (e.g. phase-level tag frequencies?)
- extract the part of speech for each word, e.g. (NOUN VERB PREP DET NOUN...) to coarse-grain all sentences to just the POS sequence, and compare n-gram stuff on the POS...?
how those statistics change in the BERT chains (e.g. do sentences become syntactically simpler as they go through the chains?)

control: compare to another non-Wikipedia corpus, e.g. web corpus
examine probabilities under a more traditional language model (e.g. GPT-2)
- note: we could define an iterated learning dynamics with a sequential model by picking a site in the sentence, chopping off the rest of the sentence, and re-generating whole sentence from there.
- or, alternatively, picking a site, re-sampling that word given previous words and then scoring probability of the whole sentence with that work swapped in (which is a little weirder)
show relationship between Gibbs sampling distribution and MH distribution
- can we prove a theorem about converging to the same sentence probability (i.e. MH using the BERT way of factoring sentence probability as P(s) = P(w_1 | w_-1)*...*P(w_n | w_-n)) goes to the same thing as Gibbs using P(w_i | w_-i))

What should we do with punctuation?
- option (1) we exclude punctuation entirely? (i.e. conditional prior given assumption of no punctuation?)
- option (2) we ignore the fact that sometimes sentences get weird when lots of slots are taken up by punctuation.

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.