josepablocam / changestructor Goto Github PK
View Code? Open in Web Editor NEWDemo repo
License: MIT License
Demo repo
License: MIT License
Our question proposal is based on a ranking loss, where we want to make the dialogue for a given change more similar to that change versus to randomly sampled historical changes.
Can we show this empirically?
If we take a look at high quality versus low quality repositories, is it true that commit messages have this property? Can we compare between these two repository quality groups? @citostyle
What are good question templates for commit logs? Can we mine them from existing high-quality repositories? If not systematically, then at least manually/qualitatively?
We currently have very simple templates (https://github.com/josepablocam/changestructor/blob/master/chg/annotator/template_annotator.py#L129)
so that we don't have to run git add
(just look at changes), no commit, and no writes to DB. Exclusively for ease of development
I've implemented some basic embedding and question ranking functionality.
https://github.com/josepablocam/changestructor/blob/master/chg/embed/basic.py
uses Microsoft's CodeBERT model to embed both code changes (BasicEmbedder.embed_code
), stand-alone natural language questions, (BasicEmbedder.embed_nl
) and dialogue (i.e. a sequence of question and answers) (`BasicEmbedder.embed_dialogue).
CodeBERT has a maximum input length (in terms of tokens), so to handle arbitrary length inputs we chunk up the input into maximum size parts, embed each of them separately, and then average those embeddings.
To produce an embedding for a chunk, we take the hidden state after each token in a sequence is consumed and we average those hidden states dimension-wise.
https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py
Given a dialogue, code changes, and a set of randomly sampled code changes, we compute the following ranking loss
(https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L73)
code_vec <- embed_code(code_changes)
diag_vec <- embed_dialogue(dialogue)
# negative code in that this dialogue is *not* associated with these changes
neg_code_vecs <- [embed_code(neg_code) for neg_code in random_sample_code_changes]
# compute for each neg_code_vec
max(0, delta - sim(code_vec, diag_vec) + sim(neg_code_vec), diag_vec)
# and then average those individual losses to get a mean loss
We will refer to this as our ranking loss. The intuition is we want the dialogue associated with our current code change to be more similar to the current code change than a set of randomly sampled negative changes
We implement a simple model to predict this loss, given the code change, the dialogue history for that code change up to that point, and a set of randomly sampled code changes.
https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L21
This model is based on using random forests and predicts the ranking loss as a function of the feature vector x
defined below
code_vec <- embed_code(code_changes)
hist_vec <- embed_dialogue(dialogue_up_to_now)
neg_code_vecs <- [embed_code(neg_code) for neg_code in random_sample_code_changes]
https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L118
context_vec <- concat(code_vec, hist_vec, neg_code_vecs)
question_vec <- embed_nl(proposed_question)
x <- concat(context_vec, question_vec)
To choose a question, we pick the question in our candidate collection that maximizes the expected improvement over the current ranking loss.
https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L31
The model accumulates user's dialogue, and uses that to refit the regression model (https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L194). We also fit an initial version offline using the existing git history (https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L213)
It feels like this problem has some related connections to both RL and Bayesian optimization. For the former, we have a stateful sequential process, where the questions and answers affect an eventual final ranking loss. For the latter, we have an "expensive function" to evaluate (i.e. the user's time writing an answer to the question). I would love to see if there is a better setup than what I've proposed in this implementation -- but I don't know enough about either area to see it myself, so very open to reformulations/reimplementations
currently can only run using docker, should work normally. I think AFS may be part of the issue??
To produce questions, we append "entities" to the templates we currently have. These entities are e.g. variable names, function names, in relevant contexts in the code change or extracted from the user's prior dialogue answers.
We should extend/refine these:
We also need to filter out entities that don't make sense (e.g. simple pronouns)
Proposed a user study:
Take a large/serious SE project, and a given code change in that project.
Group A1: annotates the change using changestructor's dialogue system
Group A2: annotates the change using changestructor but with randomized proposed question order
Separately we automatically generate a commit message using an existing generation system
First eval: qualitative experience of A1 vs A2, qualitative characterization of answers (e.g. length, time etc)
Group B1: review automated message vs changestructor dialogue, answer understandability/quality questions
Group B2: review automated message vs changestructor random order, answer understandability/quality questions
Group B3: changestructor vs changestructor random order, answer understandability/quality questions
There are a ton of questions to answer here and properly structure, but this is the gist of what we thought about. We also probably need to refine/limit and be realistic about time expectations for participants etc. @citostyle
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.