GithubHelp home page GithubHelp logo

changestructor's People

Contributors

josepablocam avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

changestructor's Issues

Empirically validate ranking loss argument

Our question proposal is based on a ranking loss, where we want to make the dialogue for a given change more similar to that change versus to randomly sampled historical changes.

Can we show this empirically?

If we take a look at high quality versus low quality repositories, is it true that commit messages have this property? Can we compare between these two repository quality groups? @citostyle

add --dev flag

so that we don't have to run git add (just look at changes), no commit, and no writes to DB. Exclusively for ease of development

Review embedding/question ranking

I've implemented some basic embedding and question ranking functionality.

Embedding

https://github.com/josepablocam/changestructor/blob/master/chg/embed/basic.py
uses Microsoft's CodeBERT model to embed both code changes (BasicEmbedder.embed_code), stand-alone natural language questions, (BasicEmbedder.embed_nl) and dialogue (i.e. a sequence of question and answers) (`BasicEmbedder.embed_dialogue).

CodeBERT has a maximum input length (in terms of tokens), so to handle arbitrary length inputs we chunk up the input into maximum size parts, embed each of them separately, and then average those embeddings.

To produce an embedding for a chunk, we take the hidden state after each token in a sequence is consumed and we average those hidden states dimension-wise.

Question Ranking

https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py

Given a dialogue, code changes, and a set of randomly sampled code changes, we compute the following ranking loss
(https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L73)

code_vec <- embed_code(code_changes)
diag_vec <- embed_dialogue(dialogue)
# negative code in that this dialogue is *not* associated with these changes
neg_code_vecs <- [embed_code(neg_code) for neg_code in random_sample_code_changes]
# compute for each neg_code_vec
max(0, delta - sim(code_vec, diag_vec) + sim(neg_code_vec), diag_vec)
# and then average those individual losses to get a mean loss

We will refer to this as our ranking loss. The intuition is we want the dialogue associated with our current code change to be more similar to the current code change than a set of randomly sampled negative changes

We implement a simple model to predict this loss, given the code change, the dialogue history for that code change up to that point, and a set of randomly sampled code changes.

https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L21

This model is based on using random forests and predicts the ranking loss as a function of the feature vector x defined below

code_vec <- embed_code(code_changes)
hist_vec <- embed_dialogue(dialogue_up_to_now)
neg_code_vecs <- [embed_code(neg_code) for neg_code in random_sample_code_changes]

https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L118
context_vec <- concat(code_vec, hist_vec, neg_code_vecs)

question_vec <- embed_nl(proposed_question)
x <- concat(context_vec, question_vec)

To choose a question, we pick the question in our candidate collection that maximizes the expected improvement over the current ranking loss.
https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L31

The model accumulates user's dialogue, and uses that to refit the regression model (https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L194). We also fit an initial version offline using the existing git history (https://github.com/josepablocam/changestructor/blob/master/chg/ranker/model_based_ranking.py#L213)

Random musings

It feels like this problem has some related connections to both RL and Bayesian optimization. For the former, we have a stateful sequential process, where the questions and answers affect an eventual final ranking loss. For the latter, we have an "expensive function" to evaluate (i.e. the user's time writing an answer to the question). I would love to see if there is a better setup than what I've proposed in this implementation -- but I don't know enough about either area to see it myself, so very open to reformulations/reimplementations

run on csail machine

currently can only run using docker, should work normally. I think AFS may be part of the issue??

Populating templates with relevant code change/dialogue entities

To produce questions, we append "entities" to the templates we currently have. These entities are e.g. variable names, function names, in relevant contexts in the code change or extracted from the user's prior dialogue answers.

We should extend/refine these:

We also need to filter out entities that don't make sense (e.g. simple pronouns)

@citostyle

Structuring ideas for paper (and proper implementation)

  • Templatized question generation
  • Templates are populated with:
    • NLP: answer analysis (simple POS/entity extraction)
    • Program analysis:
      - Intraprocedural: inspecting changes in code to identify variables/functions and enclosing scopes
      - Interprocedural (TODO): extract static CFG and perform graph differencing, identify names that change (https://arxiv.org/abs/2103.00587)
  • Dynamic question ranking:
    - Online question ranking based on expected improvement in similarity/dissimilarity of NL description of code changes
  • Infrastructure implemented:
    • Annotation
    • Search (and describe ideas behind search)

Evaluating changestructor

Proposed a user study:
Take a large/serious SE project, and a given code change in that project.
Group A1: annotates the change using changestructor's dialogue system
Group A2: annotates the change using changestructor but with randomized proposed question order
Separately we automatically generate a commit message using an existing generation system

First eval: qualitative experience of A1 vs A2, qualitative characterization of answers (e.g. length, time etc)

Group B1: review automated message vs changestructor dialogue, answer understandability/quality questions

Group B2: review automated message vs changestructor random order, answer understandability/quality questions

Group B3: changestructor vs changestructor random order, answer understandability/quality questions

There are a ton of questions to answer here and properly structure, but this is the gist of what we thought about. We also probably need to refine/limit and be realistic about time expectations for participants etc. @citostyle

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.