GithubHelp home page GithubHelp logo

Feature request: UAST diffing about sdk HOT 6 OPEN

r0mainK avatar r0mainK commented on August 16, 2024
Feature request: UAST diffing

from sdk.

Comments (6)

dennwc avatar dennwc commented on August 16, 2024 1

Is this a dumb idea with no value, and we should keep parsing each file separately ?

In terms of drivers, yes, we will parse files separately anyway. The parsers that we use don't have any way of incremental processing, so for us each parsing request will be separate. Diff will be executed on a resulting UASTs one way or another.

But from an API standpoint, it may be useful to have a parse requests that accepts multiple versions of the file, parses them and returns a merged UAST for all versions. This will require some changes on our end (e.g. diffing may produce a DAG instead of a tree).

is this a dumb idea which would result in huge UASTs or longer process times then parsing each file separately ?

I don't think so. Sending multiple files will allow us to use compression that is built into the UAST binary format, so the resulting tree from parsing multiple versions will be smaller than multiple separate trees. Even without the diff the compression may lead to a minor increase in the performance.

The diff itself will take time, of course, no way around it.

if that is the case, have you already thought of alternative ways of getting the same kind of result, and could you detail them ?

I think the idea is valid and I mentioned a few reasons why it may be a good idea from the efficiency standpoint.

if that is not the case, how hard/time-consuming would it be to implement, and would you be ready to do it at some point - be it even with restrictions on the number of versions, file size, etc ?

We actually have a diff implementation already. But we haven't tested it's output apart from the fact that applying the diff will result in the correct output. And we also haven't decided how the API might look like.

So your question may start this discussion. Specifically, you mentioned a good alternative to the file diff: instead of producing a diff itself, annotate nodes with versions in some way. We may decide to use this approach, or go back to returning diffs with node IDs. We now have much more resources so we can start evaluation those approaches depending the priority of this feature.

from sdk.

creachadair avatar creachadair commented on August 16, 2024 1

This is definitely a good idea, and one that has come up before, see for example https://github.com/src-d/feature-idea/issues/2, https://github.com/src-d/devrel/issues/115. Tree diff is a tricky algorithm, but manageable. The more interesting questions, in my view, are how it fits into the API and how the results are surfaced to the client. So as @dennwc says, this is a conversation we need to have if we want tree diff to happen.

from sdk.

dennwc avatar dennwc commented on August 16, 2024 1

@r0mainK If I understand the idea correctly, this won't work for the generic case.

Imagine the language construct that spans multiple lines. So adding/removing/changing lines nearby may break the syntax of this language construct. So instead of getting a nice merged UAST with both version we will get a syntax error from the driver.

Systems like TreeSitter will handle it much better because they support incremental parsing. In case of Babelfish we usually run parsers that are a part of the compiler, thus they are not suitable for incremental parsing on per-line level. Instead they parse the file as a whole, usually.

Having said that, the diff algorithm can definitely take positions into account. The implementation that we have right now doesn't, however.

from sdk.

r0mainK avatar r0mainK commented on August 16, 2024

Okay I see, thanks for the swift answer :) I was not aware of the different issues already opened on the subject, or the fact there already was an implementation already. What I don't understand is when you say:

In terms of drivers, yes, we will parse files separately anyway.

Why is that the case ? Is that due to the fact that from a theoretical point of view it is impossible, or simply due to the way Babelfish is constructed ? I might be understanding ASTs wrong as I'm not the most knowledgeable in that regard, so I'll expand a bit on what I had in mind, with a super simplified way this could be done.

Given two versions of a file, instead of parsing both files, wouldn't one be able to apply a diffing algorithm at the line level to create a concatenated file, keep an index tracking which line appears in which version, parse the resulting file with Babelfish as is, and then use the index to deduce the diff nodes, as well as each version's UAST (granted, with some work as not all nodes have positional information) ?

what I mean:

v1: line a > line b1 > line c 
v2: line a > line b2 > line c
concat: line a > line b1 >  line b2 > line c
index = {a: [v1, v2] ; b1: [v1] ; b2:[v2] ; c: [v1, v2]}
uast: nodes(a: v1, v2), nodes(b1: v1), nodes(b2: v2) nodes(c: v1, v2),

Unless I'm mistaken, this could already be done entirely on the client side for simple diffing (in order to get nodes which have positional information, like identifiers), but wouldn't it be possible to apply it to the entire UAST in the same fashion, or are there blockers I don't see ?

What I was thinking was essentially to apply a diffing alg before sending the data to the drivers, then when receiving it augment the UAST by annotating each node with the index, then send the augmented UAST to the user, and provide through the API the diffing (which indeed would probably not always result in a UAST) as well as version selection.

from sdk.

r0mainK avatar r0mainK commented on August 16, 2024

@dennwc yes I kind of had that in mind, this example was more to convey the idea them describe exactly how it would work. To avoid syntax errors, I was thinking of a simplistic parsing per language, which would use knowledge of the language to compute the level of abstraction at which each line is situated:

class foo:   # 0
     def __call__(self): # 1
           return "bar" # 2
     def bar(self, l): # 1
           l = [   # 2
                 e for e in l  # 3
           ]  # 2
           return l    # 2

Then use that to merge versions - something which not require extensive parsing, and especially:

  1. Could be used in pair with a compressed way of passing versions, ie line diffs outputed by blame for all versions but the first (something like a list of line_index, line content, bool_add, bool_removed) instead of passing all versions at once.
  2. Due to the point above, may result in a perf increase, especially when handling 100s or 1000s of versions, as parsing in this simple way, creating a merged file, then parsing it once "for real" with the driver without Syntax Errors.
  3. Could be done quasi-simultaneously with the actual parsing.

However as you pointed out, since Babelfish does not do incorporate incremental parsing by default, this would have to be done completely separately from the actual parsing, instead of simultaneously, which would clearly add overhead, and would entail adding some kind of wrapper per language on each driver to create a concatenated file and then the UAST, and after to augment it with version info.

Anyway thanks for the input, just tried adding my grain of salt although I did expect there would be some issues with this approach u.u"

However I do think an API receiving versions this way, and returning a single augmented UAST, would be a pretty cool design:

  • it would leverage functionalities that are often included upstream - such as git blame
  • easier object handling and manipulation post parsing
  • it would enable dynamic diffing between large number of versions without later parsing, in cases where you don't yet have knowledge of which diff is of interest, ie if you just want to dump the augmented UAST nodes in a DB
  • it would be much less data to handle for greedy clients, wishing to use UASTs for each versions, as well as diffs between each pairs of versions.

from sdk.

bzz avatar bzz commented on August 16, 2024

The more interesting questions, in my view, are how it fits into the API and how the results are surfaced to the client.

@creachadair I did not dig deep into this, but this seems to be the de-facto standard of API for treeDiff that people in academia use nowadays https://github.com/GumTreeDiff/gumtree/wiki/GumTree-API#getting-the-mappings-between-two-trees and https://www.monperrus.net/martin/tree-differencing has some examples/algorithms listed

from sdk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.