Hey guys ! So for starters just wanted to point out I am not sure if

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Feature request: UAST diffing about sdk HOT 6 OPEN

r0mainK commented on August 16, 2024

Feature request: UAST diffing

from sdk.

Comments (6)

dennwc commented on August 16, 2024 1

Is this a dumb idea with no value, and we should keep parsing each file separately ?

In terms of drivers, yes, we will parse files separately anyway. The parsers that we use don't have any way of incremental processing, so for us each parsing request will be separate. Diff will be executed on a resulting UASTs one way or another.

But from an API standpoint, it may be useful to have a parse requests that accepts multiple versions of the file, parses them and returns a merged UAST for all versions. This will require some changes on our end (e.g. diffing may produce a DAG instead of a tree).

is this a dumb idea which would result in huge UASTs or longer process times then parsing each file separately ?

I don't think so. Sending multiple files will allow us to use compression that is built into the UAST binary format, so the resulting tree from parsing multiple versions will be smaller than multiple separate trees. Even without the diff the compression may lead to a minor increase in the performance.

The diff itself will take time, of course, no way around it.

if that is the case, have you already thought of alternative ways of getting the same kind of result, and could you detail them ?

I think the idea is valid and I mentioned a few reasons why it may be a good idea from the efficiency standpoint.

if that is not the case, how hard/time-consuming would it be to implement, and would you be ready to do it at some point - be it even with restrictions on the number of versions, file size, etc ?

We actually have a diff implementation already. But we haven't tested it's output apart from the fact that applying the diff will result in the correct output. And we also haven't decided how the API might look like.

So your question may start this discussion. Specifically, you mentioned a good alternative to the file diff: instead of producing a diff itself, annotate nodes with versions in some way. We may decide to use this approach, or go back to returning diffs with node IDs. We now have much more resources so we can start evaluation those approaches depending the priority of this feature.

from sdk.

creachadair commented on August 16, 2024 1

This is definitely a good idea, and one that has come up before, see for example https://github.com/src-d/feature-idea/issues/2, https://github.com/src-d/devrel/issues/115. Tree diff is a tricky algorithm, but manageable. The more interesting questions, in my view, are how it fits into the API and how the results are surfaced to the client. So as @dennwc says, this is a conversation we need to have if we want tree diff to happen.

from sdk.

dennwc commented on August 16, 2024 1

@r0mainK If I understand the idea correctly, this won't work for the generic case.

Imagine the language construct that spans multiple lines. So adding/removing/changing lines nearby may break the syntax of this language construct. So instead of getting a nice merged UAST with both version we will get a syntax error from the driver.

Systems like TreeSitter will handle it much better because they support incremental parsing. In case of Babelfish we usually run parsers that are a part of the compiler, thus they are not suitable for incremental parsing on per-line level. Instead they parse the file as a whole, usually.

Having said that, the diff algorithm can definitely take positions into account. The implementation that we have right now doesn't, however.

from sdk.

r0mainK commented on August 16, 2024

Okay I see, thanks for the swift answer :) I was not aware of the different issues already opened on the subject, or the fact there already was an implementation already. What I don't understand is when you say:

In terms of drivers, yes, we will parse files separately anyway.

Why is that the case ? Is that due to the fact that from a theoretical point of view it is impossible, or simply due to the way Babelfish is constructed ? I might be understanding ASTs wrong as I'm not the most knowledgeable in that regard, so I'll expand a bit on what I had in mind, with a super simplified way this could be done.

Given two versions of a file, instead of parsing both files, wouldn't one be able to apply a diffing algorithm at the line level to create a concatenated file, keep an index tracking which line appears in which version, parse the resulting file with Babelfish as is, and then use the index to deduce the diff nodes, as well as each version's UAST (granted, with some work as not all nodes have positional information) ?

what I mean:

v1: line a > line b1 > line c 
v2: line a > line b2 > line c
concat: line a > line b1 >  line b2 > line c
index = {a: [v1, v2] ; b1: [v1] ; b2:[v2] ; c: [v1, v2]}
uast: nodes(a: v1, v2), nodes(b1: v1), nodes(b2: v2) nodes(c: v1, v2),

Unless I'm mistaken, this could already be done entirely on the client side for simple diffing (in order to get nodes which have positional information, like identifiers), but wouldn't it be possible to apply it to the entire UAST in the same fashion, or are there blockers I don't see ?

What I was thinking was essentially to apply a diffing alg before sending the data to the drivers, then when receiving it augment the UAST by annotating each node with the index, then send the augmented UAST to the user, and provide through the API the diffing (which indeed would probably not always result in a UAST) as well as version selection.

from sdk.

r0mainK commented on August 16, 2024

@dennwc yes I kind of had that in mind, this example was more to convey the idea them describe exactly how it would work. To avoid syntax errors, I was thinking of a simplistic parsing per language, which would use knowledge of the language to compute the level of abstraction at which each line is situated:

class foo:   # 0
     def __call__(self): # 1
           return "bar" # 2
     def bar(self, l): # 1
           l = [   # 2
                 e for e in l  # 3
           ]  # 2
           return l    # 2

Then use that to merge versions - something which not require extensive parsing, and especially:

Could be used in pair with a compressed way of passing versions, ie line diffs outputed by blame for all versions but the first (something like a list of line_index, line content, bool_add, bool_removed) instead of passing all versions at once.
Due to the point above, may result in a perf increase, especially when handling 100s or 1000s of versions, as parsing in this simple way, creating a merged file, then parsing it once "for real" with the driver without Syntax Errors.
Could be done quasi-simultaneously with the actual parsing.

However as you pointed out, since Babelfish does not do incorporate incremental parsing by default, this would have to be done completely separately from the actual parsing, instead of simultaneously, which would clearly add overhead, and would entail adding some kind of wrapper per language on each driver to create a concatenated file and then the UAST, and after to augment it with version info.

Anyway thanks for the input, just tried adding my grain of salt although I did expect there would be some issues with this approach u.u"

However I do think an API receiving versions this way, and returning a single augmented UAST, would be a pretty cool design:

it would leverage functionalities that are often included upstream - such as git blame
easier object handling and manipulation post parsing
it would enable dynamic diffing between large number of versions without later parsing, in cases where you don't yet have knowledge of which diff is of interest, ie if you just want to dump the augmented UAST nodes in a DB
it would be much less data to handle for greedy clients, wishing to use UASTs for each versions, as well as diffs between each pairs of versions.

from sdk.

bzz commented on August 16, 2024

The more interesting questions, in my view, are how it fits into the API and how the results are surfaced to the client.

@creachadair I did not dig deep into this, but this seems to be the de-facto standard of API for treeDiff that people in academia use nowadays https://github.com/GumTreeDiff/gumtree/wiki/GumTree-API#getting-the-mappings-between-two-trees and https://www.monperrus.net/martin/tree-differencing has some examples/algorithms listed

from sdk.

Feature request: UAST diffing about sdk HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs