bluekeyes / go-gitdiff Goto Github PK

Go library for parsing and applying patches created by Git

License: MIT License

Go 100.00%

go-gitdiff's Introduction

go-gitdiff

A Go library for parsing and applying patches generated by git diff, git show, and git format-patch. It can also parse and apply unified diffs generated by the standard diff tool.

It supports standard line-oriented text patches and Git binary patches, and aims to parse anything accepted by the git apply command.

patch, err := os.Open("changes.patch")
if err != nil {
    log.Fatal(err)
}

// files is a slice of *gitdiff.File describing the files changed in the patch
// preamble is a string of the content of the patch before the first file
files, preamble, err := gitdiff.Parse(patch)
if err != nil {
    log.Fatal(err)
}

code, err := os.Open("code.go")
if err != nil {
    log.Fatal(err)
}

// apply the changes in the patch to a source file
var output bytes.Buffer
if err := gitdiff.Apply(&output, code, files[0]); err != nil {
    log.Fatal(err)
}

Development Status

The parsing API and types are complete and I expect will remain stable. Version 0.7.0 introduced a new apply API that may change more in the future to support non-strict patch application.

Parsing and strict application are well-covered by unit tests and the library is used in a production application that parses and applies thousands of patches every day. However, the space of all possible patches is large, so there are likely undiscovered bugs.

The parsing code has also had a modest amount of fuzz testing.

Why another git/unified diff parser?

Several packages with similar functionality exist, so why did I write another?

No other packages I found support binary diffs, as generated with the --binary flag. This is the main reason for writing a new package, as the format is pretty different from line-oriented diffs and is unique to Git.
Most other packages only parse patches, so you need additional code to apply them (and if applies are supported, it is only for text files.)
This package aims to accept anything that git apply accepts, and closely follows the logic in apply.c.
It seemed like a fun project and a way to learn more about Git.

Differences From Git

Certain types of invalid input that are accepted by git apply generate errors. These include:
- Numbers immediately followed by non-numeric characters
- Trailing characters on a line after valid or expected content
- Malformed file header lines (lines that start with diff --git)
Errors for invalid input are generally more verbose and specific than those from git apply.
The translation from C to Go may have introduced inconsistencies in the way Unicode file names are handled; these are bugs, so please report any issues of this type.
When reading headers, there is no validation that OIDs present on an index line are shorter than or equal to the maximum hash length, as this requires knowing if the repository used SHA1 or SHA256 hashes.
When reading "traditional" patches (those not produced by git), prefixes are not stripped from file names; git apply attempts to remove prefixes that match the current repository directory/prefix.
Patches can only be applied in "strict" mode, where the line numbers and context of each fragment must exactly match the source file; git apply implements a search algorithm that tries different lines and amounts of context, with further options to normalize or ignore whitespace changes.
When parsing mail-formatted patch headers, leading and trailing whitespace is always removed from Subject lines. There is no exact equivalent to git mailinfo -k.

go-gitdiff's People

Contributors

Stargazers

Watchers

Forkers

timmyyuan xiyuan-code gwd ogflxn jmcampanini gitleaks goldsteinn areese typeling1578 infisical gitpwned sawyer1379416087 jfxdev

go-gitdiff's Issues

Implement the equivalent of git/mailinfo.c:cleanup_subject() (i.e., remove [PATCH] from title)

Patches generated with git format-patch will have [PATCH] in the subject line; at the moment, this doesn't seem to be removed by ParsePatchHeader().

It might be a good idea to just implement the equivalent of https://github.com/git/git/blob/master/mailinfo.c 's cleanup_subject(), which seems to remove the following things at the beginning of the patch title:

Re: and variations
Whitespace (' ', '\t', and ':')
Anything in between brackets.

I might take at implementing this if I have time.

'ParsePatchDate' is broken :/

time.Parse isn't guaranteed to throw errors, when it sees elements that it doesn't recognize it zeros them. https://play.golang.org/p/4kbScfG56Ic

Elements omitted from the value are assumed to be zero or, when zero is impossible, one, so parsing "3:04pm" returns the time corresponding to Jan 1, year 0, 15:04:00 UTC (note that because the year is 0, this time is before the zero Time). Years must be in the range 0000..9999. The day of the week is checked for syntax but it is otherwise ignored.

https://golang.org/pkg/time/#example_Parse

It might be worth checking the time to see if the year is 1 in addition to checking if err is nil, seeing that git was released in 2005 :D I know this would make it unusable for time travelers but 🤷

Personally I would just use something like https://github.com/kierdavis/dateparser https://play.golang.org/p/-yRXt4qPAZo

Add options to ParsePatchHeader

By default, for email formatted patches, ParsePatchHeader will remove all content in square brackets and place it in a separate field. This matches the default behavior of git, but git also includes flags to disable cleaning completely or to only remove content in square brackets that contains the word PATCH.

It should be a backwards-compatible change to have ParsePatchHeader accept option functions that can disable the default behavior or switch to only removing [PATCH] content.

Reconsider source inputs for apply functions

After coming back to it to add some features, I'm not happy with the LineReaderAt interface and to some extent the use of io.ReaderAt. This is mostly for text patches, since io.ReaderAt is actually an ideal interface for the needs of binary patches.

Things I don't like:

It's hard to know if you are at the end of the input or not. You have to read a minimal amount of data at what you think is the end offset and see if you get more data or an io.EOF.
It's hard to know how large the input is. As above, you have to read at what you think the length is and see if you get more data or an io.EOF.
The implementation of LineReaderAt wrapping an io.ReaderAt feels complicated, but maybe this is inevitable when you need to build a line index dynamically
It's hard to control the memory usage when reading lines because you can set a number of lines, but have no control over the size of each line.

Any solution needs to solve the following constraints:

Support random access to lines. Strict apply could work without this, but it's required for fuzzy apply, where you slowly backtrack through the file to find a match.
Is a standard library type or can be created from a standard library type, the more widely implemented the better.
Allows end users some control over performance and memory usage for special cases.

Things I've considered:

io.ReaderAt and LineReaderAt: this works well for binary applies (it's the minimal method needed to implement them), but has the problems outlined above for text applies.
io.ReadSeeker: this enables the same features as io.ReaderAt (and is implemented by the same standard library types) but the position tracking and Read function make some things (like copying) easier. Since I don't plan to support concurrent use of the same source, I'm not sure if there's a major difference between using Read and Seek versus using ReadAt.
[]byte: this is simple and supports random access, but doesn't allow much flexibility. The whole source must be in memory and the apply functions will compute the line index as needed even if there was a more efficient way to get it. On the other hand, it reduces the need for internal buffers, so the number of allocations is probably lower. For what it's worth, git takes this approach and reads the full source file into memory for applies.

In my usage so far, everything is already in memory for other reasons, so the []byte might be the simplest. Or maybe io.ReaderAt is the correct interface and I just need a better abstraction on top of it for line operations.

Support non-strict patch application

Currently, an Applier can only apply patches in "strict" mode, where line numbers and context lines must match exactly. Git supports a more flexible model when applying patches that allow them to work in more situations, such as cherry-picking changes to different branches:

Look for matches on different lines near the lines in the patch
Look for matches with fewer lines of context
Look for matches ignoring certain types of whitespace changes

I think copying Git's whitespace normalization could get complicated, but it would be nice to at least support exact matches on different lines or matches with reduced context.

Add validation for parsed files

If a patch is malformed or a File is created directly, various fields may disagree. Add a validate function that checks for these types of issues so clients (e.g. appliers) can rely on the content of the fields.

Some of the issues to check:

IsRename is true/false but OldName and NewName are equal/not equal
IsDelete or IsNew is true but there is more than one fragment
IsDelete is true but the single fragment has context or addition lines or NewPosition and NewLines are not 0
IsNew is true but the single fragment has context or deletion lines or OldPosition and OldLines are not 0
IsBinary is true but TextFragments is not empty

Flush() is going to endless loop

Hi.

Thanks for this library. It saved me lots of time. However, I'm currently facing a problem.
The following code run indefinitely and eats RAM:

package main

import (
	"bytes"

	"github.com/bluekeyes/go-gitdiff/gitdiff"
)

const (
	diff = `
diff --git a/app/controllers/seances_controller.rb b/app/controllers/seances_controller.rb
index 743d0ad..4f4d4e8 100644
--- a/app/controllers/seances_controller.rb
+++ b/app/controllers/seances_controller.rb
@@ -5,8 +5,6 @@ class SeancesController < ApplicationController
     if authorization_result.code != 200
       return render_by_status_code(code: authorization_result.code, data: authorization_result.data)
     end
-
-    render_by_status_code(code: 200, data: json)
   end

   def create
`
)

var (
	body = `class SeancesController < ApplicationController
  def index
    set_auth_operation_id_header

    if authorization_result.code != 200
      return render_by_status_code(code: authorization_result.code, data: authorization_result.data)
    end

    render_by_status_code(code: 200, data: json)
  end

  def create
    seance = Seance.new(
      movie_id: params[:movie_id],
      price: params[:price],
      datetime: params[:datetime]
    )

    if seance.save!
      render json: {
        data: {
          id: seance.id,
          type: 'seances',
          attributes: { datetime: seance.datetime, price: seance.price },
          seats: Seat.pluck(:id).map do |seat_id|
            { id: seat_id, vacant: true }
          end
        }
      }
    end
  rescue ActiveRecord::RecordInvalid => e
    render_invalid_record(message: e.message)
  end

  def destroy
    ActiveRecord::Base.transaction do
      Booking.where(seance: params[:id]).destroy_all
      Seance.find(params[:id]).destroy
    end

    render json: { data: [{ id: params[:id], type: 'seances' }] }
  end

  def json
    seats_ids = Seat.pluck(:id)

    Seance.includes(:bookings).where(movie: params[:movie_id]).order('datetime').limit(params[:max_results] || 50).map do |seance|
      booking_seats_ids = seance.bookings.pluck(:seat_id)

      {
        id: seance.id,
        price: seance.price,
        datetime: seance.datetime,
        seats: seats_ids.map do |seat_id|
          { id: seat_id, vacant: !(booking_seats_ids.include?(seat_id)) }
        end
      }
    end
  end
end
`
)

func main() {
	files, _, err := gitdiff.Parse(bytes.NewBufferString(diff))
	if err != nil {
		panic(err)
	}

	for _, file := range files {
		writer := bytes.NewBuffer(nil)
		reader := bytes.NewReader([]byte(body))
		appl := gitdiff.NewApplier(reader)
		if err := appl.ApplyFile(writer, file); err != nil {
			panic(err)
		}
	}
}

I debugged it to somewhere in Flush() operation. The internals keeps copying same lines over and over.

I would appreciate your help in debugging this.
Ivan

gitdiff: line 63: git file header: invalid mode line: invalid syntax

Looks like your library is a bit out-of-date? Line 63 for me in sample I was testing with looks like this:

new file mode 100644

Split apply logic by fragment type

The single Applier type does some messy internal state tracking to avoid mixing ApplyFile, ApplyTextFragment, and ApplyBinaryFragment. I think the following would be better:

Create a TextApplier (in apply_text.go) and BinaryApplier (in apply_binary.go). Each of these has methods to apply single fragments (and multiple fragments, in the case of TextApplier.)
Remove the Reset method and rename Flush to Close to better indicate that apply types are single-use
Remove the Applier type and the ApplyFile method
Move the logic for ApplyFile to the global Apply function. This is the convenience function to select an applier based on the file type and execute it.

This should reduce confusion and provides an obvious place for the eventual text-only options for fuzzy apply.

git patch with empty emails causing parse errors

I have a git patch that is triggering this line

go-gitdiff/gitdiff/patch_header.go

Line 114 in 13e8639

return PatchIdentity{}, fmt.Errorf("invalid identity string: %s", s)

commit 44b179bf547c84cb588480558de71df1e9243aaf
Author: bot-deploy Github Action <>
Date:   Tue Mar 5 17:07:58 2024 +0000

    Export updated bot artifact

diff --git a/bot_exports/ba3e9571-b1d9-45cb-be06-a7b4a2e279e7.blob b/bot_exports/ba3e9571-b1d9-45cb-be06-a7b4a2e279e7.blob
index 4ea75f9..f92448c 100644
Binary files a/bot_exports/ba3e9571-b1d9-45cb-be06-a7b4a2e279e7.blob and b/bot_exports/ba3e9571-b1d9-45cb-be06-a7b4a2e279e7.blob differ

Author: bot-deploy Github Action <> is causing parsing to fail. Can this be gracefully handled, like returning an empty email?

Discrepancy between Gitlab diff API and go-gitdiff parsed output.

Hi,

I'm using go-gitdiff to parse git diff patch files in Gitlab and derive some metrics, I used to do same by using diff response of Gitlab diff API.

I see there is some discrepancy between the results.
If I move & rename certain file with with minimal changes, go-gitdiff patch parsed output shows the file in old path as deleted -

(*gitdiff.File)(0xc00013e480)({
 OldName: (string) (len=59) "adapters/phasedAdapters/executeAdapters/execute_adapters.go",
 NewName: (string) "",
 IsNew: (bool) false,
 IsDelete: (bool) true,
 IsCopy: (bool) false,
 IsRename: (bool) false,
 OldMode: (os.FileMode) -rw-r--r--,
 NewMode: (os.FileMode) ----------,
 OldOIDPrefix: (string) (len=7) "f08cca6",
 NewOIDPrefix: (string) (len=7) "0000000",
 Score: (int) 0

and shows the file at new path as new file -

(*gitdiff.File)(0xc00013e2d0)({
 OldName: (string) "",
 NewName: (string) (len=33) "adapters/phase/execute/execute.go",
 IsNew: (bool) true,
 IsDelete: (bool) false,
 IsCopy: (bool) false,
 IsRename: (bool) false,
 OldMode: (os.FileMode) ----------,
 NewMode: (os.FileMode) -rw-r--r--,
 OldOIDPrefix: (string) (len=7) "0000000",
 NewOIDPrefix: (string) (len=7) "a1d4924",
 Score: (int) 0

whereas in Gitlab MR diff view I get move and rename as -
adapters/phasedAdapters/executeAdapters/execute_adapters.go → adapters/phase/execute/execute.go

In Gitlab diff API, the response for the file I get it as -

 {
        "old_path": "adapters/phasedAdapters/executeAdapters/execute_adapters.go",
        "new_path": "adapters/phase/execute/execute.go",
        "a_mode": "100644",
        "b_mode": "100644",
        "new_file": false,
        "renamed_file": true,
        "deleted_file": false
}

Gitlab determines the file was just renamed and move to different path whereas go-gitdiff assumes it as a new file and considers file at old path as deleted, is this expected?

Implement full decoding support for patch subjects

Looking through decode_header in the Git source, it looks like there are several possible encodings. Currently, we only support quoted-printable UTF-8 and ignore anything else (implemented in #25.)

To support arbitrary encodings, I think we need to:

Identify the encoding from the text between the first =? and the next ?
Use the q? or b? to determine if the content until the next =? is encoded as quoted-printable or base64
Decode the content
Use ianaindex.MIME to look up the encoding
Use the encoding from to convert the bytes to UTF-8

Set NewMode when mode does not change

In a patch that includes a mode, but does not change it, only the OldMode field is set. Copy this value into the NewMode field as well for convenience.

Add top-level Apply() function

To mirror Parse, the library should export an Apply(io.Writer, io.ReaderAt, *File) error function as a convenience wrapper for using Applier with default settings. This will be strict application for now, but would probably change to fuzzy application if that's ever implemented.

Support empty patches

Hi!

We are using this lib over at https://pr.pico.sh and so far haven't had any issues. So thank you for your hard work!

We are trying to support cover letters in our patch request workflow, but it looks like since the patch itself is empty, go-gitdiff skips over it and returns empty header data. Ideally this lib would still process the empty patch even if there aren't any diffs in it.

I'm curious what you think about adding support for empty patches?

isSourceLine ignore line "--------------------------------------"

since the source contains a line as "--------------------------------------" the isSourceLine function just ignore it, i wonder if the condition should be changed as below

if l := len(line); l == 0 || (l >= 3 && (line[:4] == "--- " || line[:4] == "+++ ")) {

Separate out "appendix" material?

Similar to #16: git format-patch often adds stuff after a --- line; and many people add things there too. So you might have a message that looks like this:

Subject: [PATCH] Implement foo bar

Blah blah blah

S-o-b: <[email protected]>
---
CC: [email protected]
CC: [email protected]

 xen/common/domain.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

git am always ends up removing anything after the ---, because it actually interprets --- as the beginning of the patch.

Would you be open to having PatchPatchHeader separate out this extra information into a separate field? Maybe, BodyAppendix or something like that?

If so I can write something up & send a PR.

`ParsePatchHeader` failing on dependabot emails

From a94db29e472831db7a75ba52e99ab717c17886eb Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <59619111+dependabot[bot]@users.noreply.github.com>
Date: Mon, 29 Apr 2024 17:31:28 +0000
Subject: [PATCH] =?UTF-8?q?=E2=9C=85=20(deps):=20Bump=20schemas=20fr?=
 =?UTF-8?q?om=202.11.20240425191412=20to=202.11.20240429164216=20(#10143)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 Gemfile.lock | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Gemfile.lock b/Gemfile.lock
index 5b1718812c..5e83f3d1fa 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -109,7 +109,7 @@ GEM
       faraday
       flipper
       jwt
-    schemas (2.11.20240425191412)
+    schemas (2.11.20240429164216)
       google-protobuf (~> 3.21)
       googleapis-common-protos-types
       twirp (>= 1.7)

For the above patch, ParsePatchHeader is failing with mail: missing @ in addr-spec. I completely understand that the email 59619111+dependabot[bot]@users.noreply.github.com is invalid, and the code for parsing logic belongs to net/mail package. Just wanted to post it hear anyways to hear your thoughts on it. Please close it as invalid if it's not worth your time. Thank you!

File name parsing fails for file names with spaces

I recently encountered a patch containing this file deletion (paths sanitized):

diff --git a/path/to/file/File with Spaces.pdf b/path/to/file/File with Spaces.pdf
deleted file mode 100644
index 6e02dcd4fabc172009aca3a6f78763246c59b8fe..0000000000000000000000000000000000000000

I think I assumed these would be quoted, but Git does not seem to consider spaces special characters when generating patches. This leads to a git file header: missing filename information error.

Check behavior against git_header_name in apply.c to see how Git handles this.