GithubHelp home page GithubHelp logo

Comments (19)

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

ok, was able to reproduce it:

goroutine 53 [running]:
reflect.Value.Int(...)
	/usr/lib/go-1.14/src/reflect/value.go:976
github.com/Shopify/ghostferry.RowData.GetUint64(0xc00067efc0, 0x9, 0x9, 0x0, 0x0, 0x0, 0x1)
	/work/go/home/src/github.com/Shopify/ghostferry/dml_events.go:32 +0x2f2
github.com/Shopify/ghostferry.paginationKeyFromEventData(0xc0003298c0, 0xc00067efc0, 0x9, 0x9, 0x0, 0x0, 0x0)
	/work/go/home/src/github.com/Shopify/ghostferry/dml_events.go:600 +0xdf
github.com/Shopify/ghostferry.(*BinlogUpdateEvent).PaginationKey(0xc0002fc900, 0xc000349590, 0xc0000a40c0, 0x10)
	/work/go/home/src/github.com/Shopify/ghostferry/dml_events.go:199 +0x4d
github.com/Shopify/ghostferry.(*InlineVerifier).binlogEventListener(0xc0003fcf00, 0xc000109ea0, 0xc00001b6e8, 0xc000109ea0)
	/work/go/home/src/github.com/Shopify/ghostferry/inline_verifier.go:587 +0x31f
github.com/Shopify/ghostferry.(*BinlogStreamer).emitEvent(0xc00032e500, 0xc000349590, 0x1d, 0xc00001b908)
	/work/go/home/src/github.com/Shopify/ghostferry/binlog_streamer.go:349 +0x319
github.com/Shopify/ghostferry.(*BinlogStreamer).Run(0xc00032e500)
	/work/go/home/src/github.com/Shopify/ghostferry/binlog_streamer.go:205 +0x836
github.com/Shopify/ghostferry.(*Ferry).Run.func7(0xc0000a4050, 0xc00028a000)
	/work/go/home/src/github.com/Shopify/ghostferry/ferry.go:618 +0x5b
created by github.com/Shopify/ghostferry.(*Ferry).Run
	/work/go/home/src/github.com/Shopify/ghostferry/ferry.go:615 +0x2e0

Note that the lines may vary from the master branch, as I have other changes compiled in

from ghostferry.

shuhaowu avatar shuhaowu commented on July 19, 2024

We've seen the same errors without being able to reproduce it as well. Anyway to write a test for this?

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

We've seen the same errors without being able to reproduce it as well.

glad to hear it's not just me :-) I was already afraid it's something with my version of mysql/mariadb

Anyway to write a test for this?

yep, working on it... I can reproduce it in my test setup (using real data and DBs, not something I can just put into a unit-test). But it's on my TODO for today to make a unit-test + integration test (types_test.rb)

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

FYI: I'm unable to reproduce the issue using an integration test. Not sure if we still want to commit it. Unit-tests were obviously added

from ghostferry.

shuhaowu avatar shuhaowu commented on July 19, 2024

Do you know the actual uint64 value that's returned? Maybe this value overflows the standard int64? I would like to try to have an explanation of what's happening before blindly merging, as otherwise this code becomes a bit too magical.

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

Do you know the actual uint64 value that's returned? Maybe this value overflows the standard int64?

Yes I do - see my first comment. And I tried with this value and it was still an int . I am theorizing that it has to do with other data found in a batch. Maybe if mysql finds a value for which it decides that an int is sufficient it then sticks to the type.

I tried that in my integration test, but could not trigger it

I would like to try to have an explanation of what's happening before blindly merging, as otherwise this code becomes a bit too magical.

well, that's why we have a unit-test :-)

from ghostferry.

shuhaowu avatar shuhaowu commented on July 19, 2024

@danieloliveira079 I recall that you guys saw this issue. Did you ever figure out what triggered this?

I still feel somewhat uncomfortable merging some code to resolve a crash that we don't understand. What if there's something tricky happening that needs special handling to not result in data corruption?

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

makes perfect sense, but please note that I was able to reproduce it and test my change (and it fixed the issue). I was merely unable to scale it down into an integration test and still see the same behavior.

from ghostferry.

shuhaowu avatar shuhaowu commented on July 19, 2024

I know this can be fixed. I've seen similar patches internally. I'm inclined to merge it as is, but there's always some fear in doing so without understanding the problem, because Ghostferry has the ability to silently corrupt data in a way that would be very costly to fix.

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

I know exactly where you're coming from, makes perfect sense to me :-) . Although: right now we know it's corrupting data - so not sure how much worse it gets ;-)

from ghostferry.

shuhaowu avatar shuhaowu commented on July 19, 2024

Are you referring to the corruption in the binary column or something to do with this issue?

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

sorry, I wrote "corrupt", but meant "crash". I have too many tickets open , I think 😄

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

unfortunately just ran into another stacktrace, even with my patch:

panic: reflect: call of reflect.Value.Int on uint32 Value

goroutine 11 [running]:
reflect.Value.Int(...)
	/usr/lib/go-1.14/src/reflect/value.go:976
github.com/Shopify/ghostferry.RowData.GetUint64(0xc0027f6990, 0x3, 0x3, 0x0, 0x0, 0x0, 0x1)
	/work/go/home/src/github.com/Shopify/ghostferry/dxl_events.go:32 +0x31d
github.com/Shopify/ghostferry.paginationKeyFromEventData(0xc0000d9940, 0xc0027f6990, 0x3, 0x3, 0x0, 0x0, 0x0)
	/work/go/home/src/github.com/Shopify/ghostferry/dxl_events.go:600 +0xdf
github.com/Shopify/ghostferry.(*BinlogDeleteEvent).PaginationKey(0xc0029e9860, 0xc0027f6c30, 0xc0004952b0, 0x10)
	/work/go/home/src/github.com/Shopify/ghostferry/dxl_events.go:245 +0x4c
github.com/Shopify/ghostferry.(*InlineVerifier).binlogEventListener(0xc00043b950, 0xc002e98370, 0xc000019828, 0xc002e98370)
	/work/go/home/src/github.com/Shopify/ghostferry/inline_verifier.go:602 +0x338
github.com/Shopify/ghostferry.(*BinlogStreamer).emitEvent(0xc000246500, 0xc0027f6c30, 0x1d, 0xc000019a48)
	/work/go/home/src/github.com/Shopify/ghostferry/binlog_streamer.go:347 +0x319
github.com/Shopify/ghostferry.(*BinlogStreamer).Run(0xc000246500)
	/work/go/home/src/github.com/Shopify/ghostferry/binlog_streamer.go:209 +0x569
github.com/Shopify/ghostferry.(*Ferry).Run.func7(0xc00003c200, 0xc00022e000)
	/work/go/home/src/github.com/Shopify/ghostferry/ferry.go:619 +0x5b
created by github.com/Shopify/ghostferry.(*Ferry).Run
	/work/go/home/src/github.com/Shopify/ghostferry/ferry.go:616 +0x2e0

it seems that we're not only getting uint64 but any of the unsigned ints that we need to handle

from ghostferry.

shuhaowu avatar shuhaowu commented on July 19, 2024

Is this a good time to check the siddontang library to see what's happening?

from ghostferry.

danieloliveira079 avatar danieloliveira079 commented on July 19, 2024

Hi there, below you can find the instructions that one can use to reproduce the error.

@bakhti has provided those steps and I am only the communicator :)

Basically all the instructions can be found on https://shopify.github.io/ghostferry/master/tutorialcopydb.html with the exception of one configuration used by Ghostferry named VerifierType.

Currently, the instructions suggest that it can be ChecksumTable . However, we could only reproduce the issue changing that to be InlineVerifier.

In summary, panic occurs when the table is synced and an INSERT or UPDATE statement is triggered on the source database. I believe the same would occur with a table that has completed the sync. Needs more testing with a larger dataset that would take longer to finish the sync.

I am not entirely sure what is causing the issue but the way we get that fixed (experiment branch) is demonstrated below:

func (r RowData) GetUint64(colIdx int) (res uint64, err error) {
	if valueByteSlice, ok := r[colIdx].([]byte); ok {
		valueString := string(valueByteSlice)
		res, err = strconv.ParseUint(valueString, 10, 64)
		if err != nil {
			return 0, err
		}
	} else if _, ok := r[colIdx].(uint64); ok {
		unsignedInt := reflect.ValueOf(r[colIdx]).Uint()
		if unsignedInt < 0 {
			return 0, fmt.Errorf("expected position %d in row to contain an unsigned number", colIdx)
		}
		res = unsignedInt
	} else if _, ok := r[colIdx].(uint32); ok {
		unsignedInt := reflect.ValueOf(r[colIdx]).Uint()
		if unsignedInt < 0 {
			return 0, fmt.Errorf("expected position %d in row to contain an unsigned number", colIdx)
		}
		res = uint64(unsignedInt)
	} else {
		signedInt := reflect.ValueOf(r[colIdx]).Int()
		if signedInt < 0 {
			return 0, fmt.Errorf("expected position %d in row to contain an unsigned number", colIdx)
		}
		res = uint64(signedInt)
	}
	return
}

The first time we got this error we were running Copydb on a larger database (2TB) and it was constantly crashing for different tables and moment in time. Using the code above the problem never happened but I am not sure about the side effects. It is worth saying that we are not sure if that would corrupt data or not. We are not using that fix for production DBs yet.

Let me know if you need further details.

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

thanks for the details. I'm now going into the replication module to see if they do something funky.

FYI, I'm quite positive that we not only saw this due to verification, but also due to "mere" binlog writing.

Also thanks for your patch, I've been testing with a very similar version of your code - an extension of the code currently under review:

// The mysql driver does not always give you a uint64 from Scan, instead you
// can get an int64 for values that fit in int64 or a byte slice decimal string
// with the uint64 value in it.
func (r RowData) GetUint64(colIdx int) (res uint64, err error) {
	rowValue := r[colIdx]
	switch v := rowValue.(type) {
	case uint64:
		res = v
	case uint32:
		res = uint64(v)
	case uint16:
		res = uint64(v)
	case uint8:
		res = uint64(v)
	case uint:
		res = uint64(v)
	case []byte:
		valueString := string(v)
		res, err = strconv.ParseUint(valueString, 10, 64)
	case string:
		res, err = strconv.ParseUint(v, 10, 64)
	default:
		signedInt := reflect.ValueOf(rowValue).Int()
		if signedInt < 0 {
			err = fmt.Errorf("expected position %d in row to contain an unsigned number", colIdx)
		} else {
			res = uint64(signedInt)
		}
	}
	return
}

once we get to the bottom of this, I'll send an updated review (if applicable)

from ghostferry.

kolbitsch-lastline avatar kolbitsch-lastline commented on July 19, 2024

Is this a good time to check the siddontang library to see what's happening?

I went through the code. The only "funky" think I can find is in the decodeDecimal method:

https://github.com/Shopify/ghostferry/blob/master/vendor/github.com/siddontang/go-mysql/replication/row_event.go#L576

which I believe is not causing the problem we are discussing here.

I think that we always get a signed integer from the library, which is good (it at least seems consistent). IMHO the library should use the table schema to cast the value from the signed value into the correct type, but maybe there are reasons it's not so today.

But that means: I better understand why I could not reproduce it in an integration test, because it's not the replication module. We use the GetUint64 helper in other cases than just data coming from replication, such as verification. I could swear I had the inline-verifier disabled when I saw the issues, but I was testing many different configurations , so I may simple be wrong (or maybe the conversion happens in a path before we check if the verifier is on).

What's next? We should search what other paths lead to the helper and see where we introduce unsigned values. Maybe it's still best to handle unsigned types as proposed above, so we cover all paths leading there, but at least we have a bit of a more precise idea.
I have spent quite some time on this already. I'm not sure I can dive much deeper at this point.

And, of course, please take this with a grain of salt, who knows if I overlooked something :-)

from ghostferry.

shuhaowu avatar shuhaowu commented on July 19, 2024

So finally I was able to reproduce this issue. Part of the problem is some carelessness on my part: I assumed that the column type we run ghostferry against internally is BIGINT UNSIGNED, but it turns out we are only using BIGINT. As a result, I thought there's something more complex going on here.

I'm able to reproduce this problem in the integration testing environment, and the commit message is attached here verbatim:

The issue appears to be when the PK column is `BIGINT UNSIGNED`. For
some reason I was under the impression that that column type is what we
used internally. It turns out we are mostly using ghostferry with
dataset where the PK is always `BIGINT` (signed).

If you look through the test, we also don't really use `UNSIGNED`
anywhere except in `TestCopyDataWithLargePaginationKeyValues`. However,
this test does not attempt to insert into the source while copy is
happening. This therefore does not trigger the bug, as it is only
revealed in DML events.

I'll try to cherry-pick one of the above patches and hopefully merge it ASAP.

Thanks all and sorry for all the confusion.

from ghostferry.

shuhaowu avatar shuhaowu commented on July 19, 2024

Also I can only reproduce this with the InlineVerifier on, as the InlineVerifier is calling ev.PaginationKey, which eventually calls the code to convert to Int64.

from ghostferry.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.