Exporting state on stdout works really well, but it has an annoying gotcha: if the too

copydb: State export on stdout makes resuming difficult, if tool invocation fails about ghostferry HOT 7 OPEN

shopify commented on July 19, 2024

copydb: State export on stdout makes resuming difficult, if tool invocation fails

from ghostferry.

Comments (7)

kolbitsch-lastline commented on July 19, 2024

FYI: I have a proof-of-concept patch that I'd like to open up for review. I understand that it may need a few unit-tests, but before I spend time writing those, I'd like to get some general feedback on the idea in general.

from ghostferry.

shuhaowu commented on July 19, 2024

We designed Ghostferry to work with container technologies such as Docker. Dumping to a file is somewhat problematic as the container. This is why we dump to stdout for now. I don't think the core library should make the decision to dump to a file given this. That said, if you want to make copydb dump its state to a file, that's acceptable.

What you're saying is that some of the error messages are dumped to stdout? We can also make these go into stderr like all log messages to be consistent.

from ghostferry.

kolbitsch-lastline commented on July 19, 2024

We designed Ghostferry to work with container technologies such as Docker. Dumping to a file is somewhat problematic as the container. This is why we dump to stdout for now. I don't think the core library should make the decision to dump to a file given this.

yeah, same here, only that we run everything via Kubernetes. This makes capturing data on stdout (and pass to a second invocation) even more difficult.

This is where mounted files/directories will come in handy and having to deal with stdout + CLI args becomes tedious.

However: you have a point. I think this first patch will mostly be used in testing (and turned out very handy for me already).
I have a second change in development, where the state tracking is moved into the target DB, meaning that a client can resume without any state on the client, which I think is really the solution for our pain.
Still, I found this change to be useful and very limited in scope, so I thought of posting it upstream.

What you're saying is that some of the error messages are dumped to stdout? We can also make these go into stderr like all log messages to be consistent.

no, sorry if the description was misleading. It's not that we see non-log output on stdout, it's that we see no output on stdout, because the tool never started fully. Thus, if this output is propagated to the next run, the empty string output causes issues

from ghostferry.

shuhaowu commented on July 19, 2024

Storing the state tracking in the target db (or some other MySQL instances) is certainly an interesting solution. In fact, this is what we do: we have code that spawns Ghostferry, read its stdout, and the store the state into MySQL.

There's one problem with this however: the schema dump could be very big (~10MB) depending on how many tables you have. We work around this by trimming the schema dump for now. However, if we need it in the future for schema changes, writing it to the target may be very heavy, especially if you run a lot of these moves at once.

One thing I can recommend for your setup, especially if you're running in Kubernetes, is to either have an external tool spawn Ghostferry or write a custom ghostferry application (ghostferry-my-app) that's based on copydb, which would then be able to directly access the StateTracker and therefore record the state dump to any locations available to you.

from ghostferry.

kolbitsch-lastline commented on July 19, 2024

Storing the state tracking in the target db (or some other MySQL instances) is certainly an interesting solution. In fact, this is what we do: we have code that spawns Ghostferry, read its stdout, and the store the state into MySQL.

heh, then I guess I re-implemented what you guys do. I liked this approach better than files/stdout, simply because I can embed the write to the target DB in a transaction, meaning that a DML event (batch) is applied together with the updated pointer, or neither is updated.

Do you guys have the storing embedded in the transaction, or is it just dumping state in an independent (external) storage? I was really worried about updating data but losing the resume position update, which is why I opted for the transactional approach (which, btw, does not work for ALTER TABLE statements, because you cannot alter and update in one transaction).

Since you say you do that too: do you have code you could share?

There's one problem with this however: the schema dump could be very big (~10MB) depending on how many tables you have. We work around this by trimming the schema dump for now. However, if we need it in the future for schema changes, writing it to the target may be very heavy, especially if you run a lot of these moves at once.

In my case, I only store minimal data, such as binlog/resume position as well as the pagination data. I found the rest to not be useful: since my port supports schema changes, I need to be able to re-import schemas anyways. Thus, loading the table schema from the target DB works just fine.

Do you see a reason to store the whole state? To me it seems perfectly fine loading at resume-time again.

One thing I can recommend for your setup, especially if you're running in Kubernetes, is to either have an external tool spawn Ghostferry or write a custom ghostferry application (ghostferry-my-app) that's based on copydb, which would then be able to directly access the StateTracker and therefore record the state dump to any locations available to you.

yep, I call it ghostferry-replicatedb :-) It does essentially all this. However, it's a bit more embedded in the ghostferry core , because - as written above - I want transactional safety

from ghostferry.

shuhaowu commented on July 19, 2024

Do you guys have the storing embedded in the transaction, or is it just dumping state in an independent (external) storage?

External storage.

I was really worried about updating data but losing the resume position update, which is why I opted for the transactional approach

This is not necessary. As it is, Ghostferry can update some rows, not update the last streamed position, crash, resume from an outdated position, and still maintain its data consistency guarantees. This is because any DML statements will be ignored on the target because the target already has the data (due to using INSERT IGNORE, UPDATE .. WHERE row = full row image, and DELETE .. WHERE row = full row image). Based on how the code is written, there should never be a case where the resume position is written prior to the data written to the target, which would cause data loss.

iirc this process was modeled with TLA+ and was demonstrated to be safe. Do you have any specific concerns that caused you to want transactional safety? We found this to be unnecessary.

from ghostferry.

kolbitsch-lastline commented on July 19, 2024

yeah, I read that in the docs already - unfortunately this does not hold true in my scenario where I support DDL, because any change to the schema (or any truncate table) must force a "synchronization point", as we cannot resume from an earlier point.

Imagine a table column is removed, and we revert the binlog position to an earlier insert statement - ghostferry would simply crash

However, as mentioned above already, transactional safety is not given for ALTER TABLE anyways (I cannot create a transaction that does an upsert and an alter at once in MySQL). So in my code I have to take some unfortunate races into consider and hope that a 1-line transaction for DDL and the subsequent state update either fail before the ALTER or work past the update.
I'm mostly concerned with our Kubernetes cluster killing ghostferry or getting severed off the network, making the pod/container unable to communicate its updated state

from ghostferry.

copydb: State export on stdout makes resuming difficult, if tool invocation fails about ghostferry HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs