Comments (7)
FYI: I have a proof-of-concept patch that I'd like to open up for review. I understand that it may need a few unit-tests, but before I spend time writing those, I'd like to get some general feedback on the idea in general.
from ghostferry.
We designed Ghostferry to work with container technologies such as Docker. Dumping to a file is somewhat problematic as the container. This is why we dump to stdout for now. I don't think the core library should make the decision to dump to a file given this. That said, if you want to make copydb
dump its state to a file, that's acceptable.
What you're saying is that some of the error messages are dumped to stdout? We can also make these go into stderr like all log messages to be consistent.
from ghostferry.
We designed Ghostferry to work with container technologies such as Docker. Dumping to a file is somewhat problematic as the container. This is why we dump to stdout for now. I don't think the core library should make the decision to dump to a file given this.
yeah, same here, only that we run everything via Kubernetes. This makes capturing data on stdout (and pass to a second invocation) even more difficult.
This is where mounted files/directories will come in handy and having to deal with stdout + CLI args becomes tedious.
However: you have a point. I think this first patch will mostly be used in testing (and turned out very handy for me already).
I have a second change in development, where the state tracking is moved into the target DB, meaning that a client can resume without any state on the client, which I think is really the solution for our pain.
Still, I found this change to be useful and very limited in scope, so I thought of posting it upstream.
What you're saying is that some of the error messages are dumped to stdout? We can also make these go into stderr like all log messages to be consistent.
no, sorry if the description was misleading. It's not that we see non-log output on stdout, it's that we see no output on stdout, because the tool never started fully. Thus, if this output is propagated to the next run, the empty string output causes issues
from ghostferry.
Storing the state tracking in the target db (or some other MySQL instances) is certainly an interesting solution. In fact, this is what we do: we have code that spawns Ghostferry, read its stdout, and the store the state into MySQL.
There's one problem with this however: the schema dump could be very big (~10MB) depending on how many tables you have. We work around this by trimming the schema dump for now. However, if we need it in the future for schema changes, writing it to the target may be very heavy, especially if you run a lot of these moves at once.
One thing I can recommend for your setup, especially if you're running in Kubernetes, is to either have an external tool spawn Ghostferry or write a custom ghostferry application (ghostferry-my-app
) that's based on copydb, which would then be able to directly access the StateTracker
and therefore record the state dump to any locations available to you.
from ghostferry.
Storing the state tracking in the target db (or some other MySQL instances) is certainly an interesting solution. In fact, this is what we do: we have code that spawns Ghostferry, read its stdout, and the store the state into MySQL.
heh, then I guess I re-implemented what you guys do. I liked this approach better than files/stdout, simply because I can embed the write to the target DB in a transaction, meaning that a DML event (batch) is applied together with the updated pointer, or neither is updated.
Do you guys have the storing embedded in the transaction, or is it just dumping state in an independent (external) storage? I was really worried about updating data but losing the resume position update, which is why I opted for the transactional approach (which, btw, does not work for ALTER TABLE statements, because you cannot alter and update in one transaction).
Since you say you do that too: do you have code you could share?
There's one problem with this however: the schema dump could be very big (~10MB) depending on how many tables you have. We work around this by trimming the schema dump for now. However, if we need it in the future for schema changes, writing it to the target may be very heavy, especially if you run a lot of these moves at once.
In my case, I only store minimal data, such as binlog/resume position as well as the pagination data. I found the rest to not be useful: since my port supports schema changes, I need to be able to re-import schemas anyways. Thus, loading the table schema from the target DB works just fine.
Do you see a reason to store the whole state? To me it seems perfectly fine loading at resume-time again.
One thing I can recommend for your setup, especially if you're running in Kubernetes, is to either have an external tool spawn Ghostferry or write a custom ghostferry application (ghostferry-my-app) that's based on copydb, which would then be able to directly access the StateTracker and therefore record the state dump to any locations available to you.
yep, I call it ghostferry-replicatedb :-) It does essentially all this. However, it's a bit more embedded in the ghostferry core , because - as written above - I want transactional safety
from ghostferry.
Do you guys have the storing embedded in the transaction, or is it just dumping state in an independent (external) storage?
External storage.
I was really worried about updating data but losing the resume position update, which is why I opted for the transactional approach
This is not necessary. As it is, Ghostferry can update some rows, not update the last streamed position, crash, resume from an outdated position, and still maintain its data consistency guarantees. This is because any DML statements will be ignored on the target because the target already has the data (due to using INSERT IGNORE, UPDATE .. WHERE row = full row image, and DELETE .. WHERE row = full row image). Based on how the code is written, there should never be a case where the resume position is written prior to the data written to the target, which would cause data loss.
iirc this process was modeled with TLA+ and was demonstrated to be safe. Do you have any specific concerns that caused you to want transactional safety? We found this to be unnecessary.
from ghostferry.
yeah, I read that in the docs already - unfortunately this does not hold true in my scenario where I support DDL, because any change to the schema (or any truncate table) must force a "synchronization point", as we cannot resume from an earlier point.
Imagine a table column is removed, and we revert the binlog position to an earlier insert statement - ghostferry
would simply crash
However, as mentioned above already, transactional safety is not given for ALTER TABLE
anyways (I cannot create a transaction that does an upsert and an alter at once in MySQL). So in my code I have to take some unfortunate races into consider and hope that a 1-line transaction for DDL and the subsequent state update either fail before the ALTER or work past the update.
I'm mostly concerned with our Kubernetes cluster killing ghostferry or getting severed off the network, making the pod/container unable to communicate its updated state
from ghostferry.
Related Issues (20)
- tried to advance to a zero log position HOT 1
- mediumint not recognized as numeric type HOT 2
- MariaDB SHOW SLAVE HOSTS output differs from MySQL
- MariaDB binlog events differ from MySQL
- unsigned mediumint value through binlog streamer wrongfully parsed HOT 2
- Cannot follow tutorial due to incompatible docker-compose.yml HOT 1
- Ghostferry misses data for PK values of <= 0 HOT 4
- Investigate removing the cursor and merge it back with the DataIterator
- BinlogStreamerLag in ControlServer's progress API is not seconds as it says it is HOT 1
- Ghostferry control server's webui shows the time taken as now - start
- TestThrottlerThrottlesAndUnthrottles flaky
- Ghostferry should abort if it sees a DDL command that can compromise data integrity
- Ensure BinlogEventBuffer Channel is initialized before running the BinlogStreamer
- Ghostferry binlog streamer lag with large source write volume due to misconfigured BinlogEventBatchSize? HOT 1
- Alternate exit criteria for DataIterators
- InlineVerifyer: invalid memory address or nil pointer dereference
- Trouble with virtual generated columns HOT 6
- Potentially "overlocking" in cursor?
- Config value for copydb `ReplicatedMasterPositionQuery` with vanilla MysQL replication
- Resuming can caused missed replication events HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ghostferry.