tonicai / condenser Goto Github PK

View Code? Open in Web Editor NEW

312.0 312.0 48.0 81 KB

Condenser is a database subsetting tool

Home Page: https://www.tonic.ai

License: MIT License

Python 100.00%

database mysql postgres postgresql subsetter subsetting testing

condenser's People

Contributors

Stargazers

Watchers

condenser's Issues

subset_downstream generates incorrect SELECT queries if target table has multiple fk references

I noticed this just now while testing this tool against a client's database. They have several tables that each contain 2 columns that are referenced in foreign key relationships. Consider the following tables:

CREATE TABLE IF NOT EXISTS public."user"
(
    id integer NOT NULL,
    other_key uuid NOT NULL,
    CONSTRAINT user_pkey PRIMARY KEY (id),
    CONSTRAINT user_unique_key UNIQUE (other_key)
)

CREATE TABLE IF NOT EXISTS public.some_table
(
    id integer NOT NULL,
    user_id integer,
    CONSTRAINT some_entry_pkey PRIMARY KEY (id),
    CONSTRAINT "some_entry-user-pk" FOREIGN KEY (user_id)
        REFERENCES public."user" (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)

CREATE TABLE IF NOT EXISTS public.another_table
(
    id integer NOT NULL,
    user_uuid uuid,
    CONSTRAINT another_table_pkey PRIMARY KEY (id),
    CONSTRAINT another_table_user_uuid_fkey FOREIGN KEY (user_uuid)
        REFERENCES public."user" (other_key) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)

subset_downstream generates the following query:

SELECT "user_id" FROM "public"."some_table" WHERE ("user_id") NOT IN (SELECT "other_key" FROM "public"."user")

Which then throws the following error:

psycopg2.errors.UndefinedFunction: operator does not exist: integer = uuid
LINE 1: ..._id" FROM "public"."some_table" WHERE ("user_id") NOT IN (SE...
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

It looks like the issue stems from the fact that pk_columns is only set once, instead of being set for each object in referencing_tables. This might be a naive solution, but simply adding pk_columns = r['target_columns'] before the query is built results in the data being properly copied to the destination db.

Bundling as package?

Hi, this is probably more of a feature request, but I've just found this project and it is exactly what I needed to help with the Neotoma Paleoecology Database, to export subsets of the data to our partners on a periodic basis.

Because we are a distributed organization, and because our initial_targets['where'] statement varies by partner I had started thinking about ways to modify the config.json programmatically, and to bundle this as a (forked) script that would live in our GitHub repos.

From a design perspective I was wondering if there was a specific reason this wasn't written as a more formal Python package. Is this a potential contribution you would welcome?

ignore some columns on copy_rows

Hi, is there a way to disable some table columns from config.json on copy_rows step?

MySQL has generated columns feature. See docs here.

So we can't insert or update value for generated columns since values of a generated column are computed from an expression included in the column definition.

We need to ignore those columns on copy_rows step. They will be filled by database automatically.

🤔 Tool may detect those columns automatically and ignore them from select/insert process, if MySQL has some metadata to identify generated columns at information_schema.columns.

💡 But manually entering those kinds of table columns from config.json should be enough for most cases. Also it wil give more flexibility for some future problematic columns.

Note: I've read about a possible workaround at issue #17 and evaluate this comment for my case. But it's not a solution for generated columns.

These columns also don't accept null values. An example error message from MySQL:

... The value specified for generated column 'anniversary_date' in table 'customer' is not allowed.

Doesn't seem to work on ubuntu

Hi,

To write a blog post for my company, I wanted to run Conderser on Ubuntu.

I downloaded Pynthon 3 (updated), python3-mysql.connector, python-psycopg2 and python-toposort, then I copied the repo on my machine, after installing postgresql 14 and putting a database that would serve as a test game.

Of course, I made sure that my environment variables included the variables from the Postgresql libraries.

I created my own configuration file, named config.json, here is what it contains:

{
"initial_targets": [
        {
            "table": "mgd.gxd_htsample_rnaseq",
            "percent": 10
        }
    ],
    "db_type": "postgres",
    "source_db_connection_info": {
        "user_name": "postgres",
        "host": "localhost",
        "db_name": "test",
        "port": 5432
    },
    "destination_db_connection_info": {
        "user_name": "postgres",
        "host": "localhost",
        "db_name": "test2",
        "port": 5432
    },
    "keep_disconnected_tables": false,
    "excluded_tables": [ ],
    "passthrough_tables": [ ],
    "dependency_breaks": [ ],
    "fk_augmentation": [ ],
    "upstream_filters": [ ]
 }

Thinking I had everything configured correctly, I therefore launched the python script direct_subset.py.

And I ended up with the following error:

Traceback (most recent call last):
  File "direct_subset.py", line 2, in <module>
    import config_reader, result_tabulator
  File "/var/lib/postgresql/condenser/config_reader.py", line 8
    print('WARNING: Attempted to initialize configuration twice.', file=sys.stderr)
                                                                       ^
SyntaxError: invalid syntax

Trying to understand where this error comes from, I deleted the two files config.json.example and config.json.example_all, but I still have the same error.

And since I've never done python, I'm unable to debug myself...

Could someone tell me what's wrong?

Thanks !

Anonymization support?

Great tool, nice work. Is there any chance that this would ever have a basica anonymization/masking implemented like in https://github.com/ankane/pgsync ?

FastSubset is missing

Attempting to run this tool using the latest commit (sha 9ee9b45 from 20 hours ago), but the fast_subset module is missing from the repo? Checking out the v2 tag which is just before this change was made allows the tool to function as expected

Enable subsetting of Oracle

Ignore certain relations

Hi there 👋

Thanks for making this tool open source, it seems to work really well!

However: Say I an entity table, with a createdBy column, referencing account. I would like to grab a subset of entity, so i put the following in my config

"initial_targets": [
        {
            "table": "public.entity",
            "percent": 5
        }
    ],

However, I don't want to dump accounts, since they contain PII so I add

"excluded_tables": [ "public.account" ]

However, now the inserts into entity fail, because entity.createdBy references an account which does not exist in my dump.

I am fine with createdBy being set to null on all entities, but I'm not sure how / whether that's possible with this tool

Only works on tables in the public schema

PostgreSQL sequences need to be reset after subsetter is completed

Hey Tonic devs,

I've been using your awesome tool for a couple of months now, and it's been so great to use! Something I noticed though is that after a subset has been generated, I need to run some SQL to reset DB sequences to their max value in the resulting table before I can generate a backup via pg_dump. Otherwise, the sequences are all reset to 1. Is this intentional on your part for some reason?

Error on UNIQUE CONSTRAINT creation

Hi,

My source DB has a table called employer, here is the schema.

CREATE TABLE "employer" (
	"id" BIGSERIAL PRIMARY KEY,
	"organizationId" BIGINT REFERENCES "organization" ("id") ON DELETE CASCADE NOT NULL,
	"networkId" BIGINT REFERENCES "network" ("id") ON DELETE SET NULL UNIQUE,
	"name" TEXT NOT NULL,
	"externalId" TEXT,
	"address" JSONB DEFAULT('{}') NOT NULL,
	"contact" JSONB DEFAULT('{}') NOT NULL,
	"canonicalEmployerId" BIGINT REFERENCES "employer" ("id") ON DELETE SET NULL,
	"created" TIMESTAMP WITH TIME ZONE DEFAULT(NOW()) NOT NULL,
	"modified" TIMESTAMP WITH TIME ZONE,
	UNIQUE ("organizationId", "externalId")
);
CREATE INDEX "employer_networkId_idx" ON "employer" ("networkId");
CREATE INDEX "employer_name_idx" ON "employer" ("name");
CREATE INDEX "employer_sortByVerifiedThenName_idx" ON "employer" ((COALESCE(("employer"."networkId" < 0)::INT, 1) || '_' || "employer"."name"));
CREATE INDEX "employer_name_trgm_idx" ON "employer" USING gin("name" gin_trgm_ops);
CREATE INDEX "employer_address_idx" ON "employer" USING gin("address");
COMMENT ON COLUMN "employer"."contact" IS 'Employer work email, phone number, or any contact-related data';
COMMENT ON COLUMN "employer"."externalId" IS 'ID used for external application to identify a BLN employer record';
COMMENT ON COLUMN "employer"."canonicalEmployerId" IS 'Canonical employer ID of the employer';
COMMENT ON CONSTRAINT "employer_organizationId_externalId_key" ON "employer" IS 'Combination of "organizaionId" and "externalId" has to be unique';

I'm getting an error like below

 File "main.py", line 30, in func_base
    database.validate_database_create()
  File "/xxx/xxx/xxx/condenser/database_creator.py", line 82, in validate_database_create
    raise Exception(f'Creating tables failed.  See {self.create_error_path} for details')
Exception: Creating tables failed.  See /xxx/xxx/xxx/condenser/SQL/create_error.txt for details

Then I checked SQL/create_error.txt and it has

constraint "employer_organizationId_externalId_key" for table "employer" does not exist

I have checked generated dump_create.sql file and I can not see UNIQUE ("organizationId", "externalId") . Here is the snippet from dump_create.sql .

CREATE TABLE public.employer (
    id bigint NOT NULL,
    "externalId" text,
    address jsonb DEFAULT '{}'::jsonb NOT NULL,
    created timestamp with time zone DEFAULT now() NOT NULL,
    modified timestamp with time zone,
    "networkId" bigint,
    contact jsonb DEFAULT '{}'::jsonb NOT NULL,
    "organizationId" bigint NOT NULL,
    name text NOT NULL,
    "canonicalEmployerId" bigint
);
COMMENT ON COLUMN public.employer."externalId" IS 'ID used for external application to identify a BLN employer record';
COMMENT ON COLUMN public.employer.contact IS 'Employer work email, phone number, or any contact-related data';
COMMENT ON COLUMN public.employer."canonicalEmployerId" IS 'Canonical employer ID of the employer';
CREATE SEQUENCE public.employer_id_seq
    START WITH 1
    INCREMENT BY 1
    NO MINVALUE
    NO MAXVALUE
    CACHE 1;
ALTER SEQUENCE public.employer_id_seq OWNED BY public.employer.id;

I tried to fix it by changing config.json and never success, What kind of setting am I missing?

Thanks
Gayan

Self referential tables are not supported

Seeing

ValueError: Circular dependency, public.orders depends on itself!

in cases where tables (e.g. orders) have rows with parent references of the same type.

Multiple foreign keys issue

Just trying this out for the first time. I have a table that has multiple foreign keys on id and uuid to the same target table.

The first ID related queries seem to work and then it fails while trying to do a select a subset based on the wrong key type. See the second/failed query.

Is there a way to break this foreign key? I am thinking about just dropping the uuid foreign key before running the tool. Messing with dependency breaks and fk_augmentation hasn't seemed to work at all.

Thanks for the help.

	SELECT "market_id" FROM "calc"."settings" WHERE ("market_id") NOT IN (SELECT "id" FROM "public"."market")
	Query completed in 0.0010530948638916016s
Beginning query @ 2021-01-14 21:14:09.559303:
	SELECT ty.typname
                        FROM pg_attribute att
                        JOIN pg_class cl ON cl.oid = att.attrelid
                        JOIN pg_type ty ON ty.oid = att.atttypid
                        JOIN pg_namespace ns ON ns.oid = cl.relnamespace
                        WHERE cl.relname = 'tonic_subset_59dd3165-8a06-4460-9b84-554f34af116a' AND att.attnum > 0 AND
                        NOT att.attisdropped
                        ORDER BY att.attnum;

	Query completed in 0.002404928207397461s
Beginning query @ 2021-01-14 21:14:09.561779:
	SELECT "market_uuid" FROM "calc"."settings" WHERE ("market_uuid") NOT IN (SELECT "id" FROM "public"."market")
Traceback (most recent call last):
  File "direct_subset.py", line 43, in <module>
    subsetter.run_middle_out()
  File "/Users/dustinsmith/Development/work/condenser/condenser/subset.py", line 84, in run_middle_out
    self.subset_downstream(t, relationships)
  File "/Users/dustinsmith/Development/work/condenser/condenser/subset.py", line 177, in subset_downstream
    self.__db_helper.copy_rows(self.__destination_conn, self.__destination_conn, q, temp_table)
  File "/Users/dustinsmith/Development/work/condenser/condenser/psql_database_helper.py", line 36, in copy_rows
    cursor.execute(query)
  File "/Users/dustinsmith/Development/work/condenser/condenser/db_connect.py", line 58, in execute
    retval = self.inner_cursor.execute(query)
psycopg2.errors.UndefinedFunction: operator does not exist: uuid = bigint
LINE 1: ...calc"."settings" WHERE ("market_uuid") NOT IN (SE...
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

Enable subsetting of MySQL

SQL Server Source and Dest DB getting out of sync

Editing the source DB for a SQL Server connection is supposed to update the destination DB, but currently isn't.

Upstream tables that aren't ancestors of direct target table won't be included in the subset

Maybe I'm just misunderstanding what condenser is able to do :)

I have this schema - of course much more complex in reality, but this shows the problem I'm facing:

When dumping meals, I expect to get all the food items in the meal, and their i18ns. MealI18n is dumped correctly, but FoodI18n is not.

Create tables

CREATE TABLE public.food (
  id uuid NOT NULL PRIMARY KEY
);
CREATE TABLE public.food_i18n (
  id uuid NOT NULL PRIMARY KEY REFERENCES food(id)
);
CREATE TABLE public.meal (
  id uuid NOT NULL PRIMARY KEY
);
CREATE TABLE public.meal_i18n (
  id uuid NOT NULL PRIMARY KEY REFERENCES meal(id)
);
CREATE TABLE public.meal_food (
  meal_id uuid NOT NULL REFERENCES meal (id),
  food_id uuid NOT NULL REFERENCES food (id),
  PRIMARY KEY (meal_id, food_id)
);

INSERT INTO public.food (id) VALUES ('c00d544c-2b15-11eb-adc1-0242ac120002'), ('2aa03896-780b-42de-a297-09e897a55c09');
INSERT INTO public.food_i18n (id) VALUES ('c00d544c-2b15-11eb-adc1-0242ac120002'), ('2aa03896-780b-42de-a297-09e897a55c09');

INSERT INTO public.meal (id)  VALUES ('eb0831ee-16c2-42d3-a1d0-b51b623b1e8e'), ('b0aa8b6d-a5cc-421c-9eac-88d88d9dc8dc');
INSERT INTO public.meal_i18n (id)  VALUES ('eb0831ee-16c2-42d3-a1d0-b51b623b1e8e'), ('b0aa8b6d-a5cc-421c-9eac-88d88d9dc8dc');

INSERT INTO public.meal_food (meal_id, food_id) VALUES ('eb0831ee-16c2-42d3-a1d0-b51b623b1e8e','c00d544c-2b15-11eb-adc1-0242ac120002');

Config

{
  "initial_targets": [
    {
      "table": "public.meal",
      "percent": 200
    }
  ],
  "db_type": "postgres",
  "keep_disconnected_tables": false,
  "excluded_tables": [],
  "passthrough_tables": [],
  "dependency_breaks": [],
  "upstream_filters": [],
  "fk_augmentation": [  ],  
}

Console output

Beginning subsetting with these direct targets: ['public.meal']
Processing 1 of 1: {'table': 'public.meal', 'percent': 200}
Direct target tables completed in 0.019591331481933594s
Beginning greedy upstream subsetting with these tables: ['public.food_i18n', 'public.meal_food', 'public.meal_i18n']
Processing 1 of 3: public.food_i18n
Processing 2 of 3: public.meal_food
Processing 3 of 3: public.meal_i18n
Greedy subsettings completed in 0.026256084442138672s
Beginning pass-through tables: []
Pass-through completed in 1.9073486328125e-06s
Beginning downstream subsetting with these tables: ['public.meal_i18n', 'public.meal_food', 'public.food_i18n', 'public.food', 'public.meal']
Processing 1 of 5: public.meal_i18n
Processing 2 of 5: public.meal_food
Processing 3 of 5: public.food_i18n
Processing 4 of 5: public.food
Processing 5 of 5: public.meal
Downstream subsetting completed in 0.020848989486694336s
Beginning post subset SQL calls
Completed post subset SQL calls in 5.054473876953125e-05s
public.meal, 0, 2, 0
public.meal_food, 0, 1, 0
public.food, 0, 1, 0
public.food_i18n, 0, 0, 0
public.meal_i18n, 0, 2, 0

I thought maybe the problem was, that a primary key references another primary key, but I added food_id toFoodI18n, but that still has the same problem

Undocumented option

Hi!

You've got an undocumented option perserve_fk_opportunistically, which is quite important and you've got a typo in the code – it should be preserve_fk_opportunistically instead of perserve_fk_opportunistically.

Please update the documentation to explain the impact and the way of use of that option.

Put all SELECTs in a single REPEATABLE READ transaction

example-config shows wrong form for excluded_tables option

From example-config.json

"excluded_tables": [
        {"schema": "public","tables": [ "table_of_keys"] }
    ],

This does not work, however. The code expects:

"excluded_tables": ["schema.table1", "schema.table2", ...]

The example-config.json should be updated to be correct. I suspect the example passthrough_tables option is similarly out of date.

Insert gives foreign key constraint error when correct dependency breaks are listed

Hi,

I have a database with multiple foreign key constraints between tables, and a table that has self dependencies. I was able to fix the self dependency issue by listing it as an excluded table. So when I run the condenser, the dp_dump works, but once it goes to insert records on a table at the destination inserts fail for records with foreign keys to the ignored table. I also tried listing these relationships in dependency breaks, but the problem still happens.

Does this tool adjust the foreign key constraints before inserting? Or does it assume foreign key constraint warnings are turned off prior to running for tables that are supposed to be ignored?

Does the order of two tables in dependency breaks tuple matter? I.e. {"fk_table": "table_1", "target_table": "table_2"} versus {"fk_table": "table_2", "target_table": "table_1"}? I tried both and also listed them together and neither of these solved the issue when writing/inserting records to related tables at the destination.

Also, this repo is not public. How can I push updates I made to include additional authentication parameters that I am sure would be useful to others.

access denied on 'tonic_subset_temp_db_...'

Attempt to run direct_subset on two local MySQL databases, with a user that has all privileges on both source and target DBs, I get this:

  File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 517, in cmd_query
    self._cmysql.query(query,
_mysql_connector.MySQLInterfaceError: Access denied for user 'bybe'@'localhost' to database 'tonic_subset_temp_db_398dhjr23'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/asaf/apps/condenser-master/direct_subset.py", line 42, in <module>
    subsetter.prep_temp_dbs()
  File "/home/asaf/apps/condenser-master/subset.py", line 98, in prep_temp_dbs
    self.__db_helper.prep_temp_dbs(self.__source_conn, self.__destination_conn)
  File "/home/asaf/apps/condenser-master/mysql_database_helper.py", line 10, in prep_temp_dbs
    run_query('DROP DATABASE IF EXISTS ' + temp_db, source_conn)
  File "/home/asaf/apps/condenser-master/mysql_database_helper.py", line 144, in run_query
    cur.execute(query)
  File "/home/asaf/apps/condenser-master/db_connect.py", line 58, in execute
    retval = self.inner_cursor.execute(query)
  File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/cursor_cext.py", line 270, in execute
    result = self._cnx.cmd_query(stmt, raw=self._raw,
  File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 522, in cmd_query
    raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
mysql.connector.errors.ProgrammingError: 1044 (42000): Access denied for user 'bybe'@'localhost' to database 'tonic_subset_temp_db_398dhjr23'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 517, in cmd_query
    self._cmysql.query(query,
_mysql_connector.MySQLInterfaceError: Access denied for user 'bybe'@'localhost' to database 'tonic_subset_temp_db_398dhjr23'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/asaf/apps/condenser-master/direct_subset.py", line 57, in <module>
    subsetter.unprep_temp_dbs()
  File "/home/asaf/apps/condenser-master/subset.py", line 101, in unprep_temp_dbs
    self.__db_helper.unprep_temp_dbs(self.__source_conn, self.__destination_conn)
  File "/home/asaf/apps/condenser-master/mysql_database_helper.py", line 16, in unprep_temp_dbs
    run_query('DROP DATABASE IF EXISTS ' + temp_db, source_conn)
  File "/home/asaf/apps/condenser-master/mysql_database_helper.py", line 144, in run_query
    cur.execute(query)
  File "/home/asaf/apps/condenser-master/db_connect.py", line 58, in execute
    retval = self.inner_cursor.execute(query)
  File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/cursor_cext.py", line 270, in execute
    result = self._cnx.cmd_query(stmt, raw=self._raw,
  File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 522, in cmd_query
    raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
mysql.connector.errors.ProgrammingError: 1044 (42000): Access denied for user 'bybe'@'localhost' to database 'tonic_subset_temp_db_398dhjr23'

What might be the reason?

Error when inserting into GENERATED ALWAYS AS IDENTITY columns (PostgreSQL)

A standard pattern is for an identity column to be defined like this:

CREATE TABLE IF NOT EXISTS entity
(
    id INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY,

  -- other columns omitted
)

However, condenser does not appear to be able to handle such columns gracefully. It should be possible to force insert a value in such a column using OVERRIDING SYSTEM VALUE as described here. This is indicated in the error message as well.

Traceback (most recent call last):
  File "direct_subset.py", line 43, in <module>
    subsetter.run_middle_out()
  File "C:\Source\GitHub\TonicAI\condenser\subset.py", line 54, in run_middle_out
    self.__subset_direct(target, relationships)
  File "C:\Source\GitHub\TonicAI\condenser\subset.py", line 117, in __subset_direct
    self.__db_helper.copy_rows(self.__source_conn, self.__destination_conn, q, mysql_db_name_hack(t, self.__destination_conn))
  File "C:\Source\GitHub\TonicAI\condenser\psql_database_helper.py", line 50, in copy_rows
    execute_values(destination_cursor, insert_query, rows, template)
  File "C:\Python36\lib\site-packages\psycopg2\extras.py", line 1299, in execute_values
    cur.execute(b''.join(parts))
psycopg2.errors.GeneratedAlways: cannot insert into column "firm_id"
DETAIL:  Column "id" is an identity column defined as GENERATED ALWAYS.
HINT:  Use OVERRIDING SYSTEM VALUE to override.

This issue touches on GENERATED columns, but is not quite the same as it just asks for the ability to exclude such columns from the sub-setting operation. Identity columns are non-null and are key to the foreign key relationships that the tool would preserve, so excluding them would not solve the problem.

Iterative subset crashes

I tried out python iterative_subset.py since python direct_subset.py wasn't working with my setup, and I'm getting the following error:

Traceback (most recent call last):
  File "iterative_subset.py", line 11, in <module>
    all_tables = list_all_tables(source_dbc.get_db_connection())
NameError: name 'source_dbc' is not defined

I'm wondering if it needs this (and surrounding) setup code which exists in the direct_subset script?

condenser/direct_subset.py

Line 12 in 30e12bd

source_dbc = DbConnect(config_reader.get_source_db_connection_info())

Redundancy between db_name and prefacing every table with the name of the database?

First, thank you for this project!

Perhaps I am missing something, but it seems redundant to me to require <db_name>.<table_name> in the config schema. We have to indicate the db_name in the source_db_connection_info clause so it seems that db_name should be assumed from that point on. Or is there an ability to copy from multiple source databases somehow?

Measuring Loss of Data when cutting an edge to remove a cycle

Greetings,
I was reading the following article on subsetting:
https://www.tonic.ai/blog/condenser-a-database-subsetting-tool

I don't exactly understand what the faults are at dropping a cycle from a database. Of course, one loses data when doing so, but is the same amount of data lost irrespective of where you cut the cycle? How could one measure that? What are some of the criteria that affect it?

Explain desired_result more

Hi! I'm a bit confused about what the config property desired_result is supposed to be representing / how I should use it. I see the example.config.json shows

"desired_result": {
    "table": "target_table",
    "schema": "public",
    "percent": 1
},

But any more clarification in the readme would be great to explain which table is meant to be specified, etc. I might guess that percent is the degree to which we want to subset the db?

Thanks!

Search tags: config.json configuration desired end result schema

Writing subset to SQL file instead of to destination DB

Hi folks!

This looks like a wonderful tool! I'm currently using Jailer to create DB subsets. That works fine, but you have to create the configuration using a GUI and it's overall too overcomplicated for my needs. Condenser looks like it hits the sweet spot between utility and simplicity.

One thing that is missing for my particular use case: I want to extract the subset of the DB and in a separate step load the data into a new DB (multiple DBs actually, such as development, staging etc.) Condenser is built in such a way that it immediately ingests the subset into the target DB.

How hard would it be to have an option to create a SQL file as the destination?

If you take PRs and can point me to the right place in the code where changes would need to be made, I can maybe help out

Condenser hanging -- debugging options?

I've created a public gist with my config file and a link to a dump of the database I'm applying condenser against. The issue I'm running into is that condenser appears to hang (over 24hrs with no new text to screen on verbose mode), but I'm not sure how to debug the issue, or know whether or not anything is actually happening.

I'm running condenser as part of a broader workflow through a bash script:

#!/bin/bash
#
# A bash script that uses `condenser` to export a database subset to a database
# to a `localhost` database, and then dump the file and compress it into a tar
# file.
#
# Simon Goring - May 12, 2021
#

# First we check to see if the condenser files actually exist.
if [[ ! -f db_connect.py ]]
then
    echo "Condenser does not exist in the current directory."
    pip install toposort
    pip install psycopg2-binary
    pip install mysql-connector-python
    git clone --depth=1 [email protected]:TonicAI/condenser.git .
    rm -rf !$/.git
fi

# Clone the repo
#
# Remove the .git directory
#rm -rf !$/.git

export PGPASSWORD='DATABASE PASSWORD'
psql -h localhost -U postgres -c "CREATE DATABASE export;"
echo "SELECT 'DROP SCHEMA '||nspname||' CASCADE; CREATE SCHEMA '||nspname||';' FROM pg_catalog.pg_namespace WHERE NOT nspname ~ '.*_.*'" | \
    psql -h localhost -d export -U postgres -t | \
    psql -h localhost -d export -U postgres
python3 direct_subset.py -v
echo "SELECT 'DROP SCHEMA '||nspname||' CASCADE;' FROM pg_catalog.pg_namespace WHERE nspname =ANY('{"ap","da","doi","ecg","emb","gen","ti","ts","tmp"}')" | \
    psql -h localhost -d export -U postgres -t | \
    psql -h localhost -d export -U postgres
now=`date +"%Y-%m-%d"`
mkdir -p dumps
mkdir -p archives
pg_dump -Fc -O -h -o localhost -U postgres -v -d export > ./dumps/$1_dump_${now}.sql
tar -cvf ./archives/$1_dump_${now}.tar -C ./dumps $1_dump_${now}.sql
# -----------------------------------
# |  Clean up files and databases   |
# -----------------------------------
psql -h localhost -U postgres -c "DROP DATABASE export;"
rm ./dumps/$1_dump_${now}.sql
rmdir ./dumps

That's more an FYI about how we're trying to use it though. The key element is that we're just calling condenser with python3 direct_subset.py -v and the config file is linked above in the gist.

The goal of this issue is to note that there seems to be a point at which condenser is hanging, and to figure out a way to debug it so I can fix it.

tonicai / condenser Goto Github PK

condenser's People

Contributors

Stargazers

Watchers

Forkers

condenser's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs