tonicai / condenser Goto Github PK
View Code? Open in Web Editor NEWCondenser is a database subsetting tool
Home Page: https://www.tonic.ai
License: MIT License
Condenser is a database subsetting tool
Home Page: https://www.tonic.ai
License: MIT License
I noticed this just now while testing this tool against a client's database. They have several tables that each contain 2 columns that are referenced in foreign key relationships. Consider the following tables:
CREATE TABLE IF NOT EXISTS public."user"
(
id integer NOT NULL,
other_key uuid NOT NULL,
CONSTRAINT user_pkey PRIMARY KEY (id),
CONSTRAINT user_unique_key UNIQUE (other_key)
)
CREATE TABLE IF NOT EXISTS public.some_table
(
id integer NOT NULL,
user_id integer,
CONSTRAINT some_entry_pkey PRIMARY KEY (id),
CONSTRAINT "some_entry-user-pk" FOREIGN KEY (user_id)
REFERENCES public."user" (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
CREATE TABLE IF NOT EXISTS public.another_table
(
id integer NOT NULL,
user_uuid uuid,
CONSTRAINT another_table_pkey PRIMARY KEY (id),
CONSTRAINT another_table_user_uuid_fkey FOREIGN KEY (user_uuid)
REFERENCES public."user" (other_key) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
subset_downstream generates the following query:
SELECT "user_id" FROM "public"."some_table" WHERE ("user_id") NOT IN (SELECT "other_key" FROM "public"."user")
Which then throws the following error:
psycopg2.errors.UndefinedFunction: operator does not exist: integer = uuid
LINE 1: ..._id" FROM "public"."some_table" WHERE ("user_id") NOT IN (SE...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
It looks like the issue stems from the fact that pk_columns
is only set once, instead of being set for each object in referencing_tables
. This might be a naive solution, but simply adding pk_columns = r['target_columns']
before the query is built results in the data being properly copied to the destination db.
Hi, this is probably more of a feature request, but I've just found this project and it is exactly what I needed to help with the Neotoma Paleoecology Database, to export subsets of the data to our partners on a periodic basis.
Because we are a distributed organization, and because our initial_targets['where']
statement varies by partner I had started thinking about ways to modify the config.json
programmatically, and to bundle this as a (forked) script that would live in our GitHub repos.
From a design perspective I was wondering if there was a specific reason this wasn't written as a more formal Python package. Is this a potential contribution you would welcome?
Hi, is there a way to disable some table columns from config.json
on copy_rows step?
MySQL has generated columns feature. See docs here.
So we can't insert or update value for generated columns since values of a generated column are computed from an expression included in the column definition.
We need to ignore those columns on copy_rows step. They will be filled by database automatically.
๐ค Tool may detect those columns automatically and ignore them from select/insert process, if MySQL has some metadata to identify generated columns at information_schema.columns
.
๐ก But manually entering those kinds of table columns from config.json
should be enough for most cases. Also it wil give more flexibility for some future problematic columns.
Note: I've read about a possible workaround at issue #17 and evaluate this comment for my case. But it's not a solution for generated columns.
These columns also don't accept null
values. An example error message from MySQL:
... The value specified for generated column 'anniversary_date' in table 'customer' is not allowed.
Hi,
To write a blog post for my company, I wanted to run Conderser on Ubuntu.
I downloaded Pynthon 3 (updated), python3-mysql.connector, python-psycopg2 and python-toposort, then I copied the repo on my machine, after installing postgresql 14 and putting a database that would serve as a test game.
Of course, I made sure that my environment variables included the variables from the Postgresql libraries.
I created my own configuration file, named config.json, here is what it contains:
{
"initial_targets": [
{
"table": "mgd.gxd_htsample_rnaseq",
"percent": 10
}
],
"db_type": "postgres",
"source_db_connection_info": {
"user_name": "postgres",
"host": "localhost",
"db_name": "test",
"port": 5432
},
"destination_db_connection_info": {
"user_name": "postgres",
"host": "localhost",
"db_name": "test2",
"port": 5432
},
"keep_disconnected_tables": false,
"excluded_tables": [ ],
"passthrough_tables": [ ],
"dependency_breaks": [ ],
"fk_augmentation": [ ],
"upstream_filters": [ ]
}
Thinking I had everything configured correctly, I therefore launched the python script direct_subset.py.
And I ended up with the following error:
Traceback (most recent call last):
File "direct_subset.py", line 2, in <module>
import config_reader, result_tabulator
File "/var/lib/postgresql/condenser/config_reader.py", line 8
print('WARNING: Attempted to initialize configuration twice.', file=sys.stderr)
^
SyntaxError: invalid syntax
Trying to understand where this error comes from, I deleted the two files config.json.example and config.json.example_all, but I still have the same error.
And since I've never done python, I'm unable to debug myself...
Could someone tell me what's wrong?
Thanks !
Great tool, nice work. Is there any chance that this would ever have a basica anonymization/masking implemented like in https://github.com/ankane/pgsync ?
Attempting to run this tool using the latest commit (sha 9ee9b45 from 20 hours ago), but the fast_subset module is missing from the repo? Checking out the v2 tag which is just before this change was made allows the tool to function as expected
Hi there ๐
Thanks for making this tool open source, it seems to work really well!
However: Say I an entity
table, with a createdBy
column, referencing account
. I would like to grab a subset of entity
, so i put the following in my config
"initial_targets": [
{
"table": "public.entity",
"percent": 5
}
],
However, I don't want to dump accounts, since they contain PII so I add
"excluded_tables": [ "public.account" ]
However, now the inserts into entity
fail, because entity.createdBy
references an account which does not exist in my dump.
I am fine with createdBy
being set to null on all entities, but I'm not sure how / whether that's possible with this tool
Hey Tonic devs,
I've been using your awesome tool for a couple of months now, and it's been so great to use! Something I noticed though is that after a subset has been generated, I need to run some SQL to reset DB sequences to their max value in the resulting table before I can generate a backup via pg_dump. Otherwise, the sequences are all reset to 1. Is this intentional on your part for some reason?
Hi,
My source DB has a table called employer
, here is the schema.
CREATE TABLE "employer" (
"id" BIGSERIAL PRIMARY KEY,
"organizationId" BIGINT REFERENCES "organization" ("id") ON DELETE CASCADE NOT NULL,
"networkId" BIGINT REFERENCES "network" ("id") ON DELETE SET NULL UNIQUE,
"name" TEXT NOT NULL,
"externalId" TEXT,
"address" JSONB DEFAULT('{}') NOT NULL,
"contact" JSONB DEFAULT('{}') NOT NULL,
"canonicalEmployerId" BIGINT REFERENCES "employer" ("id") ON DELETE SET NULL,
"created" TIMESTAMP WITH TIME ZONE DEFAULT(NOW()) NOT NULL,
"modified" TIMESTAMP WITH TIME ZONE,
UNIQUE ("organizationId", "externalId")
);
CREATE INDEX "employer_networkId_idx" ON "employer" ("networkId");
CREATE INDEX "employer_name_idx" ON "employer" ("name");
CREATE INDEX "employer_sortByVerifiedThenName_idx" ON "employer" ((COALESCE(("employer"."networkId" < 0)::INT, 1) || '_' || "employer"."name"));
CREATE INDEX "employer_name_trgm_idx" ON "employer" USING gin("name" gin_trgm_ops);
CREATE INDEX "employer_address_idx" ON "employer" USING gin("address");
COMMENT ON COLUMN "employer"."contact" IS 'Employer work email, phone number, or any contact-related data';
COMMENT ON COLUMN "employer"."externalId" IS 'ID used for external application to identify a BLN employer record';
COMMENT ON COLUMN "employer"."canonicalEmployerId" IS 'Canonical employer ID of the employer';
COMMENT ON CONSTRAINT "employer_organizationId_externalId_key" ON "employer" IS 'Combination of "organizaionId" and "externalId" has to be unique';
I'm getting an error like below
File "main.py", line 30, in func_base
database.validate_database_create()
File "/xxx/xxx/xxx/condenser/database_creator.py", line 82, in validate_database_create
raise Exception(f'Creating tables failed. See {self.create_error_path} for details')
Exception: Creating tables failed. See /xxx/xxx/xxx/condenser/SQL/create_error.txt for details
Then I checked SQL/create_error.txt
and it has
constraint "employer_organizationId_externalId_key" for table "employer" does not exist
I have checked generated dump_create.sql
file and I can not see UNIQUE ("organizationId", "externalId")
. Here is the snippet from dump_create.sql
.
CREATE TABLE public.employer (
id bigint NOT NULL,
"externalId" text,
address jsonb DEFAULT '{}'::jsonb NOT NULL,
created timestamp with time zone DEFAULT now() NOT NULL,
modified timestamp with time zone,
"networkId" bigint,
contact jsonb DEFAULT '{}'::jsonb NOT NULL,
"organizationId" bigint NOT NULL,
name text NOT NULL,
"canonicalEmployerId" bigint
);
COMMENT ON COLUMN public.employer."externalId" IS 'ID used for external application to identify a BLN employer record';
COMMENT ON COLUMN public.employer.contact IS 'Employer work email, phone number, or any contact-related data';
COMMENT ON COLUMN public.employer."canonicalEmployerId" IS 'Canonical employer ID of the employer';
CREATE SEQUENCE public.employer_id_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
ALTER SEQUENCE public.employer_id_seq OWNED BY public.employer.id;
I tried to fix it by changing config.json
and never success, What kind of setting am I missing?
Thanks
Gayan
Seeing
ValueError: Circular dependency, public.orders depends on itself!
in cases where tables (e.g. orders
) have rows with parent references of the same type.
Just trying this out for the first time. I have a table that has multiple foreign keys on id
and uuid
to the same target table.
The first ID related queries seem to work and then it fails while trying to do a select a subset based on the wrong key type. See the second/failed query.
Is there a way to break this foreign key? I am thinking about just dropping the uuid foreign key before running the tool. Messing with dependency breaks and fk_augmentation hasn't seemed to work at all.
Thanks for the help.
SELECT "market_id" FROM "calc"."settings" WHERE ("market_id") NOT IN (SELECT "id" FROM "public"."market")
Query completed in 0.0010530948638916016s
Beginning query @ 2021-01-14 21:14:09.559303:
SELECT ty.typname
FROM pg_attribute att
JOIN pg_class cl ON cl.oid = att.attrelid
JOIN pg_type ty ON ty.oid = att.atttypid
JOIN pg_namespace ns ON ns.oid = cl.relnamespace
WHERE cl.relname = 'tonic_subset_59dd3165-8a06-4460-9b84-554f34af116a' AND att.attnum > 0 AND
NOT att.attisdropped
ORDER BY att.attnum;
Query completed in 0.002404928207397461s
Beginning query @ 2021-01-14 21:14:09.561779:
SELECT "market_uuid" FROM "calc"."settings" WHERE ("market_uuid") NOT IN (SELECT "id" FROM "public"."market")
Traceback (most recent call last):
File "direct_subset.py", line 43, in <module>
subsetter.run_middle_out()
File "/Users/dustinsmith/Development/work/condenser/condenser/subset.py", line 84, in run_middle_out
self.subset_downstream(t, relationships)
File "/Users/dustinsmith/Development/work/condenser/condenser/subset.py", line 177, in subset_downstream
self.__db_helper.copy_rows(self.__destination_conn, self.__destination_conn, q, temp_table)
File "/Users/dustinsmith/Development/work/condenser/condenser/psql_database_helper.py", line 36, in copy_rows
cursor.execute(query)
File "/Users/dustinsmith/Development/work/condenser/condenser/db_connect.py", line 58, in execute
retval = self.inner_cursor.execute(query)
psycopg2.errors.UndefinedFunction: operator does not exist: uuid = bigint
LINE 1: ...calc"."settings" WHERE ("market_uuid") NOT IN (SE...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
Editing the source DB for a SQL Server connection is supposed to update the destination DB, but currently isn't.
Maybe I'm just misunderstanding what condenser is able to do :)
I have this schema - of course much more complex in reality, but this shows the problem I'm facing:
When dumping meals, I expect to get all the food items in the meal, and their i18ns. MealI18n
is dumped correctly, but FoodI18n
is not.
Create tables
CREATE TABLE public.food (
id uuid NOT NULL PRIMARY KEY
);
CREATE TABLE public.food_i18n (
id uuid NOT NULL PRIMARY KEY REFERENCES food(id)
);
CREATE TABLE public.meal (
id uuid NOT NULL PRIMARY KEY
);
CREATE TABLE public.meal_i18n (
id uuid NOT NULL PRIMARY KEY REFERENCES meal(id)
);
CREATE TABLE public.meal_food (
meal_id uuid NOT NULL REFERENCES meal (id),
food_id uuid NOT NULL REFERENCES food (id),
PRIMARY KEY (meal_id, food_id)
);
INSERT INTO public.food (id) VALUES ('c00d544c-2b15-11eb-adc1-0242ac120002'), ('2aa03896-780b-42de-a297-09e897a55c09');
INSERT INTO public.food_i18n (id) VALUES ('c00d544c-2b15-11eb-adc1-0242ac120002'), ('2aa03896-780b-42de-a297-09e897a55c09');
INSERT INTO public.meal (id) VALUES ('eb0831ee-16c2-42d3-a1d0-b51b623b1e8e'), ('b0aa8b6d-a5cc-421c-9eac-88d88d9dc8dc');
INSERT INTO public.meal_i18n (id) VALUES ('eb0831ee-16c2-42d3-a1d0-b51b623b1e8e'), ('b0aa8b6d-a5cc-421c-9eac-88d88d9dc8dc');
INSERT INTO public.meal_food (meal_id, food_id) VALUES ('eb0831ee-16c2-42d3-a1d0-b51b623b1e8e','c00d544c-2b15-11eb-adc1-0242ac120002');
Config
{
"initial_targets": [
{
"table": "public.meal",
"percent": 200
}
],
"db_type": "postgres",
"keep_disconnected_tables": false,
"excluded_tables": [],
"passthrough_tables": [],
"dependency_breaks": [],
"upstream_filters": [],
"fk_augmentation": [ ],
}
Console output
Beginning subsetting with these direct targets: ['public.meal']
Processing 1 of 1: {'table': 'public.meal', 'percent': 200}
Direct target tables completed in 0.019591331481933594s
Beginning greedy upstream subsetting with these tables: ['public.food_i18n', 'public.meal_food', 'public.meal_i18n']
Processing 1 of 3: public.food_i18n
Processing 2 of 3: public.meal_food
Processing 3 of 3: public.meal_i18n
Greedy subsettings completed in 0.026256084442138672s
Beginning pass-through tables: []
Pass-through completed in 1.9073486328125e-06s
Beginning downstream subsetting with these tables: ['public.meal_i18n', 'public.meal_food', 'public.food_i18n', 'public.food', 'public.meal']
Processing 1 of 5: public.meal_i18n
Processing 2 of 5: public.meal_food
Processing 3 of 5: public.food_i18n
Processing 4 of 5: public.food
Processing 5 of 5: public.meal
Downstream subsetting completed in 0.020848989486694336s
Beginning post subset SQL calls
Completed post subset SQL calls in 5.054473876953125e-05s
public.meal, 0, 2, 0
public.meal_food, 0, 1, 0
public.food, 0, 1, 0
public.food_i18n, 0, 0, 0
public.meal_i18n, 0, 2, 0
I thought maybe the problem was, that a primary key references another primary key, but I added food_id
toFoodI18n
, but that still has the same problem
Hi!
You've got an undocumented option perserve_fk_opportunistically, which is quite important and you've got a typo in the code โ it should be preserve_fk_opportunistically instead of perserve_fk_opportunistically.
Please update the documentation to explain the impact and the way of use of that option.
From example-config.json
"excluded_tables": [
{"schema": "public","tables": [ "table_of_keys"] }
],
This does not work, however. The code expects:
"excluded_tables": ["schema.table1", "schema.table2", ...]
The example-config.json should be updated to be correct. I suspect the example passthrough_tables option is similarly out of date.
Hi,
I have a database with multiple foreign key constraints between tables, and a table that has self dependencies. I was able to fix the self dependency issue by listing it as an excluded table. So when I run the condenser, the dp_dump works, but once it goes to insert records on a table at the destination inserts fail for records with foreign keys to the ignored table. I also tried listing these relationships in dependency breaks, but the problem still happens.
Does this tool adjust the foreign key constraints before inserting? Or does it assume foreign key constraint warnings are turned off prior to running for tables that are supposed to be ignored?
Does the order of two tables in dependency breaks tuple matter? I.e. {"fk_table": "table_1", "target_table": "table_2"} versus {"fk_table": "table_2", "target_table": "table_1"}? I tried both and also listed them together and neither of these solved the issue when writing/inserting records to related tables at the destination.
Also, this repo is not public. How can I push updates I made to include additional authentication parameters that I am sure would be useful to others.
Attempt to run direct_subset on two local MySQL databases, with a user that has all privileges on both source and target DBs, I get this:
File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 517, in cmd_query
self._cmysql.query(query,
_mysql_connector.MySQLInterfaceError: Access denied for user 'bybe'@'localhost' to database 'tonic_subset_temp_db_398dhjr23'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/asaf/apps/condenser-master/direct_subset.py", line 42, in <module>
subsetter.prep_temp_dbs()
File "/home/asaf/apps/condenser-master/subset.py", line 98, in prep_temp_dbs
self.__db_helper.prep_temp_dbs(self.__source_conn, self.__destination_conn)
File "/home/asaf/apps/condenser-master/mysql_database_helper.py", line 10, in prep_temp_dbs
run_query('DROP DATABASE IF EXISTS ' + temp_db, source_conn)
File "/home/asaf/apps/condenser-master/mysql_database_helper.py", line 144, in run_query
cur.execute(query)
File "/home/asaf/apps/condenser-master/db_connect.py", line 58, in execute
retval = self.inner_cursor.execute(query)
File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/cursor_cext.py", line 270, in execute
result = self._cnx.cmd_query(stmt, raw=self._raw,
File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 522, in cmd_query
raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
mysql.connector.errors.ProgrammingError: 1044 (42000): Access denied for user 'bybe'@'localhost' to database 'tonic_subset_temp_db_398dhjr23'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 517, in cmd_query
self._cmysql.query(query,
_mysql_connector.MySQLInterfaceError: Access denied for user 'bybe'@'localhost' to database 'tonic_subset_temp_db_398dhjr23'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/asaf/apps/condenser-master/direct_subset.py", line 57, in <module>
subsetter.unprep_temp_dbs()
File "/home/asaf/apps/condenser-master/subset.py", line 101, in unprep_temp_dbs
self.__db_helper.unprep_temp_dbs(self.__source_conn, self.__destination_conn)
File "/home/asaf/apps/condenser-master/mysql_database_helper.py", line 16, in unprep_temp_dbs
run_query('DROP DATABASE IF EXISTS ' + temp_db, source_conn)
File "/home/asaf/apps/condenser-master/mysql_database_helper.py", line 144, in run_query
cur.execute(query)
File "/home/asaf/apps/condenser-master/db_connect.py", line 58, in execute
retval = self.inner_cursor.execute(query)
File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/cursor_cext.py", line 270, in execute
result = self._cnx.cmd_query(stmt, raw=self._raw,
File "/home/asaf/.local/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 522, in cmd_query
raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
mysql.connector.errors.ProgrammingError: 1044 (42000): Access denied for user 'bybe'@'localhost' to database 'tonic_subset_temp_db_398dhjr23'
What might be the reason?
A standard pattern is for an identity column to be defined like this:
CREATE TABLE IF NOT EXISTS entity
(
id INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY,
-- other columns omitted
)
However, condenser does not appear to be able to handle such columns gracefully. It should be possible to force insert a value in such a column using OVERRIDING SYSTEM VALUE
as described here. This is indicated in the error message as well.
Traceback (most recent call last):
File "direct_subset.py", line 43, in <module>
subsetter.run_middle_out()
File "C:\Source\GitHub\TonicAI\condenser\subset.py", line 54, in run_middle_out
self.__subset_direct(target, relationships)
File "C:\Source\GitHub\TonicAI\condenser\subset.py", line 117, in __subset_direct
self.__db_helper.copy_rows(self.__source_conn, self.__destination_conn, q, mysql_db_name_hack(t, self.__destination_conn))
File "C:\Source\GitHub\TonicAI\condenser\psql_database_helper.py", line 50, in copy_rows
execute_values(destination_cursor, insert_query, rows, template)
File "C:\Python36\lib\site-packages\psycopg2\extras.py", line 1299, in execute_values
cur.execute(b''.join(parts))
psycopg2.errors.GeneratedAlways: cannot insert into column "firm_id"
DETAIL: Column "id" is an identity column defined as GENERATED ALWAYS.
HINT: Use OVERRIDING SYSTEM VALUE to override.
This issue touches on GENERATED columns, but is not quite the same as it just asks for the ability to exclude such columns from the sub-setting operation. Identity columns are non-null and are key to the foreign key relationships that the tool would preserve, so excluding them would not solve the problem.
I tried out python iterative_subset.py
since python direct_subset.py
wasn't working with my setup, and I'm getting the following error:
Traceback (most recent call last):
File "iterative_subset.py", line 11, in <module>
all_tables = list_all_tables(source_dbc.get_db_connection())
NameError: name 'source_dbc' is not defined
I'm wondering if it needs this (and surrounding) setup code which exists in the direct_subset script?
Line 12 in 30e12bd
First, thank you for this project!
Perhaps I am missing something, but it seems redundant to me to require <db_name>.<table_name> in the config schema. We have to indicate the db_name in the source_db_connection_info clause so it seems that db_name should be assumed from that point on. Or is there an ability to copy from multiple source databases somehow?
Greetings,
I was reading the following article on subsetting:
https://www.tonic.ai/blog/condenser-a-database-subsetting-tool
I don't exactly understand what the faults are at dropping a cycle from a database. Of course, one loses data when doing so, but is the same amount of data lost irrespective of where you cut the cycle? How could one measure that? What are some of the criteria that affect it?
Hi! I'm a bit confused about what the config property desired_result
is supposed to be representing / how I should use it. I see the example.config.json shows
"desired_result": {
"table": "target_table",
"schema": "public",
"percent": 1
},
But any more clarification in the readme would be great to explain which table is meant to be specified, etc. I might guess that percent
is the degree to which we want to subset the db?
Thanks!
Search tags: config.json configuration desired end result schema
Hi folks!
This looks like a wonderful tool! I'm currently using Jailer to create DB subsets. That works fine, but you have to create the configuration using a GUI and it's overall too overcomplicated for my needs. Condenser looks like it hits the sweet spot between utility and simplicity.
One thing that is missing for my particular use case: I want to extract the subset of the DB and in a separate step load the data into a new DB (multiple DBs actually, such as development, staging etc.) Condenser is built in such a way that it immediately ingests the subset into the target DB.
How hard would it be to have an option to create a SQL file as the destination?
If you take PRs and can point me to the right place in the code where changes would need to be made, I can maybe help out
I've created a public gist with my config file and a link to a dump of the database I'm applying condenser against. The issue I'm running into is that condenser appears to hang (over 24hrs with no new text to screen on verbose mode), but I'm not sure how to debug the issue, or know whether or not anything is actually happening.
I'm running condenser as part of a broader workflow through a bash script:
#!/bin/bash
#
# A bash script that uses `condenser` to export a database subset to a database
# to a `localhost` database, and then dump the file and compress it into a tar
# file.
#
# Simon Goring - May 12, 2021
#
# First we check to see if the condenser files actually exist.
if [[ ! -f db_connect.py ]]
then
echo "Condenser does not exist in the current directory."
pip install toposort
pip install psycopg2-binary
pip install mysql-connector-python
git clone --depth=1 [email protected]:TonicAI/condenser.git .
rm -rf !$/.git
fi
# Clone the repo
#
# Remove the .git directory
#rm -rf !$/.git
export PGPASSWORD='DATABASE PASSWORD'
psql -h localhost -U postgres -c "CREATE DATABASE export;"
echo "SELECT 'DROP SCHEMA '||nspname||' CASCADE; CREATE SCHEMA '||nspname||';' FROM pg_catalog.pg_namespace WHERE NOT nspname ~ '.*_.*'" | \
psql -h localhost -d export -U postgres -t | \
psql -h localhost -d export -U postgres
python3 direct_subset.py -v
echo "SELECT 'DROP SCHEMA '||nspname||' CASCADE;' FROM pg_catalog.pg_namespace WHERE nspname =ANY('{"ap","da","doi","ecg","emb","gen","ti","ts","tmp"}')" | \
psql -h localhost -d export -U postgres -t | \
psql -h localhost -d export -U postgres
now=`date +"%Y-%m-%d"`
mkdir -p dumps
mkdir -p archives
pg_dump -Fc -O -h -o localhost -U postgres -v -d export > ./dumps/$1_dump_${now}.sql
tar -cvf ./archives/$1_dump_${now}.tar -C ./dumps $1_dump_${now}.sql
# -----------------------------------
# | Clean up files and databases |
# -----------------------------------
psql -h localhost -U postgres -c "DROP DATABASE export;"
rm ./dumps/$1_dump_${now}.sql
rmdir ./dumps
That's more an FYI about how we're trying to use it though. The key element is that we're just calling condenser with python3 direct_subset.py -v
and the config file is linked above in the gist.
The goal of this issue is to note that there seems to be a point at which condenser is hanging, and to figure out a way to debug it so I can fix it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.