GithubHelp home page GithubHelp logo

shuttle-hq / synth Goto Github PK

View Code? Open in Web Editor NEW
1.3K 26.0 99.0 32.93 MB

The Declarative Data Generator

Home Page: https://www.getsynth.com/

License: Apache License 2.0

Nix 1.62% Rust 87.41% Shell 3.18% TypeScript 7.19% CSS 0.17% JavaScript 0.44%
rust test-data-generator synthetic-data json data-generation realistic-data postgres hacktoberfest

synth's Introduction

The Declarative Data Generator


docs license language build status discord Synth open source contributors


Synth is a tool for generating realistic data using a declarative data model. Synth is database agnostic and can scale to millions of rows of data.

Why Synth

Synth answers a simple question. There are so many ways to consume data, why are there no frameworks for generating data?

Synth provides a robust, declarative framework for specifying constraint based data generation, solving the following problems developers face on the regular:

  1. You're creating an App from scratch and have no way to populate your fresh schema with correct, realistic data.
  2. You're doing integration testing / QA on production data, but you know it is bad practice, and you really should not be doing that.
  3. You want to see how your system will scale if your database suddenly has 10x the amount of data.

Synth solves exactly these problems with a flexible declarative data model which you can version control in git, peer review, and automate.

Key Features

The key features of Synth are:

  • Data as Code: Data generation is described using a declarative configuration language allowing you to specify your entire data model as code.

  • Import from Existing Sources: Synth can import data from existing sources and automatically create data models. Synth currently has Alpha support for Postgres, MySQL and mongoDB!

  • Data Inference: While ingesting data, Synth automatically works out the relations, distributions and types of the dataset.

  • Database Agnostic: Synth supports semi-structured data and is database agnostic - playing nicely with SQL and NoSQL databases.

  • Semantic Data Types: Synth uses the fake-rs crate to enable the generation of semantically rich data with support for types like names, addresses, credit card numbers etc.

Status

  • Alpha: We are testing synth with a closed set of users
  • Public Alpha: Anyone can install synth. But go easy on us, there are a few kinks
  • Public Beta: Stable enough for most non-enterprise use-cases
  • Public: Production-ready

We are currently in Public Alpha. Watch "releases" of this repo to get notified of major updates.

Installation & Getting Started

On Linux and MacOS you can get started with the one-liner:

# Optional, set install path
$ export SYNTH_INSTALL_PATH=~/bin
$ curl -sSL https://getsynth.com/install | sh

For more installation options, check out the docs.

Examples

Building a data model from scratch

To start generating data without having a source to import from, you need to add Synth schema files to a namespace directory:

To get started we'll create a namespace directory for our data model and call it my_app:

$ mkdir my_app

Next let's create a users collection using Synth's configuration language, and put it into my_app/users.json:

{
    "type": "array",
    "length": {
        "type": "number",
        "constant": 1
    },
    "content": {
        "type": "object",
        "id": {
            "type": "number",
            "id": {}
        },
        "email": {
            "type": "string",
            "faker": {
                "generator": "safe_email"
            }
        },
        "joined_on": {
            "type": "date_time",
            "format": "%Y-%m-%d",
            "subtype": "naive_date",
            "begin": "2010-01-01",
            "end": "2020-01-01"
        }
    }
}

Finally, generate data using the synth generate command:

$ synth generate my_app/ --size 2 | jq
{
  "users": [
    {
      "email": "[email protected]",
      "id": 1,
      "joined_on": "2014-12-14"
    },
    {
      "email": "[email protected]",
      "id": 2,
      "joined_on": "2013-04-06"
    }
  ]
}

Building a data model from an external database

If you have an existing database, Synth can automatically generate a data model by inspecting the database.

You can use the synth import command to automatically generate Synth schema files from your Postgres, MySQL or MongoDB database:

$ synth import tpch --from postgres://user:pass@localhost:5432/tpch
Building customer collection...
Building primary keys...
Building foreign keys...
Ingesting data for table customer...  10 rows done.

Finally, generate data into another instance of Postgres:

$ synth generate tpch --to postgres://user:pass@localhost:5433/tpch

Why Rust

We decided to build Synth from the ground up in Rust. We love Rust, and given the scale of data we wanted synth to generate, it made sense as a first choice. The combination of memory safety, performance, expressiveness and a great community made it a no-brainer and we've never looked back!

Get in touch

If you would like to learn more, or you would like support for your use-case, feel free to open an issue on GitHub.

If your query is more sensitive, you can email [email protected] and we'll happily chat about your usecase.

About Us

The Synth project is backed by OpenQuery. We are a YCombinator backed startup based in London, England. We are passionate about data privacy, developer productivity, and building great tools for software engineers.

Contributing

First of all, we sincerely appreciate all contributions to Synth, large or small so thank you.

See the contributing section for details.

License

Synth is source-available and licensed under the Apache 2.0 License.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Christos Hadjiaslanis
Christos Hadjiaslanis

πŸ“ πŸ’Ό πŸ’» πŸ–‹ 🎨 πŸ“– πŸ” πŸ€” πŸš‡ 🚧 πŸ“¦ πŸ‘€ πŸ›‘οΈ ⚠️ πŸ“’
Nodar Daneliya
Nodar Daneliya

πŸ“ πŸ’Ό πŸ–‹ 🎨 πŸ“– πŸ” πŸ€”
llogiq
llogiq

πŸ’Ό πŸ’» πŸ–‹ πŸ€” πŸš‡ 🚧 πŸ§‘β€πŸ« πŸ‘€ πŸ›‘οΈ ⚠️
Dmitri Shkurski
Dmitri Shkurski

πŸ’»
Damien Broka
Damien Broka

πŸ“ πŸ’Ό πŸ’» πŸ–‹ 🎨 πŸ“– πŸ” πŸ€” πŸš‡ 🚧 πŸ‘€ ⚠️
fretz12
fretz12

πŸ€” πŸ’» πŸ“– ⚠️
Tyler Bailey
Tyler Bailey

πŸ’» πŸ“–
JΓΊnior Bassani
JΓΊnior Bassani

πŸ› πŸ’»
Daniel Hofstetter
Daniel Hofstetter

πŸ› πŸ’»
Dr Alexander Mikhalev
Dr Alexander Mikhalev

🚧 πŸ“– πŸ‘€ πŸ€” πŸ’»
s e
s e

πŸ’» πŸ‘€ πŸš‡ πŸ“¦ πŸ› πŸ€” ⚠️ πŸ“– 🚧

This project follows the all-contributors specification. Contributions of any kind welcome!

synth's People

Contributors

alexmikhalev avatar allcontributors[bot] avatar baile320 avatar bmoxb avatar brokad avatar cakebaker avatar chesedo avatar christos-h avatar csnweb avatar eltociear avatar fretz12 avatar hbina avatar iamwacko avatar jeyrathnam avatar jrschumacher avatar juniorbassani avatar llogiq avatar luiz787 avatar mathew-horner avatar mhorbul avatar moore-ryan avatar oluwamuyiwa avatar pickfire avatar robert-monk avatar sassela avatar shkurskid avatar siddiqueahmad avatar stygmates avatar vishalsodani avatar vlushn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

synth's Issues

Allow import/export of arbitrary postgres array types

Describe the bug

Currently, due to the way we interface with postgres via sqlx, their Encode trait does not allow encoding types containing slices (or things that deref to slices) of other custom types that implement Encode. I am not sure if we can solve this without changing sqlx.

To Reproduce
Steps to reproduce the behavior:

  1. Schema (if applicable)
{
    "type": "array",
    "content": {
        "type": "object",
        "intmatrix": {
            "type": "array",
            "length": {
                "type": "number",
                "range": {
                    "low": 1,
                    "high": 8,
                    "step": 1
                },
                "subtype": "u64"
            },
            "content": {
                "type": "array",
                "length": {
                    "type": "number",
                    "constant": 8
                },
                "content": {
                    "type": "number",
                    "subtype": "i32",
                    "range": {
                        "low": -32768,
                        "high": 32767,
                        "step": 1
                    }
                }
            }
        }
    },
    "length": {
        "type": "number",
        "range": {
          "low": 1,
          "high": 10,
          "step": 1
        },
        "subtype": "u64"
    }
}
  1. See error
wrong element type

Expected behavior
The array should be generated and inserted into the database

Environment (please complete the following information):

  • OS: any
  • Version: 0.5.5

Additional context
The problem stems from the implementation of the Encode trait that appears not recurse into array elements. Even if we tried changing our code to construct the "correct" types before insertion, we'd run into the requirement of constructing a value of an arbitrary type. To see why this is the case, consider that arrays can nest any number of levels. So we'd need at least to be able to construct a Vec<i32>, a Vec<Vec<i32>>, a Vec<Vec<Vec<i32>>>, etc. ad infinitum (and also any other type). Given that sqlx does not have Encode for Box<dyn Encode>, we could not even try using trait objects to overcome that issue.

Support `type` as field name on schema

Required Functionality
Currently for schema generator, synth use key: value. key for field / column name and value contain schema description. Therefore using string literal type will conflict with schema definition container object.

Proposed Solution

Use case
Example conflicted schema:

{
  "type": "array",
  "length": {
    "type": "number",
    "range": {
      "high": 1,
      "low": 0,
      "step": 1
    },
    "subtype": "u64"
  },
  "content": {
    "type": "object",
    "ip_address": {
      "optional": true,
      "type": "string",
      "pattern": "[a-zA-Z0-9]{0, 45}"
    },
    "user_agent": {
      "optional": true,
      "type": "string",
      "pattern": "[a-zA-Z0-9]{0, 1000}"
    },
    "hash": {
      "optional": true,
      "type": "string",
      "pattern": "[a-zA-Z0-9]{0, 64}"
    },
    "user_id": {
      "optional": true,
      "type": "string",
      "pattern": "[a-zA-Z0-9]{0, 50}"
    },
    "type": { <-- Duplicate Object key
      "optional": true,
      "type": "string",
      "pattern": "[a-zA-Z0-9]{0, 5}"
    },
    "created_at": {
      "type": "string",
      "date_time": {
        "format": "%Y-%m-%d %H:%M:%S%Z",
        "subtype": "date_time",
        "begin": null,
        "end": null
      }
    },
    "id": {
      "type": "number",
      "id": {},
      "subtype": "u64"
    }
  }
}

Feature: Allow more extensive string manipulation in `format`

Currently, string formatter can:

  • concatenate strings
  • reference component strings from other fields via same_as
  • truncate component string lengths via truncate

This is a good start but not enough for a robust string manipulation mechanism. It would be great to have support for string slicing (i.e. take the first 5 characters or between the 7th and 10th character) and regex extraction (take the first capture group). For that we also need to think about what the format string language looks like.

Examples:

  • string slicing: {name[2..]} or {mac_addr[8..12]}
  • regex extraction: {ip4cidr@^[0-9]\.[0-9]\.[0-9].\[0-9]$@{2}} to extract the third octet

It doesn't look like dynfmt currently supports more advanced formats so something might need to change there.

Feature: CSV import/export

Required Functionality
An import and export of CSV files.

Proposed Solution
The CSV format has some surprising complexity. Nonetheless, an import could at least use headers (if any) of a CSV file and try to match certain things about the values (e.g. are they all digits?). The export should have an option to declare the delimiter and perhaps quoting / escaping.

Use case

  • Creating a schema (that can later be refined) from a set of CSV files
  • Generating a set of CSV files from a schema.

Date/Times should not be subtypes of string

Currently date_time is a subtype of strings.. This is for historical reasons because we used to only support generating JSON, which does not know date times as anything else other than strings.

Nowadays, we generate into databases as well, which have first-class support for more primitive types (incl. date times) than JSON does. So we made internally the date_time nodes generate chrono::DateTime and format separately. But date_time is still a subtype of string, which is not the best UX.

This is to simply move the content::string::DateTime node to be a content::Content variant of its own right, with its value of type like "type": "date_time".

Add foreign keys to test schemas

What is wrong
The end-to-end tests in our testing harness only test with schemas that do not include primary/foreign key relations.

As a result, we don't E2E test importing into the same_as generator and we are not able to detect breaking changes to the code that orders insertion into databases with PK/FK constraints.

How to fix it
Add constraints to the test schemas in the testing harness and modify the relevant imported synth schemas (e.g. hospital_master). This will probably also require changing some of the generated data we test against (e.g. hospital_data_generated_master.json).

Lossless Sampling

Required Functionality

Currently the XExportStrategy and Sampler::sample functions work with vectors of JSON values.

This is handy, but it is loses information.

In fact for any data sink which has types which are a superset of the JSON data model (both Postgres and Mongo) you will be losing information as most types get serialized to a string (for example timestamps).

This can be a problem, as at insertion time you don't know what type to use on the client library.

Proposed Solution

  1. It feels like export strategy is doing too much + too tightly coupled to the sampler
  2. The sampler returns JSON values, which is not ideal. We want the sampler to return a vector of Value types in core::graph.
  3. The type mapping has to be redone. We currently have JSON -> PG types & JSON -> Mongo. This needs to be re-implemented for core::graph::Value -> X.

Specify collections on import

Required Functionality
For certain use-cases, it is important to specify a set of collections when importing (i.e. importing a specific set of tables from a database). Synth already supports importing a single collection with --collection but this doesn't cover all usecases.

Proposed Solution
There are various options here, but one would be a comma separated list of table names (since comma is an illegal character for a collection name anyway). --collections collection1,collection2,....

Future work could also involve supporting regular expressions. For example --collection-pattern "subset_.*".

Importing with Postgres enum types

Thanks for your work on synth!

Required Functionality
Currently, trying to import from a database that uses enum types, e.g.

CREATE TYPE MyType AS ENUM ('Left', 'Right');
CREATE TABLE my_table (my_type MyType);

results in

Error: We haven't implemented a converter for mytype

Use case
I have a database where some Rust enums are represented as enum types and it would be nice to be able to use synth's import feature with it.

dyld: Library not loaded: /Users/runner/hostedtoolcache/Python/3.9.4/x64/lib/libpython3.9.dylib

Describe the bug
When I try to run synth binary on macOS I get the following error:

dyld: Library not loaded: /Users/runner/hostedtoolcache/Python/3.9.4/x64/lib/libpython3.9.dylib
  Referenced from: /usr/local/bin/synth
  Reason: image not found
zsh: abort      synth

To Reproduce
Steps to reproduce the behavior:

  1. Run on macOS
synth
  1. See error
dyld: Library not loaded: /Users/runner/hostedtoolcache/Python/3.9.4/x64/lib/libpython3.9.dylib
  Referenced from: /usr/local/bin/synth
  Reason: image not found
zsh: abort      synth

Expected behavior
No error.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: macOS
  • Version: Latest

Additional context
Add any other context about the problem here.

Synth import extension for creating DB schema level collections and filter out non table objects

Required Functionality

Proposing adding schema to synth import command to build a data model from your Postgres database schema. Currently it goes to public schema and works but need a way to specify schema. Also, when it finds non table objects in schema, it continues to fail. I had to remove those objects temporarily for import to work.

$ synth import tpch --from postgres://user:pass@localhost:5432/tpch
Building customer collection...
Building primary keys...
Building foreign keys...
Ingesting data for table customer... 10 rows done.

Proposed Solution
Provide a switch for selecting schema from existing database. If you can also add functionality to add table section would be super.
Besides while selecting schema, based on integrity constraints, if it can add dependencies to collection automatically would help save lots of development time.

Use case
Have many schema in existing database and would like to select schema. It continues to fail when it encounters non table kind of objects. Also, adding integrity constraints to each collection is very laborious task.

Feature: Scheduler / Topological sorting namespaces

Required Functionality
synth as of now requires the namespaces to be in order when having dependencies (for example by foreign key relations in a database) between them. A topological sort would solve this.

Proposed Solution
The most used algorithm is Kahn's which the toposort-scc crate already implements. An implementation could either (in accordance with the Apache-2.0 license) be derived from that or depend on the crate.

Use case
With the topological sort and a function to know whether a namespace depends on another, we may even parallelize generation for multiple namespaces.This would also incidentally allow us to emit a good error message if the dependency graph between namespaces is cyclic.

Allow unbounded `number`s

Required Functionality
Currently number types can only be range, categorical (for integers) and constant (also some can be Id). There is no easy way to specify generating arbitrary numbers

Proposed Solution
Consider adding an unbounded variant that will generate any valid number of the given type, alternatively make serde use the range default if no argument at all is given.

Use case
Simplify the default, allow getting arbitrary numbers without much hassle

Obfuscation of production data instead of randomly created

Required Functionality
Some times is hard to creates production data quality just using the same schema + random data. There are records that follow strict business rules, and random data with rules from generators is simply not enough.

Moreover, in these cases what you only want to anonymize is sensitive data like usernames, passwords and personal info.

Proposed Solution
Would be easier a tool that using clever algorithms defines the model detecting automatically most of these sensitive fields, set the rules to anonymize / obfuscate them, and allows you to edit those definitions or add new ones.

So the basics are the same: you define a model first, that mostly is created automatically, and then you run the model to create the fake data, but using prod data also as input.

Also like currently Synth supports, the ability to set the "size" of the output is key, because some time a production DB contains gigs of data, and process all them is not possible, so the tool needs to be clever enough to anonymize a subset of it, without the need to query all the records on each table.

Use case
E.g. when working with event driven information, some time a "report" triggers the creation of many "task" records, and each task is not just related with the report, but also each one correlates , there are tasks of type "A", "B" and "C" that are always created when a report of type "X" is created (which are very different from the tasks created for reports of type "Y"), and each of these tasks has special fields regarding the type, so randomize this information in a way that makes sense for the app that uses it is almost impossible, the app will crash expecting the information to follow a given schema but also certain rules that aren't possible to generate with data definition rules.

Feature: Conditional sampling

Required Functionality
It would be great to be able to specify a subset of the schema using conditions while calling generate.

  • Ability to define one or multiple conditions:
    • Single value: Sets a field value to a fixed value.
    • Range: Defines a range that is a subset of the one defined in the schema.
  • Would be great to be able to subset following field types: bool, number: range, number: id, string: categorical (Nice to have: one_of variants)

Example: When using the bank example, I would like to be able to specify that I only want samples for transactions with currency USD and GBP.

Proposed Solution
I could imagine two solutions:

  • Adding a cli parameter and creating a subsetting syntax for each supported field type (e.g. synth generate bank_db --condition transactions.currency=USD,GBP
  • Providing a templating/patching approach (similar to kustomize where you basically validate and merge two JSON schema definitions. The base definition gets partially overridden with the newly defined subset definition.

Use case

  • Sampling certain subsets of a complex set of possible value combinations (We use this as a base for value inference using machine learning).
  • Should also match this use case from SDV: sdv-dev/SDV#316

Feature: Allow fields to be omitted from output, hiding them

It is sometimes useful to generate a more complex structure and reference parts of it in the same namespace. For example, a database table might have day/month/year as 3 separate columns, and for the sake of consistency a single datetime has to be generated (to deal with cases like different months having different days etc.)

One approach to implement this is by declaring a field as hidden and omitting it from the output entirely.

Hypothetical example:

{
  "type": "object",
  "_date": {
    "type": "string",
    "hidden": true,
    "date_time": {
      "format": "%Y-%m-%d",
      "subtype": "naive_date",
      "begin": "1930-01-01",
      "end": "2010-01-01"
    }
  },
  "day": {
    "type": "number",
    "format": {
      "format": "{date@day}",
      "arguments": {
        "date": "@namespace.content.date"
      }
    }
  }
}

Feature Request: Custom Faker Generators

Required Functionality
Custom Faker Generator to generate customized data from custom source/methods.

Proposed Solution
Allow for custom or 3rd party faker generators to expand on fake data generation.
Perhaps through a separate schema like file, or fully defined custom code that is then imported by the program (e.g. custom providers in mimesis)

Use case
When generating data, esp. relational, it can be very useful to have the fake data use real world values.

For example: when generating a table of medical utilization, you could use a custom provider to generate "fake" procedure codes by importing/selecting real values from a CMS Procedure Code list (raw codes/src).

Feature: Allow use of numbers in string.format

Required Functionality
Allowing the use of numbers when using the string.format generator, would greatly expand the use cases of the formatter.

Proposed Solution
Allow and convert numerical values to string when used within the contents of string.format

Currently this throws a BadRequest:

> .\synth.exe generate examples --collection string_format_int
Error: At namespace "examples"

Caused by:
    BadRequest: invalid type: expected 'String', found 'Number'

Use case
Such a feature would allow the mixing of numerical generators (e.g. ID) to create customized ID fields, Street addresses, and more.

Example

{
  "type": "array",
  "length": 5,
  "content": {
    "type": "object",
    "str_int": {
      "type": "string",
      "format": {
        "format": "{id}_suffix",
        "arguments": {
          "id": {
            "type": "number",
            "id": {
              "start_at": 100000
            }
          }
        }
      }
    }
  }
}

Running the `bank_db` examples fails

Describe the bug
Running the bank_db example fails

To Reproduce
Steps to reproduce the behavior:
In examples/bank, run cargo run --bin synth -- generate bank_db and get the error

Error: At namespace "bank_db"

Caused by:
    0: while compiling the namespace
    1: at `users.content.username.1`
    2: Generator 'user_name' does not exist , did you mean 'username'?

Integrate `rustfmt` with Synth

Required Functionality
Currently, the coding style is left to the taste of the programmer contributing to Synth, which causes the codebase to have an inconsistent coding style.

Proposed Solution
Delegate coding style management to rustfmt. This will include the following steps:

  1. Run cargo fmt at the root of the repository to format the entire codebase.
  2. Update the CI pipeline to verify coding style on commits and PRs.
  3. Update the contributing guide to instruct contributors to run cargo fmt before submitting a contribution. This will include pointing to resources on how to set up rustfmt integration with several IDEs.

Use case
Having rustfmt be the unique coding style dictator will avoid fragmenting the codebase with multiple coding styles.

Clippy is unhappy

Required Functionality
Running cargo clippy at the root of the project yields various errors (turned into warnings once #36 gets closed).

Proposed Solution
Apply the fixes suggested by Clippy. Most of them look pretty straightforward to solve and shouldn't cause any issue. However, I'm not sure all of them would be accepted. For example, one of the suggestions is to rename ParseState::EOF, at core/src/schema/mod.rs, to ParseState::Eof. Clippy also suggests changing the signature of some functions/methods to conform with conventions. I suggest we either fix these last issues or explicitly allow them with #allow[] macros.

(Optional) To prevent these issues from appearing again, we should also consider running Clippy in the CI pipeline, although contributors should be aware of that, in order to prevent their commits to fail.

Use case
Following Clippy's suggestions would improve readability and, in some cases, slightly improve performance.

Feature: Doc template generator

Required Functionality
We feel that for now, synth namespaces are not currently well documented. Some markdown (or similar format) describing the meaning of columns, etc. would really go a long way to make the definitions more approachable.

Proposed Solution
To reduce the work to set up such documentation and induce a certain consistency, a synth doc subcommand could parse the schema and issue a README.md in the namespace that can then be extended to create some usable documentation.

The template should have the following sections:

  • A short description (obviously left as an exercise to the author)
  • The collections of the namespace with
    • field names, types and descriptions (the latter also to be added by the author)
    • dependencies to other collections (e.g. foreign keys, subcollections)
  • A minimal pretty-printed (perhaps abridged if too long) generated output of the namespace.

Use case
The documentation would help with maintaining the schemas, especially if the original author is no longer around. Making it easy to document means we'll hopefully see better documented schemas.

Error: We haven't implemented a converter for citext

Describe the bug
I am attempting to generate data from a postgres database that uses citext and recieve the following error

Error: We haven't implemented a converter for citext

I'm hoping implementing should be pretty straight forward as citext could probably just be treated as a string

To Reproduce
synth import ztoDeoAuWw --from postgres://adm:mypass@db:5432/ztoDeoAuWw

Steps to reproduce the behavior:

  1. See error
Error: We haven't implemented a converter for citext

Expected behavior
Citext columns would be treated as any other varchar column

Environment (please complete the following information):

  • OS: [e.g. Linux, MacOS, Windows] Linux
  • Version: [e.g. 4.2] 0.5.3

synth init fails creating config file on windows 10

Describe the bug
"synth init" fails to create config.toml file on windows 10.

To Reproduce
Steps to reproduce the behavior:

  1. run synth init on any folder in windows 10.
  2. Error Message : Error: Failed to create config file at: \?\C:\Users\jaira\Documents\Projects\synth.github\workflows\scripts.synth/config.toml during initialization Caused by:
    The filename, directory name, or volume label syntax is incorrect. (os error 123)

Expected behavior
config.toml file should create under .synth directory.

Screenshots
If applicable, add screenshots to help explain your problem.
image

Environment (please complete the following information):

  • OS: windows 10 x64
  • Version: 0.5.3

Additional context
Researching about this, the issue is possibly due to "\\" at the start of the file path.

Custom panic handler

Required Functionality
Whenever synth panics, the user should get the option to send an automated bug message with another option to give their contact info so we can get back to them. This will help our users be more active when bugs occur and help us get more and better bug reports.

Proposed Solution
A custom panic handler can get some information (synth version, operating system, etc.) to format and send to an endpoint we need to set up.

Use case
Improve our reaction to bugs, make synth better faster

Support for mysql/mariadb

Required Functionality
It would be really useful for my work to be able to populate directly MySQL/MariaDb databases.

Something like:

synth generate tpch --to mysql://user:pass@localhost:3066/mydbname

Proposed Solution

Use case
PostgreSQL or MongoDb are widely used but so are MySQL/MariaDB.

Implemented a converter for timestamptz

Describe the bug
Running synth import tpch ... against a database using a timestamp with timezone produces the following:

thread 'main' panicked at 'not implemented: We haven't implemented a converter for timestamptz', synth/src/cli/postgres.rs:507:18
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: synth::cli::postgres::<impl core::convert::From<synth::cli::postgres::ColumnInfo> for synth_core::schema::content::object::FieldContent>::from
   3: <synth::cli::postgres::Collection as core::convert::From<alloc::vec::Vec<synth::cli::postgres::ColumnInfo>>>::from
   4: <synth::cli::postgres::PostgresImportStrategy as synth::cli::import::ImportStrategy>::import
   5: synth::cli::Cli::import
   6: synth::cli::with_telemetry
   7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   8: std::thread::local::LocalKey<T>::with
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  10: async_io::driver::block_on
  11: std::thread::local::LocalKey<T>::with
  12: async_std::task::builder::Builder::blocking
  13: synth::main

+1 for this being implemented.

In the meantime, is there a workaround where I can force Synth to treat all timestamptz as timestamp?

Schema Values List not same

Hi,
I created a schema with 11 fields of different types.
The PG table consist of a serial field + 11 fields with same names and types as the schema .

Now when I try generate to the postgres :
synth generate --collection test_data --to postgres://admin:1122@localhost:5432/data ~/db_test/data/

I get the error as mentioned below : -
Caused by:
0: db error: ERROR: VALUES lists must all be the same length
1: ERROR: VALUES lists must all be the same length

When I pipe the generated values to a normal file ...I noticed that some row have 10values...and some rows even have 9 values..
So not all the generated rows have 11 values.
Why this happens ....how to solve ?

  1. Schema (if applicable)
	"type":"array",
	"length":{
		"type":"number",
		"subtype":"u64",
		"constant":10000
	},
	"content":{
		"type":"object",
		
		"special_id":{
			"optional":false,
			"type":"number",
			"subtype":"u64",
			"range":{
				"low":1,
				"high":20000,
				"step":2
			}
		},
		"first_name":{"optional":false,"type":"string","faker":{"generator":"first_name"}},
		"surname":{"optional":false,"type":"string","faker":{"generator":"last_name"}},
		"nickname":{"optional":false,"type":"string","faker":{"generator":"sentence"}},
		"ages":{
			"optional":false,
			"type":"number",
			"subtype":"u64",
			"range":{
				"low":18,
				"high":45,
				"step":1
			}			
		},
		"department":{
			"optional":false,
			"type":"string",
			"pattern": "(Accounting|Managment|IT||Sales)"
		},
		"job":{
			"optional":false,
			"type":"string",
			"faker":{
				"generator":"job"
			}
		},
		"ident":{
			"optional": true,
			"type":"string",
			"uuid":{}
		},
		"email_address":{
			"optional":false,
			"type":"string",
			"faker":{
				"generator":"company_email"
			}
		},
		"company_web":{
			"optional":true,
			"type":"string",
			"faker":{
				"generator":"safe_domain_name"
			}
		},
		"active":{
			"type":"bool",
			"frequency":0.5
		}
	}	
}

Support postgres json and jsonb types

Required Functionality

import from postgres db

Error: We haven't implemented a converter for jsonb

Proposed Solution

Support postgres json and jsonb types for datasource

Use case

Generate the schema? I haven't got import working yet.

Probabilistically Distribute data in arrays

Required Functionality
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I need data that "looks" a certain way. For my specific use case, a list of 100ish company names with a normal or poisson distribution.

Proposed Solution
What you would see the solution looking like. If you have no idea - just leave empty.

Apologies if there's already a way to do this- I couldnt find one in the docs.

Add a "distribution" block to arrays with the parameters? It'd also be nice if there was some way to calculate an appropriate distribution from prod data.

Use case
Some background on the usecase that this feature would address.

I'm writing a set of scripts to chart my bank transactions. I'd like to blog about it, but I don't feel comfortable posting the businesses I'm frequently at. So I'd like to generate replacements with synth

Double quotes instead single quotes for JSON

Synth uses single quotes to serialize objects. This leads to errors, when I try to insert these object into a PostgreSQL-DB, because it supposes double quotes for property names in object.

panic: `Option::unwrap()` on a `None` value at core/src/graph/mod.rs:572:66

Describe the bug
Panic on calling synth generate

To Reproduce

  1. Schema
    users.json
{
    "type": "array",
    "length": {
        "type": "number",
        "subtype": "u64",
        "constant": 1
    },
    "content": {
        "type": "object",
        "id": {
            "type": "number",
            "subtype": "u64",
            "id": {}
        }
    }
}

posts.json

{
    "type": "array",
    "length": {
        "type": "number",
        "subtype": "u64",
        "constant": 1
    },
    "content": {
        "type": "object",
        "id": {
            "type": "number",
            "subtype": "u64",
            "id": {}
        },
        "user_id": {
            "type": "same_as",
            "ref": "users.content.id"
        }
    }
}

images.json

{
    "type": "array",
    "length": {
        "type": "number",
        "subtype": "u64",
        "constant": 1
    },
    "content": {
        "type": "object",
        "id": {
            "type": "number",
            "subtype": "u64",
            "id": {}
        },
        "post_id": {
            "type": "same_as",
            "ref": "posts.content.id"
        }
    }
}

tags.json

{
    "type": "array",
    "length": {
        "type": "number",
        "subtype": "u64",
        "constant": 1
    },
    "content": {
        "type": "object",
        "image": {
            "type": "same_as",
            "ref": "images.content.id"
        }
    }
}
  1. See error
$ synth generate ns --collection users
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', synth/core/src/graph/mod.rs:572:66

Expected behavior
Generated data or an error message explaining what's wrong with the schema.

Environment (please complete the following information):

  • OS: Ubuntu
  • Version: 21.04

Additional context
I was able to reproduce the bug both on the latest release and on master. The schema listed was the most minimal example I was able to find.

"import" doesn't generate the specified collection if the source is a json file

Describe the bug

The import command doesn't generate the specified collection, if the source is a json file. Instead, an error message is shown.

To Reproduce
Steps to reproduce the behavior:

Run the following command, with example.json containing the json content from below.

$ synth import bank_db --from ~/example.json --collection users
  1. Schema (if applicable)

The json content is a stripped-down version of the bank_db example.

{
  "users": [
    {
      "id": 1,
      "created_at_date": "2009-03-19",
      "created_at_time": "01:00:02",
      "credit_card": "346215176014733",
      "currency": "GIP",
      "email": "[email protected]",
      "is_active": false,
      "last_login_at": "2020-06-22T05:46:41+0000",
      "num_logins": 19,
      "password_hash": "eebed079e19dcf5b936e8ca5a648bee38e30bec02129790eabdd7084919d7972",
      "username": "[email protected]"
    }
  ],
  "transactions": [
    {
      "id": 1,
      "amount": 5001.7,
      "currency": "GIP",
      "timestamp": "2020-05-13T20:48:01+0000",
      "user_id": 1
    }
  ]
}
  1. See error
Error: Was expecting a collection, instead got `[here comes the entire content of the json file mentioned above]`

Expected behavior
The file bank_db/users.json should be generated.

Environment (please complete the following information):
Linux

Feature: synth init <directory-name>

Required Functionality
Currently synth init will initialise the current directory. This is means that to get started you need to:

$ mkdir workspace && cd workspace && synth init

It would be much cleaner to optionally pass a directory

$ synth init workspace

Synth should then build this directory and initialise the workspace.

Proposed Solution
The required changes should be local to CliArgs structopt enum and the init function in synth/src/cli/mod.rs.

UX: Make synth init obsolete

Required Functionality
Currently, users must initialize a "workspace" before using synth to generate data. The rationale for this was that synth would be able to write files for both import and possibly generation (e.g. with #33) and having a directory for itself would reduce the risk of accidentally overwriting files. However, both requiring a call to "synth init" and the generated directories, even if empty, make it harder to set up tests and worsen the user experience.

Proposed Solution
Recognizing that we put the baby out with the bathwater, we should not require user interaction unless there's a problem. So at the very least we could remove the synth init command and subsequent check for the .synth directory and instead add a prompt to ask the user whenever overwriting a file. Even better, we could record the files and creation timestamp whenever synth creates a new file and not issue a prompt when a file of that list gets overwritten. We should however keep the list of generated files somewhat short; storing up to 1000 files should be enough for everybody (famous last words). When the list gets full, we could check for removed files, otherwise remove the oldest entries.

Use case
Improve user experience, simplify testing and reproducibility (because the tests no longer need to clean up .synth folders).

Remove the deny(warnings) annotation

Describe the bug
Not a bug - but the deny(warnings) annotation is considered an anti-pattern.

The deny(warnings) annotation should be removed from the crates synth synth-gen and synth-core.

There should instead be explicit warning checks in CI with something like RUSTFLAGS="-D warnings" cargo build.

Add support for ingesting/synthesizing custom binary data file

Required Functionality

While binary data can come in many shapes and forms, the particular format I'm after is unencoded/uncompressed binary data that have different fields packed next to each other. Additionally, the file begins with a header, and is concluded by a footer. In the middle, is the payload data, where entries are repeated many times.

Here is a pictorial of such a format:

Header
Entry 1
Entry 2
...
Entry N
Footer

Each entry is of fixed size, and can have multiple fields of different data types occupying a different amount of bytes. Example:

timestamp (8 bytes) my_u32 (4 bytes) my_bool (1 byte) my_string (24 bytes)

Proposed Solution

The user will be required to supply additional schema info to tell synth how to parse the fields. A possible format may look something like this:

  "binary_schema": {
    "entry_size_bytes": 37,
    "is_little_endian": true,
    "payload_start_offset_bytes": 4096,
    "payload_end_offset_bytes": 1024, -> this will be bytes from the end of the file
    "fields": [
      {
        "name": "timestamp",
        "type": "u64",
        "byte_start": 0,
        "byte_end": 7
      },
      {
        "name": "my_u32",
        "type": "u32",
        "byte_start": 8,
        "byte_end": 11
      },
      {
        "name": "my_bool",
        "type": "bool",
        "byte_start": 12,
        "byte_end": 12
      },
      {
        "name": "my_string",
        "type": "string",
        "byte_start": 13,
        "byte_end": 36
      }
    ]
  }

Such a binary schema can also be used to define extensions in the future, like encoding, var-length data etc.

Synth should be able to take such a schema and data file, infer from it, and output a variant of the fields. A nice to have would be to take the original data file's header and footer, and stuff it into the generated file as is.

Use case
The use case pertains to protocol data files used in the storage industry. NVMe is one example. Other storage and networking protocols typically follow such a format to some degree, as well.

Unknown `same_as` references panic

Describe the bug
When a reference is at top-level (i.e. as a field value of the top-level object of a collection) and does not exist, an explicit panic occurs.

To Reproduce
Steps to reproduce the behavior:

  1. Minimal reproducing schema
{
    "type": "array",
    "length": {
        "type": "number",
        "constant": 1
    },
    "content": {
        "type": "object",
        "message": {
            "type": "same_as",
            "ref": "i_dont_exist.content.description"
        }
    }
}
  1. See error
thread 'main' panicked at 'field should be there', core/src/schema/namespace.rs:309:56

Expected behavior
Not panicking

Additional context
Introduced by 0e0abdb.

Acknowledge contributors

This issue should contain the necessary info to introduce the all-contributors bot to our repo.

Feature: composite primary keys

Having multiple primary keys is common for production database tables, and the tool needs to also support this. The proposal is to change the object generation so that the primary key columns are declared at the top-level, something like:

{
  "type": "object",
  "primary_keys": [ "id", "email" ],
  "id": { 
    "type": "number", 
    "id": {} 
  },
  "email": { 
    "type": "string", 
    "faker": { 
      "generator": "safe_email" 
    } 
  },
  "something_else": 42
}

Note this also applies for foreign keys - they can also be composite.

--random and --seed not working when using postgres generator

When using postgres as target, the parameters --random and --seed are ignored.

Expected behavior

When running generation with postgres target and --random flag twice, function like uuid should generate different data. Instead the same data is generated for each run.

Additional context

I looked at the source code. In synth/src/cli/postgres.rs the function Sampler::sample is used which always uses seed 0. I guess this should be changed to Sampler::sample_seeded ?

weight in one_of generator is not affecting probability of variant

Describe the bug
Applying the weight in one_of generator does not seem to be affecting how often one variant is chosen over the other.

To Reproduce

Taking a sample from the docs:

  1. Schema (if applicable)
{
  "type": "array",
  "length": 10000,
  "content": {
    "type": "one_of",
    "variants": [
      {
        "weight": 9.5,
        "type": "string",
        "faker": {
          "generator": "address"
        }
      },
      {
        "weight": 0.5,
        "type": "object",
        "postcode": {
          "type": "string",
          "faker": {
            "generator": "post_code"
          }
        },
        "number": {
          "type": "number",
          "subtype": "u64",
          "range": {
            "low": 1,
            "high": 200,
            "step": 2
          }
        }
      }
    ]
  }
}
  1. Results in an aggregated spread of
op1: 4939
op2: 5061

Expected behavior
Weight should affect the individual entry's probability in the generated data

Environment (please complete the following information):

  • OS: Win10 Pro v21H1
  • Version: v0.5.4

Use parameterized Statements for Postgres export

Required Functionality
Currently, the postgres export statements are build as strings. See postgres.sql:91 and following.

Proposed Solution
Make it use a prepared statement instead. This requires implementing ToSql on our Value implementation and creating a function that builds the SQL statement from a given list of columns for the namespace.

Use case
Not only will this make the export faster by removing the Value β†’ String β†’ Value roundtrip, it will also reduce the avenues for errors because of failing to escape data (if someone manages to generate bobby tables).

XML export / import

Required Functionality
A new export strategy and import to and from XML files.

Proposed Solution
Exporting to a simple XML file with the namespaces and fields as entities containing the values as text. XSLT can then be used to mangle that to the desired format as a first 80% solution.

Use case
Some people are still working with XML based tools. An export would make it easier for them to work with synth.

Support SQLite

Required Functionality
I use sqlite in some projects and would like to be able to use synth in these projects as well.

Something like this is what I am looking for:

synth import gen --from sqlite://real.db3
synth generate gen --to sqlite://generated.db3

Proposed Solution
I think this can be done by leveraging the already existing sqlite support in sqlx.
I did a very rough implementation here: rasviitanen@938976b to show that it requires a minimal effort.

Obviously the implementation needs cleaning up (I copied the MySql implementation and tweaked it until import/generate worked for the db I have). But it shows a rough idea about how this could be implemented.

If you are interested I would be happy to spend some more time on this and make a more refined implementation. But I understand if you dont want to add extra complexity by introducing another db at this stage.

Use case
I have a bunch of very complex sqlite-files with real user data. I want to be able to generate files similar to these by using synth.

Add tests for examples (i.e. bank_db)

Required Functionality
Currently we don't have automated tests for our examples (i.e. bank_db).

Proposed Solution
We should have a step in CI which at least runs a generate on a per-example basis and fails on a non-zero exit code.

Use case
As we get more examples it's going to be come harder and harder to maintain them.

Unique field that is a text

Hello,

Synth type number "id" is unique(can be considered as PK) and used as a FK in other collection using type same_as..
"id" does the same as type serial in PG.

Now..in my PG table I have a field as a PK which is not number (i.e, a text or varchar type)...in synth schema how to represent this field as a unique(PK in this case) ?
synth type "string" dosen't give such an option & "faker" dosen't guarantee uniqueness for all generated data.

I guess in type "string" to have a
"type":"string",
"condition":"unique",
"pattern" :" any pattern required"

or even :-
{
"type": "one_of",
"variants": [ {
"condition":"unique" , // "weight": 0.5, this no need
"type": "string",
"pattern": "any pattern required"
}
]
}

The point here is to impose the "unique" condition on the generated data to meet the PK unique values condition.

just an idea :)

Specify output size per collection

Required Functionality
Ability to specify number of rows generated for each collection.

Proposed Solution
Currently rows parameter specified divides automatically based on number of collections from internal algo.
Having number of rows specified for each collection within either collection or using some parameter file at collection level would be nice.

Use case
Have certain tables where we need millions of rows and some tables where only handful of rows to simulate actual database rows.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.