greenmaskio / greenmask Goto Github PK

PostgreSQL database anonymization tool

License: Apache License 2.0

Makefile 0.08% Go 99.06% Dockerfile 0.36% Shell 0.50%

dump golang masking obfuscation obfuscator postgresql restore s3 security security-tools staging anonymization deterministic synthetic-data transform

greenmask's Introduction

Greenmask - dump obfuscation tool

Preface

Greenmask is a powerful open-source utility that is designed for logical database backup dumping, obfuscation, and restoration. It offers extensive functionality for backup, anonymization, and data masking. Greenmask is written entirely in pure Go and includes ported PostgreSQL libraries, making it platform-independent. This tool is stateless and does not require any changes to your database schema. It is designed to be highly customizable and backward-compatible with existing PostgreSQL utilities.

Features

Deterministic transformers — deterministic approach to data transformation based on the hash functions. This ensures that the same input data will always produce the same output data. Almost each transformer supports either random or hash engine making it universal for any use case.
Dynamic parameters — almost each transformer supports dynamic parameters, allowing to parametrize the transformer dynamically from the table column value. This is helpful for resolving the functional dependencies between columns and satisfying the constraints.
Cross-platform - Can be easily built and executed on any platform, thanks to its Go-based architecture, which eliminates platform dependencies.
Database type safe - Ensures data integrity by validating data and utilizing the database driver for encoding and decoding operations. This approach guarantees the preservation of data formats.
Transformation validation and easy maintainable - During obfuscation development, Greenmask provides validation warnings and a transformation diff feature, allowing you to monitor and maintain transformations effectively throughout the software lifecycle.
Partitioned tables transformation inheritance - Define transformation configurations once and apply them to all partitions within partitioned tables, simplifying the obfuscation process.
Stateless - Greenmask operates as a logical dump and does not impact your existing database schema.
Backward compatible - It fully supports the same features and protocols as existing vanilla PostgreSQL utilities. Dumps created by Greenmask can be successfully restored using the pg_restore utility.
Extensible - Users have the flexibility to implement domain-based transformations in any programming language or use predefined templates.
Declarative - Greenmask allows you to define configurations in a structured, easily parsed, and recognizable format.
Integrable - Integrate Greenmask seamlessly into your CI/CD system for automated database obfuscation and restoration.
Parallel execution - Take advantage of parallel dumping and restoration, significantly reducing the time required to deliver results.
Provide variety of storages - Greenmask offers a variety of storage options for local and remote data storage, including directories and S3-like storage solutions.

Use Cases

Greenmask is ideal for various scenarios, including:

Backup and Restoration. Use Greenmask for your daily routines involving logical backup dumping and restoration. It seamlessly handles tasks like table restoration after truncation. Its functionality closely mirrors that of pg_dump and pg_restore, making it a straightforward replacement.
Anonymization, Transformation, and Data Masking. Employ Greenmask for anonymizing, transforming, and masking backups, especially when setting up a staging environment or for analytical purposes. It simplifies the deployment of a pre-production environment with consistently anonymized data, facilitating faster time-to-market in the development lifecycle.

Our purpose

The Greenmask utility plays a central role in the Greenmask ecosystem. Our goal is to develop a comprehensive, UI-based solution for managing obfuscation procedures. We recognize the challenges of maintaining obfuscation consistency throughout the software lifecycle. Greenmask is dedicated to providing valuable tools and features that ensure the obfuscation process remains fresh, predictable, and transparent.

General Information

It is evident that the most appropriate approach for executing logical backup dumping and restoration is by leveraging the core PostgreSQL utilities, specifically pg_dump and pg_restore. Greenmask has been purposefully designed to align with PostgreSQL's native utilities, ensuring compatibility. Greenmask primarily handles data dumping operations independently and delegates the responsibilities of schema dumping and restoration to pg_dump and pg_restore, maintaining seamless integration with PostgreSQL's standard tools.

Backup Process

The process of backing up PostgreSQL databases is divided into three distinct sections:

Pre-data - This section encompasses the raw schema of tables, excluding primary keys (PK) and foreign keys (FK).
Data - The data section contains the actual table data in COPY format, including information about sequence current values and Large Objects data.
Post-data - In this section, you'll find the definitions of indexes, triggers, rules, and constraints (such as PK and FK).

Greenmask focuses exclusively on the data section during runtime. It delegates the handling of the pre-data and post-data sections to the core PostgreSQL utilities, pg_dump and pg_restore.

Greenmask employs the directory format of pg_dump and pg_restore. This format is particularly suitable for parallel execution and partial restoration, and it includes clear metadata files that aid in determining the backup and restoration steps. Greenmask has been optimized to work seamlessly with remote storage systems and obfuscation procedures.

When performing data dumping, Greenmask utilizes the COPY command in TEXT format, maintaining reliability and compatibility with the vanilla PostgreSQL utilities.

Additionally, Greenmask supports parallel execution, significantly reducing the time required for the dumping process.

Storage Options

The core PostgreSQL utilities, pg_dump and pg_restore, traditionally operate with files in a directory format, offering no alternative methods. To meet modern backup requirements and provide flexible approaches, Greenmask introduces the concept of Storages.

s3 - This option supports any S3-like storage system, including AWS S3, making it versatile and adaptable to various cloud-based storage solutions.
directory - This is the standard choice, representing the ordinary filesystem directory for local storage.

Restoration Process

In the restoration process, Greenmask combines the capabilities of different tools:

Schema Restoration - Greenmask utilizes pg_restore to restore the database schema. This ensures that the schema is accurately reconstructed.
Data Restoration - For data restoration, Greenmask independently applies the data using the COPY protocol. This allows Greenmask to handle the data efficiently, especially when working with various storage solutions. Greenmask is aware of the restoration metadata, which enables it to download only the necessary data. This feature is particularly useful for partial restoration scenarios, such as restoring a single table from a complete backup.

Greenmask also supports parallel restoration, which can significantly reduce the time required to complete the restoration process. This parallel execution enhances the efficiency of restoring large datasets.

Data Obfuscation and Validation

Greenmask works with COPY lines, collects schema metadata using the Golang driver, and employs this driver in the encoding and decoding process. The validate command offers a way to assess the impact on both schema (validation warnings) and data (transformation and displaying differences). This command allows you to validate the schema and data transformations, ensuring the desired outcomes during the obfuscation process.

Customization

If your table schema relies on functional dependencies between columns, you can address this challenge using the TemplateRecord transformer. This transformer enables you to define transformation logic for entire tables, offering type-safe operations when assigning new values.

Greenmask provides a framework for creating your custom transformers, which can be reused efficiently. These transformers can be seamlessly integrated without requiring recompilation, thanks to the PIPE (stdin/stdout) interaction.

Furthermore, Greenmask's architecture is designed to be highly extensible, making it possible to introduce other interaction protocols, such as HTTP or Socket, for conducting obfuscation procedures.

PostgreSQL Version Compatibility

Greenmask is compatible with PostgreSQL versions 11 and higher.

References

Utilized the Demo database, provided by PostgresPro, for integration testing purposes.
Employed the adventureworks database created by morenoh149/postgresDBSamples, in the Docker Compose playground.

greenmask's People

Contributors

Stargazers

Watchers

Forkers

tony2001 riffus avary jpedro-me tarbaev-vl joao-zanutto drsequence omerkaya1

greenmask's Issues

Epic: V0.2b release

DOD:

Transformers support both random and hash modes. The last allows us to use it in a deterministic way
Adapt transformers for dynamic parameters usage

Tasks

greenmask restore fails without error messages or exiting

I am running greenmask restore --config config.yml, and it is failing after an hour or so of restoring a 70 GB database. It stops running, but it does not exit, and does not display any error messages. I know it has stopped working because htop no longer shows the process.

Here are my specs:

greenmask 0.1.6
t2.micro EC2 instance
Amazon Linux 2023
1 CPU, 1 GB RAM
60 GB storage
RDS Aurora Postgres DB
DB size is 70 GB (14.5 GB compressed)

How do I get greenmask to finish?

Bug: --data-only flag interfere with --schema-only

As was found in #102 --data-only flag interferes with --schema-only.

DOD:

Refactore dumper logic so it can generate dummy toc.dat file without pg_dump call
Validate provided pg_dump options

S3 upload error: region missing

Hello guys, I'm trying to run the project locally using the latest docker image provided in dockerhub, but I'm getting an error message saying that the region can't be found in my configuration. Here is how my config.yaml file is looking like:

common:
  tmp_dir: /home/temp

log:
  level: debug

s3:
  bucket: BUCKET_NAME
  region: us-east-1
  access_key_id: ACCESS_KEY_ID
  secret_access_key: SECRET_ACCESS_KEY

dump:
  pg_dump_options:
    host: DB_HOST
    dbname: DB_NAME

restore:
  pg_restore_options:
    host: DB_HOST
    dbname: DB_NAME

Here is the log messages I'm getting (db host ommited):

root@173f2c271d32:/home# greenmask dump --config config.yaml
2024-03-27T15:17:24Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:145 > performing snapshot export pid=390
2024-03-27T15:17:26Z DBG ../var/lib/greenmask/internal/db/postgres/pgdump/pgdump.go:44 > pg_dump: pg_dump --file /home/temp/1711552642497597375 --format d --schema-only --snapshot 00000005-00069849-1 --dbname postgres --host DB_HOST --username postgres
 pid=390
2024-03-27T15:17:34Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:197 > reading schema section pid=390
2024-03-27T15:17:34Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:226 > planned 1 workers pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:547 > exited normally WorkerId=1 pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:331 > all the data have been dumped pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:336 > merging toc entries pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:342 > writing built toc file into storage pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Validate Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Build Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Sign Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z FTL ../var/lib/greenmask/cmd/greenmask/cmd/dump/dump.go:58 > cannot make a backup error="mergeAndWriteToc stage dumping error: s3 object uploading error: MissingRegion: could not find region configuration" pid=390

I've tried to even export an AWS_REGION environment variable before executing greenmask, but I had no luck. Look forward to hear from you guys, this project is amazing!

Let me know if there is anything I can help with

Panic using RandomString when specifying symbols

I have a table with a varchar(255) column for which I want to generate a random ID while dumping (this column only has NULL values in the original database). Here's the config I try to use

    - schema: public
      name: surgery_patient
      transformers:
        - name: RandomString
          params:
            column: permanent_identification_number
            symbols: 0123456789
            min_length: 20
            max_length: 20
            keep_null: false

When running dump or validate, greenmask fails with the following error

greenmask --config greenmask.yml validate   --warnings   --data   --diff   --schema   --format=text   --table-format=vertical   --transformed-only   --rows-limit=1

panic: runtime error: index out of range [9] with length 9

goroutine 185 [running]:
github.com/greenmaskio/greenmask/internal/db/postgres/transformers/utils.RandomString(0xc000538b48?, 0x42523c?, 0x14, {0xc000708450, 0x9, 0x7f2cbbfc2ca8?}, {0xc000038140, 0x14, 0x14})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/transformers/utils/transformation_funcs.go:142 +0x114
github.com/greenmaskio/greenmask/internal/db/postgres/transformers.(*RandomStringTransformer).Transform(0xc000144e70, {0x1e0?, 0xf44720?}, 0xc0004c9b60)
	/home/runner/work/greenmask/greenmask/internal/db/postgres/transformers/random_string.go:148 +0xf3
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).TransformSync(0xc000814c00, {0x13afef8, 0xc0007240a0}, 0x100ffffffff?)
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:127 +0xa2
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).Dump(0xc000814c00, {0x13afef8, 0xc0007240a0}, {0xc00073a035?, 0xc000538dc0?, 0x13aec90?})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:153 +0x119
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*ValidationPipeline).Dump(0xc00007a068, {0x13afef8, 0xc0007240a0}, {0xc00073a035, 0xcb, 0xcb})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/validation_pipeline.go:33 +0x1c6
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).process(0xc0001244c8, {0x13afef8, 0xc0007240a0}, {0x13b52a8?, 0xc000010a98?}, {0x7f2cbbdf98a0?, 0xc0006960f0?}, {0x13b0278, 0xc00007a068})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/table.go:151 +0x3ad
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).Execute.func2()
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/table.go:91 +0x305
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 152
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

If I remove the symbols param, it works as expected

I'm using greenmask 0.1.9 on Fedora 39 (the linux-amd64 build)

Dict transformer doesn't match values

In a database, I'd like to transform values of a column using the Dict transformer. The original database has this:

The transformer used is configured like this

  transformation:

    - schema: public
      name: provider
      transformers:
        - name: Dict
          params:
            column: name
            values:
              Clinique Louis Pasteur Nancy: "Établissement 1"
              Clinique Ambroise Paré Thionville: "Établissement 2"
              Polyclinique La Ligne bleue: "Établissement 3"
              Clinique Jeanne d'Arc: "Établissement 4"
            # fail_not_matched: false

Yet, when validating or dumping, greenmask fails with

2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 FTL cannot make a backup error="data stage dumping error: at least one worker exited with error: error processing table dump: dump error: dump error on table public.provider at line 1: dump error on table public.provider at line 1: unable to match value for \"Polyclinique La Ligne bleue\""

I tried to quote (single and double) the keys in greemask config, with no difference. I even tried with simple keys (without space or special chars) with the same result.

greenmask restore fails for generated columns

I'm running greenmask reset --config config.yml latest and it fails when trying to restore a table with a generated column. I'm testing without applying transformations to any of the columns of the table that contains the generated column I get the following error message:
FTL fatal error="data stage restoration error: at least one worker exited with error: unable to perform restoration task (worker 4 restoring table \"public\".\"asdfasdf\"): error from postgres connection msg = column \"state\" is a generated column code=42P10"

According to the postgres documentation A generated column cannot be written to directly. In INSERT or UPDATE commands, a value cannot be specified for a generated column, but the keyword DEFAULT may be specified., so I have also tried to apply the following transformation:

dump:
  transformation:
    - schema: 'public'
      name: 'asdfasdf'
      transformers:
        - name: 'Replace'
          params:
            column: 'state'
            value: DEFAULT

but I get the same error message. I only get the restore to work if I ignore the table that has a generated column when doing the dump. I don't know if this is a bug or if greenmask simply doesn't support this type of columns. I hope you will give me the answer.

My specs:

Greenmask 0.1.10
PostgreSQL 14.11

Feature request: transformer "timestamp with time zone"

Hi!

I'm not 100% sure if this already exists or is possible, but it would be very useful to have a transformer for the column type TIMESTAMP WITH TIME ZONE. Would this be possible?

Thanks!

fix: Enrich dynamic parameter validation warning

Add DynamicParameterSettingValue in the ValidationWarning for dynamic parameter

Current output

2024-05-14T18:15:22+03:00 ERR internal/db/postgres/cmd/validate.go:303 > ValidationWarning={"hash":"3558dc01f382e0fddec76cb535293a2b","meta":{"ColumnName":"min","DynamicParameterSetting":"column","ParameterName":"min","SchemaName":"public","TableName":"account","TransformerName":"RandomDate"},"msg":"column does not exist","severity":"error"} pid=1467192

panic: runtime error: slice bounds out of range [:18] with capacity 17

I am getting the following error when I use the RandomString transfomer:

panic: runtime error: slice bounds out of range [:18] with capacity 17

goroutine 614 [running]:
github.com/greenmaskio/greenmask/internal/db/postgres/pgcopy.(*Row).GetColumn(0x1400007bbc8?, 0x101125824?)
	/Users/jsutherland/greenmask/internal/db/postgres/pgcopy/row.go:108 +0x114
github.com/greenmaskio/greenmask/pkg/toolkit.(*Record).GetRawColumnValueByIdx(...)
	/Users/jsutherland/greenmask/pkg/toolkit/record.go:192
github.com/greenmaskio/greenmask/internal/db/postgres/transformers.(*FakeTransformer).Transform(0x140000ba2d0, {0x1023ad6a8?, 0x1?}, 0x14000164ed0)
	/Users/jsutherland/greenmask/internal/db/postgres/transformers/random_faker.go:316 +0x38
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).TransformSync(0x14000880420, {0x101d38cb8, 0x14000046320}, 0x14000164e40?)
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:127 +0x88
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).Dump(0x14000880420, {0x101d38cb8, 0x14000046320}, {0x14000c357f1?, 0x14000164ed0?, 0x140008121b0?})
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:153 +0xf8
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).process(0x1400000e078, {0x101d38cb8, 0x14000046320}, {0x101d3dcc8?, 0x1400000e168?}, {0x1497022b8?, 0x1400000e1c8?}, {0x101d39070, 0x14000880420})
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/table.go:153 +0x308
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).Execute.func2()
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/table.go:93 +0x280
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/Users/jsutherland/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x58
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 462
	/Users/jsutherland/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0x98

My configuration looks like this:

    - schema: "public"
      name: "authentications"
      transformers:
        - name: "RandomString"
          params:
            column: "uid"
            min_length: 7
            max_length: 60

When I comment out that section, the script runs fine. When I include it back, the script fails.

I am using Postgres 12.10-alpine running in Docker on macOS 14.2.1

Can you help me resolve this?

Thank you.

Env var values not being loaded without a config file definition

@wwoytenko I encountered a problem where environment variables are not being loaded if their config isn't defined in a config file. It seems that this is a known Viper issue: spf13/viper#584

I'll come up with a PR to solve this issue, changing the common.tmp_dir default definition.

We can also utilize this PR to set a default behavior for storage config. I see these possible scenarios

Define default storage to directory storage on ~/dumps -- this would allow Docker users to map a volume to /home/greenmask/dumps and use it without needing to specify the storage.directory.path configuration
Require storage configuration and error out in case none is provided

Epic: Implement dynamic parameters for trasnformers

The parameter encode-decoding should be implemented via pgx driver
Adapt built-in transformers for dynamic parameters. There should be two modes - dynamic and static

feat: Add type validation for dynamic parameters encoders

DOD:

If the column with unsupported type passed the raise an error/VW
If type is inherited it does not raise an error

Feature: conditional transform

A conditional transform states a SQL condition used to decide whether or not to transform a row. In datanymizer a where clause is given as a string. This API seems to work. Below groups is a table.

  - name: groups
    query:
      transform_condition: "id NOT IN (select group_id FROM employee_groups)"

Datanymizer implemented this by adding NOT to the given query. I fixed an issue that adding NOT also needs proper NULL-checking behavior: datanymizer/datanymizer@24e2521

Json transform with value_template does not work

Hi - First of all, this looks like an awesome tool! Especially the ability to transform nested JSON objects.

However, I'm encountering an issue when trying to use a value_template with the set operation.

Here is the relevant part of my config:

    - schema: "public"
      name: "fitness_package_temp"
      transformers:
        - name: "Json"
          params:
            column: "profile_data"
            operations:
              - operation: "set"
                path: "weigdddht"
                error_not_exists: true
                value_template: \"test\"

No matter what value template I put in, the column is set to null. I also tried setting error_not_exists: true and using a key that doesn't exist, but no error is raised

permission denied for large object during dump action

Hi! While trying out version 1.14 I've ran into a runtime issue while trying to dump a small table. I do have output for the run, but I imagine it's not complete because of the error. I may have missed some documentation but I was hoping if you could give me some guidance.

The error is: cannot make a backup error="data stage dumping error: at least one worker exited with error: error opening large object 75897: ERROR: permission denied for large object 75897 (SQLSTATE 42501)"

My used config is:

dump:
  pg_dump_options:
    dbname: "host=obfuscated-amazon-address user=postgres dbname=db_name"
    jobs: 10
    table: "tablename"
storage:
  type: "directory"
  directory:
    path: "/home/ssm-user/tmp"

For completeness; The uname -r output gives: 6.1.79-99.167.amzn2023.x86_64. I've used greenmask-linux-amd64.tar.gz and the following libraries postgresql15 postgresql15-contrib. The database version is Postgres 15 as well.

feat: Database subset

DOD:

Introspect the DB schema
Assemble a graph
Resolve cycle dependencies
Provide a filtering option
In case we don't have FK, but logically they exist - provide that option to define the tables relationship manually

#138
#139

feat: Set min and max values not required for int values

Unset the required flag on min and max params so users can generate values in the min and max limit of type.

For instance

        - name: "RandomInt"
          params:
            column: "id"
            min: 1

In that case, if column id is int4 then the min value is 1 and the max value will be 2,147,483,647

Feat: Documentation deployment with multiversion support

We need to deploy documentation with versioning support. This is required for the next major releases and allows people to use the previous stable versions.

Restore fails at post-data stage

About half of the time when we run greenmask restore, the post-data stage fails with the following error:

FTL ../home/runner/work/greenmask/greenmask/cmd/greenmask/cmd/restore/restore.go:68 > fatal error="post-data stage restoration error: cannot start transaction: write failed: write tcp 192.168.1.212:56750->192.168.3.203:5432: write: connection reset by peer" pid=354151

My guess is that since the same connection is being reused in restore.Run https://github.com/GreenmaskIO/greenmask/blob/c21cc3b99fbfd61d842007658337d466c65d6bca/internal/db/postgres/cmd/restore.go#L480C19-L480C22, and since the data restoring stage takes several hours, the connection is timed out by the server.

I can try my hand at creating a PR that opens a separate connection for each stage, if you want?

Feat: RandomPerson transformer implementation

Implement new RandomFullName that:

Generates FirstName, LastName, and Gender in one structure
Provide multicolumn transformation allowing to resolution of functional dependencies
Provide a gender parameter that represents a person's gender and the generator can use this value more to generate gender-related data
Parameter gender should support dynamic mode
Cover with tests
Each column can contain only a specific attribute or generate the total data using template

Greenmask V0.1.13 SIGSEGV

Hi!

While using Greenmask for the first time, I've encountered a segfault. Using the dump option while testing on a single table in my database. I've verified connection with psql following config:

dump:
  pg_dump_options:
    dbname: "host=obfuscated-amazon-address user=postgres dbname=db_name"
    jobs: 10
    table: "tablename"
storage:
  type: "directory"
  directory:
    path: "/home/ssm-user/tmp"

The segfault output is:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xd3ddb1]

goroutine 151 [running]:
github.com/jackc/pgx/v5.(*LargeObject).Close(...)
	/home/runner/go/pkg/mod/github.com/jackc/pgx/[email protected]/large_objects.go:156
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*BlobsDumper).Execute.func2.2()
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/large_object.go:95 +0x31
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*BlobsDumper).Execute.func2()
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/large_object.go:100 +0x1f1
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 65
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

For completeness; The uname -r output gives: 6.1.79-99.167.amzn2023.x86_64. I've used greenmask-linux-amd64.tar.gz and the following libraries postgresql15 postgresql15-contrib. The database version is Postgres 15 as well.

On a different note, I've also found that greenmask dump --data-only gives a schema collision error : "pg_dump: error: options -s/--schema-only and -a/--data-only cannot be used together"

Feat: RandomMacAddress transformer implementation

Implement the RandomMacAddress transformer that supports the next features:

Can keep original vendor 3 bytes and generate the rest randomly if has option keep_original_vendor
Consider support cast_type and management_type for MacAdresses generation
Will be fine to provide an override the list of the first 3 bytes of MacAddress for the possibility of setting the limited vendors

fix: validate --table option wrong shcema and table parsing

It was found that --table option has wrong order {{ table_name }}.{{ schema_name }} instead of {{ schema_name }}.{{ table_name }}

locale_provider not recognized during restore with create database true

when I attempt to restore into a new Postgres instance where I have opted to create new database, I get an error that locale_provider not recognized. I have search online to find more information on this, but I haven't found anything relevant.

Would you have any pointers what I need to do here? I could create the required database manually first, but would be nice not to have to do that.

Postgres 13.5

restore:
  pg_restore_options:
    create: true
    jobs: 10

2024-05-01T01:34:41Z INF restoring dump dumpId=1714514083137
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: could not execute query: ERROR:  option \"locale_provider\" not recognized"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="LINE 1: ...plrds WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE_PRO..."
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="                                                             ^"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="Command was: CREATE DATABASE kissvtsplrds WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8';"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: could not execute query: ERROR:  database \"kissvtsplrds\" does not exist"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="Command was: ALTER DATABASE kissvtsplrds OWNER TO postgres;"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: reconnection failed: connection to server at \"kis-dev-spl.cluster-c4nuvgjpjzrh.ap-southeast-2.rds.amazonaws.com\" (10.250.14.248), port 5432 failed: FATAL:  database \"kissvtsplrds\" does not exist"

Restrict which rows are dumped?

Is it possible to conditionally dump rows?

https://github.com/datanymizer/datanymizer?tab=readme-ov-file#transform-conditions-and-limit

Have you considered creating a Discussions section where questions like these could be asked?

feat: unique transformations

DOD:

Support unique transformation based on the generated value
Support unique transformation in the limited range of values. Fo instance if we have a table with an id column, then the id should be in the range [1, max_seq_value

feat: Implement LargeObjects inclusive and exclusive list

According to #114 it would be fine to have parameters responsible for Large Object dumping

*--no-large-objects - do not dump large objects at all. But if you have references in tables on those objects you will receive the error during restoration. You will be forced to create empty large objects or set NULL values on references
*--include-large-object - inclusive list of large objects we want to dump. Other large objects will be skipped
*--exclude-large-object - exclusive list of large objects we want to exclude from dump. Other large objects will be dumped

feat: Database subset. Stage2

Implement circular dependencies resolution

Add prefix to storage config

As discussed in #56, the storage.prefix config should be added to work with both storage types directory and s3, meaning that s3.prefix will be deprecated.

Concerns:

Will storage.directory.path config remain the same or should it be removed to make place to storage.prefix as well?
- if so, which will the final storage path given the user configuration defines both?
  - {{ storage.prefix }} / {{ storage.directory.path }} or
  - {{ storage.directory.path }} / {{ storage.prefix }}

Add db metadata to storage path

As discussed in #56, we'll be adding the Database name in the storage path to logically separate dumps without the need to change storage configuration when pointing Greenmask to different databases.

This will impact the commands below that will need to be adapted:

Dump
Restore
Validate
Show dump
List dump

Concerns:

What if the user have two different database hosts in the cloud with the same database name? (i.e. two RDS instances, one for dev and another one for production, but both have a greenmask database)
- should the dbhost also be used in the path or will the user need to address that himself by adjusting a path/prefix value?
What if the config is defined like dbname: "host=localhost port=50022 user=foobar dbname=foobar" ?
- should we disallow the dbname config to be declared like that or just parse the value?

GetPgDSN has wrong behaviour

As was found by @viniciuschiele in MR #5 was merged with mistakes. Need to fix (o *Options) GetPgDSN() behavior in pgdump.Options and pgrestore.Options.

Feat: RandomIp transformer implementation

Implement new RandomIp transformer that:

Generates the IP address by the provided net mask
Supported transformed column types text, varchar. cidr
The transformer must support dynamic parameter mode for netmask parameters for the next PostgreSQL type text, varchar. inet
Cover with unit tests

Feat: Implement database subset. Stage1

Stage1: without circular dependencies resolution

doc: Review documentation for v0.2 release

According to changes in #97 review the documentation content for new transformers and their logic that has been changed

Epic: Determninistic transformations

Refactor transformers

Implement engine option (hash, random)
Refactor existed transformers

feat: Noise* transformers - allow empty min or max params

DOD:

The min or max parameter might be
- all empty
- set one
- set two
The min or max if empty set value from type thresholds

Feature: add JSON parsing to dump.transformation attribute

Currently, the only way to pass a configuration to the dump.tansformation is through YAML, making it imperative to use a config file to configure a transformation.
Adding a JSON parser to this attribute will allow users to configure Greenmask entirely from environment variables, not needing to mount any volume or file.

This is specially useful when running Greenmask from a container, because many cloud providers offer container platforms that have environment variable and secret management easily integrated for no additional cost, however, preparing and mounting a volume will require some additional configuration and planning, alongside with other infrastructure considerations.

Restore runs out of memory

I'm trying to restore a dump from our production database, but the restore command ends up being killed because it runs out of memory.

It happens at the same table each time - The machine has 8GB of memory, and every though the table is only 2GB according to metadata.json. The table has some large text columns (10k chars), so I'm not sure if that plays into it

My guess is that greenmask is loading the entire dump for the table into memory while restoring, but my go-fu is not strong enough to figure out if that's what's actually happening :/

is that mono repository project layout？

Hash transformer is too slow

I'm currently using RandomUuid for most of the columns but I was asked to hash the original values to maintain the same masked value.

I've replaced RandomUuid with Hash and what used to take less than a minute to dump/transform the data now takes 30 min.

This is what looks like the transformation config for 6 tables

    - schema: core
      name: users
      transformers:
        - name: Hash
          params:
            column: email
        - name: Hash
          params:
            column: first_name
        - name: Hash
          params:
            column: last_name

greenmaskio / greenmask Goto Github PK

greenmask's Introduction

Greenmask - dump obfuscation tool

Preface

Features

Use Cases

Our purpose

General Information

Backup Process

Storage Options

Restoration Process

Data Obfuscation and Validation

Customization

PostgreSQL Version Compatibility

References

Links

greenmask's People

Contributors

Stargazers

Watchers

Forkers

greenmask's Issues

DOD:

Tasks

Recommend Projects

Recommend Topics

Recommend Org

Jobs