GithubHelp home page GithubHelp logo

greenmaskio / greenmask Goto Github PK

View Code? Open in Web Editor NEW
407.0 4.0 6.0 15.1 MB

PostgreSQL database anonymization tool

Home Page: https://greenmask.io

License: Apache License 2.0

Makefile 0.11% Go 98.69% Dockerfile 0.50% Shell 0.70%
dump golang masking obfuscation obfuscator postgresql restore s3 security security-tools

greenmask's Introduction

Greenmask - dump obfuscation tool

Preface

Greenmask is a powerful open-source utility that is designed for logical database backup dumping, obfuscation, and restoration. It offers extensive functionality for backup, anonymization, and data masking. Greenmask is written entirely in pure Go and includes ported PostgreSQL libraries, making it platform-independent. This tool is stateless and does not require any changes to your database schema. It is designed to be highly customizable and backward-compatible with existing PostgreSQL utilities.

Features

  • Cross-platform - Can be easily built and executed on any platform, thanks to its Go-based architecture, which eliminates platform dependencies.
  • Database type safe - Ensures data integrity by validating data and utilizing the database driver for encoding and decoding operations. This approach guarantees the preservation of data formats.
  • Transformation validation and easy maintainable - During obfuscation development, Greenmask provides validation warnings and a transformation diff feature, allowing you to monitor and maintain transformations effectively throughout the software lifecycle.
  • Partitioned tables transformation inheritance - Define transformation configurations once and apply them to all partitions within partitioned tables, simplifying the obfuscation process.
  • Stateless - Greenmask operates as a logical dump and does not impact your existing database schema.
  • Backward compatible - It fully supports the same features and protocols as existing vanilla PostgreSQL utilities. Dumps created by Greenmask can be successfully restored using the pg_restore utility.
  • Extensible - Users have the flexibility to implement domain-based transformations in any programming language or use predefined templates.
  • Declarative - Greenmask allows you to define configurations in a structured, easily parsed, and recognizable format.
  • Integrable - Integrate Greenmask seamlessly into your CI/CD system for automated database obfuscation and restoration.
  • Parallel execution - Take advantage of parallel dumping and restoration, significantly reducing the time required to deliver results.
  • Provide variety of storages - Greenmask offers a variety of storage options for local and remote data storage, including directories and S3-like storage solutions.

Use Cases

Greenmask is ideal for various scenarios, including:

  • Backup and Restoration. Use Greenmask for your daily routines involving logical backup dumping and restoration. It seamlessly handles tasks like table restoration after truncation. Its functionality closely mirrors that of pg_dump and pg_restore, making it a straightforward replacement.
  • Anonymization, Transformation, and Data Masking. Employ Greenmask for anonymizing, transforming, and masking backups, especially when setting up a staging environment or for analytical purposes. It simplifies the deployment of a pre-production environment with consistently anonymized data, facilitating faster time-to-market in the development lifecycle.

Our purpose

The Greenmask utility plays a central role in the Greenmask ecosystem. Our goal is to develop a comprehensive, UI-based solution for managing obfuscation procedures. We recognize the challenges of maintaining obfuscation consistency throughout the software lifecycle. Greenmask is dedicated to providing valuable tools and features that ensure the obfuscation process remains fresh, predictable, and transparent.

General Information

It is evident that the most appropriate approach for executing logical backup dumping and restoration is by leveraging the core PostgreSQL utilities, specifically pg_dump and pg_restore. Greenmask has been purposefully designed to align with PostgreSQL's native utilities, ensuring compatibility. Greenmask primarily handles data dumping operations independently and delegates the responsibilities of schema dumping and restoration to pg_dump and pg_restore, maintaining seamless integration with PostgreSQL's standard tools.

Backup Process

The process of backing up PostgreSQL databases is divided into three distinct sections:

  • Pre-data - This section encompasses the raw schema of tables, excluding primary keys (PK) and foreign keys (FK).
  • Data - The data section contains the actual table data in COPY format, including information about sequence current values and Large Objects data.
  • Post-data - In this section, you'll find the definitions of indexes, triggers, rules, and constraints (such as PK and FK).

Greenmask focuses exclusively on the data section during runtime. It delegates the handling of the pre-data and post-data sections to the core PostgreSQL utilities, pg_dump and pg_restore.

Greenmask employs the directory format of pg_dump and pg_restore. This format is particularly suitable for parallel execution and partial restoration, and it includes clear metadata files that aid in determining the backup and restoration steps. Greenmask has been optimized to work seamlessly with remote storage systems and obfuscation procedures.

When performing data dumping, Greenmask utilizes the COPY command in TEXT format, maintaining reliability and compatibility with the vanilla PostgreSQL utilities.

Additionally, Greenmask supports parallel execution, significantly reducing the time required for the dumping process.

Storage Options

The core PostgreSQL utilities, pg_dump and pg_restore, traditionally operate with files in a directory format, offering no alternative methods. To meet modern backup requirements and provide flexible approaches, Greenmask introduces the concept of Storages.

  • s3 - This option supports any S3-like storage system, including AWS S3, making it versatile and adaptable to various cloud-based storage solutions.
  • directory - This is the standard choice, representing the ordinary filesystem directory for local storage.

!!! note If you have suggestions for additional storage options that would be valuable to implement, please feel free to share your ideas. Greenmask aims to accommodate a wide range of storage preferences to suit diverse backup needs.

Restoration Process

In the restoration process, Greenmask combines the capabilities of different tools:

  • Schema Restoration - Greenmask utilizes pg_restore to restore the database schema. This ensures that the schema is accurately reconstructed.
  • Data Restoration - For data restoration, Greenmask independently applies the data using the COPY protocol. This allows Greenmask to handle the data efficiently, especially when working with various storage solutions. Greenmask is aware of the restoration metadata, which enables it to download only the necessary data. This feature is particularly useful for partial restoration scenarios, such as restoring a single table from a complete backup.

Greenmask also supports parallel restoration, which can significantly reduce the time required to complete the restoration process. This parallel execution enhances the efficiency of restoring large datasets.

Data Obfuscation and Validation

Greenmask works with COPY lines, collects schema metadata using the Golang driver, and employs this driver in the encoding and decoding process. The validate command offers a way to assess the impact on both schema (validation warnings) and data (transformation and displaying differences). This command allows you to validate the schema and data transformations, ensuring the desired outcomes during the obfuscation process.

Customization

If your table schema relies on functional dependencies between columns, you can address this challenge using the TemplateRecord transformer. This transformer enables you to define transformation logic for entire tables, offering type-safe operations when assigning new values.

Greenmask provides a framework for creating your custom transformers, which can be reused efficiently. These transformers can be seamlessly integrated without requiring recompilation, thanks to the PIPE (stdin/stdout) interaction.

Furthermore, Greenmask's architecture is designed to be highly extensible, making it possible to introduce other interaction protocols, such as HTTP or Socket, for conducting obfuscation procedures.

PostgreSQL Version Compatibility

Greenmask is compatible with PostgreSQL versions 11 and higher.

References

  • Utilized the Demo database, provided by PostgresPro, for integration testing purposes.
  • Employed the adventureworks database created by morenoh149/postgresDBSamples, in the Docker Compose playground.

Links

greenmask's People

Contributors

gracingpro avatar joao-zanutto avatar tarbaev-vl avatar wwoytenko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

greenmask's Issues

Add db metadata to storage path

As discussed in #56, we'll be adding the Database name in the storage path to logically separate dumps without the need to change storage configuration when pointing Greenmask to different databases.

This will impact the commands below that will need to be adapted:

  • Dump
  • Restore
  • Validate
  • Show dump
  • List dump

Concerns:

  • What if the user have two different database hosts in the cloud with the same database name? (i.e. two RDS instances, one for dev and another one for production, but both have a greenmask database)
    • should the dbhost also be used in the path or will the user need to address that himself by adjusting a path/prefix value?
  • What if the config is defined like dbname: "host=localhost port=50022 user=foobar dbname=foobar" ?
    • should we disallow the dbname config to be declared like that or just parse the value?

panic: runtime error: slice bounds out of range [:18] with capacity 17

I am getting the following error when I use the RandomString transfomer:

panic: runtime error: slice bounds out of range [:18] with capacity 17

goroutine 614 [running]:
github.com/greenmaskio/greenmask/internal/db/postgres/pgcopy.(*Row).GetColumn(0x1400007bbc8?, 0x101125824?)
	/Users/jsutherland/greenmask/internal/db/postgres/pgcopy/row.go:108 +0x114
github.com/greenmaskio/greenmask/pkg/toolkit.(*Record).GetRawColumnValueByIdx(...)
	/Users/jsutherland/greenmask/pkg/toolkit/record.go:192
github.com/greenmaskio/greenmask/internal/db/postgres/transformers.(*FakeTransformer).Transform(0x140000ba2d0, {0x1023ad6a8?, 0x1?}, 0x14000164ed0)
	/Users/jsutherland/greenmask/internal/db/postgres/transformers/random_faker.go:316 +0x38
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).TransformSync(0x14000880420, {0x101d38cb8, 0x14000046320}, 0x14000164e40?)
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:127 +0x88
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).Dump(0x14000880420, {0x101d38cb8, 0x14000046320}, {0x14000c357f1?, 0x14000164ed0?, 0x140008121b0?})
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:153 +0xf8
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).process(0x1400000e078, {0x101d38cb8, 0x14000046320}, {0x101d3dcc8?, 0x1400000e168?}, {0x1497022b8?, 0x1400000e1c8?}, {0x101d39070, 0x14000880420})
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/table.go:153 +0x308
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).Execute.func2()
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/table.go:93 +0x280
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/Users/jsutherland/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x58
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 462
	/Users/jsutherland/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0x98

My configuration looks like this:

    - schema: "public"
      name: "authentications"
      transformers:
        - name: "RandomString"
          params:
            column: "uid"
            min_length: 7
            max_length: 60

When I comment out that section, the script runs fine. When I include it back, the script fails.

I am using Postgres 12.10-alpine running in Docker on macOS 14.2.1

Can you help me resolve this?

Thank you.

locale_provider not recognized during restore with create database true

when I attempt to restore into a new Postgres instance where I have opted to create new database, I get an error that locale_provider not recognized. I have search online to find more information on this, but I haven't found anything relevant.

Would you have any pointers what I need to do here? I could create the required database manually first, but would be nice not to have to do that.

Postgres 13.5

restore:
  pg_restore_options:
    create: true
    jobs: 10
2024-05-01T01:34:41Z INF restoring dump dumpId=1714514083137
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: could not execute query: ERROR:  option \"locale_provider\" not recognized"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="LINE 1: ...plrds WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE_PRO..."
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="                                                             ^"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="Command was: CREATE DATABASE kissvtsplrds WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8';"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: could not execute query: ERROR:  database \"kissvtsplrds\" does not exist"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="Command was: ALTER DATABASE kissvtsplrds OWNER TO postgres;"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: reconnection failed: connection to server at \"kis-dev-spl.cluster-c4nuvgjpjzrh.ap-southeast-2.rds.amazonaws.com\" (10.250.14.248), port 5432 failed: FATAL:  database \"kissvtsplrds\" does not exist"

Add prefix to storage config

As discussed in #56, the storage.prefix config should be added to work with both storage types directory and s3, meaning that s3.prefix will be deprecated.

Concerns:

  • Will storage.directory.path config remain the same or should it be removed to make place to storage.prefix as well?
    • if so, which will the final storage path given the user configuration defines both?
      • {{ storage.prefix }} / {{ storage.directory.path }} or
      • {{ storage.directory.path }} / {{ storage.prefix }}

Panic using RandomString when specifying symbols

I have a table with a varchar(255) column for which I want to generate a random ID while dumping (this column only has NULL values in the original database). Here's the config I try to use

    - schema: public
      name: surgery_patient
      transformers:
        - name: RandomString
          params:
            column: permanent_identification_number
            symbols: 0123456789
            min_length: 20
            max_length: 20
            keep_null: false

When running dump or validate, greenmask fails with the following error

greenmask --config greenmask.yml validate   --warnings   --data   --diff   --schema   --format=text   --table-format=vertical   --transformed-only   --rows-limit=1

panic: runtime error: index out of range [9] with length 9

goroutine 185 [running]:
github.com/greenmaskio/greenmask/internal/db/postgres/transformers/utils.RandomString(0xc000538b48?, 0x42523c?, 0x14, {0xc000708450, 0x9, 0x7f2cbbfc2ca8?}, {0xc000038140, 0x14, 0x14})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/transformers/utils/transformation_funcs.go:142 +0x114
github.com/greenmaskio/greenmask/internal/db/postgres/transformers.(*RandomStringTransformer).Transform(0xc000144e70, {0x1e0?, 0xf44720?}, 0xc0004c9b60)
	/home/runner/work/greenmask/greenmask/internal/db/postgres/transformers/random_string.go:148 +0xf3
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).TransformSync(0xc000814c00, {0x13afef8, 0xc0007240a0}, 0x100ffffffff?)
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:127 +0xa2
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).Dump(0xc000814c00, {0x13afef8, 0xc0007240a0}, {0xc00073a035?, 0xc000538dc0?, 0x13aec90?})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:153 +0x119
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*ValidationPipeline).Dump(0xc00007a068, {0x13afef8, 0xc0007240a0}, {0xc00073a035, 0xcb, 0xcb})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/validation_pipeline.go:33 +0x1c6
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).process(0xc0001244c8, {0x13afef8, 0xc0007240a0}, {0x13b52a8?, 0xc000010a98?}, {0x7f2cbbdf98a0?, 0xc0006960f0?}, {0x13b0278, 0xc00007a068})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/table.go:151 +0x3ad
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).Execute.func2()
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/table.go:91 +0x305
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 152
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

If I remove the symbols param, it works as expected

I'm using greenmask 0.1.9 on Fedora 39 (the linux-amd64 build)

greenmask restore fails for generated columns

I'm running greenmask reset --config config.yml latest and it fails when trying to restore a table with a generated column. I'm testing without applying transformations to any of the columns of the table that contains the generated column I get the following error message:
FTL fatal error="data stage restoration error: at least one worker exited with error: unable to perform restoration task (worker 4 restoring table \"public\".\"asdfasdf\"): error from postgres connection msg = column \"state\" is a generated column code=42P10"

According to the postgres documentation A generated column cannot be written to directly. In INSERT or UPDATE commands, a value cannot be specified for a generated column, but the keyword DEFAULT may be specified., so I have also tried to apply the following transformation:

dump:
  transformation:
    - schema: 'public'
      name: 'asdfasdf'
      transformers:
        - name: 'Replace'
          params:
            column: 'state'
            value: DEFAULT

but I get the same error message. I only get the restore to work if I ignore the table that has a generated column when doing the dump. I don't know if this is a bug or if greenmask simply doesn't support this type of columns. I hope you will give me the answer.

My specs:

  • Greenmask 0.1.10
  • PostgreSQL 14.11

Json transform with value_template does not work

Hi - First of all, this looks like an awesome tool! Especially the ability to transform nested JSON objects.

However, I'm encountering an issue when trying to use a value_template with the set operation.

Here is the relevant part of my config:

    - schema: "public"
      name: "fitness_package_temp"
      transformers:
        - name: "Json"
          params:
            column: "profile_data"
            operations:
              - operation: "set"
                path: "weigdddht"
                error_not_exists: true
                value_template: \"test\"

No matter what value template I put in, the column is set to null. I also tried setting error_not_exists: true and using a key that doesn't exist, but no error is raised

greenmask restore fails without error messages or exiting

I am running greenmask restore --config config.yml, and it is failing after an hour or so of restoring a 70 GB database. It stops running, but it does not exit, and does not display any error messages. I know it has stopped working because htop no longer shows the process.

Here are my specs:

  • greenmask 0.1.6
  • t2.micro EC2 instance
  • Amazon Linux 2023
  • 1 CPU, 1 GB RAM
  • 60 GB storage
  • RDS Aurora Postgres DB
  • DB size is 70 GB (14.5 GB compressed)

How do I get greenmask to finish?

Feature: add JSON parsing to dump.transformation attribute

Currently, the only way to pass a configuration to the dump.tansformation is through YAML, making it imperative to use a config file to configure a transformation.
Adding a JSON parser to this attribute will allow users to configure Greenmask entirely from environment variables, not needing to mount any volume or file.

This is specially useful when running Greenmask from a container, because many cloud providers offer container platforms that have environment variable and secret management easily integrated for no additional cost, however, preparing and mounting a volume will require some additional configuration and planning, alongside with other infrastructure considerations.

Feature: conditional transform

A conditional transform states a SQL condition used to decide whether or not to transform a row. In datanymizer a where clause is given as a string. This API seems to work. Below groups is a table.

  - name: groups
    query:
      transform_condition: "id NOT IN (select group_id FROM employee_groups)"

Datanymizer implemented this by adding NOT to the given query. I fixed an issue that adding NOT also needs proper NULL-checking behavior: datanymizer/datanymizer@24e2521

Dict transformer doesn't match values

In a database, I'd like to transform values of a column using the Dict transformer. The original database has this:

image

The transformer used is configured like this

  transformation:

    - schema: public
      name: provider
      transformers:
        - name: Dict
          params:
            column: name
            values:
              Clinique Louis Pasteur Nancy: "Établissement 1"
              Clinique Ambroise Paré Thionville: "Établissement 2"
              Polyclinique La Ligne bleue: "Établissement 3"
              Clinique Jeanne d'Arc: "Établissement 4"
            # fail_not_matched: false

Yet, when validating or dumping, greenmask fails with

2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 FTL cannot make a backup error="data stage dumping error: at least one worker exited with error: error processing table dump: dump error: dump error on table public.provider at line 1: dump error on table public.provider at line 1: unable to match value for \"Polyclinique La Ligne bleue\""

I tried to quote (single and double) the keys in greemask config, with no difference. I even tried with simple keys (without space or special chars) with the same result.

Env var values not being loaded without a config file definition

@wwoytenko I encountered a problem where environment variables are not being loaded if their config isn't defined in a config file. It seems that this is a known Viper issue: spf13/viper#584

I'll come up with a PR to solve this issue, changing the common.tmp_dir default definition.

We can also utilize this PR to set a default behavior for storage config. I see these possible scenarios

  • Define default storage to directory storage on ~/dumps -- this would allow Docker users to map a volume to /home/greenmask/dumps and use it without needing to specify the storage.directory.path configuration
  • Require storage configuration and error out in case none is provided

S3 upload error: region missing

Hello guys, I'm trying to run the project locally using the latest docker image provided in dockerhub, but I'm getting an error message saying that the region can't be found in my configuration. Here is how my config.yaml file is looking like:

common:
  tmp_dir: /home/temp

log:
  level: debug

s3:
  bucket: BUCKET_NAME
  region: us-east-1
  access_key_id: ACCESS_KEY_ID
  secret_access_key: SECRET_ACCESS_KEY

dump:
  pg_dump_options:
    host: DB_HOST
    dbname: DB_NAME

restore:
  pg_restore_options:
    host: DB_HOST
    dbname: DB_NAME

Here is the log messages I'm getting (db host ommited):

root@173f2c271d32:/home# greenmask dump --config config.yaml
2024-03-27T15:17:24Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:145 > performing snapshot export pid=390
2024-03-27T15:17:26Z DBG ../var/lib/greenmask/internal/db/postgres/pgdump/pgdump.go:44 > pg_dump: pg_dump --file /home/temp/1711552642497597375 --format d --schema-only --snapshot 00000005-00069849-1 --dbname postgres --host DB_HOST --username postgres
 pid=390
2024-03-27T15:17:34Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:197 > reading schema section pid=390
2024-03-27T15:17:34Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:226 > planned 1 workers pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:547 > exited normally WorkerId=1 pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:331 > all the data have been dumped pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:336 > merging toc entries pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:342 > writing built toc file into storage pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Validate Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Build Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Sign Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z FTL ../var/lib/greenmask/cmd/greenmask/cmd/dump/dump.go:58 > cannot make a backup error="mergeAndWriteToc stage dumping error: s3 object uploading error: MissingRegion: could not find region configuration" pid=390

I've tried to even export an AWS_REGION environment variable before executing greenmask, but I had no luck. Look forward to hear from you guys, this project is amazing!

Let me know if there is anything I can help with

Hash transformer is too slow

I'm currently using RandomUuid for most of the columns but I was asked to hash the original values to maintain the same masked value.

I've replaced RandomUuid with Hash and what used to take less than a minute to dump/transform the data now takes 30 min.

This is what looks like the transformation config for 6 tables

    - schema: core
      name: users
      transformers:
        - name: Hash
          params:
            column: email
        - name: Hash
          params:
            column: first_name
        - name: Hash
          params:
            column: last_name

Restore runs out of memory

I'm trying to restore a dump from our production database, but the restore command ends up being killed because it runs out of memory.

It happens at the same table each time - The machine has 8GB of memory, and every though the table is only 2GB according to metadata.json. The table has some large text columns (10k chars), so I'm not sure if that plays into it

My guess is that greenmask is loading the entire dump for the table into memory while restoring, but my go-fu is not strong enough to figure out if that's what's actually happening :/

Restore fails at post-data stage

About half of the time when we run greenmask restore, the post-data stage fails with the following error:

FTL ../home/runner/work/greenmask/greenmask/cmd/greenmask/cmd/restore/restore.go:68 > fatal error="post-data stage restoration error: cannot start transaction: write failed: write tcp 192.168.1.212:56750->192.168.3.203:5432: write: connection reset by peer" pid=354151

My guess is that since the same connection is being reused in restore.Run https://github.com/GreenmaskIO/greenmask/blob/c21cc3b99fbfd61d842007658337d466c65d6bca/internal/db/postgres/cmd/restore.go#L480C19-L480C22, and since the data restoring stage takes several hours, the connection is timed out by the server.

I can try my hand at creating a PR that opens a separate connection for each stage, if you want?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.