awslabs / data-solutions-framework-on-aws Goto Github PK

An open-source framework that simplifies implementation of data solutions.

Home Page: https://awslabs.github.io/data-solutions-framework-on-aws/

License: Apache License 2.0

TypeScript 80.17% JavaScript 6.44% CSS 0.32% Python 0.02% Dockerfile 0.04% MDX 13.01%

data-solutions-framework-on-aws's Introduction

Data Solutions Framework on AWS

Data Solutions Framework (DSF) on AWS is a framework for implementation and delivery of data solutions with built-in AWS best practices. DSF is an abstraction atop AWS services based on AWS Cloud Development Kit (CDK) L3 constructs, packaged as a library.

You can leverage DSF to implement your data platform in weeks rather than in months.

DSF is available in TypeScript and Python.
Use the framework to build your data solutions instead of building cloud infrastructure from scratch.
Compose data solutions using integrated building blocks via Infrastructure as Code (IaC).
Benefit from smart defaults and built-in AWS best practices.
Customize or extend according to your requirements.

Get started by exploring the framework and available examples. Learn more from documentation.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the Apache-2.0 License. See the LICENSE file.

Feedback

We'd love to hear from you! Please create GitHub issues for additional features or solutions you'd like to see.

data-solutions-framework-on-aws's People

Contributors

Stargazers

Watchers

Forkers

larssonandreas dacort ssheff ajenie7868 ivan-khvostishkov vgkowski happystep mide123adqura robert6126 scottschreckengaust vaquarkhan neuw84 muthhus

data-solutions-framework-on-aws's Issues

Amazon MSK topics governance in Amazon DataZone

Today, DataZone supports natively batch datasets with S3 and Redshift but offers an extensive API to add custom data types like streaming datasets. The objective is to provide solutions for:

Cataloging Kafka topics and making them part of the marketplace features
Adding data quality metrics integration would be a nice to have.
Granting access on topics to consumers directly from DataZone

Povide ability to control number of NAT GW in DataVpc

PySparkApplicationPackage artifact bucket not deleting properly

The PySparkApplicationPackage construct creates an artifact bucket for storing Spark entrypoint and dependencies archive which stores access logs in itself. We can't use CDK capabilities to auto delete objects and bucket when destroying the stack because the custom resource responsibile for deleting objects generate access logs... which are new objects in the bucket.

We need to:

Change the PySparkApplicationPackage construct to either disable access logs or log to another bucket
Update the cookbook with this best practice

Provide a data helper construct to replay data from multiple source to multiple targets

Replay data over time to test workloads in batch or streaming mode. We can inspire from this construct
https://constructs.dev/packages/aws-analytics-reference-architecture/v/2.12.6/api/BatchReplayer?lang=python

Add support for Kafka 3.7 in MSK

Currently MSK provisioned construct change the security group of zookepeer during the deployment. With Kafka 3.7 in MSK using KRaft and not zookeper we need to change the behavior of not look for zookeeper security group and change them.

Package AWS SDK within custom resources

Today, the custom resources provided by DSF don't always package the AWS SDK as part of the Lambda Function, resulting in potential side effects when the SDK embedded within Lambda changes.

Add support for OpenSearch Serverless

Thanks for adding support for OpenSearch in #374! Just requesting the addition of high-level construct(s) for OpenSearch Serverless as well: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_opensearchserverless-readme.html

Related:
#203, #374

@alexvt-amz @vgkowski

Provide AWS Glue as an option

Provide AWS Glue as a processing layer

Provide option for the creation of Client VPN in DataVpc construct

In some cases where the resources like Opensearch/MSK.. are deployed in a private subnet, we would like to provide a way to connect to the resources from the user machine in case there is no S2S or DX. The client VPN is an easy way to offer the connection.

Make the policy optional in SparkEmrServerlessRuntime.create_execution_role()

Enhancement in SparkJob: Use the optimized integration between Step Functions and EMR

In the SparkJob construct, we could use the optimized integration with EMR Serverless and EMR on EKS instead of the sdk integration.

Tasks

Beta Give feedback

Add support for EMR Studio

There may be value in providing a higher-level construct for EMR Studio with sensible defaults:

Provide an Athena based construct for consuming data lake data

Implement a new construct for consuming the data lake via SQL based on Athena SQL

create a workgroup with proper configuration
scope down permissions on a data lake (storage and catalog)
create an athena result bucket with configurable results retention
allow for VPC endpoint
grant access on workgroup to principal via a method
cloudwatch query metrics
set workgroup limits

Provide an Amazon Opensearch construct with configuration helpers

Currently, provisioning Amazon Opensearch requires to provision clusters then to configure the cluster through the Opensearch API. There are some common needs like:

Configuring the fine grained access control
Integrating with SAML or using internal database users
Uploading Opensearch objects including index templates and dashboards

We should provide a construct with method exposed to interact with the Opensearch cluster (via a custom resource) and simplify the overall experience.

Add scheduled queries in Redshift serverless

Scheduling a query on Redshift can be useful for example to refresh an incremental materialized views on an external table from the lake. It requires some configuration that DSF can abstract and simplify.

Add support for MSK Serverless and Provisioned

Provide a construct which deploy MSK Serverless and allow to perform operations like create a topic or partition. The construct should also offer methods to grant access to a topic to produce and consume data.

Add option to DataLakeCatalog/DataCatalogDatabase to not create crawlers

DataLakeCatalog/DataCatalogDatabase should have the option that the crawlers are not created automatically.

There are use cases (e.g. in the IoT area) where the tables have to be created manually. The data is then not added via a crawler, but manually via lambda functions.

Upgrade the documentation for `PySparkApplicationPackage` to use al2023

Provide data helpers construct to support multiple use cases

Implement one or more constructs to support the following use cases:

PoC/demos need data that can be simply copied from public datasets (https://github.com/awslabs/aws-data-solutions-framework/blob/main/examples/spark-data-lake/infra/stacks/demo_helpers/data_load.py)
dev/test environment can work on synthetic data (https://github.com/awslabs/synthtable)
Replaying data over time to test workloads in batch or streaming mode (https://constructs.dev/packages/aws-analytics-reference-architecture/v/2.12.6/api/BatchReplayer?lang=python)

DataLakeStorage Independent bucket Policy access to L2 constructs

Hi team

Looking to alter the bucket policies of buckets. Currently the default means that nobody can do any S3;* operation in the bucket.

We are looking for a way that we can apply specific policies to specific buckets for example:

dsf.storage.DataLakeStorage(self, "MyDataLakeStorage", bronzeBucketPolicy = Policy, GoldBucketPolicy= Policy )

DataVpc Improvements for provisioning interface endpoint

Provide the ability to provision interface endpoint for services that are used and can be deployed by/through DSF.

Relationship of this project to aws-ddk?

How does this project relate to aws-ddk? If there is alignment in both project's goals, don't you think it is better to merge efforts into a single project that eventually becomes better than both?

Add support for removal_policy parameter in sparkSparkEmrServerlessJob

Provide a data helper construct to synthetize data from S3 datasets

Dev/test environment can work on synthetic data extracted from production data. Synthetic data will allow to extract fake data with the same pattern as the production data.
We can use something like this tool https://github.com/awslabs/synthtable

Add support for EMR on EKS interactive sessions (managed endpoints)

Quicksight based construct to dashboard data from the data lake

Quicksight as code:

Create datasets
Use templates
Provision analyses, dashboards
Dataset refresh schedule
...

Reference:
https://github.com/aws-samples/amazon-quicksight-assets-as-code-sample
https://github.com/gmournos/quicksight-as-code

Data Lake Storage documentation on accessing buckets

Hi team

Im hoping we can add some documentation examples around accessing bronze/ silver/ gold buckets dynamically for CDK .
for example for python:

storage = dsf.storage.DataLakeStorage(self, "MyDataLakeStorage")
bronze_bucket_arn = storage.bronze_bucket.bucket_arn

In addition if there is a way to launch bronze /silver/gold in account A , but access logs bucket in audit account B, this would be great.

Add support for automatic column statistics generation in DSF catalog constructs

AWS native products including EMR, Athena, Glue, Redshift provide better query performances with Glue Column statistics
https://docs.aws.amazon.com/glue/latest/dg/column-statistics.html

DSF should provide an option (with default) to enable column statistic generation at the database level when creating a DatabaseCatalog.

DSF should create the gold layer of the DataLakeCatalog with column statistics enabled.

Make the `PySparkApplicationPackage` construct resource id unique

for now all the resources created within the construct have static id which makes the construct unusable multiple time within the same stack

Add option to DataLakeCatalog/DataCatalogDatabase that crawlers use manually defined table

DataLakeCatalog/DataCatalogDatabase should have the option of manually setting the tables for the crawler as parameters. There are several use cases that require a manually created catalog table.

You want to choose the catalog table name manually and not rely on the catalog table naming algorithm
Reuse the table later in the stack (e.g., in a lambda to query the table)

See: https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#update-manual-tables

Feat: support bring your own Opensearch cluster

Provide a construct that take opensearch cluster defintion and provide method to interact with a OS cluster through its API, without creating a cluster.

Integrate DSF with Lake Formation authorizations

Currently, DSF only supports IAM authorizations. Support for Lake Formation would be benefitial in two constructs: DataLakeCatalog and DatabaseCatalog

Provide a data helper construct to copy a dataset from an S3 bucket to another

PoC/demos need data that can be simply copied from public datasets (https://github.com/awslabs/aws-data-solutions-framework/blob/main/examples/spark-data-lake/infra/stacks/demo_helpers/data_load.py)

Bump cdk-nag from 2.27.136 to 2.27.152

Add support to MSK KRaft

AWS launched support for KRaft on new clusters on Amazon MSK starting from version 3.7. https://aws.amazon.com/blogs/big-data/introducing-support-for-apache-kafka-on-raft-mode-kraft-with-amazon-msk-clusters/

Currently, MSK Provisioned cluster depends on zookeeper, so if you try to deploy a new cluster using KRaft, you will get the following error:

Ex:

version = dsf.streaming.KafkaVersion.of('3.7.x.kraft')

msk_cluster = dsf.streaming.MskProvisioned(self, "cluster",
            vpc=vpc,
            cluster_name="my-cluster-3-6-0",
            kafka_version=version,
            subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_NAT),
            security_groups=[ec2.SecurityGroup.from_security_group_id(
                self,
                id='sg',
                security_group_id="sg-1234"
            )],
            client_authentication=dsf.streaming.ClientAuthentication.sasl(
                iam=True
            )
        )

Returns the following error:

Received response status [FAILED] from custom resource. Message returned: Error: Cannot read properties of undefined (reading 'split') TypeError: Cannot read properties of undefined (reading 'split'), at Runtime.onEventHandler [as handler] (file:///var/task/index.mjs:23:76), at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at P (/var/task/index.js:1:1756) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async Runtime.handler (/var/task/__entrypoint__.js:1:932)

In the future, the Apache Kafka community plans to remove the ZooKeeper mode entirely.

Bug: DataLakeStorage crawler not using the proper table level in S3

The crawler created as part of the DataLakeCatalog is not creating tables at the right level of prefix in S3. In the following example the crawler is configured with table level 2 instead of one

storage = dsf.storage.DataLakeStorage(self, 'Storage',
    removal_policy=RemovalPolicy.DESTROY,                                      
)

catalog = dsf.governance.DataLakeCatalog(
    self, 
    'Catalog',
    data_lake_storage=storage,
    removal_policy=RemovalPolicy.DESTROY,
)

dsf.utils.S3DataCopy(
    self,
    "SourceDataCopy",
    source_bucket=Bucket.from_bucket_name(self, 'SourceBucket', 'nyc-tlc'),
    source_bucket_prefix="trip data/",
    source_bucket_region="us-east-1",
    target_bucket= storage.silver_bucket,
    target_bucket_prefix="trip-data/",
)

Provide a Redshift based construct to consume data lake data

Implement a construct based on Redshift to ingest/query data from the data lake:

based on Redshift serverless
federated authentication
grant method to allow Redshift to query data lake (via federated IAM identity)
setup autoload from a data lake table
provide a method to provision resources within Redshift with Create/Update/Delete lifecycle (based on custom resources lifecycle)
configure audit logs with CloudWatch logs
configure backup/restore
data shares management
managed grafana for operational monitoring (bonus)

Feat: provide constructs to simplify genAI patterns implementation

The most common patterns for GenAI applications are RAG and LLM fine-tuning/training. DSF can bring some GenAI constructs to accelerate the implementation of these patterns. In details we have identified 3 constructs that could help:

The RAG pipeline to ingest data into vector databases and provide semantic context to GenAI applications
The Data API pipeline to expose data to GenAI application and provide situational context to GenAI applications
The data preparation pipeline to prepare data for model training or fine-tuning

Improve EC2 capacity creation for SparkEmrRuntimeContainer

For now the available method require user to provide the full Nodeclass definition. The ideal would be to reduce the complexity related to understanding Karpenter

Provide a Spark Streaming job construct on EMR Serverless

Currently, implementing a Spark Streaming job on EMR Serverless requires additional tooling to implement streaming best practices. We can provide a construct similar to the SparkEmrServerlessJob but for streaming. Main features it should support:

Checkpointing the Spark state on a resilient storage
Graceful update of the Spark Streaming application. When deploying a new version of the Spark code, the construct should gracefully shutdown the current Spark Streaming job and then start the new one from the same checkpoint
automatically retry the Spark Streaming job when a failure is detected. The retry mechanism should have a maximum number of retry, an exponential backoff retry mechanism and an alerting

Add Glue Catalog auto-mounting in Redshift Serverless construct

Glue catalog can be auto-mounted in Redshift
https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-glue.html

We should provide the option to auto-mount in Redshift serverless. The construct would automate the configuration (IAM/LF permissions, Redshift DDL...)

Add support for DBT

This issue is to gather the needs and possible ways to support DBT as well as possible features to implement.