GithubHelp home page GithubHelp logo

awslabs / data-solutions-framework-on-aws Goto Github PK

View Code? Open in Web Editor NEW
103.0 10.0 13.0 8.73 MB

An open-source framework that simplifies implementation of data solutions.

Home Page: https://awslabs.github.io/data-solutions-framework-on-aws/

License: Apache License 2.0

TypeScript 80.17% JavaScript 6.44% CSS 0.32% Python 0.02% Dockerfile 0.04% MDX 13.01%

data-solutions-framework-on-aws's Introduction

Data Solutions Framework on AWS

Data Solutions Framework (DSF) on AWS is a framework for implementation and delivery of data solutions with built-in AWS best practices. DSF is an abstraction atop AWS services based on AWS Cloud Development Kit (CDK) L3 constructs, packaged as a library.

You can leverage DSF to implement your data platform in weeks rather than in months.

  • DSF is available in TypeScript and Python.
  • Use the framework to build your data solutions instead of building cloud infrastructure from scratch.
  • Compose data solutions using integrated building blocks via Infrastructure as Code (IaC).
  • Benefit from smart defaults and built-in AWS best practices.
  • Customize or extend according to your requirements.

Get started by exploring the framework and available examples. Learn more from documentation.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the Apache-2.0 License. See the LICENSE file.

Feedback

We'd love to hear from you! Please create GitHub issues for additional features or solutions you'd like to see.

data-solutions-framework-on-aws's People

Contributors

alexvt-amz avatar amazon-auto avatar armaseg avatar cmclel7 avatar dacort avatar dependabot[bot] avatar dzeno avatar jeromevdl avatar jmgtan avatar lmouhib avatar scottschreckengaust avatar shalaka-k avatar vgkowski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-solutions-framework-on-aws's Issues

Amazon MSK topics governance in Amazon DataZone

Today, DataZone supports natively batch datasets with S3 and Redshift but offers an extensive API to add custom data types like streaming datasets. The objective is to provide solutions for:

  1. Cataloging Kafka topics and making them part of the marketplace features
  2. Adding data quality metrics integration would be a nice to have.
  3. Granting access on topics to consumers directly from DataZone

PySparkApplicationPackage artifact bucket not deleting properly

The PySparkApplicationPackage construct creates an artifact bucket for storing Spark entrypoint and dependencies archive which stores access logs in itself. We can't use CDK capabilities to auto delete objects and bucket when destroying the stack because the custom resource responsibile for deleting objects generate access logs... which are new objects in the bucket.

We need to:

  1. Change the PySparkApplicationPackage construct to either disable access logs or log to another bucket
  2. Update the cookbook with this best practice

Add support for Kafka 3.7 in MSK

Currently MSK provisioned construct change the security group of zookepeer during the deployment. With Kafka 3.7 in MSK using KRaft and not zookeper we need to change the behavior of not look for zookeeper security group and change them.

Package AWS SDK within custom resources

Today, the custom resources provided by DSF don't always package the AWS SDK as part of the Lambda Function, resulting in potential side effects when the SDK embedded within Lambda changes.

Provide an Athena based construct for consuming data lake data

Implement a new construct for consuming the data lake via SQL based on Athena SQL

  • create a workgroup with proper configuration
  • scope down permissions on a data lake (storage and catalog)
  • create an athena result bucket with configurable results retention
  • allow for VPC endpoint
  • grant access on workgroup to principal via a method
  • cloudwatch query metrics
  • set workgroup limits

Provide an Amazon Opensearch construct with configuration helpers

Currently, provisioning Amazon Opensearch requires to provision clusters then to configure the cluster through the Opensearch API. There are some common needs like:

  • Configuring the fine grained access control
  • Integrating with SAML or using internal database users
  • Uploading Opensearch objects including index templates and dashboards

We should provide a construct with method exposed to interact with the Opensearch cluster (via a custom resource) and simplify the overall experience.

Add support for MSK Serverless and Provisioned

Provide a construct which deploy MSK Serverless and allow to perform operations like create a topic or partition. The construct should also offer methods to grant access to a topic to produce and consume data.

Provide data helpers construct to support multiple use cases

Implement one or more constructs to support the following use cases:

DataLakeStorage Independent bucket Policy access to L2 constructs

Hi team

Looking to alter the bucket policies of buckets. Currently the default means that nobody can do any S3;* operation in the bucket.

We are looking for a way that we can apply specific policies to specific buckets for example:

dsf.storage.DataLakeStorage(self, "MyDataLakeStorage", bronzeBucketPolicy = Policy, GoldBucketPolicy= Policy )

Relationship of this project to aws-ddk?

How does this project relate to aws-ddk? If there is alignment in both project's goals, don't you think it is better to merge efforts into a single project that eventually becomes better than both?

Data Lake Storage documentation on accessing buckets

Hi team

Im hoping we can add some documentation examples around accessing bronze/ silver/ gold buckets dynamically for CDK .
for example for python:

storage = dsf.storage.DataLakeStorage(self, "MyDataLakeStorage")
bronze_bucket_arn = storage.bronze_bucket.bucket_arn

In addition if there is a way to launch bronze /silver/gold in account A , but access logs bucket in audit account B, this would be great.

Add option to DataLakeCatalog/DataCatalogDatabase that crawlers use manually defined table

DataLakeCatalog/DataCatalogDatabase should have the option of manually setting the tables for the crawler as parameters. There are several use cases that require a manually created catalog table.

  • You want to choose the catalog table name manually and not rely on the catalog table naming algorithm
  • Reuse the table later in the stack (e.g., in a lambda to query the table)

See: https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#update-manual-tables

Add support to MSK KRaft

AWS launched support for KRaft on new clusters on Amazon MSK starting from version 3.7. https://aws.amazon.com/blogs/big-data/introducing-support-for-apache-kafka-on-raft-mode-kraft-with-amazon-msk-clusters/

Currently, MSK Provisioned cluster depends on zookeeper, so if you try to deploy a new cluster using KRaft, you will get the following error:

Ex:

version = dsf.streaming.KafkaVersion.of('3.7.x.kraft')

msk_cluster = dsf.streaming.MskProvisioned(self, "cluster",
            vpc=vpc,
            cluster_name="my-cluster-3-6-0",
            kafka_version=version,
            subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_NAT),
            security_groups=[ec2.SecurityGroup.from_security_group_id(
                self,
                id='sg',
                security_group_id="sg-1234"
            )],
            client_authentication=dsf.streaming.ClientAuthentication.sasl(
                iam=True
            )
        )

Returns the following error:

Received response status [FAILED] from custom resource. Message returned: Error: Cannot read properties of undefined (reading 'split') TypeError: Cannot read properties of undefined (reading 'split'), at Runtime.onEventHandler [as handler] (file:///var/task/index.mjs:23:76), at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at P (/var/task/index.js:1:1756) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async Runtime.handler (/var/task/__entrypoint__.js:1:932)

In the future, the Apache Kafka community plans to remove the ZooKeeper mode entirely.

Bug: DataLakeStorage crawler not using the proper table level in S3

The crawler created as part of the DataLakeCatalog is not creating tables at the right level of prefix in S3. In the following example the crawler is configured with table level 2 instead of one

storage = dsf.storage.DataLakeStorage(self, 'Storage',
    removal_policy=RemovalPolicy.DESTROY,                                      
)

catalog = dsf.governance.DataLakeCatalog(
    self, 
    'Catalog',
    data_lake_storage=storage,
    removal_policy=RemovalPolicy.DESTROY,
)

dsf.utils.S3DataCopy(
    self,
    "SourceDataCopy",
    source_bucket=Bucket.from_bucket_name(self, 'SourceBucket', 'nyc-tlc'),
    source_bucket_prefix="trip data/",
    source_bucket_region="us-east-1",
    target_bucket= storage.silver_bucket,
    target_bucket_prefix="trip-data/",
)

Provide a Redshift based construct to consume data lake data

Implement a construct based on Redshift to ingest/query data from the data lake:

  • based on Redshift serverless
  • federated authentication
  • grant method to allow Redshift to query data lake (via federated IAM identity)
  • setup autoload from a data lake table
  • provide a method to provision resources within Redshift with Create/Update/Delete lifecycle (based on custom resources lifecycle)
  • configure audit logs with CloudWatch logs
  • configure backup/restore
  • data shares management
  • managed grafana for operational monitoring (bonus)

Feat: provide constructs to simplify genAI patterns implementation

The most common patterns for GenAI applications are RAG and LLM fine-tuning/training. DSF can bring some GenAI constructs to accelerate the implementation of these patterns. In details we have identified 3 constructs that could help:

  • The RAG pipeline to ingest data into vector databases and provide semantic context to GenAI applications
  • The Data API pipeline to expose data to GenAI application and provide situational context to GenAI applications
  • The data preparation pipeline to prepare data for model training or fine-tuning

Provide a Spark Streaming job construct on EMR Serverless

Currently, implementing a Spark Streaming job on EMR Serverless requires additional tooling to implement streaming best practices. We can provide a construct similar to the SparkEmrServerlessJob but for streaming. Main features it should support:

  • Checkpointing the Spark state on a resilient storage
  • Graceful update of the Spark Streaming application. When deploying a new version of the Spark code, the construct should gracefully shutdown the current Spark Streaming job and then start the new one from the same checkpoint
  • automatically retry the Spark Streaming job when a failure is detected. The retry mechanism should have a maximum number of retry, an exponential backoff retry mechanism and an alerting

Add support for DBT

This issue is to gather the needs and possible ways to support DBT as well as possible features to implement.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.