GithubHelp home page GithubHelp logo

aws-samples / aws-analytics-reference-architecture Goto Github PK

View Code? Open in Web Editor NEW
156.0 9.0 47.0 45.55 MB

Home Page: https://aws.amazon.com/blogs/opensource/adding-cdk-constructs-to-the-aws-analytics-reference-architecture/

License: Other

Python 24.81% Scala 7.02% Shell 0.02% PLpgSQL 1.82% Java 14.46% Jinja 0.01% JavaScript 1.07% TypeScript 50.75% Batchfile 0.03%

aws-analytics-reference-architecture's Introduction

AWS Analytics Reference Architecture

Note this project is deprecated in favor of the AWS Data Solutions Framework. AWS DSF provides not only examples but also components that can be directly reused by AWS partners and customers. Popular constructs from this project are being migrated step by step into AWS DSF.

The AWS Analytics Reference Architecture is a set of analytics solutions put together as end-to-end examples. It regroups AWS best practices for designing, implementing, and operating analytics platforms through different purpose-built patterns, handling common requirements, and solving customers' challenges.

This project is composed of:

  • Reusable core components exposed in an AWS CDK (Cloud Development Kit) library currently available in Typescript and Python. This library contains AWS CDK constructs that can be used to quickly provision analytics solutions in demos, prototypes, proof of concepts and end-to-end reference architectures.
  • Reference architectures consumming the reusable components to demonstrate end-to-end examples in a business context. Currently, the AWS native reference architecture is available.

This repository contains the codebase and getting started instructions for:

Contributing

Please refer to the contributing guidelines and contributing FAQ for details.

License Summary

The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.

The sample code within this documentation is made available under the MIT-0 license. See the LICENSE-SAMPLECODE file.

aws-analytics-reference-architecture's People

Contributors

alexvt-amz avatar anady208 avatar asenousy avatar bouhajer avatar dairiley avatar dependabot[bot] avatar dzeno avatar flochaz avatar heitorlessa avatar ijemmy avatar jeromevdl avatar lmouhib avatar lxg avatar rubenhgaws avatar sercankaraoglu avatar vgkowski avatar yadavzjy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws-analytics-reference-architecture's Issues

DataLakeStorage construct doesn't check minimum days for transitioning rules

The DataLakeStorage construct provides default configuration for transitioning objects to different Amazon S3 storage classes and you can customize it but there is no check that the service constraints are respected.

Constraints are listed here

Errors are raised at the deployment time but should be raised at the synthetize time to avoid AWS CloudFormation deploy and rollback

Data generation for stream targets

The BatchReplayer only generates data in batch or micro-batch and writes to batch type targets. We need to allow for writing to streams including Amazon Kinesis Streams and Amazon MSK

If the BatchReplayer cannot be used, we should implement a new Construct, long running.

Upgrade EKS version to 1.23

EKS version is soon to be deprecated following the EKS release cycle. The emr on eks construct should be upgraded

Remove the last AWS Glue Crawler in the batch pipeline

The last AWS Glue crawler in the batch pipeline was implemented to register the table with parquet classification format instead of glueparquet format. Glueparquet format was not usable as a source according to the documentation but it's compatible with parquet source.

We need to remove the Crawler from the pipeline:

And update the documentation accordingly:
https://aws-samples.github.io/aws-analytics-reference-architecture/solutions/data-preparation/#keeping-the-data-catalog-up-to-date

Proper implementation for singleton resources

Currently, singleton resources (called with getOrCreate methods) are created in the current stack (parent or nested). This method only checks existence in the nested stack or the parent stack but not in the other nested stacks. For example, create a DataLakeStorage in a nested and an EmrEksCluster in another nested, both will try to create an s3-access-logs bucket and it will fail because the bucket name is not unique.

Multiple solutions are available:

  • Create singleton resources in the top level stack and search for it in top level stack with getOrCreate
  • Create singleton resources in a dedicated nested stack for common resources and search in it with getOrCreate
  • Create singleton resources in the nested stack of the first call and search across all the nested stack with getOrCreate

In all the cases, customizing resource policies cross (nested)stacks is required for S3CrossAccount construct.

Data generator should provide BYOD feature

Currently, BatchReplayer is consuming PreparedDataset to generate data. We can provide a new construct to prepare the data for replay during provisioning of the CDK application.

This construct can take a source dataset as input parameters and run a synchronous AWS Glue job to modify the dataset and make it consumable by the BatchReplayer

Pre-requisites for BatchReplayer are listed in the PreparedDataset construct documentation

* A PreparedDataset has following properties:

Add some quality checks to prevent the preparation from failing.

Ensure the PySpark script is packaged into the core library.

Implement a Dwh construct to abstract Redshift and helpers

Currently, there is no Dwh abstraction in the Analytics Reference Architecture and you need to provision Redshift L2 Construct with lots of additional resources to get a complete and usable setup.

We can implement a new construct called Dwh which bundle all the required resources with defaults but still customizable. It can be composed of:

  • Redshift cluster with a default parameter group (multiple priority queues, concurrency scaling,
  • KMS key for data encryption
  • Bucket for logs (should be an AraBucket singleton 'redshift-access-logs')
  • Default WLM configuration
  • IAM role for Spectrum
  • Schema pointing to Glue catalog
  • Default users for ETL, data engineer, data analyst, data scientist

Points to be investigated:

  • L2 alpha versus L1 Cfn based construct
  • Need for SSL enforcement
  • IAM authentication vs secret manager username/password

LfS3Location construct not compatible with cross account bucket

The LfS3Location requires a Bucket object and is using grantReadWrite to authorize the LakeFormation role to access data. This requires the bucket to be managed in the same CDK stack and so does not work in cross account setup.

We need to refactor the interface to an S3Location and an IKey (the KMS key used for the bucket encryption) and then create IAM permissions manually (not using the grantReadWrite)

BatchReplayer should provide more database target

Currently, the BatchReplayer generates data in Amazon S3. We can provide more options including:

  • Amazon RDS MySQL
  • Amazon RDS Postgres
  • Amazon Aurora
  • Amazon Redshift
  • Amazon DynamoDB

The BatchReplayer relies on AWS Data Wrangler so it should be easy to provide more targets by modifying the AWS Lambda responsible to writing the data
https://github.com/aws-samples/aws-analytics-reference-architecture/blob/main/core/src/data-generator/resources/lambdas/write-in-batch/write-in-batch.py

We need to change the construct interface:

  • sinkBucket needs to be optional
  • sinkObjectKey should not be defined if sinkBucket is undefined
  • Add Database targets with new optional parameters (if possible use CDK Interfaces and not ARN in string)
  • Add authentication method for the target database

BatchReplayer should generate part of the dataset during deployment

BatchReplayer is currently replaying the dataset from scratch. Sometimes we just need data to be in the target and we don't want to wait for each batch/micro-batch to generate new data.
We should add a parameter to the construct to write a percentage of the dataset during the provisioning step of the construct. A part of the data will already be in the target when the CDK application is provisioned.

Upgrade to node 16

Node 14 is deprecated we need to upgrade our github actions accordingly.

Multiple managed endpoints failing

First managed option is created using "tooling" EKS managed node group which is amd64 based, however additinal managed endpoints are provisioned using "shared-0" EKS managed group which is arm640-based which is casing JEG to fail provisioning.

Warning Failed 7m25s (x2 over 7m41s) kubelet Failed to pull image “[755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest](http://755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest)”: rpc error: code = Unknown desc = no matching manifest for linux/arm64/v8 in the manifest list entries Normal BackOff 2m36s (x22 over 7m40s) kubelet Back-off pulling image “[755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest](http://755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest)”

Opensearch L3 construct

Currently, we don't have any easy way to setup an Opensearch domain with fine grained access control as it requires to run some configuration requests on the cluster API after the cluster is created. The common approach is to use a custom resource to do the API request on the cluster endpoint like in the AWS native refarch streaming module

I propose to build an L3 Construct that provide methods to do common cluster configuration with fine grained access control including:

  • create cluster with IAM master role
  • update the domain configuration and enable internal database users
  • create internal users mapped with opensearch roles
  • create IAM role/user mappings with opensearch roles
  • create opensearch roles
  • creating index mappings
  • creating rolling index strategy

The design would be similar to the EmrEksCluster which provides 2 methods for adding virtual clusters and managed endpoint

public addEmrVirtualCluster(scope: Construct, options: EmrVirtualClusterOptions): CfnVirtualCluster {

public addManagedEndpoint(scope: Construct, id: string, options: EmrManagedEndpointOptions) {

Methods need to rely on custom resources and follow Cfn resources lifecycle (create, update, delete). The lambda custom resource is using the master role to perform the other tasks.
The custom resource should rely on the PreBundledFunction and PreBundledLayer to be sure everything is packaged in the construct and available at synth/deploy time.

Improve costs for the data generators

Currently the data generator reads source data from the AWS Analytics Reference Architecture public bucket in eu-west-1. This design generates unnecessary costs with data transfer out when the data generator is deployed in a different region.

We need to implement 2 features to solve this:

  • Use requester pay to ensure the consumer of AWS Analytics Reference Architecture pays for its usage. For this the data generator needs to be updated to support requester pays:

    1. The Athena based data generator. A new Athena workgroup needs to be created and configured with requester pay compatibility
    2. The lambda based batch replayer using AWS data wrangler. It supports requester pay feature in the lambdas here, here and here
  • Implement Cross Region Replication from the source bucket in eu-west-1 to the major regions and adapt the data generator to pick the source in the same region it's deployed

feat (cdk): Move to cdk v2

Current ara constructs are delivered as L3 based on CDK v1 which in path for deprecation.

We need to migrate to v2.

Create a CloudFormation template to deploy CDK applications in limited environments

CDK applications are not always supported by transient environments. We need to provide a CloudFormation template that will synthetize and deploy CDK applications.

The bootstrap stack can use:

  • CodeBuild to:
    • Install deps
    • Build the CDK app
    • Bootstrap the account in the region
    • Deploy the CDK app
    • Respond to CloudFormation custom resource with deployment status
  • A custom resource to:
    • Trigger the CodeBuild and pass the response URL for CloudFormation custom resource

Example here

Migrate the AWS native reference architecture on the core components

The current version of the AWS native reference architecture is using L1 and L2 Constructs from AWS CDK. The objective is to migrate part of the batch module of the reference architecture on L3 Constructs from core components including:

  • DataLakeStorage
  • DataLakeCatalog
  • DataGenerator

This would simplify the maintenance of the reference architecture as it will automatically benefit from new L3 Constructs versions and features.

Introduce DevOps with dev/staging/prod accounts into the architecture

There's already a notion of CI/CD in AWS native reference architecture stack deployment via CICD. What's missing is that it doesn't explain how to promote code between environments, e.g. from dev to staging to prod, akin to the ML platform reference architecture and existing reusable MLOps templates in SageMaker.

In particular, ML platform architecture states the following about data management account that I believe should be a concern for the analytics reference architecture:

Data management account — While data management is outside of this document's scope, it is recommended to have a separate data management AWS account that can feed data to the various machine learning workload or business unit accounts and is accessible from those accounts. Similar to the Shared Services account, data management also should have multiple environments for the development and testing of data services.

So, the ask to introduce these multiple accounts into the architecture.

It would be also great to have some sample code, e. g. a simple Java code for Spark ETL, that a developer will build into a Jar file in the dev account with CodeBuild, deploy it to test/staging account with CodePipeline and approve/trigger the ETL to be deployed into the prod account.

Use projen to generate Github workflows

Currently Projen is not used to generate YAML configuration files for Github workflows because Projen config was not able to write in a different folder than the projen root. Statically defined YAML for workflows adds extra maintenance and is error prone is something changes in a new projen version

We should change to dynamic YAML files generated by projen but written in the repo root folder and not in Projen root folder.

AWS native refarch cannot be deployed in AWS accounts with Lake Formation enabled

Deploying the AWS native refarch in an account with Lake Formation enabled fails because the CloudFormation execution role is not granted to create Glue resources in Lake Formation. In this setup, IAM permissions are not used anymore by Glue.

The workaround is to grant Lake Formation permissions to the IAM role used by CDK. By default the IAM role used by CDK is common to all CDK applications deployed in an AWS account and is created when bootstrapping an account with cdk bootstrap. This role can be found in the default CDKToolkit stack in CloudFormation console (cdk-xxxxxxx-cfn-exec-role-<ACCOUNT_ID>-). We should document this workaround in the getting started guide.

The long term solution is to use a custom bootstrap with:

  • A custom qualifier to scope the custom bootstrap to AWS Analytics Reference Architecture. To ensure the qualifier is passed to all the stacks, we should probably create a new Stack type (AraStack)
  • A custom bootstrap CloudFormation template granting Lake Formation permissions to the CDK execution via an AWS::LakeFormation::PrincipalPermissions

feat: Scope the role exec role produced by `EmrEksCluster.createExecutionRole`

Currently the method createExecutionRole creates a role with a trust relationship with the EKS cluster oidc provider. The method should provide a the ability to further scope down the role by adding a condition on the namespace -example- to which the virtual cluster belong. See example below.

"Condition": { "StringLike": { "<OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:*" } }

Deidentify Data

Create architecture and components for de-identfying data.

Typo in construct id

rule.addTarget(new targets.EventBus(
      EventBus.fromEventBusArn(
        this,
        '${id}DomainEventBus',
        dataDomainBusArn
      )),
    );

Typo in construct id, using '' instead of ` for interpolation.

Output the Execution role arn

The current method to add the execution role does not output the role arn it creates. I would be good to output the role arn along with its name as <key, value>

Reporting a vulnerability

Hello!

I hope you are doing well!

We are a security research team. Our tool automatically detected a vulnerability in this repository. We want to disclose it responsibly. GitHub has a feature called Private vulnerability reporting, which enables security research to privately disclose a vulnerability. Unfortunately, it is not enabled for this repository.

Can you enable it, so that we can report it?

Thanks in advance!

PS: you can read about how to enable private vulnerability reporting here: https://docs.github.com/en/code-security/security-advisories/repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository

BatchReplayer sink_object_key Property returns "None" if not initialised

Although there is a default value for the sink_object_key, if I don't specify it, it returns the string "None".

data_generator = ara.BatchReplayer(
  scope=self,
  id="customer-data",
  dataset=ara.PreparedDataset.RETAIL_1_GB_CUSTOMER,
  sink_bucket=storage.raw_bucket,
)

crawler = glue.CfnCrawler(
    self,
    id='ara-crawler',
    name='ara-crawler', 
    role=glue_role.iam_role.role_arn,
    database_name='raw',
    targets={
        's3Targets': [{"path": f"s3://{storage.raw_bucket.bucket_name}/{data_generator.sink_object_key}/"}],
    }
)

This crawler target will be rendered as:

s3://raw-<account_id>-<region_id>/None/

Instead, IMO it should return the default key.

Missing jar file for the FlywayRunner during unit and e2e tests

When running tests, Jest doesn't find the Jar file from the flyway runner lambda. The jar file is only generated in the build phase via a custom projen task using gradle.

How to reproduce the issue:

  • Run jest --group=integ/redshift/flyway-runner

  • Jest generates this error: Cannot find asset at /.../aws-analytics-reference-architecture/core/src/db-schema-manager/resources/flyway-lambda/flyway-all.jar

Add support for EMR Serverless

EMR Serverless is GA now. It would be good to have support for EMR Serverless in addition to EMR EKS cluster construct.

Change NotebookPlatform from a nested stack to a construct

Currently, NotebookPlatform extends NestedStack to ensure we doesn't reach the 500 resources limit in one Cloud Formation stack. Because of recent optimizations on the constructs, we can now extend Construct and let the consumer decide to deploy in a nested stack or not. Additionally, it would solve the issue with documentation where the NestedStack exportValue function breaks the entire lib documentation.

Add observability feature

Provide a construct that bootstrap managed Prometheus and Grafana to be used by other ARA constructs.

setting the disksize for the nodes

it is not possible to use diskSize property to set the disk size for the nodes because addEksNodeGroup is creating LunchTemplate (and not exposing it for changing) and you can't have both diskSize and LunchTemplate defined...
when trying to add diskSize:100 you get the following error at synth time
Error: diskSize must be specified within the launch template

Glue crawler skips tables for some prepared TPCDS datasets

Glue crawler skips tables with Multiple tables are found under location for some prepared datasets including:

  • aws-analytics-reference-architecture/datasets/retail/1GB/web-sale
  • aws-analytics-reference-architecture/datasets/retail/1GB/store-sale

Partition names seem well structured, and files are of the same format. Tried to keep only a single file for testing, and removing the rest, but the problem persists.

This is not the case for /customer dataset.

Minimize costs of BatchReplayer with PreparedDatasets in multiple regions

Currently, PreparedDatasets are consumed from a single region (eu-west-1) and it generates cross region data transfer costs when provision a BatchReplayer in a different region.

We need to make the S3 Location of the PreparedDataset dynamic so it can point to a local S3 Location if it exists. Datasets can be replicated with Amazon S3 Cross Region Replication.

Cross account id for `LakeFormationS3Location`

In data mesh CentraGovernance construct domain account id is not passed to LakeFormationS3Location, hence account id from the central account is used for the KMS policy.

new LakeFormationS3Location(this, `${id}LFLocation`, {
      s3Location: {
        bucketName: domainBucket,
        objectKey: domainPrefix,
      },
      kmsKeyId: domainKey,
    });

Fix: add accountId parameter for registered data domain.

DataGenerator fails in some regions

The DataGenerator relies on AWS Data Wrangler lambda layer and its ARN is not consistent across regions.

Reproduce: deploy a DataGenerator in eu-west-2 and you will get this error
Resource handler returned message: "User: arn:aws:sts::1111111111111:assumed-role/Admin/myRole is not authorized to perform: lambda:GetLayerVersion on resource: arn:aws:lambda:eu-west-2:336392948345:layer:AWSDataWrangler-Python38:6 because no resource-based policy allows the lambda:GetLayerVersion action

Workaround:

NotebookPlatform construct potentially building resources ID with tokens

The method addUser() in the NotebookPlatformconstruct is using the identity_name parameter in some resources ID. If the username is a token that is resolved at deploy time, CDK fails.

Here is a typical example that is failing:

    notebook_user = iam.User(self, 'NotebookUser', user_name='my-user')

    # Notebook to user association
    exec_roles = notebook_platform.add_user([ara.NotebookUserOptions(
        identity_name=notebook_user.user_name,
        notebook_managed_endpoints=[ ara.NotebookManagedEndpointOptions(
            emr_on_eks_version= ara.EmrVersion.V6_9,
            execution_policy= exec_policy,
            managed_endpoint_name="test"
        )])
    ])

Workaround: replace the identity_name value by the actual name you provide to the User construct and add a CDK node dependency between them

    notebook_user = iam.User(self, 'NotebookUser', user_name='my-user')

    # Notebook to user association
    exec_roles = notebook_platform.add_user([ara.NotebookUserOptions(
        identity_name='my-user',
        notebook_managed_endpoints=[ ara.NotebookManagedEndpointOptions(
            emr_on_eks_version= ara.EmrVersion.V6_9,
            execution_policy= exec_policy,
            managed_endpoint_name="test"
        )])
    ])

    # Get the Role created by the notebook platform
    exec_role=exec_roles[0]

    exec_role.node.add_dependency(notebook_user)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.