License: Other

Python 24.81% Scala 7.02% Shell 0.02% PLpgSQL 1.82% Java 14.46% Jinja 0.01% JavaScript 1.07% TypeScript 50.75% Batchfile 0.03%

aws-analytics-reference-architecture's Issues

Remove the last AWS Glue Crawler in the batch pipeline

The last AWS Glue crawler in the batch pipeline was implemented to register the table with parquet classification format instead of glueparquet format. Glueparquet format was not usable as a source according to the documentation but it's compatible with parquet source.

We need to remove the Crawler from the pipeline:

aws-analytics-reference-architecture/refarch/aws-native/batch/batch_cdk/batch_module.py

Line 33 in 9057363

clean_crawler = Crawler(self, 'Clean',

And update the documentation accordingly:
https://aws-samples.github.io/aws-analytics-reference-architecture/solutions/data-preparation/#keeping-the-data-catalog-up-to-date

Multiple managed endpoints failing

First managed option is created using "tooling" EKS managed node group which is amd64 based, however additinal managed endpoints are provisioned using "shared-0" EKS managed group which is arm640-based which is casing JEG to fail provisioning.

Warning Failed 7m25s (x2 over 7m41s) kubelet Failed to pull image “[755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest](http://755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest)”: rpc error: code = Unknown desc = no matching manifest for linux/arm64/v8 in the manifest list entries Normal BackOff 2m36s (x22 over 7m40s) kubelet Back-off pulling image “[755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest](http://755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest)”

Minimize costs of BatchReplayer with PreparedDatasets in multiple regions

Currently, PreparedDatasets are consumed from a single region (eu-west-1) and it generates cross region data transfer costs when provision a BatchReplayer in a different region.

We need to make the S3 Location of the PreparedDataset dynamic so it can point to a local S3 Location if it exists. Datasets can be replicated with Amazon S3 Cross Region Replication.

feat: Provide the ability to pass workgroup to SynchronousAthenaQuery

The current SynchronousAthenaQuery construct has a workgroup hard coded in it. The construct should allow using other workgroup while providing a default one if the user is does not provide it.

EMR on EKS custom image automation

Provide a construct/method to automate the build of custom images to use in jobs or managed endpoint in EMR on EKS.

Introduce DevOps with dev/staging/prod accounts into the architecture

There's already a notion of CI/CD in AWS native reference architecture stack deployment via CICD. What's missing is that it doesn't explain how to promote code between environments, e.g. from dev to staging to prod, akin to the ML platform reference architecture and existing reusable MLOps templates in SageMaker.

In particular, ML platform architecture states the following about data management account that I believe should be a concern for the analytics reference architecture:

Data management account — While data management is outside of this document's scope, it is recommended to have a separate data management AWS account that can feed data to the various machine learning workload or business unit accounts and is accessible from those accounts. Similar to the Shared Services account, data management also should have multiple environments for the development and testing of data services.

So, the ask to introduce these multiple accounts into the architecture.

It would be also great to have some sample code, e. g. a simple Java code for Spark ETL, that a developer will build into a Jar file in the dev account with CodeBuild, deploy it to test/staging account with CodePipeline and approve/trigger the ETL to be deployed into the prod account.

Create a CloudFormation template to deploy CDK applications in limited environments

CDK applications are not always supported by transient environments. We need to provide a CloudFormation template that will synthetize and deploy CDK applications.

The bootstrap stack can use:

CodeBuild to:
- Install deps
- Build the CDK app
- Bootstrap the account in the region
- Deploy the CDK app
- Respond to CloudFormation custom resource with deployment status
A custom resource to:
- Trigger the CodeBuild and pass the response URL for CloudFormation custom resource

Example here

feat: Scope the role exec role produced by `EmrEksCluster.createExecutionRole`

Currently the method createExecutionRole creates a role with a trust relationship with the EKS cluster oidc provider. The method should provide a the ability to further scope down the role by adding a condition on the namespace -example- to which the virtual cluster belong. See example below.

"Condition": { "StringLike": { "<OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:*" } }

Opensearch L3 construct

Currently, we don't have any easy way to setup an Opensearch domain with fine grained access control as it requires to run some configuration requests on the cluster API after the cluster is created. The common approach is to use a custom resource to do the API request on the cluster endpoint like in the AWS native refarch streaming module

I propose to build an L3 Construct that provide methods to do common cluster configuration with fine grained access control including:

create cluster with IAM master role
update the domain configuration and enable internal database users
create internal users mapped with opensearch roles
create IAM role/user mappings with opensearch roles
create opensearch roles
creating index mappings
creating rolling index strategy

The design would be similar to the EmrEksCluster which provides 2 methods for adding virtual clusters and managed endpoint

aws-analytics-reference-architecture/core/src/emr-eks-platform/emr-eks-cluster.ts

Line 445 in 542c919

 public addEmrVirtualCluster(scope: Construct, options: EmrVirtualClusterOptions): CfnVirtualCluster { 

aws-analytics-reference-architecture/core/src/emr-eks-platform/emr-eks-cluster.ts

Line 506 in 542c919

 public addManagedEndpoint(scope: Construct, id: string, options: EmrManagedEndpointOptions) { 

Methods need to rely on custom resources and follow Cfn resources lifecycle (create, update, delete). The lambda custom resource is using the master role to perform the other tasks.
The custom resource should rely on the PreBundledFunction and PreBundledLayer to be sure everything is packaged in the construct and available at synth/deploy time.

BatchReplayer should provide more database target

Currently, the BatchReplayer generates data in Amazon S3. We can provide more options including:

The BatchReplayer relies on AWS Data Wrangler so it should be easy to provide more targets by modifying the AWS Lambda responsible to writing the data
https://github.com/aws-samples/aws-analytics-reference-architecture/blob/main/core/src/data-generator/resources/lambdas/write-in-batch/write-in-batch.py

We need to change the construct interface:

sinkBucket needs to be optional
sinkObjectKey should not be defined if sinkBucket is undefined
Add Database targets with new optional parameters (if possible use CDK Interfaces and not ARN in string)
Add authentication method for the target database

Glue crawler skips tables for some prepared TPCDS datasets

Glue crawler skips tables with Multiple tables are found under location for some prepared datasets including:

aws-analytics-reference-architecture/datasets/retail/1GB/web-sale
aws-analytics-reference-architecture/datasets/retail/1GB/store-sale

Partition names seem well structured, and files are of the same format. Tried to keep only a single file for testing, and removing the rest, but the problem persists.

This is not the case for /customer dataset.

AWS native refarch cannot be deployed in AWS accounts with Lake Formation enabled

Deploying the AWS native refarch in an account with Lake Formation enabled fails because the CloudFormation execution role is not granted to create Glue resources in Lake Formation. In this setup, IAM permissions are not used anymore by Glue.

The workaround is to grant Lake Formation permissions to the IAM role used by CDK. By default the IAM role used by CDK is common to all CDK applications deployed in an AWS account and is created when bootstrapping an account with cdk bootstrap. This role can be found in the default CDKToolkit stack in CloudFormation console (cdk-xxxxxxx-cfn-exec-role-<ACCOUNT_ID>-). We should document this workaround in the getting started guide.

The long term solution is to use a custom bootstrap with:

A custom qualifier to scope the custom bootstrap to AWS Analytics Reference Architecture. To ensure the qualifier is passed to all the stacks, we should probably create a new Stack type (AraStack)
A custom bootstrap CloudFormation template granting Lake Formation permissions to the CDK execution via an AWS::LakeFormation::PrincipalPermissions

Proper implementation for singleton resources

Currently, singleton resources (called with getOrCreate methods) are created in the current stack (parent or nested). This method only checks existence in the nested stack or the parent stack but not in the other nested stacks. For example, create a DataLakeStorage in a nested and an EmrEksCluster in another nested, both will try to create an s3-access-logs bucket and it will fail because the bucket name is not unique.

Multiple solutions are available:

Create singleton resources in the top level stack and search for it in top level stack with getOrCreate
Create singleton resources in a dedicated nested stack for common resources and search in it with getOrCreate
Create singleton resources in the nested stack of the first call and search across all the nested stack with getOrCreate

In all the cases, customizing resource policies cross (nested)stacks is required for S3CrossAccount construct.

NotebookPlatform construct potentially building resources ID with tokens

The method addUser() in the NotebookPlatformconstruct is using the identity_name parameter in some resources ID. If the username is a token that is resolved at deploy time, CDK fails.

Here is a typical example that is failing:

    notebook_user = iam.User(self, 'NotebookUser', user_name='my-user')

    # Notebook to user association
    exec_roles = notebook_platform.add_user([ara.NotebookUserOptions(
        identity_name=notebook_user.user_name,
        notebook_managed_endpoints=[ ara.NotebookManagedEndpointOptions(
            emr_on_eks_version= ara.EmrVersion.V6_9,
            execution_policy= exec_policy,
            managed_endpoint_name="test"
        )])
    ])

Workaround: replace the identity_name value by the actual name you provide to the User construct and add a CDK node dependency between them

    notebook_user = iam.User(self, 'NotebookUser', user_name='my-user')

    # Notebook to user association
    exec_roles = notebook_platform.add_user([ara.NotebookUserOptions(
        identity_name='my-user',
        notebook_managed_endpoints=[ ara.NotebookManagedEndpointOptions(
            emr_on_eks_version= ara.EmrVersion.V6_9,
            execution_policy= exec_policy,
            managed_endpoint_name="test"
        )])
    ])

    # Get the Role created by the notebook platform
    exec_role=exec_roles[0]

    exec_role.node.add_dependency(notebook_user)

Typo in construct id

rule.addTarget(new targets.EventBus(
      EventBus.fromEventBusArn(
        this,
        '${id}DomainEventBus',
        dataDomainBusArn
      )),
    );

Typo in construct id, using '' instead of ` for interpolation.

Output the Execution role arn

The current method to add the execution role does not output the role arn it creates. I would be good to output the role arn along with its name as <key, value>

Provide PreparedDataset for different industries

Currently, the only PreparedDataset available is a retail dataset derived from TPC-DS. We need to provide datasets for:

FSI
Teclo
Industrial & Manufacturing
Gametech
...

EMR on EKS Job template interface

Add the ability to submit EMR on EKS job template with a method.

Change in DataLakeStorage initializer props requirement

The DataLakeStorage initializer has the props as required. However, all of the props parameters are optional.

Proposed change: Make the props in the DataLakeStorage initializer as optional

LfS3Location construct not compatible with cross account bucket

The LfS3Location requires a Bucket object and is using grantReadWrite to authorize the LakeFormation role to access data. This requires the bucket to be managed in the same CDK stack and so does not work in cross account setup.

We need to refactor the interface to an S3Location and an IKey (the KMS key used for the bucket encryption) and then create IAM permissions manually (not using the grantReadWrite)

DataLakeStorage construct doesn't check minimum days for transitioning rules

The DataLakeStorage construct provides default configuration for transitioning objects to different Amazon S3 storage classes and you can customize it but there is no check that the service constraints are respected.

Constraints are listed here

Errors are raised at the deployment time but should be raised at the synthetize time to avoid AWS CloudFormation deploy and rollback

Deidentify Data

Create architecture and components for de-identfying data.

Upgrade EKS version to 1.23

EKS version is soon to be deprecated following the EKS release cycle. The emr on eks construct should be upgraded

feat (cdk): Move to cdk v2

Current ara constructs are delivered as L3 based on CDK v1 which in path for deprecation.

We need to migrate to v2.

feat: move AWS native reference architecture to CDK v2

Current AWS native ref arch is using CDH 1.134.0 and AWS Analytics Reference Architecture 1.11.0
We need to upgrade CDK and ARA to v2

Depends on #337

Create the construct for all the major flow in Data Mesh 1/Register Data Product 2/ Request Data product 3/Approve/Deny Request

Without the need of an UI , can ARA provide all the building blocks to build data mesh in a headless fashion

ManagedEndpoint CR failing to deploy

The current managed endpoint CR fails to deploy due to type mismatch expecting dict but receiving string

Missing jar file for the FlywayRunner during unit and e2e tests

When running tests, Jest doesn't find the Jar file from the flyway runner lambda. The jar file is only generated in the build phase via a custom projen task using gradle.

How to reproduce the issue:

Run jest --group=integ/redshift/flyway-runner
Jest generates this error: Cannot find asset at /.../aws-analytics-reference-architecture/core/src/db-schema-manager/resources/flyway-lambda/flyway-all.jar

Add Amazon Kinesis Data Analytics in the high level architecture diagram

Amazon Kinesis Data Analytics is used to process the stream of data and to ingest into Amazon Elasticsearch Service. The service is not present in the diagram

Data generation for stream targets

The BatchReplayer only generates data in batch or micro-batch and writes to batch type targets. We need to allow for writing to streams including Amazon Kinesis Streams and Amazon MSK

If the BatchReplayer cannot be used, we should implement a new Construct, long running.

DataGenerator fails in some regions

The DataGenerator relies on AWS Data Wrangler lambda layer and its ARN is not consistent across regions.

Reproduce: deploy a DataGenerator in eu-west-2 and you will get this error
Resource handler returned message: "User: arn:aws:sts::1111111111111:assumed-role/Admin/myRole is not authorized to perform: lambda:GetLayerVersion on resource: arn:aws:lambda:eu-west-2:336392948345:layer:AWSDataWrangler-Python38:6 because no resource-based policy allows the lambda:GetLayerVersion action

Workaround:

deploy the DataGenerator from ARA v1 where the AWSDataWrangler-Python38:6 DataWrangler layer version is available
deploy the DataGenerator from ARA v2 where the AWSDataWrangler-Python39:1 DataWrangler layer version is available

Reporting a vulnerability

Hello!

I hope you are doing well!

We are a security research team. Our tool automatically detected a vulnerability in this repository. We want to disclose it responsibly. GitHub has a feature called Private vulnerability reporting, which enables security research to privately disclose a vulnerability. Unfortunately, it is not enabled for this repository.

Can you enable it, so that we can report it?

Thanks in advance!

PS: you can read about how to enable private vulnerability reporting here: https://docs.github.com/en/code-security/security-advisories/repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository

Data generator should provide BYOD feature

Currently, BatchReplayer is consuming PreparedDataset to generate data. We can provide a new construct to prepare the data for replay during provisioning of the CDK application.

This construct can take a source dataset as input parameters and run a synchronous AWS Glue job to modify the dataset and make it consumable by the BatchReplayer

Pre-requisites for BatchReplayer are listed in the PreparedDataset construct documentation

aws-analytics-reference-architecture/core/src/data-generator/prepared-dataset.ts

Line 47 in a000619

* A PreparedDataset has following properties:

Add some quality checks to prevent the preparation from failing.

Ensure the PySpark script is packaged into the core library.

BatchReplayer should generate part of the dataset during deployment

BatchReplayer is currently replaying the dataset from scratch. Sometimes we just need data to be in the target and we don't want to wait for each batch/micro-batch to generate new data.
We should add a parameter to the construct to write a percentage of the dataset during the provisioning step of the construct. A part of the data will already be in the target when the CDK application is provisioned.

Change NotebookPlatform from a nested stack to a construct

Currently, NotebookPlatform extends NestedStack to ensure we doesn't reach the 500 resources limit in one Cloud Formation stack. Because of recent optimizations on the constructs, we can now extend Construct and let the consumer decide to deploy in a nested stack or not. Additionally, it would solve the issue with documentation where the NestedStack exportValue function breaks the entire lib documentation.

Missing information on KDA app

Hello,

I don't see information about kinetics analytics application "KDA-application" (Streaming Analytics section) in the architecture diagram at https://aws-samples.github.io/aws-analytics-reference-architecture/high-level-design/architecture.

Add observability feature

Provide a construct that bootstrap managed Prometheus and Grafana to be used by other ARA constructs.

BatchReplayer sink_object_key Property returns "None" if not initialised

Although there is a default value for the sink_object_key, if I don't specify it, it returns the string "None".

data_generator = ara.BatchReplayer(
  scope=self,
  id="customer-data",
  dataset=ara.PreparedDataset.RETAIL_1_GB_CUSTOMER,
  sink_bucket=storage.raw_bucket,
)

crawler = glue.CfnCrawler(
    self,
    id='ara-crawler',
    name='ara-crawler', 
    role=glue_role.iam_role.role_arn,
    database_name='raw',
    targets={
        's3Targets': [{"path": f"s3://{storage.raw_bucket.bucket_name}/{data_generator.sink_object_key}/"}],
    }
)

This crawler target will be rendered as:

s3://raw-<account_id>-<region_id>/None/

Instead, IMO it should return the default key.

Upgrade to node 16

Node 14 is deprecated we need to upgrade our github actions accordingly.

Migrate the AWS native reference architecture on the core components

The current version of the AWS native reference architecture is using L1 and L2 Constructs from AWS CDK. The objective is to migrate part of the batch module of the reference architecture on L3 Constructs from core components including:

DataLakeStorage
DataLakeCatalog
DataGenerator

This would simplify the maintenance of the reference architecture as it will automatically benefit from new L3 Constructs versions and features.

Use projen to generate Github workflows

Currently Projen is not used to generate YAML configuration files for Github workflows because Projen config was not able to write in a different folder than the projen root. Statically defined YAML for workflows adds extra maintenance and is error prone is something changes in a new projen version

We should change to dynamic YAML files generated by projen but written in the repo root folder and not in Projen root folder.

setting the disksize for the nodes

it is not possible to use diskSize property to set the disk size for the nodes because addEksNodeGroup is creating LunchTemplate (and not exposing it for changing) and you can't have both diskSize and LunchTemplate defined...
when trying to add diskSize:100 you get the following error at synth time
Error: diskSize must be specified within the launch template

Create a Data Mesh reference architecture

We can implement a new CDK application demonstrating an end-to-end example of a data mesh on AWS

Add github action for e2e testing

Use this util to assume role and execute e2e testing as part of the build flow

fix: Provide field for passing the version of autoscaler to use with EKS

Currently the EMR on EKS construct provide the ability to define the version of EKS but not autoscaler. If the k8s version 1.22 is used then deployment fails because the autoscaler version used is not correct.

Cross account id for `LakeFormationS3Location`

In data mesh CentraGovernance construct domain account id is not passed to LakeFormationS3Location, hence account id from the central account is used for the KMS policy.

new LakeFormationS3Location(this, `${id}LFLocation`, {
      s3Location: {
        bucketName: domainBucket,
        objectKey: domainPrefix,
      },
      kmsKeyId: domainKey,
    });

Fix: add accountId parameter for registered data domain.

Add support for EMR Serverless

EMR Serverless is GA now. It would be good to have support for EMR Serverless in addition to EMR EKS cluster construct.

Improve costs for the data generators

Currently the data generator reads source data from the AWS Analytics Reference Architecture public bucket in eu-west-1. This design generates unnecessary costs with data transfer out when the data generator is deployed in a different region.

We need to implement 2 features to solve this:

Use requester pay to ensure the consumer of AWS Analytics Reference Architecture pays for its usage. For this the data generator needs to be updated to support requester pays:
1. The Athena based data generator. A new Athena workgroup needs to be created and configured with requester pay compatibility
2. The lambda based batch replayer using AWS data wrangler. It supports requester pay feature in the lambdas here, here and here
Implement Cross Region Replication from the source bucket in eu-west-1 to the major regions and adapt the data generator to pick the source in the same region it's deployed

Implement a Dwh construct to abstract Redshift and helpers

Currently, there is no Dwh abstraction in the Analytics Reference Architecture and you need to provision Redshift L2 Construct with lots of additional resources to get a complete and usable setup.

We can implement a new construct called Dwh which bundle all the required resources with defaults but still customizable. It can be composed of:

Redshift cluster with a default parameter group (multiple priority queues, concurrency scaling,
KMS key for data encryption
Bucket for logs (should be an AraBucket singleton 'redshift-access-logs')
Default WLM configuration
IAM role for Spectrum
Schema pointing to Glue catalog
Default users for ETL, data engineer, data analyst, data scientist

Points to be investigated:

L2 alpha versus L1 Cfn based construct
Need for SSL enforcement
IAM authentication vs secret manager username/password

DomainSecret in `registerDataDomain` is missing a sepcific identifier to enable multiple data domains.

Error: There is already a Construct with name 'DomainSecret' in CentralGovernance

This is due to const domainSecret = Secret.fromSecretCompleteArn(this, 'DomainSecret', domainSecretArn); having generic id DomainSecret.

aws-samples / aws-analytics-reference-architecture Goto Github PK

aws-analytics-reference-architecture's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs