aws-samples / aws-analytics-reference-architecture Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
The last AWS Glue crawler in the batch pipeline was implemented to register the table with parquet classification format instead of glueparquet format. Glueparquet format was not usable as a source according to the documentation but it's compatible with parquet source.
We need to remove the Crawler from the pipeline:
And update the documentation accordingly:
https://aws-samples.github.io/aws-analytics-reference-architecture/solutions/data-preparation/#keeping-the-data-catalog-up-to-date
First managed option is created using "tooling" EKS managed node group which is amd64 based, however additinal managed endpoints are provisioned using "shared-0" EKS managed group which is arm640-based which is casing JEG to fail provisioning.
Warning Failed 7m25s (x2 over 7m41s) kubelet Failed to pull image “[755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest](http://755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest)”: rpc error: code = Unknown desc = no matching manifest for linux/arm64/v8 in the manifest list entries Normal BackOff 2m36s (x22 over 7m40s) kubelet Back-off pulling image “[755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest](http://755674844232.dkr.ecr.us-east-1.amazonaws.com/notebook-jeg/emr-6.7.0:latest)”
Currently, PreparedDatasets
are consumed from a single region (eu-west-1
) and it generates cross region data transfer costs when provision a BatchReplayer
in a different region.
We need to make the S3 Location of the PreparedDataset
dynamic so it can point to a local S3 Location if it exists. Datasets can be replicated with Amazon S3 Cross Region Replication.
The current SynchronousAthenaQuery construct has a workgroup hard coded in it. The construct should allow using other workgroup while providing a default one if the user is does not provide it.
Provide a construct/method to automate the build of custom images to use in jobs or managed endpoint in EMR on EKS.
There's already a notion of CI/CD in AWS native reference architecture stack deployment via CICD. What's missing is that it doesn't explain how to promote code between environments, e.g. from dev to staging to prod, akin to the ML platform reference architecture and existing reusable MLOps templates in SageMaker.
In particular, ML platform architecture states the following about data management account that I believe should be a concern for the analytics reference architecture:
Data management account — While data management is outside of this document's scope, it is recommended to have a separate data management AWS account that can feed data to the various machine learning workload or business unit accounts and is accessible from those accounts. Similar to the Shared Services account, data management also should have multiple environments for the development and testing of data services.
So, the ask to introduce these multiple accounts into the architecture.
It would be also great to have some sample code, e. g. a simple Java code for Spark ETL, that a developer will build into a Jar file in the dev account with CodeBuild, deploy it to test/staging account with CodePipeline and approve/trigger the ETL to be deployed into the prod account.
CDK applications are not always supported by transient environments. We need to provide a CloudFormation template that will synthetize and deploy CDK applications.
The bootstrap stack can use:
Example here
Currently the method createExecutionRole
creates a role with a trust relationship with the EKS cluster oidc provider
. The method should provide a the ability to further scope down the role by adding a condition on the namespace -example- to which the virtual cluster belong. See example below.
"Condition": { "StringLike": { "<OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:*" } }
Currently, we don't have any easy way to setup an Opensearch domain with fine grained access control as it requires to run some configuration requests on the cluster API after the cluster is created. The common approach is to use a custom resource to do the API request on the cluster endpoint like in the AWS native refarch streaming module
I propose to build an L3 Construct that provide methods to do common cluster configuration with fine grained access control including:
The design would be similar to the EmrEksCluster
which provides 2 methods for adding virtual clusters and managed endpoint
Methods need to rely on custom resources and follow Cfn resources lifecycle (create, update, delete). The lambda custom resource is using the master role to perform the other tasks.
The custom resource should rely on the PreBundledFunction
and PreBundledLayer
to be sure everything is packaged in the construct and available at synth/deploy time.
Currently, the BatchReplayer
generates data in Amazon S3. We can provide more options including:
The BatchReplayer
relies on AWS Data Wrangler so it should be easy to provide more targets by modifying the AWS Lambda responsible to writing the data
https://github.com/aws-samples/aws-analytics-reference-architecture/blob/main/core/src/data-generator/resources/lambdas/write-in-batch/write-in-batch.py
We need to change the construct interface:
sinkBucket
needs to be optionalsinkObjectKey
should not be defined if sinkBucket
is undefinedGlue crawler skips tables with Multiple tables are found under location
for some prepared datasets including:
aws-analytics-reference-architecture/datasets/retail/1GB/web-sale
aws-analytics-reference-architecture/datasets/retail/1GB/store-sale
Partition names seem well structured, and files are of the same format. Tried to keep only a single file for testing, and removing the rest, but the problem persists.
This is not the case for /customer
dataset.
Deploying the AWS native refarch in an account with Lake Formation enabled fails because the CloudFormation execution role is not granted to create Glue resources in Lake Formation. In this setup, IAM permissions are not used anymore by Glue.
The workaround is to grant Lake Formation permissions to the IAM role used by CDK. By default the IAM role used by CDK is common to all CDK applications deployed in an AWS account and is created when bootstrapping an account with cdk bootstrap
. This role can be found in the default CDKToolkit
stack in CloudFormation console (cdk-xxxxxxx-cfn-exec-role-<ACCOUNT_ID>-). We should document this workaround in the getting started guide.
The long term solution is to use a custom bootstrap with:
Currently, singleton resources (called with getOrCreate
methods) are created in the current stack (parent or nested). This method only checks existence in the nested stack or the parent stack but not in the other nested stacks. For example, create a DataLakeStorage in a nested and an EmrEksCluster in another nested, both will try to create an s3-access-logs bucket and it will fail because the bucket name is not unique.
Multiple solutions are available:
getOrCreate
getOrCreate
getOrCreate
In all the cases, customizing resource policies cross (nested)stacks is required for S3CrossAccount construct.
The method addUser()
in the NotebookPlatform
construct is using the identity_name
parameter in some resources ID. If the username is a token that is resolved at deploy time, CDK fails.
Here is a typical example that is failing:
notebook_user = iam.User(self, 'NotebookUser', user_name='my-user')
# Notebook to user association
exec_roles = notebook_platform.add_user([ara.NotebookUserOptions(
identity_name=notebook_user.user_name,
notebook_managed_endpoints=[ ara.NotebookManagedEndpointOptions(
emr_on_eks_version= ara.EmrVersion.V6_9,
execution_policy= exec_policy,
managed_endpoint_name="test"
)])
])
Workaround: replace the identity_name
value by the actual name you provide to the User
construct and add a CDK node dependency between them
notebook_user = iam.User(self, 'NotebookUser', user_name='my-user')
# Notebook to user association
exec_roles = notebook_platform.add_user([ara.NotebookUserOptions(
identity_name='my-user',
notebook_managed_endpoints=[ ara.NotebookManagedEndpointOptions(
emr_on_eks_version= ara.EmrVersion.V6_9,
execution_policy= exec_policy,
managed_endpoint_name="test"
)])
])
# Get the Role created by the notebook platform
exec_role=exec_roles[0]
exec_role.node.add_dependency(notebook_user)
rule.addTarget(new targets.EventBus(
EventBus.fromEventBusArn(
this,
'${id}DomainEventBus',
dataDomainBusArn
)),
);
Typo in construct id, using ''
instead of `
for interpolation.
The current method to add the execution role does not output the role arn it creates. I would be good to output the role arn along with its name as <key, value>
Currently, the only PreparedDataset
available is a retail dataset derived from TPC-DS. We need to provide datasets for:
Add the ability to submit EMR on EKS job template with a method.
The DataLakeStorage initializer has the props
as required. However, all of the props
parameters are optional.
Proposed change: Make the props
in the DataLakeStorage initializer as optional
The LfS3Location requires a Bucket object and is using grantReadWrite
to authorize the LakeFormation role to access data. This requires the bucket to be managed in the same CDK stack and so does not work in cross account setup.
We need to refactor the interface to an S3Location and an IKey (the KMS key used for the bucket encryption) and then create IAM permissions manually (not using the grantReadWrite)
The DataLakeStorage construct provides default configuration for transitioning objects to different Amazon S3 storage classes and you can customize it but there is no check that the service constraints are respected.
Constraints are listed here
Errors are raised at the deployment time but should be raised at the synthetize time to avoid AWS CloudFormation deploy and rollback
Create architecture and components for de-identfying data.
EKS version is soon to be deprecated following the EKS release cycle. The emr on eks construct should be upgraded
Current ara constructs are delivered as L3 based on CDK v1 which in path for deprecation.
We need to migrate to v2.
Current AWS native ref arch is using CDH 1.134.0 and AWS Analytics Reference Architecture 1.11.0
We need to upgrade CDK and ARA to v2
Depends on #337
Without the need of an UI , can ARA provide all the building blocks to build data mesh in a headless fashion
The current managed endpoint CR fails to deploy due to type mismatch expecting dict
but receiving string
When running tests, Jest doesn't find the Jar file from the flyway runner lambda. The jar file is only generated in the build phase via a custom projen task using gradle.
How to reproduce the issue:
Run jest --group=integ/redshift/flyway-runner
Jest generates this error: Cannot find asset at /.../aws-analytics-reference-architecture/core/src/db-schema-manager/resources/flyway-lambda/flyway-all.jar
Amazon Kinesis Data Analytics is used to process the stream of data and to ingest into Amazon Elasticsearch Service. The service is not present in the diagram
The BatchReplayer
only generates data in batch or micro-batch and writes to batch type targets. We need to allow for writing to streams including Amazon Kinesis Streams and Amazon MSK
If the BatchReplayer cannot be used, we should implement a new Construct, long running.
The DataGenerator relies on AWS Data Wrangler lambda layer and its ARN is not consistent across regions.
Reproduce: deploy a DataGenerator in eu-west-2 and you will get this error
Resource handler returned message: "User: arn:aws:sts::1111111111111:assumed-role/Admin/myRole is not authorized to perform: lambda:GetLayerVersion on resource: arn:aws:lambda:eu-west-2:336392948345:layer:AWSDataWrangler-Python38:6 because no resource-based policy allows the lambda:GetLayerVersion action
Workaround:
AWSDataWrangler-Python38:6
DataWrangler layer version is availableAWSDataWrangler-Python39:1
DataWrangler layer version is availableHello!
I hope you are doing well!
We are a security research team. Our tool automatically detected a vulnerability in this repository. We want to disclose it responsibly. GitHub has a feature called Private vulnerability reporting, which enables security research to privately disclose a vulnerability. Unfortunately, it is not enabled for this repository.
Can you enable it, so that we can report it?
Thanks in advance!
PS: you can read about how to enable private vulnerability reporting here: https://docs.github.com/en/code-security/security-advisories/repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository
Currently, BatchReplayer
is consuming PreparedDataset
to generate data. We can provide a new construct to prepare the data for replay during provisioning of the CDK application.
This construct can take a source dataset as input parameters and run a synchronous AWS Glue job to modify the dataset and make it consumable by the BatchReplayer
Pre-requisites for BatchReplayer
are listed in the PreparedDataset
construct documentation
Add some quality checks to prevent the preparation from failing.
Ensure the PySpark script is packaged into the core library.
BatchReplayer is currently replaying the dataset from scratch. Sometimes we just need data to be in the target and we don't want to wait for each batch/micro-batch to generate new data.
We should add a parameter to the construct to write a percentage of the dataset during the provisioning step of the construct. A part of the data will already be in the target when the CDK application is provisioned.
Currently, NotebookPlatform extends NestedStack to ensure we doesn't reach the 500 resources limit in one Cloud Formation stack. Because of recent optimizations on the constructs, we can now extend Construct and let the consumer decide to deploy in a nested stack or not. Additionally, it would solve the issue with documentation where the NestedStack exportValue
function breaks the entire lib documentation.
Hello,
I don't see information about kinetics analytics application "KDA-application" (Streaming Analytics section) in the architecture diagram at https://aws-samples.github.io/aws-analytics-reference-architecture/high-level-design/architecture.
Provide a construct that bootstrap managed Prometheus and Grafana to be used by other ARA constructs.
Although there is a default value for the sink_object_key, if I don't specify it, it returns the string "None".
data_generator = ara.BatchReplayer(
scope=self,
id="customer-data",
dataset=ara.PreparedDataset.RETAIL_1_GB_CUSTOMER,
sink_bucket=storage.raw_bucket,
)
crawler = glue.CfnCrawler(
self,
id='ara-crawler',
name='ara-crawler',
role=glue_role.iam_role.role_arn,
database_name='raw',
targets={
's3Targets': [{"path": f"s3://{storage.raw_bucket.bucket_name}/{data_generator.sink_object_key}/"}],
}
)
This crawler target will be rendered as:
s3://raw-<account_id>-<region_id>/None/
Instead, IMO it should return the default key.
Node 14 is deprecated we need to upgrade our github actions accordingly.
The current version of the AWS native reference architecture is using L1 and L2 Constructs from AWS CDK. The objective is to migrate part of the batch module of the reference architecture on L3 Constructs from core components including:
This would simplify the maintenance of the reference architecture as it will automatically benefit from new L3 Constructs versions and features.
Currently Projen is not used to generate YAML configuration files for Github workflows because Projen config was not able to write in a different folder than the projen root. Statically defined YAML for workflows adds extra maintenance and is error prone is something changes in a new projen version
We should change to dynamic YAML files generated by projen but written in the repo root folder and not in Projen root folder.
it is not possible to use diskSize property to set the disk size for the nodes because addEksNodeGroup is creating LunchTemplate (and not exposing it for changing) and you can't have both diskSize and LunchTemplate defined...
when trying to add diskSize:100 you get the following error at synth time
Error: diskSize must be specified within the launch template
We can implement a new CDK application demonstrating an end-to-end example of a data mesh on AWS
Use this util to assume role and execute e2e testing as part of the build flow
Currently the EMR on EKS construct provide the ability to define the version of EKS but not autoscaler. If the k8s version 1.22 is used then deployment fails because the autoscaler version used is not correct.
In data mesh CentraGovernance
construct domain account id is not passed to LakeFormationS3Location
, hence account id from the central account is used for the KMS policy.
new LakeFormationS3Location(this, `${id}LFLocation`, {
s3Location: {
bucketName: domainBucket,
objectKey: domainPrefix,
},
kmsKeyId: domainKey,
});
Fix: add accountId
parameter for registered data domain.
EMR Serverless is GA now. It would be good to have support for EMR Serverless in addition to EMR EKS cluster construct.
Currently the data generator reads source data from the AWS Analytics Reference Architecture public bucket in eu-west-1. This design generates unnecessary costs with data transfer out when the data generator is deployed in a different region.
We need to implement 2 features to solve this:
Use requester pay to ensure the consumer of AWS Analytics Reference Architecture pays for its usage. For this the data generator needs to be updated to support requester pays:
Implement Cross Region Replication from the source bucket in eu-west-1 to the major regions and adapt the data generator to pick the source in the same region it's deployed
Currently, there is no Dwh abstraction in the Analytics Reference Architecture and you need to provision Redshift L2 Construct with lots of additional resources to get a complete and usable setup.
We can implement a new construct called Dwh
which bundle all the required resources with defaults but still customizable. It can be composed of:
AraBucket
singleton 'redshift-access-logs')Points to be investigated:
Error: There is already a Construct with name 'DomainSecret' in CentralGovernance
This is due to const domainSecret = Secret.fromSecretCompleteArn(this, 'DomainSecret', domainSecretArn);
having generic id DomainSecret
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.