terraform-aws-elasticsearch-loader-kinesis-ec2

A Terraform module which deploys a Snowplow Elasticsearch Loader application on AWS running on top of EC2. If you want to use a custom AMI for this deployment you will need to ensure it is based on top of Amazon Linux 2.

Telemetry

This module by default collects and forwards telemetry information to Snowplow to understand how our applications are being used. No identifying information about your sub-account or account fingerprints are ever forwarded to us - it is very simple information about what modules and applications are deployed and active.

If you wish to subscribe to our mailing list for updates to these modules or security advisories please set the user_provided_id variable to include a valid email address which we can reach you at.

How do I disable it?

To disable telemetry simply set variable telemetry_enabled = false.

What are you collecting?

For details on what information is collected please see this module: https://github.com/snowplow-devops/terraform-snowplow-telemetry

Usage

Cluster Authentication

There are two different ways to authenticate with the Elasticsearch Cluster - its important that you configure the loader appropriately to ensure that you can scope and connect to the destination cluster appropriately.

Note: If neither of these are defined its assumed that no authentication is required.

1. AWS Elasticsearch Service + IAM and/or RBAC

The offerring from AWS supports RBAC and IAM based access controls. As long as you have configured the aws_es_domain_name variable the loader will start signing all outbound requests.

You can then manage access via either a straightforward IAM policy at the cluster level which allows actions coming from the IAM role associated with the EC2 servers to insert data or take it a step forward and setup fine-grained-access-control with RBAC. The later allows you to limit the loader to specific indices and patterns.

Documentation here: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/fgac.html

2. Basic-auth

This one is pretty straight-forward - all you need to do is configure the es_cluster_username and es_cluster_password and then all HTTP requests sent from the loader will contain these variables as a basicauth header.

How to manually configure the indices for data to load into?

Before loading data it is generally recommended to configure the index that you want to load into - this allows you to define the structure and expected types of the fields that are going to be loaded and avoids Elasticsearch interpreting a field incorrectly.

The default mappings for good (aka "enriched") and bad data can be found in the templates/mappings directory.

To then create an index it is as simple as issuing a single cURL command:

curl -XPUT \
  'https://${cluster_endpoint}/${index_name}?pretty' \
  -H 'Content-Type: application/json' \
  -d '${mapping_json}'"

This index name is what you would then configure for the loader in the es_cluster_index variable.

Do I need to set `es_cluster_document_type`?

The document type is a now deprecated field in an index mapping. If you have an index created with a document type (v6.x or earlier) you should include this - if however you have created newer v7.x compatible indices you should not include this. By default we set this to an empty string.

How to rotate and manage manual indices?

NOTE: The loaders will infer mappings automatically - so this step is not required if you just want to quickly get started!

If you want to manage your index in the way detailed above you will need to rotate these indices as well. Generally speaking indices are time-limited to allow you to expire data cleanly and to avoid having enormous indices to query across.

Curator is the general go to tool for dealing with this problem and Amazon has a fully worked example for how to set this up here: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/curator.html

Using alias pointers to avoid updating the loader?

To avoid needing to update the loader configuration everytime you rotate the indices its recommended to leverage index aliases in front of these indexes and to use this alias name in the es_cluster_index variable.

Essentially the flow that needs to be followed is:

Create new index
Add this new index to the alias
Remove the old index from the alias

The loader should then begin loading into the new index automatically.

Setting up the loader

This example shows the configuration for a fine-grained-access-control enabled Elasticsearch Service leveraging RBAC on the cluster to allow loading from EC2 nodes which have assumed the IAM role created by the loader module.

Note: Setting up the cluster "role" and linking it to the IAM role "user" are documented in the links above.

Loading good data

locals {
  es_cluster_endpoint         = "vpc-xxx.eu-west-1.es.amazonaws.com"
  es_cluster_port             = 443
  es_cluster_http_ssl_enabled = true

  # Set if you want to use basicauth to authenticate
  #
  # Note: If using RBAC with AWS ES this should not be set as the authentication is done via the IAM role attached
  #       to the loader instances instead
  es_cluster_username = ""
  es_cluster_password = ""

  # Set if you want to use AWS Request Signing to authenticate
  #
  # Note: This requires configuring either an IAM cluster policy and/or fine-grained-access-control with the IAM role
  #       created by the ES Loader modules
  aws_es_domain_name = "test-cluster"

  # Set only if you are using a different region for your ES Cluster - by default will use the same region as the loader
  aws_es_region = ""
}

module "enriched_stream" {
  source  = "snowplow-devops/kinesis-stream/aws"
  version = "0.1.1"

  name = "enriched-stream"
}

module "bad_1_stream" {
  source  = "snowplow-devops/kinesis-stream/aws"
  version = "0.1.1"

  name = "bad-1-stream"
}

module "es_loader_enriched" {
  source  = "snowplow-devops/elasticsearch-loader-kinesis-ec2/aws"

  accept_limited_use_license = true

  name             = "es-loader-enriched-server"
  vpc_id           = var.vpc_id
  subnet_ids       = var.subnet_ids
  ssh_key_name     = "your-key-name"
  ssh_ip_allowlist = ["0.0.0.0/0"]

  in_stream_type  = "ENRICHED_EVENTS"
  in_stream_name  = module.enriched_stream.name
  bad_stream_name = module.bad_1_stream.name

  es_cluster_endpoint         = local.es_cluster_endpoint
  es_cluster_port             = local.es_cluster_port
  es_cluster_http_ssl_enabled = local.es_cluster_http_ssl_enabled

  es_cluster_index         = "snowplow-enriched-index"
  es_cluster_document_type = "good"

  es_cluster_username = local.es_cluster_username
  es_cluster_password = local.es_cluster_password
  aws_es_domain_name  = local.aws_es_domain_name
  aws_es_region       = local.aws_es_region
}

Loading bad data

module "bad_1_stream" {
  source  = "snowplow-devops/kinesis-stream/aws"
  version = "0.1.1"

  name = "bad-1-stream"
}

module "bad_2_stream" {
  source  = "snowplow-devops/kinesis-stream/aws"
  version = "0.1.1"

  name = "bad-2-stream"
}

module "es_loader_bad" {
  source  = "snowplow-devops/elasticsearch-loader-kinesis-ec2/aws"

  accept_limited_use_license = true

  name             = "es-loader-bad-server"
  vpc_id           = var.vpc_id
  subnet_ids       = var.subnet_ids
  ssh_key_name     = "your-key-name"
  ssh_ip_allowlist = ["0.0.0.0/0"]

  in_stream_type  = "BAD_ROWS"
  in_stream_name  = module.bad_1_stream.name
  bad_stream_name = module.bad_2_stream.name

  es_cluster_endpoint         = local.es_cluster_endpoint
  es_cluster_port             = local.es_cluster_port
  es_cluster_http_ssl_enabled = local.es_cluster_http_ssl_enabled

  es_cluster_index         = "snowplow-bad-index"
  es_cluster_document_type = "bad"

  es_cluster_username = local.es_cluster_username
  es_cluster_password = local.es_cluster_password
  aws_es_domain_name  = local.aws_es_domain_name
  aws_es_region       = local.aws_es_region
}

Requirements

Name	Version
terraform	>= 1.0.0
aws	>= 3.75.0

Providers

Name	Version
aws	>= 3.75.0

Modules

Name	Source	Version
instance_type_metrics	snowplow-devops/ec2-instance-type-metrics/aws	0.1.2
kcl_autoscaling	snowplow-devops/dynamodb-autoscaling/aws	0.2.0
service	snowplow-devops/service-ec2/aws	0.2.1
telemetry	snowplow-devops/telemetry/snowplow	0.5.0

Resources

Name	Type
aws_cloudwatch_log_group.log_group	resource
aws_dynamodb_table.kcl	resource
aws_iam_instance_profile.instance_profile	resource
aws_iam_policy.iam_policy	resource
aws_iam_role.iam_role	resource
aws_iam_role_policy_attachment.policy_attachment	resource
aws_security_group.sg	resource
aws_security_group_rule.egress_tcp_443	resource
aws_security_group_rule.egress_tcp_80	resource
aws_security_group_rule.egress_tcp_es_cluster	resource
aws_security_group_rule.egress_udp_123	resource
aws_security_group_rule.ingress_tcp_22	resource
aws_caller_identity.current	data source
aws_region.current	data source

Inputs

Name	Description	Type	Default	Required
bad_stream_name	The name of the bad kinesis stream that the Elasticsearch Loader will insert bad data into	`string`	n/a	yes
es_cluster_endpoint	The endpoint of the cluster to load data into	`string`	n/a	yes
es_cluster_index	The name of the Elasticsearch Index to load into	`string`	n/a	yes
es_cluster_port	The port number of the cluster to load data into	`number`	n/a	yes
in_stream_name	The name of the input kinesis stream that the Elasticsearch Loader will pull data from	`string`	n/a	yes
in_stream_type	The type of data that will be consumed by the application (ENRICHED_EVENTS, BAD_ROWS or JSON)	`string`	n/a	yes
name	A name which will be pre-pended to the resources created	`string`	n/a	yes
ssh_key_name	The name of the SSH key-pair to attach to all EC2 nodes deployed	`string`	n/a	yes
subnet_ids	The list of subnets to deploy the Elasticsearch Loader across	`list(string)`	n/a	yes
vpc_id	The VPC to deploy the Elasticsearch Loader within	`string`	n/a	yes
accept_limited_use_license	Acceptance of the SLULA terms (https://docs.snowplow.io/limited-use-license-1.0/)	`bool`	`false`	no
amazon_linux_2_ami_id	The AMI ID to use which must be based of of Amazon Linux 2; by default the latest community version is used	`string`	`""`	no
app_version	App version to use. This variable facilitates dev flow, the modules may not work with anything other than the default value.	`string`	`"2.1.2"`	no
associate_public_ip_address	Whether to assign a public ip address to this instance	`bool`	`true`	no
aws_es_domain_name	The domain name of the Amazon Elasticsearch Service that signed requests will be made against	`string`	`""`	no
aws_es_region	If signing is enabled this is the region where the destination cluster is located; if unset defaults to the region of the loader deployment	`string`	`""`	no
buffer_byte_limit	The amount of bytes to buffer events before pushing them to Elasticsearch	`number`	`1000000`	no
buffer_record_limit	The number of events to buffer before pushing them to Elasticsearch	`number`	`500`	no
buffer_time_limit_ms	The amount of time to buffer events before pushing them to Elasticsearch	`number`	`500`	no
chunk_byte_limit	The maximum amount of bytes to send to Elasticsearch in one request	`number`	`1000000`	no
chunk_record_limit	The maximum number of events to send to Elasticsearch in one request	`number`	`500`	no
cloudwatch_logs_enabled	Whether application logs should be reported to CloudWatch	`bool`	`true`	no
cloudwatch_logs_retention_days	The length of time in days to retain logs for	`number`	`7`	no
enable_auto_scaling	Whether to enable auto-scaling policies for the service	`bool`	`true`	no
es_cluster_document_type	The document type of the data being loaded - this is the type defined in your index mapping (Note: generally 'good' or 'bad')	`string`	`""`	no
es_cluster_http_ssl_enabled	Whether to enforce SSL for HTTP connections to the cluster	`bool`	`true`	no
es_cluster_password	A basicauth password to use when authenticating	`string`	`""`	no
es_cluster_shard_date_field	The timestamp field to leverage when sharding data into the cluster (Note: defaults to derived_tstamp)	`string`	`""`	no
es_cluster_shard_date_format	A date format pattern for sharding inbound data into the cluster	`string`	`""`	no
es_cluster_username	A basicauth username to use when authenticating	`string`	`""`	no
iam_permissions_boundary	The permissions boundary ARN to set on IAM roles created	`string`	`""`	no
initial_position	Where to start processing the input Kinesis Stream from (TRIM_HORIZON or LATEST)	`string`	`"TRIM_HORIZON"`	no
instance_type	The instance type to use	`string`	`"t3a.micro"`	no
java_opts	Custom JAVA Options	`string`	`"-XX:InitialRAMPercentage=75 -XX:MaxRAMPercentage=75"`	no
kcl_read_max_capacity	The maximum READ capacity for the KCL DynamoDB table	`number`	`10`	no
kcl_read_min_capacity	The minimum READ capacity for the KCL DynamoDB table	`number`	`1`	no
kcl_write_max_capacity	The maximum WRITE capacity for the KCL DynamoDB table	`number`	`10`	no
kcl_write_min_capacity	The minimum WRITE capacity for the KCL DynamoDB table	`number`	`1`	no
max_size	The maximum number of servers in this server-group	`number`	`2`	no
min_size	The minimum number of servers in this server-group	`number`	`1`	no
scale_down_cooldown_sec	Time (in seconds) until another scale-down action can occur	`number`	`600`	no
scale_down_cpu_threshold_percentage	The average CPU percentage that we must be below to scale-down	`number`	`20`	no
scale_down_eval_minutes	The number of consecutive minutes that we must be below the threshold to scale-down	`number`	`60`	no
scale_up_cooldown_sec	Time (in seconds) until another scale-up action can occur	`number`	`180`	no
scale_up_cpu_threshold_percentage	The average CPU percentage that must be exceeded to scale-up	`number`	`60`	no
scale_up_eval_minutes	The number of consecutive minutes that the threshold must be breached to scale-up	`number`	`5`	no
ssh_ip_allowlist	The list of CIDR ranges to allow SSH traffic from	`list(any)`	[ "0.0.0.0/0" ]	no
tags	The tags to append to this resource	`map(string)`	`{}`	no
telemetry_enabled	Whether or not to send telemetry information back to Snowplow Analytics Ltd	`bool`	`true`	no
user_provided_id	An optional unique identifier to identify the telemetry events emitted by this stack	`string`	`""`	no

Outputs

Name	Description
asg_id	ID of the ASG
asg_name	Name of the ASG
sg_id	ID of the security group attached to the Elasticsearch Loader servers

Copyright and license

Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)

snowplow-devops / terraform-aws-elasticsearch-loader-kinesis-ec2 Goto Github PK