GithubHelp home page GithubHelp logo

philips-labs / terraform-aws-github-runner Goto Github PK

View Code? Open in Web Editor NEW
2.3K 27.0 559.0 17.53 MB

Terraform module for scalable GitHub action runners on AWS

Home Page: https://philips-labs.github.io/terraform-aws-github-runner/

License: MIT License

HCL 49.45% Shell 3.59% TypeScript 44.16% Dockerfile 0.09% PowerShell 2.71%
github github-actions terraform actions-runner serverless lambda aws scalable cicd self-hosted action-runner hacktoberfest

terraform-aws-github-runner's Introduction

Terraform module Self-Hosted Scalable GitHub Actions runners on AWS.

docs awesome-runners Terraform registry Terraform checks Lambdas

๐Ÿ“„ Extensive documentation is available via our GitHub Pages Docs site.

๐Ÿ“ข We maintain the project as a truly open-source project. We maintain the project on a best effort basis. We welcome contributions from the community. Feel free to help us answering issues, reviewing PRs, or maintaining and improving the project.

๐Ÿ“ข v5 replaces Amazon Linux 2 with Amazon Linux 2023 as default OS. Check the PR for more details and other changes.

๐Ÿ“ข For contributions to older versions you can make a PR to the related branch, e.g. v4. We have no release process in place for older versions.

This Terraform module creates the required infrastructure needed to host GitHub Actions self-hosted, auto-scaling runners on AWS spot instances. It provides the required logic to handle the life cycle for scaling up and down using a set of AWS Lambda functions. Runners are scaled down to zero to avoid costs when no workflows are active.

Runners overview

Features

  • Scaling: Scale up and down based on GitHub events
  • Sustainability: Scale down to zero when no jobs are running
  • Security: Runners are created on-demand and terminated after use (ephemeral runners)
  • Cost optimization: Runners are created on spot instances
  • Tailored software, hardware and network configuration: Bring your own AMI, define the instance types and subnets to use.
  • OS support: Linux (x64/arm64) and Windows
  • Multi-Runner: Create multiple runner configurations with a single deployment
  • GitHub cloud and GitHub Enterprise Server (GHES) support.
  • Org and repo level runners. enterprise level runners are not supported (yet).

Getting started

Check out the detailed instructions in the Getting Started section of the docs. On a high level, the following steps are required to get started:

  • Setup your AWS account
  • Create and configure a GitHub App
  • Download or build the required lambdas
  • Deploy the module using Terraform
  • Install the GitHub App to your organization or repositories and add your repositories to the runner group(s).

Check out the provided Terraform examples in the examples directory for different scenarios.

Configuration

Please check the configuration section of the docs for major configuration options. See the Terraform module documentation for all available options.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

We welcome contributions, please check out the contribution guide. Be aware we use pre commit hooks to update the docs.

Philips Forest

This module is part of the Philips Forest.

                                                     ___                   _
                                                    / __\__  _ __ ___  ___| |_
                                                   / _\/ _ \| '__/ _ \/ __| __|
                                                  / / | (_) | | |  __/\__ \ |_
                                                  \/   \___/|_|  \___||___/\__|

                                                                 Infrastructure

Talk to the forestkeepers in the runners-channel on Slack.

Slack

Terraform root module documention

Requirements

Name Version
terraform >= 1.3.0
aws ~> 5.27
random ~> 3.0

Providers

Name Version
aws 5.31.0
random 3.6.0

Modules

Name Source Version
ami_housekeeper ./modules/ami-housekeeper n/a
instance_termination_watcher ./modules/termination-watcher n/a
runner_binaries ./modules/runner-binaries-syncer n/a
runners ./modules/runners n/a
ssm ./modules/ssm n/a
webhook ./modules/webhook n/a

Resources

Name Type
aws_sqs_queue.queued_builds resource
aws_sqs_queue.queued_builds_dlq resource
aws_sqs_queue.webhook_events_workflow_job_queue resource
aws_sqs_queue_policy.build_queue_dlq_policy resource
aws_sqs_queue_policy.build_queue_policy resource
aws_sqs_queue_policy.webhook_events_workflow_job_queue_policy resource
random_string.random resource
aws_iam_policy_document.deny_unsecure_transport data source

Inputs

Name Description Type Default Required
ami_filter Map of lists used to create the AMI filter for the action runner AMI. map(list(string))
{
"state": [
"available"
]
}
no
ami_housekeeper_cleanup_config Configuration for AMI cleanup.

amiFilters - Filters to use when searching for AMIs to cleanup. Default filter for images owned by the account and that are available.
dryRun - If true, no AMIs will be deregistered. Default false.
launchTemplateNames - Launch template names to use when searching for AMIs to cleanup. Default no launch templates.
maxItems - The maximum numer of AMI's tha will be queried for cleanup. Default no maximum.
minimumDaysOld - Minimum number of days old an AMI must be to be considered for cleanup. Default 30.
ssmParameterNames - SSM parameter names to use when searching for AMIs to cleanup. This parameter should be set when using SSM to configure the AMI to use. Default no SSM parameters.
object({
amiFilters = optional(list(object({
Name = string
Values = list(string)
})),
[{
Name : "state",
Values : ["available"],
},
{
Name : "image-type",
Values : ["machine"],
}]
)
dryRun = optional(bool, false)
launchTemplateNames = optional(list(string))
maxItems = optional(number)
minimumDaysOld = optional(number, 30)
ssmParameterNames = optional(list(string))
})
{} no
ami_housekeeper_lambda_s3_key S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas. string null no
ami_housekeeper_lambda_s3_object_version S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket. string null no
ami_housekeeper_lambda_schedule_expression Scheduler expression for action runner binary syncer. string "rate(1 day)" no
ami_housekeeper_lambda_timeout Time out of the lambda in seconds. number 300 no
ami_housekeeper_lambda_zip File location of the lambda zip file. string null no
ami_id_ssm_parameter_name Externally managed SSM parameter (of data type aws:ec2:image) that contains the AMI ID to launch runner instances from. Overrides ami_filter string null no
ami_kms_key_arn Optional CMK Key ARN to be used to launch an instance from a shared encrypted AMI string null no
ami_owners The list of owners used to select the AMI of action runner instances. list(string)
[
"amazon"
]
no
associate_public_ipv4_address Associate public IPv4 with the runner. Only tested with IPv4 bool false no
aws_partition (optiona) partition in the arn namespace to use if not 'aws' string "aws" no
aws_region AWS region. string n/a yes
block_device_mappings The EC2 instance block device configuration. Takes the following keys: device_name, delete_on_termination, volume_type, volume_size, encrypted, iops, throughput, kms_key_id, snapshot_id.
list(object({
delete_on_termination = optional(bool, true)
device_name = optional(string, "/dev/xvda")
encrypted = optional(bool, true)
iops = optional(number)
kms_key_id = optional(string)
snapshot_id = optional(string)
throughput = optional(number)
volume_size = number
volume_type = optional(string, "gp3")
}))
[
{
"volume_size": 30
}
]
no
cloudwatch_config (optional) Replaces the module's default cloudwatch log config. See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html for details. string null no
create_service_linked_role_spot (optional) create the service linked role for spot instances that is required by the scale-up lambda. bool false no
delay_webhook_event The number of seconds the event accepted by the webhook is invisible on the queue before the scale up lambda will receive the event. number 30 no
disable_runner_autoupdate Disable the auto update of the github runner agent. Be aware there is a grace period of 30 days, see also the GitHub article bool false no
enable_ami_housekeeper Option to disable the lambda to clean up old AMIs. bool false no
enable_cloudwatch_agent Enables the cloudwatch agent on the ec2 runner instances. The runner uses a default config that can be overridden via cloudwatch_config. bool true no
enable_ephemeral_runners Enable ephemeral runners, runners will only be used once. bool false no
enable_event_rule_binaries_syncer DEPRECATED: Replaced by state_event_rule_binaries_syncer. bool null no
enable_fifo_build_queue Enable a FIFO queue to keep the order of events received by the webhook. Recommended for repo level runners. bool false no
enable_jit_config Overwrite the default behavior for JIT configuration. By default JIT configuration is enabled for ephemeral runners and disabled for non-ephemeral runners. In case of GHES check first if the JIT config API is avaialbe. In case you upgradeing from 3.x to 4.x you can set enable_jit_config to false to avoid a breaking change when having your own AMI. bool null no
enable_job_queued_check Only scale if the job event received by the scale up lambda is in the queued state. By default enabled for non ephemeral runners and disabled for ephemeral. Set this variable to overwrite the default behavior. bool null no
enable_managed_runner_security_group Enables creation of the default managed security group. Unmanaged security groups can be specified via runner_additional_security_group_ids. bool true no
enable_organization_runners Register runners to organization, instead of repo level bool false no
enable_runner_binaries_syncer Option to disable the lambda to sync GitHub runner distribution, useful when using a pre-build AMI. bool true no
enable_runner_detailed_monitoring Should detailed monitoring be enabled for the runner. Set this to true if you want to use detailed monitoring. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html for details. bool false no
enable_runner_on_demand_failover_for_errors Enable on-demand failover. For example to fall back to on demand when no spot capacity is available the variable can be set to InsufficientInstanceCapacity. When not defined the default behavior is to retry later. list(string) [] no
enable_runner_workflow_job_labels_check_all If set to true all labels in the workflow job must match the GitHub labels (os, architecture and self-hosted). When false if any label matches it will trigger the webhook. bool true no
enable_ssm_on_runners Enable to allow access to the runner instances for debugging purposes via SSM. Note that this adds additional permissions to the runner instances. bool false no
enable_user_data_debug_logging_runner Option to enable debug logging for user-data, this logs all secrets as well. bool false no
enable_userdata Should the userdata script be enabled for the runner. Set this to false if you are using your own prebuilt AMI. bool true no
enable_workflow_job_events_queue Enabling this experimental feature will create a secondory sqs queue to which a copy of the workflow_job event will be delivered. bool false no
ghes_ssl_verify GitHub Enterprise SSL verification. Set to 'false' when custom certificate (chains) is used for GitHub Enterprise Server (insecure). bool true no
ghes_url GitHub Enterprise Server URL. Example: https://github.internal.co - DO NOT SET IF USING PUBLIC GITHUB string null no
github_app GitHub app parameters, see your github app. Ensure the key is the base64-encoded .pem file (the output of base64 app.private-key.pem, not the content of private-key.pem).
object({
key_base64 = string
id = string
webhook_secret = string
})
n/a yes
idle_config List of time periods, defined as a cron expression, to keep a minimum amount of runners active instead of scaling down to 0. By defining this list you can ensure that in time periods that match the cron expression within 5 seconds a runner is kept idle.
list(object({
cron = string
timeZone = string
idleCount = number
evictionStrategy = optional(string, "oldest_first")
}))
[] no
instance_allocation_strategy The allocation strategy for spot instances. AWS recommends using price-capacity-optimized however the AWS default is lowest-price. string "lowest-price" no
instance_max_spot_price Max price price for spot instances per hour. This variable will be passed to the create fleet as max spot price for the fleet. string null no
instance_profile_path The path that will be added to the instance_profile, if not set the environment name will be used. string null no
instance_target_capacity_type Default lifecycle used for runner instances, can be either spot or on-demand. string "spot" no
instance_termination_watcher Configuration for the instance termination watcher. This feature is Beta, changes will not trigger a major release as long in beta.

enable: Enable or disable the spot termination watcher.
'enable_metrics': Enable or disable the metrics for the spot termination watcher.
memory_size: Memory size linit in MB of the lambda.
s3_key: S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas.
s3_object_version: S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket.
timeout: Time out of the lambda in seconds.
zip: File location of the lambda zip file.
object({
enable = optional(bool, false)
enable_metric = optional(object({
spot_warning = optional(bool, false)
}))
memory_size = optional(number, null)
s3_key = optional(string, null)
s3_object_version = optional(string, null)
timeout = optional(number, null)
zip = optional(string, null)
})
{} no
instance_types List of instance types for the action runner. Defaults are based on runner_os (al2023 for linux and Windows Server Core for win). list(string)
[
"m5.large",
"c5.large"
]
no
job_queue_retention_in_seconds The number of seconds the job is held in the queue before it is purged. number 86400 no
key_name Key pair name string null no
kms_key_arn Optional CMK Key ARN to be used for Parameter Store. This key must be in the current account. string null no
lambda_architecture AWS Lambda architecture. Lambda functions using Graviton processors ('arm64') tend to have better price/performance than 'x86_64' functions. string "arm64" no
lambda_principals (Optional) add extra principals to the role created for execution of the lambda, e.g. for local testing.
list(object({
type = string
identifiers = list(string)
}))
[] no
lambda_runtime AWS Lambda runtime. string "nodejs20.x" no
lambda_s3_bucket S3 bucket from which to specify lambda functions. This is an alternative to providing local files directly. string null no
lambda_security_group_ids List of security group IDs associated with the Lambda function. list(string) [] no
lambda_subnet_ids List of subnets in which the action runners will be launched, the subnets needs to be subnets in the vpc_id. list(string) [] no
lambda_tracing_mode DEPRECATED: Replaced by tracing_config. string null no
log_level Logging level for lambda logging. Valid values are 'silly', 'trace', 'debug', 'info', 'warn', 'error', 'fatal'. string "info" no
logging_kms_key_id Specifies the kms key id to encrypt the logs with. string null no
logging_retention_in_days Specifies the number of days you want to retain log events for the lambda log group. Possible values are: 0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653. number 180 no
metrics_namespace The namespace for the metrics created by the module. Merics will only be created if explicit enabled. string "GitHub Runners" no
minimum_running_time_in_minutes The time an ec2 action runner should be running at minimum before terminated, if not busy. number null no
pool_config The configuration for updating the pool. The pool_size to adjust to by the events triggered by the schedule_expression. For example you can configure a cron expression for weekdays to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1.
list(object({
schedule_expression = string
size = number
}))
[] no
pool_lambda_memory_size Memory size limit for scale-up lambda. number 512 no
pool_lambda_reserved_concurrent_executions Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. number 1 no
pool_lambda_timeout Time out for the pool lambda in seconds. number 60 no
pool_runner_owner The pool will deploy runners to the GitHub org ID, set this value to the org to which you want the runners deployed. Repo level is not supported. string null no
prefix The prefix used for naming resources string "github-actions" no
queue_encryption Configure how data on queues managed by the modules in ecrypted at REST. Options are encryped via SSE, non encrypted and via KMSS. By default encryptes via SSE is enabled. See for more details the Terraform aws_sqs_queue resource https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/sqs_queue.
object({
kms_data_key_reuse_period_seconds = number
kms_master_key_id = string
sqs_managed_sse_enabled = bool
})
{
"kms_data_key_reuse_period_seconds": null,
"kms_master_key_id": null,
"sqs_managed_sse_enabled": true
}
no
redrive_build_queue Set options to attach (optional) a dead letter queue to the build queue, the queue between the webhook and the scale up lambda. You have the following options. 1. Disable by setting enabled to false. 2. Enable by setting enabled to true, maxReceiveCount to a number of max retries.
object({
enabled = bool
maxReceiveCount = number
})
{
"enabled": false,
"maxReceiveCount": null
}
no
repository_white_list List of github repository full names (owner/repo_name) that will be allowed to use the github app. Leave empty for no filtering. list(string) [] no
role_path The path that will be added to role path for created roles, if not set the environment name will be used. string null no
role_permissions_boundary Permissions boundary that will be added to the created roles. string null no
runner_additional_security_group_ids (optional) List of additional security groups IDs to apply to the runner. list(string) [] no
runner_architecture The platform architecture of the runner instance_type. string "x64" no
runner_as_root Run the action runner under the root user. Variable runner_run_as will be ignored. bool false no
runner_binaries_s3_logging_bucket Bucket for action runner distribution bucket access logging. string null no
runner_binaries_s3_logging_bucket_prefix Bucket prefix for action runner distribution bucket access logging. string null no
runner_binaries_s3_sse_configuration Map containing server-side encryption configuration for runner-binaries S3 bucket. any
{
"rule": {
"apply_server_side_encryption_by_default": {
"sse_algorithm": "AES256"
}
}
}
no
runner_binaries_s3_versioning Status of S3 versioning for runner-binaries S3 bucket. Once set to Enabled the change cannot be reverted via Terraform! string "Disabled" no
runner_binaries_syncer_lambda_memory_size Memory size limit in MB for binary syncer lambda. number 256 no
runner_binaries_syncer_lambda_timeout Time out of the binaries sync lambda in seconds. number 300 no
runner_binaries_syncer_lambda_zip File location of the binaries sync lambda zip file. string null no
runner_boot_time_in_minutes The minimum time for an EC2 runner to boot and register as a runner. number 5 no
runner_credit_specification The credit option for CPU usage of a T instance. Can be unset, "standard" or "unlimited". string null no
runner_ec2_tags Map of tags that will be added to the launch template instance tag specifications. map(string) {} no
runner_egress_rules List of egress rules for the GitHub runner instances.
list(object({
cidr_blocks = list(string)
ipv6_cidr_blocks = list(string)
prefix_list_ids = list(string)
from_port = number
protocol = string
security_groups = list(string)
self = bool
to_port = number
description = string
}))
[
{
"cidr_blocks": [
"0.0.0.0/0"
],
"description": null,
"from_port": 0,
"ipv6_cidr_blocks": [
"::/0"
],
"prefix_list_ids": null,
"protocol": "-1",
"security_groups": null,
"self": null,
"to_port": 0
}
]
no
runner_extra_labels Extra (custom) labels for the runners (GitHub). Labels checks on the webhook can be enforced by setting enable_runner_workflow_job_labels_check_all. GitHub read-only labels should not be provided. list(string) [] no
runner_group_name Name of the runner group. string "Default" no
runner_iam_role_managed_policy_arns Attach AWS or customer-managed IAM policies (by ARN) to the runner IAM role list(string) [] no
runner_log_files (optional) Replaces the module default cloudwatch log config. See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html for details.
list(object({
log_group_name = string
prefix_log_group = bool
file_path = string
log_stream_name = string
}))
null no
runner_metadata_options Metadata options for the ec2 runner instances. By default, the module uses metadata tags for bootstrapping the runner, only disable instance_metadata_tags when using custom scripts for starting the runner. map(any)
{
"http_endpoint": "enabled",
"http_put_response_hop_limit": 1,
"http_tokens": "required",
"instance_metadata_tags": "enabled"
}
no
runner_name_prefix The prefix used for the GitHub runner name. The prefix will be used in the default start script to prefix the instance name when register the runner in GitHub. The value is availabe via an EC2 tag 'ghr:runner_name_prefix'. string "" no
runner_os The EC2 Operating System type to use for action runner instances (linux,windows). string "linux" no
runner_run_as Run the GitHub actions agent as user. string "ec2-user" no
runners_lambda_s3_key S3 key for runners lambda function. Required if using S3 bucket to specify lambdas. string null no
runners_lambda_s3_object_version S3 object version for runners lambda function. Useful if S3 versioning is enabled on source bucket. string null no
runners_lambda_zip File location of the lambda zip file for scaling runners. string null no
runners_maximum_count The maximum number of runners that will be created. number 3 no
runners_scale_down_lambda_memory_size Memory size limit in MB for scale-down lambda. number 512 no
runners_scale_down_lambda_timeout Time out for the scale down lambda in seconds. number 60 no
runners_scale_up_Lambda_memory_size Memory size limit in MB for scale-up lambda. number null no
runners_scale_up_lambda_memory_size Memory size limit in MB for scale-up lambda. number 512 no
runners_scale_up_lambda_timeout Time out for the scale up lambda in seconds. number 30 no
runners_ssm_housekeeper Configuration for the SSM housekeeper lambda. This lambda deletes token / JIT config from SSM.

schedule_expression: is used to configure the schedule for the lambda.
enabled: enable or disable the lambda trigger via the EventBridge.
lambda_memory_size: lambda memery size limit.
lambda_timeout: timeout for the lambda in seconds.
config: configuration for the lambda function. Token path will be read by default from the module.
object({
schedule_expression = optional(string, "rate(1 day)")
enabled = optional(bool, true)
lambda_memory_size = optional(number, 512)
lambda_timeout = optional(number, 60)
config = object({
tokenPath = optional(string)
minimumDaysOld = optional(number, 1)
dryRun = optional(bool, false)
})
})
{
"config": {}
}
no
scale_down_schedule_expression Scheduler expression to check every x for scale down. string "cron(*/5 * * * ? *)" no
scale_up_reserved_concurrent_executions Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. number 1 no
ssm_paths The root path used in SSM to store configuration and secrets.
object({
root = optional(string, "github-action-runners")
app = optional(string, "app")
runners = optional(string, "runners")
webhook = optional(string, "webhook")
use_prefix = optional(bool, true)
})
{} no
state_event_rule_binaries_syncer Option to disable EventBridge Lambda trigger for the binary syncer, useful to stop automatic updates of binary distribution string "ENABLED" no
subnet_ids List of subnets in which the action runner instances will be launched. The subnets need to exist in the configured VPC (vpc_id), and must reside in different availability zones (see #2904) list(string) n/a yes
syncer_lambda_s3_key S3 key for syncer lambda function. Required if using an S3 bucket to specify lambdas. string null no
syncer_lambda_s3_object_version S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket. string null no
tags Map of tags that will be added to created resources. By default resources will be tagged with name and environment. map(string) {} no
tracing_config Configuration for lambda tracing.
object({
mode = optional(string, null)
capture_http_requests = optional(bool, false)
capture_error = optional(bool, false)
})
{} no
userdata_content Alternative user-data content, replacing the templated one. By providing your own user_data you have to take care of installing all required software, including the action runner and registering the runner. Be-aware configuration paramaters in SSM as well as tags are treated as internals. Changes will not trigger a breaking release. string null no
userdata_post_install Script to be ran after the GitHub Actions runner is installed on the EC2 instances string "" no
userdata_pre_install Script to be ran before the GitHub Actions runner is installed on the EC2 instances string "" no
userdata_template Alternative user-data template file path, replacing the default template. By providing your own user_data you have to take care of installing all required software, including the action runner. Variables userdata_pre/post_install are ignored. string null no
vpc_id The VPC for security groups of the action runners. string n/a yes
webhook_lambda_apigateway_access_log_settings Access log settings for webhook API gateway.
object({
destination_arn = string
format = string
})
null no
webhook_lambda_memory_size Memory size limit in MB for webhook lambda in. number 256 no
webhook_lambda_s3_key S3 key for webhook lambda function. Required if using S3 bucket to specify lambdas. string null no
webhook_lambda_s3_object_version S3 object version for webhook lambda function. Useful if S3 versioning is enabled on source bucket. string null no
webhook_lambda_timeout Time out of the webhook lambda in seconds. number 10 no
webhook_lambda_zip File location of the webhook lambda zip file. string null no
workflow_job_queue_configuration Configuration options for workflow job queue which is only applicable if the flag enable_workflow_job_events_queue is set to true.
object({
delay_seconds = number
visibility_timeout_seconds = number
message_retention_seconds = number
})
{
"delay_seconds": null,
"message_retention_seconds": null,
"visibility_timeout_seconds": null
}
no

Outputs

Name Description
binaries_syncer n/a
instance_termination_watcher n/a
queues SQS queues.
runners n/a
ssm_parameters n/a
webhook n/a

terraform-aws-github-runner's People

Contributors

aadrijnberg avatar alexjurkiewicz avatar bdruth avatar bendavies avatar dependabot[bot] avatar dylanmtaylor avatar forest-releaser[bot] avatar gertjanmaas avatar github-actions[bot] avatar guptanavdeep1983 avatar henrynguyen5 avatar jeroenknoops avatar jonico avatar jpalomaki avatar julada avatar kmaehashi avatar kring avatar kuvaldini avatar marcofranssen avatar marekaf avatar mcaulifn avatar npalm avatar patrickmennen avatar scottguymer avatar sdarwin avatar semantic-release-bot avatar taharah avatar toast-gear avatar ulich avatar wzyboy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-aws-github-runner's Issues

Security Model?

Problem to solve

As a project owner I want to limit production runner access to protected branches

Intended users

Repo owners setting up deployment rules

Further details

In GitLab you can tie certain runners to protected branches. This enables us to use runners with production credentials and access levels, separate from the pool of runners available for every other branch.

It provides a security model in which accidental or intentional changes to production are limited to merged code.

Proposal

No proposal, this is a question.

Documentation

Availability & Testing

What does success look like, and how can we measure that?

Other links/references

I asked a similar question in the GitHub Actions community forum:
https://github.community/t5/GitHub-Actions/Limit-self-managed-runners-to-protected-branches/m-p/55943#M9692

What happens if an external user installs another organizations github app?

In the readme the following is stated:

Go to GitHub and create a new app. Beware you can create apps your organization or for a user. For now we handle only the organization level app.

But when the option to create an organization level app also forces the app to be public, so it is installable by anyone.

So if I create an organization level app for running this module, what's stopping someone else from discovering my github app installation url and using my self-hosted runners?

Generate terraform docs

Avoid manual update of terraform docs (input / output) in readme. Options

  • pre commit hook
  • via ci

Workflows immediately fail and jobs are never created.

Summary

I just followed all the instructions and successfuly deployed this services to AWS and integrated it with GitHub. When I create a new job run, GitHub contacts the webhook and the webhook successfuly sends the message to the scale-up lambda, but as there were no runners at the time the job was enqueued, it immediately fails. This provokes that the scale-up lambda finds 0 queued jobs when querying the repository workflows, and doesn't create any runner.

Steps to reproduce

I simply followed the instructions. Tried with 0.1.0 and 0.2.0

Possible fixes

Ideally we would be able to tell a minimum amount of running runners, so we guarantee that there always is an immediately available runner.

Distribution Lambda occasionally fails after creation

Summary

Distribution syncer lambda sometimes not working after a terraform apply, cause unclear. Removing the lambda and run an apply again solves the issue.

Steps to reproduce

Not reproduceable, happens occasionally.

runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted'

Summary

When following the readme, using the example configuration and adjusting the Github app permissions as per #100 (comment) the scale-up lambda fails to create the EC2 instance due to ServiceLinkedRoleCreationNotPermitted

Steps to reproduce

  • Do step 1 of Github app setup
  • Checkout terraform-aws-github-runner repo, cd into example folder
  • Download lambda zips
  • Create terraform.tfvars file with Github App credentials
  • run terraform init && terraform apply
  • Trigger a build on Github

What is the current bug behavior?

Github app sends webhook, webhook lambda forwards it, scaleup-lambda throws error:

...
ERROR	AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.
    at Request.extractError (/var/task/index.js:41424:35)
    at Request.callListeners (/var/task/index.js:47771:20)
    at Request.emit (/var/task/index.js:47743:10)
    at Request.emit (/var/task/index.js:18467:14)
    at Request.transition (/var/task/index.js:17801:10)
    at AcceptorStateMachine.runTo (/var/task/index.js:26145:12)
    at /var/task/index.js:26157:10
    at Request.<anonymous> (/var/task/index.js:17817:9)
    at Request.<anonymous> (/var/task/index.js:18469:12)
    at Request.callListeners (/var/task/index.js:47781:18) {
  code: 'AuthFailure.ServiceLinkedRoleCreationNotPermitted',
  time: 2020-07-30T15:03:24.631Z,
  requestId: 'c7bab39e-b75c-4e7d-bc29-6622b3d4ddb1',
  statusCode: 403,
  retryable: false,
  retryDelay: 68.19342592727871
}

What is the expected correct behavior?

Scale up lambda should create EC2 instance

Possible fixes

I'm sure this is a IAM permissions issue. I am rather new to both AWS and terraform and am not sure in which of them this needs to be solved and how to go about it.
Would be great to get some pointers.

PEM routines:get_name:no start line

Summary

Error in scale lambda invocation having to do with the private key decoding.

Steps to reproduce

The configuration:

module "runners" {
  source  = "philips-labs/github-runner/aws"
  version = "~> 0.2"

  ...snip...

  github_app = {
    key_base64     = var.github_app_key_base64
    id             = var.github_app_id
    client_id      = var.github_app_client_id
    client_secret  = var.github_app_client_secret
    webhook_secret = random_password.random.result
  }

  webhook_lambda_zip                = "lambdas-download/webhook.zip"
  runner_binaries_syncer_lambda_zip = "lambdas-download/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas-download/runners.zip"
  enable_organization_runners       = true
  runner_extra_labels               = "default"
}

The github_app_key_base64 which I suspect is the problem is set as following (PKCS#1 RSAPrivateKey):

github_app_key_base64    = <<-EOT
-----BEGIN RSA PRIVATE KEY-----
<base64 encoded>
-----END RSA PRIVATE KEY-----
EOT

What is the current bug behavior?

scale lambda fails.

What is the expected correct behavior?

scale lambda succeeds.

Relevant logs and/or screenshots

ERROR	Error: error:0909006C:PEM routines:get_name:no start line
    at Sign.sign (internal/crypto/sig.js:105:29)
    at Object.sign (/var/task/index.js:12802:45)
    at Object.jwsSign [as sign] (/var/task/index.js:9637:24)
    at Object.module.exports.6343.module.exports [as sign] (/var/task/index.js:36570:16)
    at getToken (/var/task/index.js:1861:23)
    at Object.githubAppJwt (/var/task/index.js:1882:23)
    at getAppAuthentication (/var/task/index.js:1509:57)
    at getInstallationAuthentication (/var/task/index.js:1630:35)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  library: 'PEM routines',
  function: 'get_name',
  reason: 'no start line',
  code: 'ERR_OSSL_PEM_NO_START_LINE'
}

Kudos for the really nice work on this and for sharing with the community! :)

Runner instances are incorrectly detected as orphaned and terminated

Summary

When there are over 30 runners in a repo/organization the scale down lambda thinks new runners are orphans and will terminate them even while running a build.

Steps to reproduce

  1. Add 30 runners to your repository or organisation. These can be offline.
  2. Trigger a new workflow run to generate a new instance via this project
  3. Wait the configured time (minimum_running_time_in_minutes option or 5 minutes by default)
  4. Cloudwatch logs on the scale down function shows that the newly created instance is an orphan

What is the current bug behavior?

Runners get terminated while they should not be deleted.

What is the expected correct behavior?

These runners should not be terminated in this scenario.

Relevant logs and/or screenshots

2020-08-26T09:30:06.050Z 7d6440bb-27ba-4d17-ba26-3edc285b88c1 INFO Runner 'i-0d90f0e61ef64b847' is orphan, and will be removed.
2020-08-26T09:30:06.272Z 7d6440bb-27ba-4d17-ba26-3edc285b88c1 DEBUG Runner terminated.i-0d90f0e61ef64b847

Possible fixes

In modules/runners/lambdas/runners/src/scale-runners/scale-down.ts in the scaleDown function the following code is used to retrieve registered runners.

 const registered = enableOrgLevel
      ? await githubAppClient.actions.listSelfHostedRunnersForOrg({
          org: repo.repoOwner,
        })
      : await githubAppClient.actions.listSelfHostedRunnersForRepo({
          owner: repo.repoOwner,
          repo: repo.repoName,
        });

This API is paginated and by default returns the first 30 runners. The page size can be upped to 100 runners, but to be sure we should get all the runners.

If there are more than 100 runners registered, scale-down fails

scale-down.ts:113 and scale-down.ts:116 use actions.listSelfHostedRunnersForOrg and actions.listSelfHostedRunnersForRepo directly.
The returned object contains an array data.runners, containing up to 100 registered runners according to the API documentation.

If there are more than 100 runners registered, the additional runners are not considered for scaling down.

Question: What happens if a spot instance is terminated by AWS?

Hi, first of all, congratulations for this great project.

We have deployed github-runner successfully and it's running very well so far.

One question please. As you know, spot instances can be terminated by AWS. If a GitHub Runner EC2 instance is suddenly stopped by AWS (I mean, in the middle of a pipeline), what happens to the GitHub pipeline? It fails? Is there any retry/re-schedule mechanism to re-execute the build?

Thank you very much in advance.

Improve runner deletion by using `busy` flag

When building this solution the Github API couldn't tell if a runner was busy or not, so we resorted to trying to delete each runner via de API. If that returned a 500 Internal Server Error, we would know if it was busy.

Just played with the API again and saw the busy flag was added for runners. See https://docs.github.com/en/rest/reference/actions#list-self-hosted-runners-for-a-repository

Instead of trying to delete a runner we should use this flag so reduce the number of API calls on Github.

Resource not accessible by integration? What perms am I missing?

I'm sure I am just missing something documented somewhere, my integration has been made under my user account whilst I test this. Does the integration need anymore permissions than:

  • read/write on admin
  • read on checks
  • read/write on actions
    at /var/task/index.js:15325:23
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  status: 403,
  headers: {
    'access-control-allow-origin': '*',
    'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset',
    connection: 'close',
    'content-encoding': 'gzip',
    'content-security-policy': "default-src 'none'",
    'content-type': 'application/json; charset=utf-8',
    date: 'Mon, 13 Jul 2020 14:29:12 GMT',
    'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
    server: 'GitHub.com',
    status: '403 Forbidden',
    'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
    'transfer-encoding': 'chunked',
    vary: 'Accept-Encoding, Accept, X-Requested-With',
    'x-content-type-options': 'nosniff',
    'x-frame-options': 'deny',
    'x-github-media-type': 'github.v3; format=json',
    'x-github-request-id': 'B73E:A1E4:B1A84D:D6FF77:5F0C6FB8',
    'x-ratelimit-limit': '5000',
    'x-ratelimit-remaining': '4884',
    'x-ratelimit-reset': '1594651917',
    'x-xss-protection': '1; mode=block'
  },
  request: {
    method: 'GET',
    url: 'https://api.github.com/repos/callum-tait-pbx/test_repository/actions/runs?status=queued',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit-rest.js/17.11.2 octokit-core.js/2.5.4 Node.js/12.16.3 (Linux 4.14; x64)',
      authorization: 'token [REDACTED]'
    },
    request: { hook: [Function: bound bound register] }
  },
  documentation_url: 'https://developer.github.com/v3/actions/workflow_runs/#list-repository-workflow-runs'

Note I haven't attached a dummy runner to my repository yet, I was assuming I would run into a problem at some point that pointed to that and deal with it then.

Question: Required Envs for Various Lambda Functions

Hello!

This looks excellent whilst we wait for GitHub to provide a supported solution. I work in a Cloudformation shop however and so I will be converting the Terraform, what isn't super clear to me is what varous env variables I need to provide the various Lambda functions as I will be deploying via Cloudformation and the serverless framework. Could you clarify what is required on the individual lambda functions for them to work?

Scale up lambda failed

Hi. I've error on lambda scale up after setup your module.
Cloudwatch logs below:

ERROR	Invoke Error 	
{
    "errorType": "Error",
    "errorMessage": "Failed handling SQS event",
    "stack": [
        "Error: Failed handling SQS event",
        "    at _homogeneousError (/var/runtime/CallbackContext.js:12:12)",
        "    at postError (/var/runtime/CallbackContext.js:29:54)",
        "    at callback (/var/runtime/CallbackContext.js:41:7)",
        "    at /var/runtime/CallbackContext.js:104:16",
        "    at /var/task/index.js:16834:16",
        "    at Generator.throw (<anonymous>)",
        "    at rejected (/var/task/index.js:16816:65)",
        "    at processTicksAndRejections (internal/process/task_queues.js:97:5)"
    ]
}
    at /var/task/index.js:15124:23
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  status: 403,
  headers: {
    'access-control-allow-origin': '*',
    'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset',
    connection: 'close',
    'content-encoding': 'gzip',
    'content-security-policy': "default-src 'none'",
    'content-type': 'application/json; charset=utf-8',
    date: 'Tue, 17 Nov 2020 17:51:47 GMT',
    'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
    server: 'GitHub.com',
    status: '403 Forbidden',
    'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
    'transfer-encoding': 'chunked',
    vary: 'Accept-Encoding, Accept, X-Requested-With',
    'x-content-type-options': 'nosniff',
    'x-frame-options': 'deny',
    'x-github-media-type': 'github.v3; format=json',
    'x-github-request-id': '93DE:E7C5:957F272:AC944E7:5FB40DB3',
    'x-ratelimit-limit': '5600',
    'x-ratelimit-remaining': '5598',
    'x-ratelimit-reset': '1605639047',
    'x-ratelimit-used': '2',
    'x-xss-protection': '1; mode=block'
  },
  request: {
    method: 'GET',
    url: 'https://api.github.com/repos/RaketaApp/packer-base-ami/actions/runs?status=queued',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit-rest.js/18.0.6 octokit-core.js/3.1.1 Node.js/12.18.4 (linux; x64)',
      authorization: 'token [REDACTED]'
    },
    request: { hook: [Function: bound bound register] }
  },
  documentation_url: 'https://docs.github.com/rest/reference/actions#list-workflow-runs-for-a-repository'
}``` 

scale-up: Resource not accessible by integration

scale-up lambda function fails with error, RequestError [HttpError]: Resource not accessible by integration.

I also struggled to find what is the expected format for the github_app key_base64 variable (I kept getting errors like error:0909006C:PEM routines:get_name:no start line, a multi-line string (starting LS0t) which was the base64 of the PEM file seemed to work.

I tried the suggestion in #203 of granting Read access to the installed app in the "Actions" repository permissions without success.

The error message shows that the URL being accessed is https://api.github.com/repos/<my organisation>/<my repo>/actions/runs?status=queued with authorization header, authorization: 'token [REDACTED]' and the request is rejected by GitHub.com with '403 Forbidden'.

Please advise.

Quesiton: Runner based on label

Hello Again!

I've got a question, how do you support spinning up different images depending on the label? My dream solution is runners are span up as required with them avaliable at the organisation level. The runner that is span up is based on the label provided, so if it's a node-12 label for instance then a node 12 instance is span up based on a node 12 launch template. How does the setup support multiple labels?

Cheers

Queued workflow not picked up by AWS runner

Hi,

Whenever I trigger a workflow run while there are no running EC2 runner instances, the following happens:

  1. Lambda webhook gets the check_run event and queues it to SQS
  2. SQS triggers the scale-up Lambda
  3. Lambda scale-up starts an EC2 instance
  4. EC2 instance properly registers as a self-hosted runner (visible in the GH repository "Actions" settings page
  5. The workflow run isn't picked up by the runner and stays in the queue forever, until I cancel it manually

If I trigger another workflow run while the EC2 runner is started, it gets properly picked up and executed.

Any idea what the problem is here?

Thanks!

Jobs getting dropped

Summary

I'm not sure if this this is one problem that is with the runners or two problems and one of them is GitHub. Or maybe it's all GitHub's fault, and it's not communicating properly with the webhooks. IDK.

In the past week, I have regularly been seeing jobs getting cancelled or just not happening. The first thing I'm seeing, it seems like there might be some sort of timing issue between the close order being given to the spot request and the request picking up another job. My jobs are getting "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled" when no one has done anything.

Screen Shot 2020-07-23 at 12 46 39 PM

The other thing I'm seeing is jobs that just never run and yet the workflow fails.

Screen Shot 2020-07-23 at 1 16 37 PM

Steps to reproduce

I don't know. It happens all the time with my pipeline with all of my workflows and jobs. All of my jobs are running bash scripts which in turn run docker containers for everything. I do have a few differences from the default settings. I have instance type set to m5.4xlarge, and I have a post_install script that provides ecr access:

mkdir /home/ec2-user/.docker
touch /home/ec2-user/.docker/config.json
echo "{" >> /home/ec2-user/.docker/config.json
echo '	"credsStore": "ecr-login"' >> /home/ec2-user/.docker/config.json
echo "}" >> /home/ec2-user/.docker/config.json
amazon-linux-extras enable docker
yum install -y amazon-ecr-credential-helper

I just thought to try updating the lambda zips, since I'm based straight on the github repo and haven't done that since the last time I ran a terraform init. So I'll give that a shot.

Request Limit Exceeded?

Seeing runners fail to delete. The underlying AWS instances get purged with "orphaned runner deleted" log messages, but for some reason we are getting rate limited somewhere (I think in AWS) and then the Github runners never get removed.

If we wait long enough, we have seen as many as 800 offline runners...

Here are some relevant lambda logs from the scale down lambda:

{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "RequestLimitExceeded: Request limit exceeded.",
    "reason": {
        "errorType": "RequestLimitExceeded",
        "errorMessage": "Request limit exceeded.",
        "code": "RequestLimitExceeded",
        "message": "Request limit exceeded.",
        "time": "2020-10-19T12:00:12.277Z",
        "requestId": "eb9ecb84-5f5a-4317-974b-10371c2df8f7",
        "statusCode": 503,
        "retryable": true,
        "stack": [
            "RequestLimitExceeded: Request limit exceeded.",
            "    at Request.extractError (/var/task/index.js:40075:35)",
            "    at Request.callListeners (/var/task/index.js:46386:20)",
            "    at Request.emit (/var/task/index.js:46358:10)",
            "    at Request.emit (/var/task/index.js:17843:14)",
            "    at Request.transition (/var/task/index.js:17177:10)",
            "    at AcceptorStateMachine.runTo (/var/task/index.js:25384:12)",
            "    at /var/task/index.js:25396:10",
            "    at Request.<anonymous> (/var/task/index.js:17193:9)",
            "    at Request.<anonymous> (/var/task/index.js:17845:12)",
            "    at Request.callListeners (/var/task/index.js:46396:18)"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: RequestLimitExceeded: Request limit exceeded.",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:315:20)",
        "    at process.EventEmitter.emit (domain.js:483:12)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}


[ERROR] [1603108812400] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. | [ERROR] [1603108812400] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403.
-- | --

Any ideas what could be going on here? Thanks in advance for your help!

EC2 instance type

How can you pass the instance type you want to build. I saw that the default instance is m5.large, but there is no explanation on how we can change that.

Github App permission issue

Summary

I have followed the README and created the Github App and setup the terraform modules, however, can't get the runners created. Please see the error below, I guess its something to do with App permissions but I have tried them all and have been at this for a while, but no luck, not sure what I'm missing!!

Steps to reproduce

Run the example module and try to create a runner

What is the current bug behavior?

Does not create a runner

What is the expected correct behavior?

Should create a runner

Relevant logs and/or screenshots

2020-09-08T12: 53: 03.620Z	61a6e96f-ddd4-5bd3-ac59-bebe5d0eb4b7	ERROR	RequestError [HttpError
]: Not Found
    at /var/task/index.js: 14863: 23
    at processTicksAndRejections (internal/process/task_queues.js: 97: 5) {
  status: 404,
  headers: {
    'access-control-allow-origin': '*',
    'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset',
    connection: 'close',
    'content-encoding': 'gzip',
    'content-security-policy': "default-src 'none'",
    'content-type': 'application/json; charset=utf-8',
    date: 'Tue,
    08 Sep 2020 12: 53: 03 GMT',
    'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
    server: 'GitHub.com',
    status: '404 Not Found',
    'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
    'transfer-encoding': 'chunked',
    vary: 'Accept-Encoding, Accept, X-Requested-With',
    'x-content-type-options': 'nosniff',
    'x-frame-options': 'deny',
    'x-github-media-type': 'github.v3; format=json',
    'x-github-request-id': 'DFE0: 5DBC: 88B16E0:A4C558A: 5F577EAF',
    'x-ratelimit-limit': '5000',
    'x-ratelimit-remaining': '4986',
    'x-ratelimit-reset': '1599572823',
    'x-ratelimit-used': '14',
    'x-xss-protection': '1; mode=block'
  },
  request: {
    method: 'POST',
    url: 'https: //api.github.com/orgs/theabrar/actions/runners/registration-token',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit-rest.js/18.0.3 octokit-core.js/3.1.1 Node.js/12.18.2 (linux; x64)',
      authorization: 'token [REDACTED
      ]',
      'content-length': 0
    },
    request: { hook: [Function: bound bound register
      ]
    }
  },
  documentation_url: 'https: //docs.github.com/rest/reference/actions#create-a-registration-token-for-an-organization'
}

Possible fixes

I think this issue can be resolved by automating the GitHub app creation, possibly using probot.

Add support for ARM64 runners using AWS Graviton/Graviton2 instance-types.

Problem to solve

The current solution is unable to launch instances compatible to use with GitHub's ARM64 self-hosted runner.

Intended users

Developers/Teams building for ARM64 (e.g. Raspberry Pi)

Further details

Benefit: extends support to GitHub Actions pipelines that use ARM64

Proposal

PR in progress with changes to runner_architecture auto-detected from instance type, support for downloading arm64 actions-runner from GitHub, and a patch to account for lack of pre-installed ICU support in .NET Core, required by the arm64 actions-runner.

Documentation

Will document how to enable arm64 support as well as some gotchas I ran into (some Graviton instances aren't available in all AZs)

Availability & Testing

Not much? Might need a test case for the change to syncer lambda.

What does success look like, and how can we measure that?

Setting a Graviton/Graviton2 instance type in example/default/main.tf and (optionally) specifying subnet AZs in example/default/vpc.tf results in a successful stack that can launch functional ARM64 self-hosted runners.

Other links/references

n/a

Runners not executing jobs, just idle and shut down

Summary

Github Actions checks are not executed, instances boot up and then shut down without executing the job.

Steps to reproduce

I just did the normal v2 setup

What is the current bug behavior?

The workers boot up, but are idle, until they are shut down again.

What is the expected correct behavior?

The workers pick up the jobs and execute them in a reasonable time

Not sure if this belongs here, but have you any idea what could be the reason? The workers are definitely online and it just started randomly. The only thing I did is delete workers and unregister some that did not exist anymore and were somehow not unregistered or running without stop for multiple days.

I have an offline macOs Worker for tests not to fail, my CI runs on linux. Does this pose a problem?

Some additional info needs to be added to readme

I managed to fix the issues I encountered regarding Resource not accessible by integration and Not Found:

  • In the app permissions, also need to do Repository permissions > Actions > Read-only (regardless if you're an organization or not)

Also need to add into the terraform file (via #104):

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}

(this is for the 0.5.0 version)

Support for Windows runners

I've had a poke at the module and I am presuming this currently only supports Linux-based runners? Any plans to add Windows runner support?

Ephemeral Runners?

Problem to solve

As a developer interacting with a public repository, I want to be able to have ephemeral instances so that I can safely use self-hosted runners in a public repo.

Intended users

Any public repository user where github actions are used, and the default github hosted runners do not provide sufficient resources.

Proposal

  • Have a warm pool of idling runners waiting for a job from github (polling sqs queue or something)
  • When a idling runner gets a job, execute that job, and delete the runner when finished (the lifetime of the runner is the same as the GHA job it is executing)

What does success look like, and how can we measure that?

  • Jobs are quickly executed since runners are pre-provisioned
  • Security concerns over persistence of data across jobs are addressed since the lifetime of runners are tied to a single github job.

dev-usw2-scale-up failure: "Failed handling SQS event" "PEM routines:get_name:no start line at Sign.sign"

Summary

dev-usw2-scale-up Execution result: failed

Steps to reproduce

Trigger via commit in configured application with requisite github app per the docs/README in this repo and https://040code.github.io/2020/05/25/scaling-selfhosted-action-runners

What is the current bug behavior?

ERROR Error: error:0909006C:PEM routines:get_name:no start line at Sign.sign ERROR Invoke Error
(see full error trace/out below)

What is the expected correct behavior?

The commit to configured repo causes lambda function execution and requisite scaling up of or deployment of AWS EC2 spot instance.

Relevant logs and/or screenshots

The most recent failure/error upon commit to configured github repo with github app configured to watch
CloudWatch: CloudWatch Logs: Log groups: /aws/lambda/dev-usw2-scale-up
available in github gist here:

gist-file-aws-lambda-dev-usw2-scale-up-error

Possible fixes

At first glance, appears like might be related to a cert/key error?

Who can address the issue

Requesting validation and suggestions on resolution

Other links/references

Thank you

scale-down lambda fails with: SyntaxError: Unexpected token u in JSON at position 0

Summary

Hello, thanks for the great project. Everything is working fine, except scale-down lambda. It fails with the SyntaxError: Unexpected token u in JSON at position 0 errors.

Steps to reproduce

here is my lambda download code:

module "lambdas" {
  source  = "philips-labs/github-runner/aws//modules/download-lambda"
  version = "0.4.0"

  lambdas = [
    {
      name = "webhook"
      tag  = "v0.4.0"
    },
    {
      name = "runners"
      tag  = "v0.4.0"
    },
    {
      name = "runner-binaries-syncer"
      tag  = "v0.4.0"
    }
  ]
}

as for idle config, I'm using defaults.

What is the current bug behavior?

here is the logs from the cloudwatch:


2020-08-19T15:03:24.035Z	22fd0c68-5fde-46e1-963e-422a6ae3aa00	ERROR	Unhandled Promise Rejection 
{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "SyntaxError: Unexpected token u in JSON at position 0",
    "reason": {
        "errorType": "SyntaxError",
        "errorMessage": "Unexpected token u in JSON at position 0",
        "stack": [
            "SyntaxError: Unexpected token u in JSON at position 0",
            "    at JSON.parse (<anonymous>)",
            "    at Object.<anonymous> (/var/task/index.js:8456:39)",
            "    at Generator.next (<anonymous>)",
            "    at /var/task/index.js:8385:71",
            "    at new Promise (<anonymous>)",
            "    at module.exports.471.__awaiter (/var/task/index.js:8381:12)",
            "    at Object.scaleDown (/var/task/index.js:8455:12)",
            "    at /var/task/index.js:16564:22",
            "    at Generator.next (<anonymous>)",
            "    at /var/task/index.js:16543:71"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: SyntaxError: Unexpected token u in JSON at position 0",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:315:20)",
        "    at process.EventEmitter.emit (domain.js:482:12)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}
ย  | 2020-08-19T17:03:24.075+02:00Copy[ERROR] [1597849404074] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. | [ERROR] [1597849404074] LAMBDA_RUNTIME Failed to post handler success

What is the expected correct behavior?

Scale down lambda should work as expected and terminate idle instances after timeout.

===
thanks for any help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.