GithubHelp home page GithubHelp logo

chef-boneyard / aws_native_chef_server Goto Github PK

View Code? Open in Web Editor NEW
37.0 37.0 25.0 2.71 MB

Cloudformation templates for building a scalable cloud-native Chef Server on AWS

License: Apache License 2.0

Shell 100.00%
chef-server

aws_native_chef_server's People

Contributors

danielcbright avatar gsreynolds avatar jeffj254 avatar jeremymv2 avatar nrgetik avatar rwc avatar sam1el avatar tensibai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws_native_chef_server's Issues

Are the build scripts for the pre-baked AMIs available?

Our security policies only allow the deployment of AMIs that we've built ourselves. Are the build scripts for the AMIs referenced in main.yaml available anywhere? We'd like to be able to leverage this repo, but the AMI list is a problem.

Add parameter to set ELB security policy

Currently the templates do not specify an ELB security policy which means they are set with the AWS default of ELBSecurityPolicy-2016-08. This security policy allows TLS 1, 1.1, and 1.2. Our corporate standards require us to limit connections to only TLS 1.2.

It would be helpful to add another parameter to the templates to allow a different security policy to be specified than the default, if desired.

Current version of Chef Infra Server in images has ldap ssl bug

After testing with a customer this week, we found that the images are using Chef server core 12.19.31. This causes the server not to allow ldap over SSL which was a requirement for the customer unfortunately. Downgrading to 12.18.14 fixed the issue for now.

Set the Automate JVM heap size

Right now we're using the default JVM heap size, which is way too small. We should automatically set this based on the instance size.

Update Monitoring to Report RDS State Changes to SNS

Background:

  • Customer RDS state change occurred that broke replication between primary instance and read-only replica
  • Primary RDS began storing large amounts of transaction/binary/WAL logs locally as replication wasn't occurring
  • Customer upped storage size to accommodate for further accumulating logs, further breaking replication and exacerbating issue

Request:

  • Update AWS Native Chef Server Cloudwatch monitoring to report RDS state changes + replica health to proper SNS topic. State changes (failovers primarily) are a known cause of replication issues between RDS instances

This will give the user of an AWS Native Chef Server the proper warning that something has gone sideways and corrective action may need to be taken.

RDS Parameters to investigate

OldestReplicationSlotLag
ReplicaLag

RDS Events to investigate

RDS-EVENT-0006 - The DB instance is restarting and will be unavailable until the restart is complete.
RDS-EVENT-0004 - The DB instance has shut down.
RDS-EVENT-0034 - Amazon RDS is not attempting a requested failover because a failover recently occurred on the DB instance.
RDS-EVENT-0013 - A Multi-AZ failover that resulted in the promotion of a standby instance has started.
RDS-EVENT-0015 - A Multi-AZ failover that resulted in the promotion of a standby instance is complete. It may take several minutes for the DNS to transfer to the new primary DB instance.
RDS-EVENT-0065 - The instance has recovered from a partial failover.
RDS-EVENT-0049 - A Multi-AZ failover has completed.
RDS-EVENT-0045 - An error has occurred in the read replication process. For more information, see the event message. For information on troubleshooting Read Replica errors, see Troubleshooting a MySQL or MariaDB Read Replica Problem.
RDS-EVENT-0057 - Replication on the Read Replica was terminated.
RDS-EVENT-0062 - Replication on the Read Replica was manually stopped
RDS-EVENT-0063 - Replication on the Read Replica was reset.

Using _status page for Chef Server health check takes Chef Server offline if Automate is not available

Currently the templates use the /_status endpoint for the ELB health check against the Chef Server instances. The /_status endpoint will return a 500 http code if the Automate DataCollectorURL is not available to the Chef server instance, even if the rest of Chef Server is totally healthy. This will cause the ELB to believe the node is unhealthy and to take it out of rotation.

This means if Automate is offline or unreachable for any reason all of Chef Server will be taken offline and the non-bootstrap autoscaling group will start continually tearing down and rebuilding Chef Server instances.

It seems like the endpoint used for the Chef Server health check should be an endpoint that will only report an error code if Chef Server itself is unhealthy. I'm not sure if there are any other quick loading endpoints like /_status on Chef Server, but as a preliminary fix it would probably work to use the /login endpoint.

Thoughts?

Add the ability to execute a second script for customizations

I imagine that besides main.sh and everything that happens there, customers may also wish to implement additional customizations (security, monitoring tools, log collectors, etc) and we should make that easy and not requiring customers to fork.

The simplest way to accomplish this would be to have a CustomizationScript parameter that defaults to an empty string, with logic that will only execute the script if the parameter is non-empty. Otherwise we can provide a blank customizations.sh that does nothing (besides run /bin/true)

use of aws-signing-proxy

If we dont want to use aws-signing-proxy, what will be changes , we need to make to the template? Basically I am more interested in understanding the role of aws-signing-proxy. can anyone share their thoughts please?

WaitCondition timed out.

The WaitCondition resource seems to just time out ("WaitCondition timed out. Received 0 conditions when expecting 1"). Whenever I try to kick off a build, I keep getting tripped up by this resource. Any suggestions on troubleshooting the issue?

AWS::ElasticLoadBalancingV2::LoadBalancer Name

The Name value isn't a required field so I recommend removing it so that if LB needs updating you can. But putting a name you get:

UPDATE_FAILED | AWS::ElasticLoadBalancingV2::LoadBalancer | ChefALB | CloudFormation cannot update a stack when a custom-named resource requires replacing. Rename Chef-lb and update the stack again.

Idea: Add a travis pipeline to publish tagged templates to S3, add a "Launch" button

I thought this was a super neat idea from: https://www.weave.works/docs/tutorials/kubernetes/launch-aws-cloudformation/

<p><a href="https://console.aws.amazon.com/cloudformation/home#/stacks/new?templateURL=https:%2F%2Fs3.amazonaws.com%2Fweaveworks-cfn-public%2Fkubernetes-ami%2Fcloudformation.json&amp;stackName=WeaveCloudKubernetesGettingStarted"><img src="/assets/misc/launch-stack.svg" alt="Launch Stack" /></a></p>

How would we do this:

  1. Create a versioned s3 bucket as $s3-bucket
  2. Publish all master builds to $s3-bucket/current/
  3. Publish all tagged builds to $s3-bucket/tagname/ and also $s3-bucket/stable/
  4. Add a Launch Stack button to the README

ref: https://docs.travis-ci.com/user/deployment/s3/

cc: @itmustbejj @nsdavidson

Bookshelf SignatureDoesNotMatch Errors With Multiple Backends

When more than one chef server is running behind an ELB, Bookshelf complains of SignatureDoesNotMatch when clients are downloading cookbooks. This happens when in "sql" mode. The fix is to ensure

bookshelf['access_key_id']
bookshelf['secret_access_key']

are the same on every backend server during bootstrap. If these settings are not set a random string is generated so each backend is expecting a different keypair.

chef-server-ctl set-secret bookshelf access_key_id
and
chef-server-ctl set-secret bookshelf secret_access_key

need to be run at bootstrap time. The values could be stored in aws ssm for easy access.

Implement Chef Package Repo Mirror

This serves multiple purposes

  • Protect packages.chef.io from undue load

  • Provide a local packages.chef.io package repo mirror for each AWS Native install for speed

  • Provide a general package repo for regular client bootstraps and other product installations

https://github.com/chef/mixlib-install seems like a good candidate for the purpose of building such a repo, as it can handle all package types, including windows as well as generate usable install.sh/install.ps1

The build might go something like

  1. Get all the packages, potentially filtering on a specific OS/architecture
  2. Build repos and mirroring for the selected OS types
  3. Bootstrap some things locally!

Error while reporting to Data Collector - Automate ingest token setup for IAMv2

Currently the ingest token is being created using the Legacy data collector token ported from A1 method. With currently released A2, this results in that token not having ingest permissions.

Chef Clients report this:

WARN: Error while reporting run start to Data Collector. URL: https://CHEF-FQDN/organizations/test/data-collector Exception: 403 -- 403 "Forbidden"

The token can be manually added to the Ingest Policy via the Automate UI or API.

We should update to IAMv2 methods for creating the token.

S3 bucket URLs problem

I am having problems with deploying with the s3 URL as specified in the templates today.

If I am reading https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html#access-bucket-intro correctly, the style of URL is valid only if the client and the bucket are in N.Virginia if the regios were created before a certain date. In addition, it appears that some of the behavior will change also later this year. It looks like the bucket URL should be <bucketname>.s3.<region>.amazonaws.com.

(assuming that the region would end up being a parameter in the various stacks)

LB access logs are disabled

I didn't see a way to enable this using the supplied template, but being able to enable access logging on the load balancers would be handy.

awslog agent fails to install

When creating the stack, the automate instance is failing relatively quickly. it looks like something in python has changed and it is causing the instance to fail the setup.


Version 1.4.6 Install Starting
Collecting virtualenv
  Downloading https://files.pythonhosted.org/packages/c1/61/7506ddd79ef6f09beeefb81c4c55bf395a8ad96b33ff1c6b06e40f8aa101/virtualenv-20.0.7-py2.py3-none-any.whl (8.0MB)
Requirement already satisfied (use --upgrade to upgrade): six<2,>=1.9.0 in /usr/lib/python2.7/site-packages (from virtualenv)
Collecting appdirs<2,>=1.4.3 (from virtualenv)
  Downloading https://files.pythonhosted.org/packages/56/eb/810e700ed1349edde4cbdc1b2a21e28cdf115f9faf263f6bbf8447c1abf3/appdirs-1.4.3-py2.py3-none-any.whl
Collecting importlib-metadata<2,>=0.12; python_version < "3.8" (from virtualenv)
  Downloading https://files.pythonhosted.org/packages/8b/03/a00d504808808912751e64ccf414be53c29cad620e3de2421135fcae3025/importlib_metadata-1.5.0-py2.py3-none-any.whl
Collecting distlib<1,>=0.3.0 (from virtualenv)
  Downloading https://files.pythonhosted.org/packages/7d/29/694a3a4d7c0e1aef76092e9167fbe372e0f7da055f5dcf4e1313ec21d96a/distlib-0.3.0.zip (571kB)
Collecting filelock<4,>=3.0.0 (from virtualenv)
  Downloading https://files.pythonhosted.org/packages/14/ec/6ee2168387ce0154632f856d5cc5592328e9cf93127c5c9aeca92c8c16cb/filelock-3.0.12.tar.gz
Collecting importlib-resources<2,>=1.0; python_version < "3.7" (from virtualenv)
  Downloading https://files.pythonhosted.org/packages/45/51/5baae3dde223ff6b64aecaf4c191d2a2679f60abf1270b337823af668bf5/importlib_resources-1.2.0-py2.py3-none-any.whl
Collecting contextlib2<1,>=0.6.0; python_version < "3.3" (from virtualenv)
  Downloading https://files.pythonhosted.org/packages/85/60/370352f7ef6aa96c52fb001831622f50f923c1d575427d021b8ab3311236/contextlib2-0.6.0.post1-py2.py3-none-any.whl
Collecting pathlib2<3,>=2.3.3; python_version < "3.4" and sys_platform != "win32" (from virtualenv)
  Downloading https://files.pythonhosted.org/packages/e9/45/9c82d3666af4ef9f221cbb954e1d77ddbb513faf552aea6df5f37f1a4859/pathlib2-2.3.5-py2.py3-none-any.whl
Collecting configparser>=3.5; python_version < "3" (from importlib-metadata<2,>=0.12; python_version < "3.8"->virtualenv)
  Downloading https://files.pythonhosted.org/packages/7a/2a/95ed0501cf5d8709490b1d3a3f9b5cf340da6c433f896bbe9ce08dbe6785/configparser-4.0.2-py2.py3-none-any.whl
Collecting zipp>=0.5 (from importlib-metadata<2,>=0.12; python_version < "3.8"->virtualenv)
  Downloading https://files.pythonhosted.org/packages/ce/8c/2c5f7dc1b418f659d36c04dec9446612fc7b45c8095cc7369dd772513055/zipp-3.1.0.tar.gz
  Running setup.py (path:/tmp/pip-build-JHtZ_h/zipp/setup.py) egg_info for package zipp produced metadata for project name unknown. Fix your #egg=zipp fragments.
Collecting typing; python_version < "3.5" (from importlib-resources<2,>=1.0; python_version < "3.7"->virtualenv)
  Downloading https://files.pythonhosted.org/packages/22/30/64ca29543375759dc589ade14a6cd36382abf2bec17d67de8481bc9814d7/typing-3.7.4.1-py2-none-any.whl
Collecting singledispatch; python_version < "3.4" (from importlib-resources<2,>=1.0; python_version < "3.7"->virtualenv)
  Downloading https://files.pythonhosted.org/packages/c5/10/369f50bcd4621b263927b0a1519987a04383d4a98fb10438042ad410cf88/singledispatch-3.4.0.3-py2.py3-none-any.whl
Collecting scandir; python_version < "3.5" (from pathlib2<3,>=2.3.3; python_version < "3.4" and sys_platform != "win32"->virtualenv)
  Downloading https://files.pythonhosted.org/packages/df/f5/9c052db7bd54d0cbf1bc0bb6554362bba1012d03e5888950a4f5c5dadc4e/scandir-1.10.0.tar.gz
Installing collected packages: appdirs, configparser, unknown, contextlib2, scandir, pathlib2, importlib-metadata, distlib, filelock, typing, singledispatch, importlib-resources, virtualenv
  Running setup.py install for unknown: started
    Running setup.py install for unknown: finished with status 'done'
  Running setup.py install for scandir: started
    Running setup.py install for scandir: finished with status 'done'
  Running setup.py install for distlib: started
    Running setup.py install for distlib: finished with status 'done'
  Running setup.py install for filelock: started
    Running setup.py install for filelock: finished with status 'done'
Successfully installed appdirs-1.4.3 configparser-4.0.2 contextlib2-0.6.0.post1 distlib-0.3.0 filelock-3.0.12 importlib-metadata-1.5.0 importlib-resources-1.2.0 pathlib2-2.3.5 scandir-1.10.0 singledispatc
h-3.4.0.3 typing-3.7.4.1 unknown-0.0.0 virtualenv-20.0.7
You are using pip version 8.1.2, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
/usr/bin/virtualenv
Traceback (most recent call last):
  File "/usr/bin/virtualenv", line 7, in <module>
    from virtualenv.__main__ import run_with_catch
  File "/usr/lib/python2.7/site-packages/virtualenv/__init__.py", line 3, in <module>
    from .run import cli_run
  File "/usr/lib/python2.7/site-packages/virtualenv/run/__init__.py", line 12, in <module>
    from .plugin.activators import ActivationSelector
  File "/usr/lib/python2.7/site-packages/virtualenv/run/plugin/activators.py", line 6, in <module>
    from .base import ComponentBuilder
  File "/usr/lib/python2.7/site-packages/virtualenv/run/plugin/base.py", line 9, in <module>
    from importlib_metadata import entry_points
  File "/usr/lib/python2.7/site-packages/importlib_metadata/__init__.py", line 9, in <module>
    import zipp
ImportError: No module named zipp```

interestingly enough, if i update pip with `pip install --upgrade pip` then I can run the command successfully in the instance (after it's failed).

Review and remove unnecessary parameters

Boy does this template have a lot of parameters! We could make it less confusing to new users if we eliminate some of the most unneeded ones.

In my own personal order:

  1. SSHSecurityGroup - this allows you to specify an SG that can SSH to the frontends. often the default SG for ChefServerSubnets would cover that.
  2. DBSubnetGroupArn - we identified one rare case a while ago where a customer couldn't create DB subnet groups. pretty sure that no sane AWS setup actually operates that way, does it?
  3. LoadBalancerSubnets - how many people actually need to put LBs on different subnets than their FEs?

looking for feedback from @itmustbejj @lcc2207 @griffint61

Invalid instance profiles allowed in the template

I have tried to build a POC Chef stack and for that reason selected:

  {
    "ParameterKey":   "InstanceType",
    "ParameterValue": "t2.large"
  }

After stack failed I have chceked the reason - it failed at WaitCondition.
When I debugged further I found out that Auto Scaling Group could not launch instance for following reason:

Launching a new EC2 instance. Status Reason: EBS-optimized instances are not supported for your requested configuration. Please check the documentation for supported configurations. Launching EC2 instance failed.

I believe that you either should forbid using t2 instances or adjust ServerLaunchConfig so it won't request EbsOptimized request when instance type does not support it.

Allow selection of specific subnets for ELBs and Database

Currently all resources are deployed to the same set of VPC subnets specified in the ServerSubnets parameter. This doesn't work for us since our corporate VPC standards require all devices allowing traffic from the internet to be in a DMZ subnet and for RDS databases to be placed in a specific DB subnet. Ideally we would build this stack with the load balancers publicly exposed in DMZ, EC2 instances in a protected private subnet, and the Postgres database in the DB subnet.

Adding the ability to specific different subnets for the load balancers, EC2 instances, and Database would make the templates compatible with our network segmentation standards.

Allow additional security groups to be specified on ELBs

We use a third party WAF in front of our Chef clusters and use security groups to restrict direct access to the ELBs to only the connecting IPs from the WAF provider. This list of IPs is too large for the single security group that the templates currently allow. It would be helpful if the templates allowed additional security groups to be specified for the ELBs

Cosmetic updates to answer potential user questions

  • re-work the parameter groups that are currently commented out at https://github.com/chef-customers/aws_native_chef_server/blob/3b1e938ab1faf8ab269858601b432dc2062f9667/main.yaml#L188-L232
  • *DataVolumeSize parameters should specify the unit (GB) in comments
  • Prereqs Doc needs to be updated for mention of multiple SSLCertificateARN fields. it would also be helpful to mention a wildcard certificate here
  • Prereqs Doc needs to be updated to mention Automate and Supermarket DnsRecordNames
  • it is unclear at first if *DnsRecordName should be just hostname or FQDN
  • AutomationBucket parameter should specify that it likely doesn't need to change (unless you want to pull from your own copy)
  • determine and document a good value for Rollback Trigger

Switch to virtual-style s3 bucket pathing

Currently the templates use classic S3 pathing where the bucket is found at https://s3.amazonaws.com/[bucket name]. For locations that have URL whitelists on internet access this can be problematic since it would require whitelisting all of S3, which generally isn't allowed or advised.

Switching to the newer style virtual pathing would make the templates whitelist-friendly (and apparently also match Amazon's preferences: https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/).

This would mean changing the bucket urls to https://[bucket name].s3.amazonaws.com

Elasticsearch, Cloudformation and ZoneAwareness, the struggle is real

AWS recently added 3-AZ zone awareness to Elasticsearch.

This is awesome because previously ZoneAwarenessEnabled forced you into some bad decisions:

  1. Run with this value false and you only get good availability at the ES software level, because that requires 3 instances, but they are all in a single zone
    • This is what we do today, and if AWS loses an AZ then you'll have a very bad time
  2. Run with this value true and you're forced to specify an even number of ES instances
    • either 2 (which gives you poor availability)
    • or 4 which is super expensive and overkill for us

Amazon provides a good document on the availability trade-offs: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html#es-managedomains-multiaz - particularly this table:

Screen Shot 2019-03-08 at 1 59 37 PM

Unfortunately, Cloudformation doesn't yet support 3-AZ awareness, so we are forced to wait until it does ๐Ÿ˜ž- keep an eye on this page

Bootstrap server looping after stack creation

https://github.com/chef-customers/aws_native_chef_server/blob/bce23346b8fee3461c4872d0fa90992307bd0a7f/chef_server_ha.yaml#L451

The command here will fail if the stack is created, thus triggering the trap defined a bit above.

In case someone terminate the bootstrap instance by mistake or it goes unresponsive you're loosing the bootstrap/pushjob instance as it is reported unhealthy because of this command failure.

I've worked around it with a || echo "Ok, nevermind" to avoid that.
A better solution would be to test the stack state before trying to signal it.

Wrong FQDN in `api_fqdn`

The api_fqdn in the chef-server.rb gets set to the FQDN of the load balancer instead of the FQDN being based off the DNS entry.

The result is a mostly working Chef server, but there are a couple of issues:-

  • Confirmation eMails for user self-registrations contain a clickable link with the wrong FQDN resulting in SSL problems in the browser (cert doesn't match requested URL)
  • Chef starter kits contain the wrong FQDN in the knife.rb which also result in SSL problems.

Attribute: PublicIp was not found for resource: instance-ID

Ran into the following issue while running the latest 4.0.0 stack.
Set the parameters AssociatePublicIpAddress to false and LoadBalancerScheme to internal

Embedded stack chef-AutomateStack was not successfully created: 
Attribute: PublicIp was not found for resource: i-XXXXXXX

Had the same issue when the parameters are set to true and internet-facing
Note: our subnets used, doesn't allow the public ip address.

Current method of installing pip is not compatible with VPCs where internet access is filtered through Squid

Our corporate VPCs use Squid gateways to provide filtered internet with a URL whitelist to our VPC private subnets. While I have been able to whitelist most endpoints required by the templates, I have run into a problem with the pip install process.

Currently pip is installed during instance startup by the awslogs-agent-setup.py script. This script uses easy_install from the python setuptools package to install pip. Unfortunately it seems easy_install does not format some of it's http requests correctly and essentially sends the requests directly to an IP instead of to a specific URL.

This causes our Squid gateways to block the request since there is no URL to check against the whitelist. Easy_install then fails to install Squid and the stack creation as a whole fails due to the instance not finishing cfn-init successfully.

I have been able to work around this problem by installing pip through yum before the awslogs-agent-setup.py script executes by adding a yum install -y python-pip line before the awslogs script runs, so pip is already there and the easy_install command is skipped. This is a bit of a hacky workaround though, and not something I feel should be contributed back.

Is this something I should just take as a required customization for our version of the templates or is there another method that could be used to install pip that is friendly to our URL whitelisting?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.