aws / amazon-ecs-agent Goto Github PK

View Code? Open in Web Editor NEW

2.1K 158.0 603.0 84.62 MB

Amazon Elastic Container Service Agent

Home Page: http://aws.amazon.com/ecs/

License: Apache License 2.0

Go 97.43% Makefile 0.42% Shell 1.29% C 0.02% PowerShell 0.47% Dockerfile 0.30% Python 0.02% Roff 0.05% Smarty 0.01%

go amazon-ecs-agent amazon-ec2 docker-container amazon-linux-ami

amazon-ecs-agent's People

Contributors

Stargazers

Watchers

Forkers

emyphan grze boyand picorb paarthp euank macswinarski coursera aaithal ianblenke danbeaulieu samuelkarp shensiduanxing edwardt poojamaiya carlosrobles gongzhen vsheffer andrewpsp ejholmes ketzacoatl kanoapps aleksandersumowski segment-boneyard seiffert lambci johnmorales kiranmeduri mergermarket dwestbro dockerstack munchery-reeselevine chaos-generator jaimeguzman swipely xelibrion chrismoos eddieliu shufo kkurahar s1devops dblackdblack mgresko zwellnes mrojass willbern benchling mumoshu umaptechnologies jhspaybar rhoml abrgr jessecollier errazudin rafkhan instacart yyolk pranavs18 miketlive jansonzhou prateek-s franklyinc tangfeixiong wallnerryan sfussenegger molindo xplenty pixeleet hiloboy0119 sprucehealth akopitsa kohey18 ixcr-bot appuri bmanas colbt mikeybtn iserko otreva beaulyddon-wf witsoej dstroppa juanrhenals gfv softrobin5 pingles hasimo izogain jsh2134 thecloudbook chrisrut sofam danielrowles-wf jtiret circleci-archived getlight ypcs03 coderover shihuazhang iasindev

amazon-ecs-agent's Issues

Retry EC2 metadata reads

I've observed the EC2 metadata service reads failing at various times. The ruby SDK has retries with backoff when reading credentials from it, and it would make sense to use those counts / timings as inspiration probably, as well as take it as a sign that this is a good thing to do.

"Credential should be scoped to a valid region, not 'eu-west-1'. "

Hi,

I'm using the agent in a Docker container (using the official amazon/amazon-ecs-agent image from Docker Hub) on a CoreOS EC2 instance. (see the systemd unit file below) Right now, I'm having problems having the agents reliably register the instances at ECS. I can't really tell when exactly this happens, but in one out of three starts of the agent, it keeps logging the following lines and does not report the agent to be connected in the ECS Console.

2015-05-18T11:34:33Z [ERROR] Unable to discover poll endpoint module="acs handler" err="Credential should be scoped to a valid region, not 'eu-west-1'. "
2015-05-18T11:34:33Z [INFO] Error from acs; backing off module="acs handler" err="Credential should be scoped to a valid region, not 'eu-west-1'. "
2015-05-18T11:36:47Z [ERROR] Unable to discover poll endpoint module="acs handler" err="Credential should be scoped to a valid region, not 'eu-west-1'. "
2015-05-18T11:36:47Z [INFO] Error from acs; backing off module="acs handler" err="Credential should be scoped to a valid region, not 'eu-west-1'. "
2015-05-18T11:39:09Z [ERROR] Unable to discover poll endpoint module="acs handler" err="Credential should be scoped to a valid region, not 'eu-west-1'. "
2015-05-18T11:39:09Z [INFO] Error from acs; backing off module="acs handler" err="Credential should be scoped to a valid region, not 'eu-west-1'. "

The Systemd unit file I'm using is:

[Unit]
Description=The AWS ECS agent
After=docker.service
Requires=docker.service
Type=service
[Service]
TimeoutStartSec=0
TimeoutStopSec=0
Restart=on-failure
SyslogIdentifierg=ecs-agent
ExecStartPre=-/bin/mkdir -p /var/log/ecs /var/ecs-data
ExecStartPre=-/usr/bin/docker stop ecs-agent
ExecStartPre=-/usr/bin/docker pull amazon/amazon-ecs-agent
ExecStartPre=-/usr/bin/docker rm ecs-agent
ExecStart=/usr/bin/docker run --name ecs-agent -v /var/run/docker.sock:/var/run/docker.sock -v /var/log/ecs:/log -v /var/ecs-data:/data -p 127.0.0.1:51678:51678 --env-file /etc/ecs/ecs.config -e ECS_LOGFILE=/log/ecs-agent.log amazon/amazon-ecs-agent

/etc/ecs/ecs.config:

ECS_CLUSTER=<name of my existing cluster>
ECS_DATADIR=/data/
ECS_CHECKPOINT=true
AWS_DEFAULT_REGION=eu-west-1

Logging driver options

I'd like the ability to set the syslog tag, facility, etc... for the new docker logging options:

https://github.com/docker/docker/blob/609e7b0a55d4082fce40eabae3a06ca57c188ba5/docs/reference/run.md#logging-drivers---log-driver

mechanism to limit the amount of memory used for containers

The agent uses the ReadMemInfo function in the github.com/docker/docker/pkg/system package to determine available memory:

amazon-ecs-agent/agent/api/api_client.go

Line 121 in d71b002

memInfo, err := system.ReadMemInfo()

It seems that this is sent back to the ECS service, which allocates tasks to the instance consuming this amount of memory. This doesn't take into account anything else running on the instance (e.g. the agent, we run containers for shipping logs, etc). We would therefore like an option to reserve some of the memory for other things running on the system.

This is therefore a feature request that an option is added to allow an amount of memory to be reserved - perhaps an environment variable called ECS_RESERVE_MEMORY_MB. Please can you let me know what you think of this feature - whether it's worth sending a pull request that implements it?

Thanks

Tom

ECS Agent disconnected problem

We're using ECS for force12.io our demo of micro scaling. We're seeing intermittent problems when one of our container instances stops responding for between 30 and 60 seconds. During this time the agent connected flag in the ECS web console is false. We also see the 2 errors below in the agent logs.

2015-06-08T15:06:09Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:06:09Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.20.16:443: use of closed network connection"

This error is occuring 6 or 7 times an hour on each container instance. Here is when it occurred between 15:00 and 16:00 UTC today.

2015-06-08T15:06:09Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:06:09Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:14:49Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.18.202:443: use of closed network connection"
2015-06-08T15:14:49Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.18.202:443: use of closed network connection"
2015-06-08T15:23:31Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:23:31Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:32:08Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:32:08Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:39:20Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:39:20Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:48:08Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:48:08Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:55:56Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:55:56Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.20.16:443: use of closed network connection"

I've uploaded the full agent log to S3 for this hour.
Our instances are running amzn-ami-2015.03.b-amazon-ecs-optimized - ami-d0b9acb8

Please let me know if you need any further information or additional logs.

Thanks

Ross

Readme: Add hints about datadir/checkpointing

Hi,

I've noticed that you can't run the agent reliably/resiliently with the described docker run statement as it is missing a configuration for a persistent datadir. As far as I can see configuring this is necessary if you want a container instance to survive ECS Agent restarts as every new instance of the agent will look for an existing configuration and create a new container instance id if it can't be found. Please correct me if this is wrong.
I would suggest adding a section to the readme (I could create a PR) that explains the usage with checkpointing and the datadir. Also, I'd like to clarify the 'default values' of ECS_DATADIR (it states to be /data/, however checkpointing is only enabled when you explicitly pass ECS_DATADIR).

Cheers,
Paul

Licensing

https://github.com/aws/amazon-ecs-agent/blob/master/agent/agent.go#L1-L3

It seems a little unusual to see 'All Rights Reserved' as well as the Apache 2.0 boilerplate in the header of all the files which were released. I'm curious what effect it has upon consumers of the code, and anything contributed back to the project via Pull Request or similar.

This ambiguity has come up with other projects recently, see -

GoogleWebComponents/google-sheets#12

IANAL. Any clarification which you can supply would be much appreciated!

Support for docker-compose like YAML file for task definitions

The current JSON style task definition format is making the definitions unnecessarily lengthy and difficult to maintain. Can we get support for docker-compose like task definitions?

Can't Get Hostname/IP Where Task Is Running

I have tried most of the resources from this page: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/ECS.html#listContainerInstances-property

And I can get the automatically generated port for my application running on docker, but I can't get the IP/AWS host on where it is running (the individual ec2 instance in the cluster)

I just get this information: { bindIP: '0.0.0.0', containerPort: 4444, hostPort: 49166 }

Functionality to remove one or many container instance(s) that runs tasks without service interuption

Hi,
currently there does not seem to be a way to remove a container instance that runs tasks from a cluster without downtime without a number of steps that have to happen manually to accomplish the same goal.

It would be really helpful if one could mark a container instance as to be deregistered and ecs would take care of reorganizing the tasks first to other nodes before taking the container instance out of the cluster.

Please let me know if i missed something and there is already a way.

Manually you'd have to do:

start task that is running on node x on another node
wait for that task to be up and registered to the load balancer
remove instance from load balancer and drain connections
stop task on node x
deregister container instance from cluster or terminate the instance directly

Sometimes you just want to replace all nodes in your cluster because the ecs/docker versions changed or some other change that you want to roll out without service interruption.

Cheers

Proposal: Add the task id as a label on the container.

Right now, I have a Hekad plugin that takes docker logs in via the Docker Log Input. What I'd ultimately like to do, is take the container id, query the docker daemon for the container config, and extract some labels for use as the syslog programname and pid fields, so log lines end up looking something like:

<timestamp> <host> <app> web.<task uuid> - msg

The first step to make this possible, would be to have the ecs agent add an com.amazonaws.ecs.task label when creating the container. The end goal would hopefully be to have ECS support docker labels natively within task definitions so that I can add labels like app=foo, process_type=web.

Thoughts?

Won't get image from dockerhub (when using a private repo)

Authentication seems to happen ok (no error in the logs)

But when it goes to get the image, it can't find it.

time="2015-04-17T08:30:54Z" level="info" msg="+job pull(xx/xx, latest)"
time="2015-04-17T08:30:54Z" level="info" msg="+job resolve_repository(xx/xx)"
time="2015-04-17T08:30:54Z" level="info" msg="-job resolve_repository(xx/xx) = OK (0)"
Error: image xx/xx:latest not found
time="2015-04-17T08:30:55Z" level="info" msg="-job pull(xx/xx, latest) = ERR (1)"

If I set the repo to public, it gets the image fine and the docker is deployed.

Swap options

The default of no swap seems a bit heavy-handed, for some programs it's really hard to pin-point the usage you'll need, and often 90% of the time they use 10% of that limit until a spike. Maybe an option to disable the limit all together would be nice, not like we want swap at all haha, but for spiky behaviour it's tricky. Let me know what you think!

Allow arbitrary `docker run` command flags to be specified

For example, I've been attempting to container-ize an HDFS cluster. The way nodes are resolved means I need to specify --net=host for cluster discovery to work (there are other ways to get it working too but all require some other flag combination, such as --cap-add, which similarly aren't available).

What are the reasons for not being able to specify arbitrary flags? Like some kind of 'advanced' option that is only available when editing the JSON directly perhaps? Without it ECS cannot be used to manage such nodes, which is a shame.

Can't authenticate with Docker Hub

I am getting this error in the ecs agent log:

t=2015-04-17T06:11:53+0000 lvl=eror msg="Unrecognized AuthType type" module="docker auth" type="dockercfg\r"

Here is my ecs.config

ECS_ENGINE_AUTH_TYPE=dockercfg
ECS_ENGINE_AUTH_DATA={"https://index.docker.io/v1/":{"auth":"xxxx","email":"[email protected]"}}

Running this AMI: amzn-ami-2015.03.a-amazon-ecs-optimized (ami-ecd5e884)

Rolling deploys

We've been playing around with the service and are somewhat confused about how it works. Basically, what we've tried (after a suggestion from an aws dev) is this:

Create launch configuration
Create autoscaling group
Create ecs cluster
Create service and task, set number of tasks to 100 (eg. we don't think we'll ever reach this)

What we're deploying is a webapp. So basically we're trying a 1 to 1 mapping - one task per host. This way we can quite easily scale up the autoscaling group and tasks will be placed on each new ec2 instance and added to ELB. So far it's kind of reasonable.

Now, the problem is when we need to deploy a new version - there are no resources in the ecs cluster and no deploy takes place.
One possibility here might be:

Find number of ec2 instances running in the autoscaling group, set ecs tasks to that number - 1
Deploy new task revision
Wait/poll or something until all tasks are updated to new revision
Set number of tasks to 100 again

That seemed kind of convoluted and there may be gotchas. Wouldn't this be solved if one could set the number of tasks sort of like you set min/max/desired for an autoscaling group? That way ecs could possibly do this more seamlessly by trying to keep tasks at max but at deploy time scale down to min tasks... or if it just wasn't so hellbent on always prioritizing number of tasks over deploy.

Container output should (optionally) be captured and sent to CloudWatch or S3

It would be great if this could be done by the agent so that containers don't need to implement their own logging.

Add support in task definitions for more dockerConfig options

We'd like to be able to set more options when running a task definition. Specifically, on the command line our command is docker run -it <image> <cmd>, and it would be awesome to set the same options when calling in to the remote API. It looks like the dockerGo client already supports these options, but the ecs-agent does not allow them to be configured.

Tune SIGKILL timeout

I'm assuming ECS just uses the regular 10s default before a SIGKILL, but this is far too low for a number of programs we have which work with queues, any thoughts on making this tuneable?

Calling run-task immediately after stop-task causes Docker link error

I created a deployment script which

Pushes a new Docker image to a repository
Calls "aws ecs stop-task" and immediately
Calls "aws ecs run-task" to start the same task again

I believe ECS Agent gets confused when it tries to resolve the link dependencies for the "run-task" operation, because some the containers that were just stopped are still alive. When it tries to create a new container, I see this error in "docker inspect":

"State": {
    "Error": "Cannot link to a non running container: /ecs-xxx-19-mongodb-f6eea4f8a4c0e2b17600 AS /ecs-xxx-19-xxx-a4a59ffef4d1b48bb701/mongodb",
    "ExitCode": 128,

My task definition has 4 containers: A MongoDB server and some applications linked to it.

Agent seems to stop dispatching after thousands of jobs

Sorry, this is a bit anecdotal, but I've seen this a few times. After a couple of thousand short-lived jobs, the agent just seems to stop - no log output, no job dispatching, etc.

If I try and force stop and restart the agent, I get:

[ec2-user@ip-10-1-3-220 ~]$ sudo docker logs 1e2bf6f75398
t=2015-03-16T10:18:22+0000 lvl=crit msg="Error loading previously saved state" module=main err="Could not unmarshal state; incomplete save. There was no task for docker id 19408a2fba970833ac535a0b41cb22fe718b827fa8c23f6cf05e95048fd9ab6f"

Although the agent does actually come back up and start processing again.

Log details

Are there any logs that could help debug tasks?

I found the agent logs here /var/logs/ecs/
But there isn't anything useful there to help me debug the command I've run below.

[
  {
    "environment": [
      {"name":"SLACK_URL", "value":"URL"}
    ],
    "name": "taskName",
    "image": "imageId",
    "cpu": 512,
    "memory": 1884,
    "command": ["curl","-XPOST", "$SLACK_URL", "-d'{\"text\":\"Something\"}'"],
    "essential": true
  }
]

TCS connection issue

Hi,

I'm not sure what exactly TCS is, however it seems that the connection to it seems to trouble our ECS-Agents. An excerpt from our agents' logs:

2015-06-08T08:59:12Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.eu-west-1.amazonaws.com"
2015-06-08T08:59:12Z [WARN] Error creating a websocket client module="ws client" err="websocket: bad handshake"
2015-06-08T08:59:12Z [ERROR] Error connecting to TCS: {"AccessDeniedException":"Forbidden"}, websocket: bad handshake module="tcs handler"
2015-06-08T08:59:12Z [INFO] Error from tcs; backing off module="tcs handler" err="{"AccessDeniedException":"Forbidden"}, websocket: bad handshake"
2015-06-08T09:00:19Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.eu-west-1.amazonaws.com"
2015-06-08T09:00:19Z [WARN] Error creating a websocket client module="ws client" err="websocket: bad handshake"
2015-06-08T09:00:19Z [ERROR] Error connecting to TCS: {"AccessDeniedException":"Forbidden"}, websocket: bad handshake module="tcs handler"
2015-06-08T09:00:19Z [INFO] Error from tcs; backing off module="tcs handler" err="{"AccessDeniedException":"Forbidden"}, websocket: bad handshake"

Apparently it lacks permissions to access TCS, however I don't really know how I can change this. Can you help me with this one?

We are using a ECS-Agent built from the dev-branch.

Thanks,
Paul

Agent should report dispatch failures... somewhere

I've had two cases where containers fail to start:

machine is out of disk space
repository is insecure, so docker fails to pull

In both cases, the task fails, but without tailing the logs on the machine, I wasn't aware as to why. Perhaps the Agent could send errors to CloudWatch? Or would the recommended approach be to bake my own AMI with cloud watch agent set up for those log files?

"Error from acs"

Hey! Every ECS agent in the last day-ish (and one I just launched) seems to be failing with the following. Let me know if there's anything I can try or if you need more info.

Agent image: 153c1961a9ce

Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.3.3
Git commit (client): 7c8fca2/1.6.2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.3.3
Git commit (server): 7c8fca2/1.6.2
OS/Arch (server): linux/amd64

2015-06-10T15:36:23Z [INFO] Creating poll dialer module="acs client" host="ecs-a-2.us-west-2.amazonaws.com"
2015-06-10T15:36:23Z [ERROR] Error getting message from acs module="acs client" err="websocket: close 1011 Server Error"
2015-06-10T15:36:23Z [INFO] Error from acs; backing off module="acs handler" err="websocket: close 1011 Server Error"
2015-06-10T15:36:24Z [INFO] Creating poll dialer module="acs client" host="ecs-a-2.us-west-2.amazonaws.com"
2015-06-10T15:36:33Z [INFO] Saving state! module="statemanager"
2015-06-10T15:52:25Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.240.249.14:443: use of closed network connection"
2015-06-10T15:52:25Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.240.249.14:443: use of closed network connection"
2015-06-10T15:52:32Z [ERROR] Unable to discover poll endpoint module="acs handler" err="Post https://ecs.us-west-2.amazonaws.com/: net/http: request canceled while waiting for connection"
2015-06-10T15:52:32Z [INFO] Error from acs; backing off module="acs handler" err="Post https://ecs.us-west-2.amazonaws.com/: net/http: request canceled while waiting for connection"
2015-06-10T15:52:41Z [ERROR] Unable to discover poll endpoint module="acs handler" err="Post https://ecs.us-west-2.amazonaws.com/: net/http: request canceled while waiting for connection"
2015-06-10T15:52:41Z [INFO] Error from acs; backing off module="acs handler" err="Post https://ecs.us-west-2.amazonaws.com/: net/http: request canceled while waiting for connection"
2015-06-10T15:52:56Z [ERROR] Unable to discover poll endpoint module="acs handler" err="Post https://ecs.us-west-2.amazonaws.com/: net/http: request canceled while waiting for connection"
2015-06-10T15:52:56Z [INFO] Error from acs; backing off module="acs handler" err="Post https://ecs.us-west-2.amazonaws.com/: net/http: request canceled while waiting for connection"

Troubleshooting `docker run`

Hi I'm trying to find out why $(whoami) fails when ecs-agent docker runs my containers, when I ssh into the instance and launch a container via the CLI it works fine but I don't find a way to docker run in the exact same or the exact params that ecs-agent is doing, probably because is done through the fsouza/go-dockerclient but then how can I debug this? is there a endpoint at the ECS Container Agent Introspection API that I can use? Or what would be the CLI equivalent of the API call?

docker containers should be destroyed when stopped

Hi,

No cleanup is done when the container is stopped or fails: after a couple of minutes, I have up to dozens of failed containers.
Is there a way to reproduce the docker -rm option?

Thanks

log file name

When launching the agent, the log file appends the current date onto the end of the file, is there a way to prevent this?
Docker version 1.5
OS : Centos 6
Agent : Latest

"unable to detect version control system" for mockgen

I'm attempting to build the ECS agent on an EC2 instance running Ubuntu 14.04, and getting the following error when it attempts to pull the mockgen dependency, with the error

package code.google.com/p/gomock/mockgen: unable to detect version control system for code.google.com/ path

(I get the same error when trying to pull the dependency locally with Go).

http://code.google.com/p/gomock/mockgen redirects to GitHub now, so possible the dependency location needs updating in this repo?

[email protected]# make
docker build -t "amazon/amazon-ecs-agent-cert-source:make" misc/certs/
Sending build context to Docker daemon 3.072 kB
Sending build context to Docker daemon
Step 0 : FROM debian:latest
 ---> bf84c1d84a8f
Step 1 : RUN apt-get update &&      apt-get install -y ca-certificates &&     rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 91350f438685
Successfully built 91350f438685
docker run "amazon/amazon-ecs-agent-cert-source:make" cat /etc/ssl/certs/ca-certificates.crt > misc/certs/ca-certificates.crt
Sending build context to Docker daemon 5.024 MB
Sending build context to Docker daemon
Step 0 : FROM golang:1.4
 ---> 1a22368b487a
Step 1 : MAINTAINER Amazon Web Services, Inc.
 ---> Using cache
 ---> 95525eda1a2d
Step 2 : RUN mkdir /out
 ---> Using cache
 ---> 9c4bf2b29597
Step 3 : VOLUME ['/out']
 ---> Using cache
 ---> 0d5232a84ee9
Step 4 : RUN mkdir -p /go/src/github.com/aws/
 ---> Using cache
 ---> 8dfd53fcff05
Step 5 : COPY /scripts/build /scripts/build
 ---> Using cache
 ---> ebd1386ec32c
Step 6 : WORKDIR /go/src/github.com/aws/amazon-ecs-agent
 ---> Using cache
 ---> d2af5fbe8acf
Step 7 : ENTRYPOINT /scripts/build
 ---> Using cache
 ---> cd7444209b3e
Successfully built cd7444209b3e
go get github.com/tools/godep
go get golang.org/x/tools/cmd/cover
go get code.google.com/p/gomock/mockgen
package code.google.com/p/gomock/mockgen: unable to detect version control system for code.google.com/ path
Makefile:95: recipe for target 'get-deps' failed
make: *** [get-deps] Error 1
cat agent/gogenerate/inflections.csv >> agent/Godeps/_workspace/src/github.com/awslabs/aws-sdk-go/internal/model/api/inflections.csv
acs/update_handler/os/mock/filesystem.go:1:1: expected 'package', found 'EOF'
api/mocks/api_mocks.go:1:1: expected 'package', found 'EOF'
ec2/mocks/ec2_mocks.go:1:1: expected 'package', found 'EOF'
engine/dockerclient/mocks/dockerclient_mocks.go:1:1: expected 'package', found 'EOF'
engine/mocks/engine_mocks.go:1:1: expected 'package', found 'EOF'
httpclient/mock/httpclient.go:1:1: expected 'package', found 'EOF'
stats/mock/engine.go:1:1: expected 'package', found 'EOF'
stats/resolver/mock/resolver.go:1:1: expected 'package', found 'EOF'
wsclient/mock/client.go:1:1: expected 'package', found 'EOF'
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
acs/update_handler/os/filesystem.go:18: running "mockgen.sh": exit status 127
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
api/generate_mocks.go:16: running "mockgen.sh": exit status 127
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
ec2/generate_mocks.go:16: running "mockgen.sh": exit status 127
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
engine/generate_mocks.go:16: running "mockgen.sh": exit status 127
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
engine/dockerclient/generate_mocks.go:16: running "mockgen.sh": exit status 127
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
httpclient/httpclient.go:33: running "mockgen.sh": exit status 127
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
stats/engine.go:16: running "mockgen.sh": exit status 127
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
stats/resolver/resolver.go:18: running "mockgen.sh": exit status 127
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: mockgen: command not found
/go/src/github.com/aws/amazon-ecs-agent/scripts/generate/mockgen.sh: line 43: goimports: command not found
wsclient/client.go:92: running "mockgen.sh": exit status 127
godep: go exit status 1
Makefile:59: recipe for target 'gogenerate' failed
make: *** [gogenerate] Error 1
./
./tmp/
8f2da306715d3021818b12b7058b56c6aab77aabc4cea0efddf6bbb69889eb01
Sending build context to Docker daemon 5.024 MB
Sending build context to Docker daemon
Step 0 : FROM amazon/amazon-ecs-scratch:make
 ---> 8f2da306715d
Step 1 : COPY out/amazon-ecs-agent /agent
INFO[0000] out/amazon-ecs-agent: no such file or directory
make: *** [docker] Error 1

Privileged mode support

It seems this isn't supported today. I would really like this for running things like vpn:s and such that may need to modify iptables etc.

Is this on the roadmap?

Container Removal: Make more flexible than a fixed timeout

Currently the Agent removes containers that are part of stopped tasks after the task has been stopped for 3 hours.

This time was chosen as a reasonable duration to retain them for debugging purposes, but is not always the right duration. If a service thrashes, for example, you end up with many stopped containers in a short duration which can still lead to running out of disk / excessive Docker mounts.

This enhancement was requested in this forum post and also alluded to in issue #69.

Possible solutions (taken in part from the above linked discussions):

Only keep N stopped containers per task-family, removing any extra stopped tasks ahead of the 3 hour timeout.
Only keep N stopped containers total.
Monitor free disk space and begin removing the oldest tasks as soon as it becomes low.
Configurable removal timeout (per task? global for agent?)
Other (suggestions welcome).

I find the first two options fairly attractive, in part for their relative simplicity. However, they have the issue that if there is a specific task family that happens to run very large containers, that N might be too large even then. With N=1 it still could make sense though.

Monitoring disk space is a more tricky business, but also could be a good option.

Docker ENV Private Registry parse error...

Howdy,

Found that if you don't wrap ECS_ENGINE_AUTH_DATA in double quotes on the docker run, you will have a bad time.

Good

--env=ECS_ENGINE_AUTH_DATA="{"https://index.docker.io/v1/":{"username":"my_name","password":"my_password","email":"[email protected]"}}"

Bad

--env=ECS_ENGINE_AUTH_DATA={"https://index.docker.io/v1/":{"username":"my_name","password":"my_password","email":"[email protected]"}}

Without double quotes, ECS_ENGINE_AUTH_DATA will be on two lines, and thus not work. Example:

Docker inspect output

Bad

"Env": [
    "ECS_LOGFILE=/log/ecs-agent.log",
    "ECS_LOGLEVEL=info",
    "ECS_DATADIR=/data",
    "ECS_CLUSTER=my-cluster",
    "ECS_ENGINE_AUTH_TYPE=dockercfg",
    "ECS_ENGINE_AUTH_DATA={https://index.docker.io/v1/:auth:REMOVED}",
    "ECS_ENGINE_AUTH_DATA={https://index.docker.io/v1/:email:REMOVED}",
    "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]

Good

"Env": [
    "ECS_LOGFILE=/log/ecs-agent.log",
    "ECS_LOGLEVEL=info",
    "ECS_DATADIR=/data",
    "ECS_CLUSTER=my-cluster",
    "ECS_ENGINE_AUTH_TYPE=dockercfg",
    "ECS_ENGINE_AUTH_DATA={https://index.docker.io/v1/:{auth:REMOVED,email:REMOVED}}",
    "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]

The error you face when ECS Agent cannot read your ECS_ENGINE_AUTH_DATA looks like:

2015-06-24T17:22:17Z [ERROR] Unable to decode provided docker credentials module="ecs credentials" type="dockercfg"

TL;DR

The agent should be able to understand the single-line ECS_ENGINE_AUTH_DATA and two line ECS_ENGINE_AUTH_DATA in docker ENV.

Thanks,
Jason

Docker server pulls irrelevant images when the version of the Docker image is not set into the task definition

Hi,

On amazon linux using Docker 1.6.2, it seems like Docker pulls random images when no version of the Docker image is set into the task definition.
For instance:

If the task definition references the image "registry", Docker will pull unusable images until the devicemapper directory fills the whole filesystem
If I try to pull the image "registry" directly, if works
If I reference "registry:latest" instead of "registry", it works.

Any idea what's happening ?

Thanks

Ports are assumed to be TCP

I am looking into ECS but after looking a the source code it seems ports are assumed to be /tcp. In Docker you can specify a port as /udp for example -p 8301:8301/udp in the port mapping. Some of my containers require udp exchanges.

Specifying "host" to running ecs-launched containers

When manually starting a container ("docker run") you can do --host=myhostname which sets the hostname of running container.

For various licensing reasons a piece of software we have requires the hostname to be "bi-01" yet when I deploy via ecs, I cannot provide a hostname in the configuration.

Am I missing something?

Image Removal

Howdy,

Currently the ECS-Agent does not remove old, un-used Docker Images. Over time, this could cause issues with space.

'entryPoint' is not correctly passed through to docker

Entrypoints are not set for docker when they're set in task definitions.

Container is unable to communicate with EC2 Metadata service

I'm using a standard ubuntu-based AMI with docker on top instead of the ECS-optimized one. I'm trying to run the ecs-agent (v1.2.1) as stated in the documentation:

docker run --rm --name ecs-agent -v /var/run/docker.sock:/var/run/docker.sock \
-v /var/log/ecs:/log -v /var/lib/ecs/data:/data -p 127.0.0.1:51678:51678  \
-e ECS_CLUSTER="my-cluster" -e ECS_LOGFILE=/log/ecs-agent.log  \
-e ECS_LOGLEVEL=debug -e ECS_DATADIR=/data amazon/amazon-ecs-agent:latest

But it seems like the ecs-agent is not able to reach the EC2 metadata endpoint in the instance

2015-06-22T15:15:13Z [INFO] Starting Agent: Amazon ECS Agent - v1.2.1 (5da1555)
2015-06-22T15:15:13Z [INFO] Loading configuration
2015-06-22T15:15:14Z [CRITICAL] Unable to communicate with EC2 Metadata service to infer region: Get http://169.254.169.254/2014-02-25/dynamic/instance-identity/document: dial tcp 169.254.169.254:80: i/o timeout module="config"
2015-06-22T15:15:14Z [CRITICAL] Configuration key not set module="config" key="AWSRegion"
2015-06-22T15:15:14Z [CRITICAL] Error loading config: Missing required fields: AWSRegion

That's the reason I tried adding --net=host at the former and it did the trick on the container.

2015-06-22T15:17:31Z [INFO] Starting Agent: Amazon ECS Agent - v1.2.1 (5da1555)
2015-06-22T15:17:31Z [INFO] Loading configuration
2015-06-22T15:17:31Z [DEBUG] Loaded config: Cluster: %!v(MISSING), Region: %!v(MISSING), DataDir: %!v(MISSING), Checkpoint: %!v(MISSING), AuthType: %!v(MISSING), UpdatesEnabled: %!v(MISSING), DisableMetrics: %!v(MISSING), ReservedMem: %!v(MISSING)
2015-06-22T15:17:31Z [INFO] Checkpointing is enabled. Attempting to load state
2015-06-22T15:17:31Z [INFO] Loading state! module="statemanager"
2015-06-22T15:17:31Z [DEBUG] Loaded state! module="statemanager" state="&{Data:map[EC2InstanceID:0xc2080b0100 ACSSeqNum:0xc2080b0130 TaskEngine:0xc2080b00a0 ContainerInstanceArn:0xc2080b00c0 Cluster:0xc2080b00e0] Version:3}"
2015-06-22T15:17:31Z [INFO] Restored cluster 'my-cluster'
2015-06-22T15:17:31Z [INFO] Restored from checkpoint file. I am running as 'arn:*******' in cluster 'my-cluster'
2015-06-22T15:17:31Z [INFO] Saving state! module="statemanager"
2015-06-22T15:17:31Z [INFO] Beginning Polling for updates
2015-06-22T15:17:31Z [DEBUG] Connecting to ACS endpoint https://ecs-a-1.us-east-1.amazonaws.com/ module="acs handler"
2015-06-22T15:17:31Z [DEBUG] Updates disabled; no handlers added module="updater"
2015-06-22T15:17:31Z [INFO] Creating poll dialer module="ws client" host="ecs-a-1.us-east-1.amazonaws.com"
2015-06-22T15:17:31Z [DEBUG] Starting websocket poll loop module="acs client"

However, I'm not sure whether this is a real solution. Giving a container full access to its host's network stack is rather radical, thus, this is not likely the ideal situation. I haven't seen any references to this in the documentation, so most likely something else is going on and I'm missing it. Is it something else required for the container to be able to reach the EC2 metadata endpoint?.

I'd be glad if you guys lend me a hand on this. If guys need some context on this, just let me know.

Thank you 🍷

Allow a task definition to reference auth/credentials

As suggested by @asans (#4 (comment) on issue 4, option 2)

Right now Docker auth can be configured at the container-instance level and affects all tasks launched on that instance.

It would make sense for auth information to be referenced along with the image it pertains to as part of a task definition. For security and ease of update, it would also make sense if this information could be given as an S3 resource reference.

Server eventually runs out of disk space after running a LOT of containers

Although the tasks finish, the containers seem to be kept around. Could the agent clear these periodically, or only keep the last n containers?

Accessing private docker images

I've been looking in the code and looking at ways on how ECS can access private images, but am unable to find any documentation (AWS documentation doesn't contain information on this either) about this.

Based on the description on the ECS product page, it indicates that this can be done. So where can I set the docker authentication for the ECS agent? Or I assume that this needs to be set inside .dockercfg file and somehow placed into the AMI image upon instance launching?

Or should ecs-agent actually support this?

Slower starting and stopping of tasks with v1.1 agent

We're having performance problems with v1.1 of the ECS agent. We're using ECS for force12.io which is a demo of container autoscaling / prioritization. The demo starts and stops tasks based on a random metric that changes every 5 seconds.

Our live site is using the v1.0 agent and usually keeps up with the metric. Our staging site is running the v1.1 agent and is noticeably slower and doesn't keep up. Otherwise the 2 environments are identical.

The delay occurs after tasks have been stopped and a new task is started. It seems to be a delay in the agent receiving the task from the scheduler rather than the agent taking a long time.

I can reproduce the problem on a container instance with the 1.1 agent by

starting 4 tasks
stopping 3 tasks
starting another task

For the final task there is a delay of 20-25 seconds before the POST /v1.17/images/create message appears in the agent logs. Doing the same test with a 1.0 agent the message appears in 2 seconds.

We're running CoreOS stable (ami-ea657582) with this cloud-config data.

Re-register deregistered container instance

Since register-container-instance is "private", is there another way to re-register a container instance? Primarily for maintenance purposes. We'll probably just end up working around this by making our instances easier to toss away and spin up new ones, but there may still be valid use-cases where you'd just like to momentarily de-register.

The docs are also not clear on the method of which the running containers are shut down when de-registering a container instance. Is this graceful? Are they still properly deregistered from ELBs and SIGTERM'd or are they SIGKILL'd etc

thanks!

Data volume at mount point ignores mounted EBS volume

I have a task definition which mounts a directory from the host into the container for writing, like /mnt/host -> /mnt/container.

I have an EBS volume mounted on the host at this path, for example mount -t ext4 /dev/xvdf /mnt/host.

No matter what I do, writes that originate inside the container are ending up writing into the host's "mount point", ie /mnt/host underneath the mounted EBS volume. So writes are invisible on the host machine until I unmount the EBS volume, at which point they are revealed in the underlying mount point.

Can anyone help me with this? I am using the latest ECS AMI (ami-ae6559c6) and Docker image built very simply that starts with FROM ubuntu:14.04.

Docker Bench Security Recommendations

Hi,

We ran the docker bench security container against our ECS cluster, and it is recommending a few changes for the ecs-agent container. We are using the ECS optimized AMI for our environment. Here are a few items that were flagged:

ecs-agent is running as root within its container
no memory usage limitations exist on the ecs-agent container
no cpu prioritization is set for the ecs-agent container

Can the above items be addressed in future releases of the ecs-agent?

Thanks.

Container failure reasons should be more clear and include more types of failures

When a container unexpectedly exits, it should be more clearly communicated what went wrong.

This issue is to point out specific cases where it could be improved.

OOMKilled - If a container is killed due to memory constraints, this should be communicated.
"no such image" should instead give the error from pulling (e.g. registry auth).
Anything docker 1.5 puts in the .State.Error field (executable file not found in $PATH, etc) should be bubbled up.

Additional suggestions are welcome.

Usage of an http/https proxy

Hi,

The host machines do not access the internet directly, but through an http/https proxy into my infrastructure: I'd like to be able to use it!
I tried to write :
HTTP_PROXY=http://proxyaddress:proxyport
HTTPS_PROXY=http://proxyaddress:proxyport
NO_PROXY=169.254.169.254
into /etc/ecs/ecs.config (as I believe it should work ?), but had no success so far.

Is is a feature you intend to implement ?

Thanks,

Arthur

Permission denied pulling images from quay.io.

I'm trying to troubleshoot some issues with the agent, but I can't get a shell in the container. I can get a shell in other containers.

% docker run -it --entrypoint "/bin/bash" ubuntu

Works fine. This doesn't:

% docker run -it --entrypoint "/bin/bash" amazon/amazon-ecs-agent
exec: "/bin/bash": stat /bin/bash: no such file or directory
FATA[0000] Error response from daemon: Cannot start container 87b432eab5ae7439fd1a0e4275b51b5e36fb40a7d230a8ebc7885683fd202c20: [8] System error: exec: "/bin/bash": stat /bin/bash: no such file or directory

Docker info:

% docker info
Containers: 3
Images: 46
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 52
 Dirperm1 Supported: true
Execution Driver: native-0.2
Kernel Version: 4.0.3-boot2docker
Operating System: Boot2Docker 1.6.2 (TCL 5.4); master : 4534e65 - Wed May 13 21:24:28 UTC 2015
CPUs: 8
Total Memory: 1.957 GiB
Name: boot2docker
ID: QD2S:VSPG:KS3T:L72V:UXPI:U3SN:M4C7:3BJT:5SN2:LI3P:W2RF:AMGL
Debug mode (server): true
Debug mode (client): false
Fds: 12
Goroutines: 17
System Time: Thu May 21 15:18:58 UTC 2015
EventsListeners: 0
Init SHA1: 7f9c6798b022e64f04d2aff8c75cbf38a2779493
Init Path: /usr/local/bin/docker
Docker Root Dir: /mnt/sda1/var/lib/docker
Username: optimality
Registry: [https://index.docker.io/v1/]

data-volume containers can result in tasks stuck in PENDING

The agent does not properly move forwards a task with dead data volumes in all cases. A data-volume should be able to be used when the associated container is stopped.

Support for sha256 style tags

Docker supports immutable tags, but these are not supported by ECS currently. Task definitions + the agent should properly support these tags as they are useful for many usecases (e.g. actually immutable task-definisions / deployments) and could allow better caching.

Example: busybox@sha256:5b2fff9306bd9380ee85f96adaf67097f5b9e95d37da995099875b9e95913063

aws / amazon-ecs-agent Goto Github PK

amazon-ecs-agent's People

Contributors

Stargazers

Watchers

Forkers

amazon-ecs-agent's Issues

Are there any logs that could help debug tasks?

Good

Bad

Docker inspect output

Bad

Good

TL;DR

Recommend Projects

Recommend Topics

Recommend Org

Jobs