contiv / ansible Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 28.0 515 KB

ansible scripts for contiv cluster

License: Other

Python 17.24% Shell 55.51% Makefile 1.80% HCL 25.45%

ansible's People

Stargazers

Watchers

ansible's Issues

prune the `base` packages to minimally needed set and move the rest to `dev` or other roles

the base package installs multiple dev packages like git, gcc, perl etc which shouldn't be installed in production environment unless they are needed for running or installing the service itself

Following is the list in base role that we can most likely move to dev role:
- ntp
- vim
- curl
- git
- mercurial
- gcc
- perl

use defaults for storing role defaults

right now we have a mix of two ways for storing roles variables viz. vars dir and defaults dir. It would be nice to make this consistent.

Moving to defaults seems reasonable as these variables hold least precedence wrt inventory variables and variable passed from command line allowing us to override these easily in our code that depends on these variables.

wait for ucp files to be created times out..

when scheduler_provider = "ucp-swam"
This timeout is seen on many runs.

TASK [ucp : wait for ucp files to be created, which ensures the service has started] ***



failed: [cluster-node1] => (item=ucp-fingerprint) => {"elapsed": 300, "failed": true, "item": "ucp-fingerprint", "msg": "Timeout when waiting for file /tmp/ucp-fingerprint"}
failed: [cluster-node1] => (item=ucp-instance-id) => {"elapsed": 300, "failed": true, "item": "ucp-instance-id", "msg": "Timeout when waiting for file /tmp/ucp-instance-id"}

PLAY RECAP *********************************************************************
cluster-node1              : ok=132  changed=22   unreachable=0    failed=1

ucp/swarm replica needs to be set for HA

we need to set --replica and --replication flags respectively on master nodes for ucp and swarm respectively for a HA cluster.

ucp bootstrap node shall use node name instead of address

this is tracking the comment #97 (comment) to make ucp logic behave identical to swarm as done in #97

aci_demo_installer will fail if bzip2 is not installed in the worker nodes (Centos7)

Workaround: install bzip2 manually in the worker nodes

TASK: [contiv_network | install netmaster and netplugin] **********************
failed: [node2] => {"changed": true, "cmd": "tar vxjf /tmp/contivnet.tar.bz2", "delta": "0:00:00.007268", "end": "2016-01-07 08:08:07.476339", "rc": 2, "start": "2016-01-07 08:08:07.469071", "warnings": ["Consider using unarchive module rather than running tar"]}
stderr: tar (child): bzip2: Cannot exec: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
to retry, use: --limit @/root/site.retry

node2 : ok=43 changed=24 unreachable=0 failed=1

Swarm running as master on more than one node causes issues

Currently when more than one node is provisioned with swarm running as master, swarm doesn't work as expected. We need a mechanism to run it as master on only one node.

Currently, we are seeing this issue when netplugin-node hostgroup is used.

Need to bring in capabilities to reinstall/restart services

Encountered many issues while trying to re-run playbook on a node because of older version or after change in environment variables. Need a mechanism to identify if the services running on a node are of older version/older configuration and reinstall/restart the services as necessary.

serf template needs a copule of fixes

the control interface name is hard coded in the command to start serf
the serf binary needs to be picked from the path where serf is installed by ansible

systemd could not find the requested service etcd

After Docker and etcd installation, we got the following error:
TASK [base : stop etcd] ********************************************************
fatal: [node1]: FAILED! => {"changed": false, "failed": true, "msg": "systemd could not find the requested service "'etcd'": « }

And we found that there is no service file in /etc/systemd/ or /lib/systemd .

Here is a proposed patch.
main.txt

Need to address apt-get update failure

apt-get update errors out on some hosts. The workaround to fix this is to clean up /var/lib/apt/lists.
We need to incorporate this into the ansible scripts to perform this clean up based on the error.

use devicemapper instead of overlayfs in docker role

there is a bug with using overlayfs in docker, so we need to revert back to using devicemapper for now until this is resolved:

moby/moby#16902

organize tasks that differ in installtion of same things on different linux type

where possible we should get the tasks organized as proposed here #23 (comment)

ip tables rules for serf's mdns discovery seem to be filling up INPUT chain

tracking an issue that was recently noticed in volplugin where multipl iptable entries were installed for mDNS ports. The rule is installed as part of serf's service setup here https://github.com/contiv/ansible/blob/master/roles/serf/templates/serf.j2#L17

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:5353
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5353
...
...

install specific docker version

The docker role should install a specific version of docker that we test our services against.

etcd traffic may get dropped due to default iptable/firewalld rules on centos

we need to install appropriate rules to allow etcd traffic something on lines of https://github.com/kubernetes/contrib/blob/master/ansible/roles/etcd/tasks/main.yml#L39-L43

Install docker CS engine

We need to install and validate the CS version of Docker Engine as described here:

https://docs.docker.com/docker-trusted-registry/install/install-csengine/

docker role: 'check docker service state' task incorrectly fails when docker is not started

check docker service state should set ignore_errors to true otherwise ansible can incorrectly fail under conditions where docker service is not running

eth1 is being used as control interface always and "docker start" fails when ansible playbook is run

I am seeing this issue when i run net_installer_demo script. The cfg.yml contains the following

CONNECTION_INFO:
     172.29.205.249:
       **control: p1p2**
       data: eth2

Ansible output:

TASK: [docker | copy systemd units for docker tcp socket settings] ************
ok: [node1]

TASK: [docker | start docker tcp socket service] ******************************
failed: [node1] => {"changed": true, "cmd": "sudo systemctl stop docker && sudo systemctl start docker-tcp.socket && sudo systemctl start docker", "delta": "0:00:00.164433", "end": "2016-02-03 03:51:51.044325", "rc": 1, "start": "2016-02-03 03:51:50.879892", "warnings": []}
stderr: Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Job for docker.service failed. See "systemctl status docker.service" and "journalctl -xe" for details.

FATAL: all hosts have already failed -- aborting

journal-xe output

-- Unit serf.service has begun starting up.
Feb 03 04:07:41 contiv146 serf.sh[13880]: setting up iptables for mdns
Feb 03 04:07:41 contiv146 serf.sh[13880]: starting serf
Feb 03 04:07:43 contiv146 serf.sh[13880]: eth1 is not assigned a valid addr: bin boot dev etc home initrd.img lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var vmlinuz
Feb 03 04:07:43 contiv146 systemd[1]: serf.service: main process exited, code=exited, status=1/FAILURE
Feb 03 04:07:43 contiv146 systemd[1]: Unit serf.service entered failed state.
Feb 03 04:07:43 contiv146 systemd[1]: serf.service failed.
Feb 03 04:07:49 contiv146 /usr/sbin/irqbalance[1321]: irq 56 affinity_hint subset empty
Feb 03 04:07:53 contiv146 systemd[1]: serf.service holdoff time over, scheduling restart.
Feb 03 04:07:53 contiv146 systemd[1]: Started Serf.
-- Subject: Unit serf.service has finished start-up
-- Defined-By: systemd

Add versioning to pull aci-gw container

When pulling aci-gw container, check for a particular version to pull.

need ways to introduce k8s

The roles and groups that are defined so far includes swarm scheduler. What would be the best way to contribute to this repo to add k8s scheduler. Few options I can think of:

create host-groups that are different for k8s, for example host-group service-master, can be split into two 'service-master-docker' vs 'service-master-k8s'
abstract out scheduler role from this to be called 'scheduler' and replace it with swarm, k8s, etc.

My preference would be first one (different host-groups) because there may be more dependent top level roles that some of the schedulers (e.g. mesos) might require.

ping @mapuri, @erikh, @jojimt

netplugin: need to set `vtep-ip` flag always

this issue tracks setting the vtep-ip flag for netplugin when setting up the contiv_network role

need cleanup tasks when a node is decommissioned from cluster

This issue tracks identification and development of cleanup tasks that need to be executed when a node is decommissioned. A few tasks that I have in my mind are (not in any specific order of execution):

remove the node from VIP advertisement [only master nodes]
remove the node from etcd cluster [ only master nodes ]
remove the node from swarm cluster [ master and worker nodes ]
stop contiv services [ master and worker nodes ]
reschedule application containers [ should be taken care of by swarm's rescheduling policy ?? ]

New versions of binaries not downloaded and replaced

This issue is seen for tasks where we download binary releases and install them on the server nodes.

Example scenario:
Netplugin/netmaster binaries are downloaded from released version and installed on server-nodes
If we were to re-run the ansible playbook after running cleanup.yml playbook and then site.yml with a new release version, the new version is not downloaded and installed

Workaround:
Remove /tmp/contivnet.tar.bz2
Remove associated binaries from /usr/bin/{netplugin/netmast/netctl/contivk8s}
Rerun ansible-playbook site.yml

This issue would be seen for any such binaries/released version.

Local State of Contiv Ansible Thinks etcd running

After rebooting a cluster of Contiv VMs the netmaster and etcd services wouldn't come back up.
Reboot occurred after increasing the VM's number of cores from 1 to 2 and increasing the memory from 1G to 8G.
Attempted to restart contiv with the "startPlugin.py" script and after attempted to reinstall the whole package with "net_demo_installer". After running script checked status of the etcd and netmaster services both reported failing. Manually starting etcd service on both the primary and secondary node failed.

Citing code segment. This code is from /usr/bin/etcd.sh shell script that is called by ansible installer to restart:

if [ ! -f /var/tmp/etcd.existing ]; then
touch /var/tmp/etcd.existing
export ETCD_INITIAL_CLUSTER_STATE=new
export ETCD_INITIAL_CLUSTER="node1=http://10.88.38.75:2380,node1=http://10.88.38.75:7001,node2=http://10.88.38.73:2380,node2=http://10.88.38.73:7001"
else
# XXX: There seems an issue using etcdctl with ETCD_INITIAL_ADVERTISE_PEER_URLS so passing
# ETCD_LISTEN_PEER_URLS for now
out=`etcdctl --peers="10.88.38.73:2379,10.88.38.73:4001"
member add node1 "$ETCD_LISTEN_PEER_U

Docker role fails on task "start docker tcp socket service".. Failure reason: docker-tcp.socket fails: Socket service docker.service already active

Docker role fails with the following message:
TASK: [docker | start docker tcp socket service] ******************************
failed: [node1] => {"failed": true}
msg: Job for docker-tcp.socket failed. See "systemctl status docker-tcp.socket" and "journalctl -xe" for details.

contivuser@server:~$ sudo systemctl status -ln100 docker-tcp.socket
● docker-tcp.socket - Docker Socket for the API
Loaded: loaded (/etc/systemd/system/docker-tcp.socket; disabled; vendor preset: enabled)
Active: inactive (dead)
Listen: [::]:2385 (Stream)

Jan 11 14:05:42 server systemd[1]: Socket service docker.service already active, refusing.
Jan 11 14:05:42 server systemd[1]: Failed to listen on Docker Socket for the API.
Jan 11 14:15:05 server systemd[1]: Socket service docker.service already active, refusing.
Jan 11 14:15:05 server systemd[1]: Failed to listen on Docker Socket for the API.

Seems to be the same issue as mentioned here: https://github.com/coreos/coreos-vagrant/issues/172
We could try to incorporate the solutions that have been suggested in the mentioned thread.

contivctl installation fails in contiv_network role

With golang no longer getting installed with base role, contiv_network role fails on "install contivctl" task

TASK: [contiv_network | install contivctl] ************************************
failed: [node1] => {"changed": true, "cmd": ". /etc/profile.d/00golang.sh && go get github.com/contiv/contivctl", "delta": "0:00:00.001944", "end": "2016-01-11 17:51:11.875943", "rc": 2, "start": "2016-01-11 17:51:11.873999", "warnings": []}
stderr: /bin/sh: 1: .: Can't open /etc/profile.d/00golang.sh

FATAL: all hosts have already failed -- aborting

Could move contivctl to an appropriate role/install necessary prerequisites for this to succeed.

binary paths for services

current ansible installs services in /usr/bin. This bug tracks to change their location to more recommended /usr/local/bin and /usr/local/sbin locations as appropriate

fix contivctl task to extract the binaries if a new tar ball was downloaded

right now the extraction task checks for binary to exist, but it should check for a new tarball to have been downloaded

add tasks to validate some of the environment and fail

we should possibly add a few checks that can help fail early. Some of the checks are:

ansible version 2.0
control_interface if defined should exist
netplugin_if if defined should exist

Anything else??

use a better variable name than monitor_interface

current ansible uses monitor_interface as a variable for the name of the linux interface to use for all control traffic like etcd, ceph-mon traffic.

It will be better to use a name like control_interface instead

cleanup tasks still refer to vars directory

we have moved all the variable definitions to defaults but cleanup playbook is still picking up the etcd vars from vars directory which causes it to fail

Deploying the playbooks will fail on CentOS 7 if docker already installed

Running aci_demo_installer will fail if docker was previously installed in a centos7, either before running aci_demo_installer, or when running aci_demo_installer for the 2nd time.

I am running the installer as root in a CentOS7 bare metal server, got the error in the first work node, but after rerunning aci_demo_installer, I got it in the master node as well.

The workaround is uninstalling docker from all nodes (yum erase docker-engine -y) before running aci_demo_installer.

Here the error message:

TASK: [base | remove older docker and etcd] ***********************************
changed: [node2] => (item={'src': '/usr/bin/docker'})
ok: [node2] => (item={'src': '/usr/bin/etcd'})
ok: [node2] => (item={'src': '/etc/systemd/system/docker.service.d/http-proxy.conf'})
failed: [node2] => (item={'src': '/var/lib/docker'}) => {"failed": true, "item": {"src": "/var/lib/docker"}}
msg: rmtree failed: [Errno 16] Device or resource busy: '/var/lib/docker/devicemapper/mnt/84a9ed0763cdd5650bb142cfea56910f1b6c7a7b36ff0b58abfcc7531d00e60d'
FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
to retry, use: --limit @/root/site.retry
node2 : ok=13 changed=11 unreachable=0 failed=1

ERROR! environment must be a dictionary, received env

At the beginning of Ansible playbook run on master node:
TASK [base : ensure custom facts directory exists] *****************************
[DEPRECATION WARNING]: Using bare variables for environment is deprecated. Update your playbooks so that the environment value uses the full variable syntax ('{{foo}}'). This feature will be removed in a future release. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
fatal: [node1]: FAILED! => {"failed": true, "msg": "ERROR! environment must be a dictionary, received env (<class 'ansible.parsing.yaml.objects.AnsibleUnicode'>)"}

PLAY RECAP *********************************************************************
node1 : ok=1 changed=0 unreachable=0 failed=1

—> To correct, you have to change in ./ansible/site.yml all the « environment: env » to « environment: '{{env}}’ » according to ansible/ansible#11912

Here is a proposed patch.
site.txt

`vagrant provision` fails on a centos box

After doing vagrant up on the test vagrant box, subsequent vagrant provision fails with following error:

TASK: [base | upgrade system (redhat)] ****************************************
skipping: [host1]
failed: [host0] => {"changed": true, "failed": true, "rc": 1, "results": ["Loaded plugins: fastestmirror, priorities\nLoading mirror speeds from cached hostfile\n * base: mirror.web-ster.co
m\n * epel: mirror.csclub.uwaterloo.ca\n * extras: linux.mirrors.es.net\n * updates: repos.lax.quadranet.com\nResolving Dependencies\n--> Running transaction check\n---> Package python-babe
l.noarch 0:0.9.6-8.el7 will be updated\n---> Package python-babel.noarch 0:1.3-6.el7 will be an update\n--> Processing Dependency: pytz for package: python-babel-1.3-6.el7.noarch\n---> Pack
age python-requests.noarch 0:2.6.0-1.el7_1 will be updated\n---> Package python-requests.noarch 0:2.7.0-1.el7 will be an update\n---> Package python-urllib3.noarch 0:1.10.2-2.el7_1 will be
updated\n---> Package python-urllib3.noarch 0:1.10.4-1.20150503gita91975b.el7 will be an update\n--> Running transaction check\n---> Package pytz.noarch 0:2012d-5.el7 will be installed\n-->
 Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package         Arch   Version
            Repository      Size\n================================================================================\nUpdating:\n python-babel    noarch 1.3-6.el7                          ope
nstack-kilo 2.4 M\n python-requests noarch 2.7.0-1.el7                        openstack-kilo  95 k\n python-urllib3  noarch 1.10.4-1.20150503gita91975b.el7    openstack-kilo 113 k\nInstalli
ng for dependencies:\n pytz            noarch 2012d-5.el7                        base            38 k\n\nTransaction Summary\n===============================================================
=================\nInstall             ( 1 Dependent package)\nUpgrade  3 Packages\n\nTotal download size: 2.7 M\nDownloading packages:\nDelta RPMs disabled because /usr/bin/applydeltarpm n
ot installed.\n--------------------------------------------------------------------------------\nTotal                                              661 kB/s | 2.7 MB  00:04     \nRunning tr
ansaction check\nRunning transaction test\nTransaction test succeeded\nRunning transaction\n  Installing : pytz-2012d-5.el7.noarch                                      1/7 \n  Updating   :
python-urllib3-1.10.4-1.20150503gita91975b.el7.noarch        2/7 \n  Updating   : python-requests-2.7.0-1.el7.noarch                           3/7 \nerror: unpacking of archive failed on fi
le /usr/lib/python2.7/site-packages/requests/packages/chardet: cpio: rename\n  Updating   : python-babel-1.3-6.el7.noarch                                4/7 \nerror: python-requests-2.7.0-1
.el7.noarch: install failed\n  Cleanup    : python-urllib3-1.10.2-2.el7_1.noarch                         5/7 \nerror: python-requests-2.6.0-1.el7_1.noarch: erase skipped\n  Cleanup    : pyt
hon-babel-0.9.6-8.el7.noarch                              6/7 \n  Verifying  : python-babel-1.3-6.el7.noarch                                1/7 \n  Verifying  : python-urllib3-1.10.4-1.2015
0503gita91975b.el7.noarch        2/7 \n  Verifying  : pytz-2012d-5.el7.noarch                                      3/7 \n  Verifying  : python-babel-0.9.6-8.el7.noarch
        4/7 \n  Verifying  : python-requests-2.6.0-1.el7_1.noarch                         5/7 \n  Verifying  : python-urllib3-1.10.2-2.el7_1.noarch                         6/7 \n  Verifying
  : python-requests-2.7.0-1.el7.noarch                           7/7 \n\nDependency Installed:\n  pytz.noarch 0:2012d-5.el7                                                     \n\nUpdated:\
n  python-babel.noarch 0:1.3-6.el7                                               \n  python-urllib3.noarch 0:1.10.4-1.20150503gita91975b.el7                       \n\nFailed:\n  python-requ
ests.noarch 0:2.6.0-1.el7_1  python-requests.noarch 0:2.7.0-1.el7 \n\nComplete!\n"]}
msg: Error unpacking rpm package python-requests-2.7.0-1.el7.noarch
python-requests-2.6.0-1.el7_1.noarch was supposed to be removed but is not!

ansible installation fail in centos packer boxes with error: Error unpacking rpm package python-crypto-2.6.1-1.el7.centos.x86_64

running the devtest host-group fails while running ansible install task with error: Error unpacking rpm package python-crypto-2.6.1-1.el7.centos.x86_64.

When I try yum install ansible on a vm created from the packer box, then I don't see this error.

+++++++
build-virtualbox:
build-virtualbox: TASK: [base | install ansible (redhat)] ***************************************
build-virtualbox: failed: [127.0.0.1] => {"changed": true, "rc": 1, "results": ["Loaded plugins: fastestmirror\nLoading mirror speeds from cached hostfile\n * base: mirror.beyondhosting.net\n * epel: fedora-epel.mirror.iweb.com\n * extras: centos.mb
ni.med.umich.edu\n * updates: bay.uchicago.edu\nResolving Dependencies\n--> Running transaction check\n---> Package ansible.noarch 0:1.9.4-1.el7 will be installed\n--> Processing Dependency: sshpass for package: ansible-1.9.4-1.el7.noarch\n--> Processin
g Dependency: python-paramiko for package: ansible-1.9.4-1.el7.noarch\n--> Processing Dependency: python-keyczar for package: ansible-1.9.4-1.el7.noarch\n--> Processing Dependency: python-jinja2 for package: ansible-1.9.4-1.el7.noarch\n--> Processing De
pendency: python-httplib2 for package: ansible-1.9.4-1.el7.noarch\n--> Processing Dependency: PyYAML for package: ansible-1.9.4-1.el7.noarch\n--> Running transaction check\n---> Package PyYAML.x86_64 0:3.10-11.el7 will be installed\n--> Processing Depen
dency: libyaml-0.so.2()(64bit) for package: PyYAML-3.10-11.el7.x86_64\n---> Package python-httplib2.noarch 0:0.7.7-3.el7 will be installed\n---> Package python-jinja2.noarch 0:2.7.2-2.el7 will be installed\n--> Processing Dependency: python-babel >= 0.8
for package: python-jinja2-2.7.2-2.el7.noarch\n--> Processing Dependency: python-markupsafe for package: python-jinja2-2.7.2-2.el7.noarch\n---> Package python-keyczar.noarch 0:0.71c-2.el7 will be installed\n--> Processing Dependency: python-pyasn1 for
package: python-keyczar-0.71c-2.el7.noarch\n--> Processing Dependency: python-crypto for package: python-keyczar-0.71c-2.el7.noarch\n---> Package python-paramiko.noarch 0:1.15.1-1.el7 will be installed\n--> Processing Dependency: python-ecdsa for packag
e: python-paramiko-1.15.1-1.el7.noarch\n---> Package sshpass.x86_64 0:1.05-5.el7 will be installed\n--> Running transaction check\n---> Package libyaml.x86_64 0:0.1.4-11.el7_0 will be installed\n---> Package python-babel.noarch 0:0.9.6-8.el7 will be ins
talled\n---> Package python-crypto.x86_64 0:2.6.1-1.el7.centos will be installed\n---> Package python-ecdsa.noarch 0:0.11-3.el7.centos will be installed\n---> Package python-markupsafe.x86_64 0:0.11-10.el7 will be installed\n---> Package python-pyasn1.n
oarch 0:0.1.6-2.el7 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package Arch Version Reposit
ory Size\n================================================================================\nInstalling:\n ansible noarch 1.9.4-1.el7 epel 1.7 M\nInstalling for dependencies:\n PyYAML x86_64
3.10-11.el7 base 153 k\n libyaml x86_64 0.1.4-11.el7_0 base 55 k\n python-babel noarch 0.9.6-8.el7 base 1.4 M\n python-crypto x86_64 2.6.1-1.e
l7.centos extras 470 k\n python-ecdsa noarch 0.11-3.el7.centos extras 69 k\n python-httplib2 noarch 0.7.7-3.el7 epel 70 k\n python-jinja2 noarch 2.7.2-2.el7
base 515 k\n python-keyczar noarch 0.71c-2.el7 epel 218 k\n python-markupsafe x86_64 0.11-10.el7 base 25 k\n python-paramiko noarch 1.15.1-1.el7 epe
l 999 k\n python-pyasn1 noarch 0.1.6-2.el7 base 91 k\n sshpass x86_64 1.05-5.el7 epel 21 k\n\nTransaction Summary\n====================================================
============================\nInstall 1 Package (+12 Dependent packages)\n\nTotal download size: 5.7 M\nInstalled size: 25 M\nDownloading packages:\n--------------------------------------------------------------------------------\nTotal
754 kB/s | 5.7 MB 00:07 \nRunning transaction check\nRunning transaction test\nTransaction test succeeded\nRunning transaction\n Installing : python-crypto-2.6.1-1.el7.centos.x86_64 1/13 \nerror: u
npacking of archive failed on file /usr/lib64/python2.7/site-packages/pycrypto-2.6.1-py2.7.egg-info: cpio: rename\n Installing : python-ecdsa-0.11-3.el7.centos.noarch 2/13 \nerror: python-crypto-2.6.1-1.el7.centos.x86_64: install
failed\n Installing : python-paramiko-1.15.1-1.el7.noarch 3/13 \n Installing : sshpass-1.05-5.el7.x86_64 4/13 \n Installing : python-babel-0.9.6-8.el7.noarch 5/13
n Installing : python-pyasn1-0.1.6-2.el7.noarch 6/13 \n Installing : python-keyczar-0.71c-2.el7.noarch 7/13 \n Installing : python-httplib2-0.7.7-3.el7.noarch 8/13 \n Inst
alling : python-markupsafe-0.11-10.el7.x86_64 9/13 \n Installing : python-jinja2-2.7.2-2.el7.noarch 10/13 \n Installing : libyaml-0.1.4-11.el7_0.x86_64 11/13 \n Installing
: PyYAML-3.10-11.el7.x86_64 12/13 \n Installing : ansible-1.9.4-1.el7.noarch 13/13 \n Verifying : python-keyczar-0.71c-2.el7.noarch 1/13 \n Verifying : libya
ml-0.1.4-11.el7_0.x86_64 2/13 \n Verifying : python-jinja2-2.7.2-2.el7.noarch 3/13 \n Verifying : python-markupsafe-0.11-10.el7.x86_64 4/13 \n Verifying : python-httpl
ib2-0.7.7-3.el7.noarch 5/13 \n Verifying : python-pyasn1-0.1.6-2.el7.noarch 6/13 \n Verifying : PyYAML-3.10-11.el7.x86_64 7/13 \n Verifying : ansible-1.9.4-1.el7
.noarch 8/13 \n Verifying : python-babel-0.9.6-8.el7.noarch 9/13 \n Verifying : sshpass-1.05-5.el7.x86_64 10/13 \n Verifying : python-ecdsa-0.11-3.el7.ce
ntos.noarch 11/13 \n Verifying : python-paramiko-1.15.1-1.el7.noarch 12/13 \n Verifying : python-crypto-2.6.1-1.el7.centos.x86_64 13/13 \n\nInstalled:\n ansible.noarch 0:1.9.4-1.el7
\n\nDependency Installed:\n PyYAML.x86_64 0:3.10-11.el7 libyaml.x86_64 0:0.1.4-11.el7_0 \n python-babel.noarch 0:0.9.6-8.el7 python-ecdsa.noarch 0:0.11-3.el7.centos\n python-httplib2.
noarch 0:0.7.7-3.el7 python-jinja2.noarch 0:2.7.2-2.el7 \n python-keyczar.noarch 0:0.71c-2.el7 python-markupsafe.x86_64 0:0.11-10.el7 \n python-paramiko.noarch 0:1.15.1-1.el7 python-pyasn1.noarch 0:0.1.6-2.el7 \n sshpass.x86_64 0:1.05-5.el
7 \n\nFailed:\n python-crypto.x86_64 0:2.6.1-1.el7.centos \n\nComplete!\n"]}
build-virtualbox: msg: Error unpacking rpm package python-crypto-2.6.1-1.el7.centos.x86_64
build-virtualbox:

support for Ubuntu16.04

Our ansible has been tested on Ubuntu 15.04 for a while until we started seeing more failure and support issues due to it being no lts release. So we have disabled it from sanities in #94

This bug track bringing back Ubuntu testing using 16.04 once it is more readily supported by repos that we use viz. ansible, docker etc

start docker tcp socket service fails on subsequent tries.

steps:

In contiv/cluster repo,
make demo
vagrant provision

TASK [docker : start docker tcp socket service] ********************************
fatal: [cluster-node1]: FAILED! => {"changed": true, "cmd": "sudo systemctl daemon-reload && sudo systemctl stop docker && sudo systemctl start docker-tcp.socket && sudo systemctl start docker", "delta": "0:01:32.122085", "end": "2016-02-18 19:24:42.840007", "failed": true, "rc": 1, "start": "2016-02-18 19:23:10.717922", "stderr": "Warning: Stopping docker.service, but it can still be activated by:\n  docker-tcp.socket\nJob for docker.service failed because the control process exited with error code. See \"systemctl status docker.service\" and \"journalctl -xe\" for details.", "stdout": "", "stdout_lines": [], "warnings": ["Consider using 'become', 'become_method', and 'become_user' rather than running sudo"]}

logs:

[vagrant@cluster-node1 ~]$ sudo systemctl status docker.service -ln 1000
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─env.conf
   Active: failed (Result: exit-code) since Thu 2016-02-18 19:24:42 UTC; 1min 55s ago
     Docs: https://docs.docker.com
  Process: 11387 ExecStart=/usr/bin/docker daemon -s overlay -H fd:// --cluster-store=etcd://localhost:2379 (code=exited, status=2)
 Main PID: 11387 (code=exited, status=2)

Feb 18 19:23:12 cluster-node1 systemd[1]: Starting Docker Application Container Engine...
Feb 18 19:23:12 cluster-node1 docker[11387]: time="2016-02-18T19:23:12.733787480Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Feb 18 19:23:12 cluster-node1 docker[11387]: time="2016-02-18T19:23:12.747448331Z" level=info msg="Firewalld running: false"
Feb 18 19:24:25 cluster-node1 docker[11387]: time="2016-02-18T19:24:25.105628230Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Feb 18 19:24:25 cluster-node1 docker[11387]: time="2016-02-18T19:24:25.233232244Z" level=info msg="Loading containers: start."
Feb 18 19:24:42 cluster-node1 systemd[1]: docker.service start operation timed out. Terminating.
Feb 18 19:24:42 cluster-node1 docker[11387]: ..........
Feb 18 19:24:42 cluster-node1 systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 18 19:24:42 cluster-node1 systemd[1]: Failed to start Docker Application Container Engine.
Feb 18 19:24:42 cluster-node1 systemd[1]: Unit docker.service entered failed state.
Feb 18 19:24:42 cluster-node1 systemd[1]: docker.service failed.

it should be possible to change cluster-manager config and restart the service

Right now the cluster-manager is not passed the config flag as part of syetd unit setup, which makes it a bit hard to change it's configuration and restart. This issue tracks ansible change to setup a configuration file for cluster-manager that user can tweak, if needed.

The configuration talked here includes things like ansible playbook location, user credentials etc.

Proposal: reorganize/cleanup the base role

Background:
At some point when we decided to use ansible for provisioning all our projects we started off with the ansible playbooks used by our packer builds. At that point base role was added to download and install multiple binaries. It was pretty much everything that packer builds were installing like baked in go, docker, etcd, ovs and so on.

Fast Forward to requirements:
Overtime as we have gained some more experience adding ansible plays for our own services, the existing organization of base role has gotten a bit unaligned with the general organization of other plays like:

not all service specific dependencies are always installed together. Just to state one instance, docker binary get's installed in base but the service is setup as part of docker role; same is true with etcd and ovs.
also overtime we have realized there are two broad environments that we are going to have a need to support with our ansible scripts for deploying Contiv services viz. production and development. A production environment would be a stock Ubuntu15.04 or Cento7.1 or in future a RHEL OS image loaded by the customer and then provisioned using our ansible playbooks according to their deployment. A development environment is similar to customer environment but we want some packages to be pre-installed in order to save time when bringing up the development environment. An example of development environment will be a vagrant box that we use for testing in our projects or the host machines that we use for development.

Resolving the above will hopefully help us get more clarity on how we can use our playbooks in customer/demo environments and development environments.

Proposal:

To achieve the above two requirements, I propose the following:

move the service specific dependencies to respective roles that take care of provisioning the service. This shall address 1. above.
introduce a new role called dev that shall take care of pre-installing the requirements needed for the development environment but it shall reuse the tasks from underlying services. This shall address 2. above.

Also see the associated patch of something in works in my fork here: https://github.com/mapuri/ansible/tree/devrole

This shall give a more concrete idea of what I have in mind. I will create a formal PR once I have it more polished and tested. Feel free to take a look and suggest other cleanups as well that you might have had in mind.

cc @jainvipin @erikh

etcd out of sync on hosts

Setup - Non vagrant setup on 2 hosts using net_demo_installer script. Etcd config is out of sync on hosts when I use control interface eth1 on host1 and control interface eth2 on host 2. etcd is unable to communicate. To reproduce issue, use config below
CONNECTION_INFO:
:
control: eth1
data: eth2
:
control: eth2
data: eth1

ucp | create a local fetch directory if it doesn't exist - errors out.

When scheduler_provider = "ucp-swarm", the following error is seen

TASK: [ucp | create a local fetch directory if it doesn't exist] **************
failed: [cluster-node1 -> 127.0.0.1] => {"failed": true, "parsed": false}
[sudo via ansible, key=pinebvdaownqyquvfpxaengckhfixffn] password:


FATAL: all hosts have already failed -- aborting

add vagrant user to docker group

As part of vagrant role we should also add the vagrant user to docker group. This shall avoid needing to specify sudo for docker commands on vagrant machines

ucarp role: when the vip interface get created it should not add a route to the the subnet through it

this affects host's connectivity otherwise

start docker tcp socket service task errors on subsequent ansible runs.

It works only the first time ansible is run and errors out every other time.

to reproduce - checkout ansible repo - vagrant up; vagrant provision

TASK: [docker | start docker tcp socket service] ******************************
failed: [cluster-node1] => {"changed": true, "cmd": "sudo systemctl stop docker && sudo systemctl start docker-tcp.socket && sudo systemctl start docker", "delta": "0:00:00.436279", "end": "2016-02-17 09:32:39.085159", "rc": 1, "start": "2016-02-17 09:32:38.648880", "warnings": []}
stderr: Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: Stopping docker.service, but it can still be activated by:
  docker-tcp.socket
Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

FATAL: all hosts have already failed -- aborting

it will be nice to get time taken by each task

This can help improving the playbook by identifying longer task

We can use something like this: https://github.com/jlafon/ansible-profile

cleanup netplugin related files/binaries before installing it

In older installation logic for netplugin we used to extract files directly to usr/bin but we have now moved to using links instead as part of #52

This causes a failure if ansible is run on a host configured using old ansible.

Increase log limit for services

Logs for services launched via systemd unit are getting rolled over and important information might be lost. Some of the options to alleviate this could be to increase the log space for all the services started Provide an option to increase the logs per service as required.

ceph playbook should allow incremental addition of mons and osds

moved from contiv/build#22 as ansible repo is the right place for this.

+++++++++
right now the way ceph configuration is generated (as shown in the snippet from ansible/roles/ceph-common/templates/ceph.conf.j2 below) results in a dependency that all mons and osds hosts need to be configured together in single playbook run.

In a real cluster, we would need to allow incremental provisioning of new mons and osds. This issue tracks this requirement.

79 {% for host in groups[mon_group_name] %}
80 {% if hostvars[host]['ansible_hostname'] is defined %}
81 [mon.{{ hostvars[host]['ansible_hostname'] }}]
82 host = {{ hostvars[host]['ansible_hostname'] }}
83 mon addr = {{ hostvars[host]['ansible_' + monitor_interface]['ipv4']['address'] }}
84 {% endif %}
85 {% endfor %}

docker version not updated if a previous version exists

While running the net_demo_installer i saw that the docker version was not updated on the bare metal server.
The docker version on the baremetal was

ladmin@contiv146:~/src/github.com/contiv/demo/net$ docker version
Client:
Version: 1.9.0-dev
API version: 1.21
Go version: go1.4.2
Git commit: 02ae137
Built: Fri Sep 25 17:37:00 UTC 2015
OS/Arch: linux/amd64
Experimental: true

However manually updating docker using the docker provided script did upgrade to 1.9.1

contiv / ansible Goto Github PK

ansible's People

Stargazers

Watchers

Forkers

ansible's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs