contiv / ansible Goto Github PK
View Code? Open in Web Editor NEWansible scripts for contiv cluster
License: Other
ansible scripts for contiv cluster
License: Other
the base package installs multiple dev packages like git, gcc, perl etc which shouldn't be installed in production environment unless they are needed for running or installing the service itself
Following is the list in base
role that we can most likely move to dev
role:
- ntp
- vim
- curl
- git
- mercurial
- gcc
- perl
right now we have a mix of two ways for storing roles variables viz. vars
dir and defaults
dir. It would be nice to make this consistent.
Moving to defaults
seems reasonable as these variables hold least precedence wrt inventory variables and variable passed from command line allowing us to override these easily in our code that depends on these variables.
when scheduler_provider = "ucp-swam"
This timeout is seen on many runs.
TASK [ucp : wait for ucp files to be created, which ensures the service has started] ***
failed: [cluster-node1] => (item=ucp-fingerprint) => {"elapsed": 300, "failed": true, "item": "ucp-fingerprint", "msg": "Timeout when waiting for file /tmp/ucp-fingerprint"}
failed: [cluster-node1] => (item=ucp-instance-id) => {"elapsed": 300, "failed": true, "item": "ucp-instance-id", "msg": "Timeout when waiting for file /tmp/ucp-instance-id"}
PLAY RECAP *********************************************************************
cluster-node1 : ok=132 changed=22 unreachable=0 failed=1
we need to set --replica
and --replication
flags respectively on master nodes for ucp and swarm respectively for a HA cluster.
this is tracking the comment #97 (comment) to make ucp logic behave identical to swarm as done in #97
Workaround: install bzip2 manually in the worker nodes
TASK: [contiv_network | install netmaster and netplugin] **********************
failed: [node2] => {"changed": true, "cmd": "tar vxjf /tmp/contivnet.tar.bz2", "delta": "0:00:00.007268", "end": "2016-01-07 08:08:07.476339", "rc": 2, "start": "2016-01-07 08:08:07.469071", "warnings": ["Consider using unarchive module rather than running tar"]}
stderr: tar (child): bzip2: Cannot exec: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
FATAL: all hosts have already failed -- aborting
PLAY RECAP ********************************************************************
to retry, use: --limit @/root/site.retry
node2 : ok=43 changed=24 unreachable=0 failed=1
Currently when more than one node is provisioned with swarm running as master, swarm doesn't work as expected. We need a mechanism to run it as master on only one node.
Currently, we are seeing this issue when netplugin-node hostgroup is used.
Encountered many issues while trying to re-run playbook on a node because of older version or after change in environment variables. Need a mechanism to identify if the services running on a node are of older version/older configuration and reinstall/restart the services as necessary.
After Docker and etcd installation, we got the following error:
TASK [base : stop etcd] ********************************************************
fatal: [node1]: FAILED! => {"changed": false, "failed": true, "msg": "systemd could not find the requested service "'etcd'": « }
And we found that there is no service file in /etc/systemd/ or /lib/systemd .
Here is a proposed patch.
main.txt
apt-get update errors out on some hosts. The workaround to fix this is to clean up /var/lib/apt/lists.
We need to incorporate this into the ansible scripts to perform this clean up based on the error.
there is a bug with using overlayfs in docker, so we need to revert back to using devicemapper for now until this is resolved:
where possible we should get the tasks organized as proposed here #23 (comment)
tracking an issue that was recently noticed in volplugin where multipl iptable entries were installed for mDNS ports. The rule is installed as part of serf's service setup here https://github.com/contiv/ansible/blob/master/roles/serf/templates/serf.j2#L17
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp spt:5353
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:5353
...
...
The docker
role should install a specific version of docker that we test our services against.
we need to install appropriate rules to allow etcd traffic something on lines of https://github.com/kubernetes/contrib/blob/master/ansible/roles/etcd/tasks/main.yml#L39-L43
We need to install and validate the CS version of Docker Engine as described here:
https://docs.docker.com/docker-trusted-registry/install/install-csengine/
check docker service state
should set ignore_errors
to true otherwise ansible can incorrectly fail under conditions where docker service is not running
I am seeing this issue when i run net_installer_demo script. The cfg.yml contains the following
CONNECTION_INFO:
172.29.205.249:
**control: p1p2**
data: eth2
Ansible output:
TASK: [docker | copy systemd units for docker tcp socket settings] ************
ok: [node1]
TASK: [docker | start docker tcp socket service] ******************************
failed: [node1] => {"changed": true, "cmd": "sudo systemctl stop docker && sudo systemctl start docker-tcp.socket && sudo systemctl start docker", "delta": "0:00:00.164433", "end": "2016-02-03 03:51:51.044325", "rc": 1, "start": "2016-02-03 03:51:50.879892", "warnings": []}
stderr: Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Job for docker.service failed. See "systemctl status docker.service" and "journalctl -xe" for details.
FATAL: all hosts have already failed -- aborting
journal-xe output
-- Unit serf.service has begun starting up.
Feb 03 04:07:41 contiv146 serf.sh[13880]: setting up iptables for mdns
Feb 03 04:07:41 contiv146 serf.sh[13880]: starting serf
Feb 03 04:07:43 contiv146 serf.sh[13880]: eth1 is not assigned a valid addr: bin boot dev etc home initrd.img lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var vmlinuz
Feb 03 04:07:43 contiv146 systemd[1]: serf.service: main process exited, code=exited, status=1/FAILURE
Feb 03 04:07:43 contiv146 systemd[1]: Unit serf.service entered failed state.
Feb 03 04:07:43 contiv146 systemd[1]: serf.service failed.
Feb 03 04:07:49 contiv146 /usr/sbin/irqbalance[1321]: irq 56 affinity_hint subset empty
Feb 03 04:07:53 contiv146 systemd[1]: serf.service holdoff time over, scheduling restart.
Feb 03 04:07:53 contiv146 systemd[1]: Started Serf.
-- Subject: Unit serf.service has finished start-up
-- Defined-By: systemd
When pulling aci-gw container, check for a particular version to pull.
The roles and groups that are defined so far includes swarm scheduler. What would be the best way to contribute to this repo to add k8s scheduler. Few options I can think of:
My preference would be first one (different host-groups) because there may be more dependent top level roles that some of the schedulers (e.g. mesos) might require.
this issue tracks setting the vtep-ip
flag for netplugin when setting up the contiv_network
role
This issue tracks identification and development of cleanup tasks that need to be executed when a node is decommissioned. A few tasks that I have in my mind are (not in any specific order of execution):
This issue is seen for tasks where we download binary releases and install them on the server nodes.
Example scenario:
Netplugin/netmaster binaries are downloaded from released version and installed on server-nodes
If we were to re-run the ansible playbook after running cleanup.yml playbook and then site.yml with a new release version, the new version is not downloaded and installed
Workaround:
Remove /tmp/contivnet.tar.bz2
Remove associated binaries from /usr/bin/{netplugin/netmast/netctl/contivk8s}
Rerun ansible-playbook site.yml
This issue would be seen for any such binaries/released version.
After rebooting a cluster of Contiv VMs the netmaster and etcd services wouldn't come back up.
Reboot occurred after increasing the VM's number of cores from 1 to 2 and increasing the memory from 1G to 8G.
Attempted to restart contiv with the "startPlugin.py" script and after attempted to reinstall the whole package with "net_demo_installer". After running script checked status of the etcd and netmaster services both reported failing. Manually starting etcd service on both the primary and secondary node failed.
Citing code segment. This code is from /usr/bin/etcd.sh shell script that is called by ansible installer to restart:
if [ ! -f /var/tmp/etcd.existing ]; then
touch /var/tmp/etcd.existing
export ETCD_INITIAL_CLUSTER_STATE=new
export ETCD_INITIAL_CLUSTER="node1=http://10.88.38.75:2380,node1=http://10.88.38.75:7001,node2=http://10.88.38.73:2380,node2=http://10.88.38.73:7001"
else
# XXX: There seems an issue using etcdctl with ETCD_INITIAL_ADVERTISE_PEER_URLS so passing
# ETCD_LISTEN_PEER_URLS for now
out=`etcdctl --peers="10.88.38.73:2379,10.88.38.73:4001"
member add node1 "$ETCD_LISTEN_PEER_U
Docker role fails with the following message:
TASK: [docker | start docker tcp socket service] ******************************
failed: [node1] => {"failed": true}
msg: Job for docker-tcp.socket failed. See "systemctl status docker-tcp.socket" and "journalctl -xe" for details.
contivuser@server:~$ sudo systemctl status -ln100 docker-tcp.socket
● docker-tcp.socket - Docker Socket for the API
Loaded: loaded (/etc/systemd/system/docker-tcp.socket; disabled; vendor preset: enabled)
Active: inactive (dead)
Listen: [::]:2385 (Stream)
Jan 11 14:05:42 server systemd[1]: Socket service docker.service already active, refusing.
Jan 11 14:05:42 server systemd[1]: Failed to listen on Docker Socket for the API.
Jan 11 14:15:05 server systemd[1]: Socket service docker.service already active, refusing.
Jan 11 14:15:05 server systemd[1]: Failed to listen on Docker Socket for the API.
Seems to be the same issue as mentioned here: https://github.com/coreos/coreos-vagrant/issues/172
We could try to incorporate the solutions that have been suggested in the mentioned thread.
With golang no longer getting installed with base role, contiv_network role fails on "install contivctl" task
TASK: [contiv_network | install contivctl] ************************************
failed: [node1] => {"changed": true, "cmd": ". /etc/profile.d/00golang.sh && go get github.com/contiv/contivctl", "delta": "0:00:00.001944", "end": "2016-01-11 17:51:11.875943", "rc": 2, "start": "2016-01-11 17:51:11.873999", "warnings": []}
stderr: /bin/sh: 1: .: Can't open /etc/profile.d/00golang.sh
FATAL: all hosts have already failed -- aborting
Could move contivctl to an appropriate role/install necessary prerequisites for this to succeed.
current ansible installs services in /usr/bin
. This bug tracks to change their location to more recommended /usr/local/bin
and /usr/local/sbin
locations as appropriate
right now the extraction task checks for binary to exist, but it should check for a new tarball to have been downloaded
we should possibly add a few checks that can help fail early. Some of the checks are:
Anything else??
current ansible uses monitor_interface
as a variable for the name of the linux interface to use for all control traffic like etcd, ceph-mon traffic.
It will be better to use a name like control_interface
instead
we have moved all the variable definitions to defaults but cleanup playbook is still picking up the etcd vars from vars directory which causes it to fail
Running aci_demo_installer will fail if docker was previously installed in a centos7, either before running aci_demo_installer, or when running aci_demo_installer for the 2nd time.
I am running the installer as root in a CentOS7 bare metal server, got the error in the first work node, but after rerunning aci_demo_installer, I got it in the master node as well.
The workaround is uninstalling docker from all nodes (yum erase docker-engine -y) before running aci_demo_installer.
Here the error message:
TASK: [base | remove older docker and etcd] ***********************************
changed: [node2] => (item={'src': '/usr/bin/docker'})
ok: [node2] => (item={'src': '/usr/bin/etcd'})
ok: [node2] => (item={'src': '/etc/systemd/system/docker.service.d/http-proxy.conf'})
failed: [node2] => (item={'src': '/var/lib/docker'}) => {"failed": true, "item": {"src": "/var/lib/docker"}}
msg: rmtree failed: [Errno 16] Device or resource busy: '/var/lib/docker/devicemapper/mnt/84a9ed0763cdd5650bb142cfea56910f1b6c7a7b36ff0b58abfcc7531d00e60d'
FATAL: all hosts have already failed -- aborting
PLAY RECAP ********************************************************************
to retry, use: --limit @/root/site.retry
node2 : ok=13 changed=11 unreachable=0 failed=1
At the beginning of Ansible playbook run on master node:
TASK [base : ensure custom facts directory exists] *****************************
[DEPRECATION WARNING]: Using bare variables for environment is deprecated. Update your playbooks so that the environment value uses the full variable syntax ('{{foo}}'). This feature will be removed in a future release. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
fatal: [node1]: FAILED! => {"failed": true, "msg": "ERROR! environment must be a dictionary, received env (<class 'ansible.parsing.yaml.objects.AnsibleUnicode'>)"}
PLAY RECAP *********************************************************************
node1 : ok=1 changed=0 unreachable=0 failed=1
—> To correct, you have to change in ./ansible/site.yml all the « environment: env » to « environment: '{{env}}’ » according to ansible/ansible#11912
Here is a proposed patch.
site.txt
After doing vagrant up
on the test vagrant box, subsequent vagrant provision
fails with following error:
TASK: [base | upgrade system (redhat)] ****************************************
skipping: [host1]
failed: [host0] => {"changed": true, "failed": true, "rc": 1, "results": ["Loaded plugins: fastestmirror, priorities\nLoading mirror speeds from cached hostfile\n * base: mirror.web-ster.co
m\n * epel: mirror.csclub.uwaterloo.ca\n * extras: linux.mirrors.es.net\n * updates: repos.lax.quadranet.com\nResolving Dependencies\n--> Running transaction check\n---> Package python-babe
l.noarch 0:0.9.6-8.el7 will be updated\n---> Package python-babel.noarch 0:1.3-6.el7 will be an update\n--> Processing Dependency: pytz for package: python-babel-1.3-6.el7.noarch\n---> Pack
age python-requests.noarch 0:2.6.0-1.el7_1 will be updated\n---> Package python-requests.noarch 0:2.7.0-1.el7 will be an update\n---> Package python-urllib3.noarch 0:1.10.2-2.el7_1 will be
updated\n---> Package python-urllib3.noarch 0:1.10.4-1.20150503gita91975b.el7 will be an update\n--> Running transaction check\n---> Package pytz.noarch 0:2012d-5.el7 will be installed\n-->
Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package Arch Version
Repository Size\n================================================================================\nUpdating:\n python-babel noarch 1.3-6.el7 ope
nstack-kilo 2.4 M\n python-requests noarch 2.7.0-1.el7 openstack-kilo 95 k\n python-urllib3 noarch 1.10.4-1.20150503gita91975b.el7 openstack-kilo 113 k\nInstalli
ng for dependencies:\n pytz noarch 2012d-5.el7 base 38 k\n\nTransaction Summary\n===============================================================
=================\nInstall ( 1 Dependent package)\nUpgrade 3 Packages\n\nTotal download size: 2.7 M\nDownloading packages:\nDelta RPMs disabled because /usr/bin/applydeltarpm n
ot installed.\n--------------------------------------------------------------------------------\nTotal 661 kB/s | 2.7 MB 00:04 \nRunning tr
ansaction check\nRunning transaction test\nTransaction test succeeded\nRunning transaction\n Installing : pytz-2012d-5.el7.noarch 1/7 \n Updating :
python-urllib3-1.10.4-1.20150503gita91975b.el7.noarch 2/7 \n Updating : python-requests-2.7.0-1.el7.noarch 3/7 \nerror: unpacking of archive failed on fi
le /usr/lib/python2.7/site-packages/requests/packages/chardet: cpio: rename\n Updating : python-babel-1.3-6.el7.noarch 4/7 \nerror: python-requests-2.7.0-1
.el7.noarch: install failed\n Cleanup : python-urllib3-1.10.2-2.el7_1.noarch 5/7 \nerror: python-requests-2.6.0-1.el7_1.noarch: erase skipped\n Cleanup : pyt
hon-babel-0.9.6-8.el7.noarch 6/7 \n Verifying : python-babel-1.3-6.el7.noarch 1/7 \n Verifying : python-urllib3-1.10.4-1.2015
0503gita91975b.el7.noarch 2/7 \n Verifying : pytz-2012d-5.el7.noarch 3/7 \n Verifying : python-babel-0.9.6-8.el7.noarch
4/7 \n Verifying : python-requests-2.6.0-1.el7_1.noarch 5/7 \n Verifying : python-urllib3-1.10.2-2.el7_1.noarch 6/7 \n Verifying
: python-requests-2.7.0-1.el7.noarch 7/7 \n\nDependency Installed:\n pytz.noarch 0:2012d-5.el7 \n\nUpdated:\
n python-babel.noarch 0:1.3-6.el7 \n python-urllib3.noarch 0:1.10.4-1.20150503gita91975b.el7 \n\nFailed:\n python-requ
ests.noarch 0:2.6.0-1.el7_1 python-requests.noarch 0:2.7.0-1.el7 \n\nComplete!\n"]}
msg: Error unpacking rpm package python-requests-2.7.0-1.el7.noarch
python-requests-2.6.0-1.el7_1.noarch was supposed to be removed but is not!
running the devtest host-group fails while running ansible install task with error: Error unpacking rpm package python-crypto-2.6.1-1.el7.centos.x86_64.
When I try yum install ansible
on a vm created from the packer box, then I don't see this error.
+++++++
build-virtualbox:
build-virtualbox: TASK: [base | install ansible (redhat)] ***************************************
build-virtualbox: failed: [127.0.0.1] => {"changed": true, "rc": 1, "results": ["Loaded plugins: fastestmirror\nLoading mirror speeds from cached hostfile\n * base: mirror.beyondhosting.net\n * epel: fedora-epel.mirror.iweb.com\n * extras: centos.mb
ni.med.umich.edu\n * updates: bay.uchicago.edu\nResolving Dependencies\n--> Running transaction check\n---> Package ansible.noarch 0:1.9.4-1.el7 will be installed\n--> Processing Dependency: sshpass for package: ansible-1.9.4-1.el7.noarch\n--> Processin
g Dependency: python-paramiko for package: ansible-1.9.4-1.el7.noarch\n--> Processing Dependency: python-keyczar for package: ansible-1.9.4-1.el7.noarch\n--> Processing Dependency: python-jinja2 for package: ansible-1.9.4-1.el7.noarch\n--> Processing De
pendency: python-httplib2 for package: ansible-1.9.4-1.el7.noarch\n--> Processing Dependency: PyYAML for package: ansible-1.9.4-1.el7.noarch\n--> Running transaction check\n---> Package PyYAML.x86_64 0:3.10-11.el7 will be installed\n--> Processing Depen
dency: libyaml-0.so.2()(64bit) for package: PyYAML-3.10-11.el7.x86_64\n---> Package python-httplib2.noarch 0:0.7.7-3.el7 will be installed\n---> Package python-jinja2.noarch 0:2.7.2-2.el7 will be installed\n--> Processing Dependency: python-babel >= 0.8
for package: python-jinja2-2.7.2-2.el7.noarch\n--> Processing Dependency: python-markupsafe for package: python-jinja2-2.7.2-2.el7.noarch\n---> Package python-keyczar.noarch 0:0.71c-2.el7 will be installed\n--> Processing Dependency: python-pyasn1 for
package: python-keyczar-0.71c-2.el7.noarch\n--> Processing Dependency: python-crypto for package: python-keyczar-0.71c-2.el7.noarch\n---> Package python-paramiko.noarch 0:1.15.1-1.el7 will be installed\n--> Processing Dependency: python-ecdsa for packag
e: python-paramiko-1.15.1-1.el7.noarch\n---> Package sshpass.x86_64 0:1.05-5.el7 will be installed\n--> Running transaction check\n---> Package libyaml.x86_64 0:0.1.4-11.el7_0 will be installed\n---> Package python-babel.noarch 0:0.9.6-8.el7 will be ins
talled\n---> Package python-crypto.x86_64 0:2.6.1-1.el7.centos will be installed\n---> Package python-ecdsa.noarch 0:0.11-3.el7.centos will be installed\n---> Package python-markupsafe.x86_64 0:0.11-10.el7 will be installed\n---> Package python-pyasn1.n
oarch 0:0.1.6-2.el7 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package Arch Version Reposit
ory Size\n================================================================================\nInstalling:\n ansible noarch 1.9.4-1.el7 epel 1.7 M\nInstalling for dependencies:\n PyYAML x86_64
3.10-11.el7 base 153 k\n libyaml x86_64 0.1.4-11.el7_0 base 55 k\n python-babel noarch 0.9.6-8.el7 base 1.4 M\n python-crypto x86_64 2.6.1-1.e
l7.centos extras 470 k\n python-ecdsa noarch 0.11-3.el7.centos extras 69 k\n python-httplib2 noarch 0.7.7-3.el7 epel 70 k\n python-jinja2 noarch 2.7.2-2.el7
base 515 k\n python-keyczar noarch 0.71c-2.el7 epel 218 k\n python-markupsafe x86_64 0.11-10.el7 base 25 k\n python-paramiko noarch 1.15.1-1.el7 epe
l 999 k\n python-pyasn1 noarch 0.1.6-2.el7 base 91 k\n sshpass x86_64 1.05-5.el7 epel 21 k\n\nTransaction Summary\n====================================================
============================\nInstall 1 Package (+12 Dependent packages)\n\nTotal download size: 5.7 M\nInstalled size: 25 M\nDownloading packages:\n--------------------------------------------------------------------------------\nTotal
754 kB/s | 5.7 MB 00:07 \nRunning transaction check\nRunning transaction test\nTransaction test succeeded\nRunning transaction\n Installing : python-crypto-2.6.1-1.el7.centos.x86_64 1/13 \nerror: u
npacking of archive failed on file /usr/lib64/python2.7/site-packages/pycrypto-2.6.1-py2.7.egg-info: cpio: rename\n Installing : python-ecdsa-0.11-3.el7.centos.noarch 2/13 \nerror: python-crypto-2.6.1-1.el7.centos.x86_64: install
failed\n Installing : python-paramiko-1.15.1-1.el7.noarch 3/13 \n Installing : sshpass-1.05-5.el7.x86_64 4/13 \n Installing : python-babel-0.9.6-8.el7.noarch 5/13
n Installing : python-pyasn1-0.1.6-2.el7.noarch 6/13 \n Installing : python-keyczar-0.71c-2.el7.noarch 7/13 \n Installing : python-httplib2-0.7.7-3.el7.noarch 8/13 \n Inst
alling : python-markupsafe-0.11-10.el7.x86_64 9/13 \n Installing : python-jinja2-2.7.2-2.el7.noarch 10/13 \n Installing : libyaml-0.1.4-11.el7_0.x86_64 11/13 \n Installing
: PyYAML-3.10-11.el7.x86_64 12/13 \n Installing : ansible-1.9.4-1.el7.noarch 13/13 \n Verifying : python-keyczar-0.71c-2.el7.noarch 1/13 \n Verifying : libya
ml-0.1.4-11.el7_0.x86_64 2/13 \n Verifying : python-jinja2-2.7.2-2.el7.noarch 3/13 \n Verifying : python-markupsafe-0.11-10.el7.x86_64 4/13 \n Verifying : python-httpl
ib2-0.7.7-3.el7.noarch 5/13 \n Verifying : python-pyasn1-0.1.6-2.el7.noarch 6/13 \n Verifying : PyYAML-3.10-11.el7.x86_64 7/13 \n Verifying : ansible-1.9.4-1.el7
.noarch 8/13 \n Verifying : python-babel-0.9.6-8.el7.noarch 9/13 \n Verifying : sshpass-1.05-5.el7.x86_64 10/13 \n Verifying : python-ecdsa-0.11-3.el7.ce
ntos.noarch 11/13 \n Verifying : python-paramiko-1.15.1-1.el7.noarch 12/13 \n Verifying : python-crypto-2.6.1-1.el7.centos.x86_64 13/13 \n\nInstalled:\n ansible.noarch 0:1.9.4-1.el7
\n\nDependency Installed:\n PyYAML.x86_64 0:3.10-11.el7 libyaml.x86_64 0:0.1.4-11.el7_0 \n python-babel.noarch 0:0.9.6-8.el7 python-ecdsa.noarch 0:0.11-3.el7.centos\n python-httplib2.
noarch 0:0.7.7-3.el7 python-jinja2.noarch 0:2.7.2-2.el7 \n python-keyczar.noarch 0:0.71c-2.el7 python-markupsafe.x86_64 0:0.11-10.el7 \n python-paramiko.noarch 0:1.15.1-1.el7 python-pyasn1.noarch 0:0.1.6-2.el7 \n sshpass.x86_64 0:1.05-5.el
7 \n\nFailed:\n python-crypto.x86_64 0:2.6.1-1.el7.centos \n\nComplete!\n"]}
build-virtualbox: msg: Error unpacking rpm package python-crypto-2.6.1-1.el7.centos.x86_64
build-virtualbox:
Our ansible has been tested on Ubuntu 15.04 for a while until we started seeing more failure and support issues due to it being no lts release. So we have disabled it from sanities in #94
This bug track bringing back Ubuntu testing using 16.04 once it is more readily supported by repos that we use viz. ansible, docker etc
steps:
In contiv/cluster repo,
make demo
vagrant provision
TASK [docker : start docker tcp socket service] ********************************
fatal: [cluster-node1]: FAILED! => {"changed": true, "cmd": "sudo systemctl daemon-reload && sudo systemctl stop docker && sudo systemctl start docker-tcp.socket && sudo systemctl start docker", "delta": "0:01:32.122085", "end": "2016-02-18 19:24:42.840007", "failed": true, "rc": 1, "start": "2016-02-18 19:23:10.717922", "stderr": "Warning: Stopping docker.service, but it can still be activated by:\n docker-tcp.socket\nJob for docker.service failed because the control process exited with error code. See \"systemctl status docker.service\" and \"journalctl -xe\" for details.", "stdout": "", "stdout_lines": [], "warnings": ["Consider using 'become', 'become_method', and 'become_user' rather than running sudo"]}
logs:
[vagrant@cluster-node1 ~]$ sudo systemctl status docker.service -ln 1000
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─env.conf
Active: failed (Result: exit-code) since Thu 2016-02-18 19:24:42 UTC; 1min 55s ago
Docs: https://docs.docker.com
Process: 11387 ExecStart=/usr/bin/docker daemon -s overlay -H fd:// --cluster-store=etcd://localhost:2379 (code=exited, status=2)
Main PID: 11387 (code=exited, status=2)
Feb 18 19:23:12 cluster-node1 systemd[1]: Starting Docker Application Container Engine...
Feb 18 19:23:12 cluster-node1 docker[11387]: time="2016-02-18T19:23:12.733787480Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Feb 18 19:23:12 cluster-node1 docker[11387]: time="2016-02-18T19:23:12.747448331Z" level=info msg="Firewalld running: false"
Feb 18 19:24:25 cluster-node1 docker[11387]: time="2016-02-18T19:24:25.105628230Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Feb 18 19:24:25 cluster-node1 docker[11387]: time="2016-02-18T19:24:25.233232244Z" level=info msg="Loading containers: start."
Feb 18 19:24:42 cluster-node1 systemd[1]: docker.service start operation timed out. Terminating.
Feb 18 19:24:42 cluster-node1 docker[11387]: ..........
Feb 18 19:24:42 cluster-node1 systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 18 19:24:42 cluster-node1 systemd[1]: Failed to start Docker Application Container Engine.
Feb 18 19:24:42 cluster-node1 systemd[1]: Unit docker.service entered failed state.
Feb 18 19:24:42 cluster-node1 systemd[1]: docker.service failed.
Right now the cluster-manager is not passed the config flag as part of syetd unit setup, which makes it a bit hard to change it's configuration and restart. This issue tracks ansible change to setup a configuration file for cluster-manager that user can tweak, if needed.
The configuration talked here includes things like ansible playbook location, user credentials etc.
Background:
At some point when we decided to use ansible for provisioning all our projects we started off with the ansible playbooks used by our packer builds. At that point base
role was added to download and install multiple binaries. It was pretty much everything that packer builds were installing like baked in go, docker, etcd, ovs and so on.
Fast Forward to requirements:
Overtime as we have gained some more experience adding ansible plays for our own services, the existing organization of base
role has gotten a bit unaligned with the general organization of other plays like:
docker
role; same is true with etcd and ovs.Resolving the above will hopefully help us get more clarity on how we can use our playbooks in customer/demo environments and development environments.
Proposal:
To achieve the above two requirements, I propose the following:
dev
that shall take care of pre-installing the requirements needed for the development environment but it shall reuse the tasks from underlying services. This shall address 2. above.Also see the associated patch of something in works in my fork here: https://github.com/mapuri/ansible/tree/devrole
This shall give a more concrete idea of what I have in mind. I will create a formal PR once I have it more polished and tested. Feel free to take a look and suggest other cleanups as well that you might have had in mind.
cc @jainvipin @erikh
Setup - Non vagrant setup on 2 hosts using net_demo_installer script. Etcd config is out of sync on hosts when I use control interface eth1 on host1 and control interface eth2 on host 2. etcd is unable to communicate. To reproduce issue, use config below
CONNECTION_INFO:
:
control: eth1
data: eth2
:
control: eth2
data: eth1
When scheduler_provider = "ucp-swarm", the following error is seen
TASK: [ucp | create a local fetch directory if it doesn't exist] **************
failed: [cluster-node1 -> 127.0.0.1] => {"failed": true, "parsed": false}
[sudo via ansible, key=pinebvdaownqyquvfpxaengckhfixffn] password:
FATAL: all hosts have already failed -- aborting
As part of vagrant
role we should also add the vagrant user to docker group. This shall avoid needing to specify sudo
for docker commands on vagrant machines
this affects host's connectivity otherwise
It works only the first time ansible is run and errors out every other time.
to reproduce - checkout ansible repo - vagrant up; vagrant provision
TASK: [docker | start docker tcp socket service] ******************************
failed: [cluster-node1] => {"changed": true, "cmd": "sudo systemctl stop docker && sudo systemctl start docker-tcp.socket && sudo systemctl start docker", "delta": "0:00:00.436279", "end": "2016-02-17 09:32:39.085159", "rc": 1, "start": "2016-02-17 09:32:38.648880", "warnings": []}
stderr: Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Warning: Stopping docker.service, but it can still be activated by:
docker-tcp.socket
Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
FATAL: all hosts have already failed -- aborting
This can help improving the playbook by identifying longer task
We can use something like this: https://github.com/jlafon/ansible-profile
In older installation logic for netplugin we used to extract files directly to usr/bin
but we have now moved to using links instead as part of #52
This causes a failure if ansible is run on a host configured using old ansible.
Logs for services launched via systemd unit are getting rolled over and important information might be lost. Some of the options to alleviate this could be to increase the log space for all the services started Provide an option to increase the logs per service as required.
moved from contiv/build#22 as ansible repo is the right place for this.
+++++++++
right now the way ceph configuration is generated (as shown in the snippet from ansible/roles/ceph-common/templates/ceph.conf.j2 below) results in a dependency that all mons and osds hosts need to be configured together in single playbook run.
In a real cluster, we would need to allow incremental provisioning of new mons and osds. This issue tracks this requirement.
79 {% for host in groups[mon_group_name] %}
80 {% if hostvars[host]['ansible_hostname'] is defined %}
81 [mon.{{ hostvars[host]['ansible_hostname'] }}]
82 host = {{ hostvars[host]['ansible_hostname'] }}
83 mon addr = {{ hostvars[host]['ansible_' + monitor_interface]['ipv4']['address'] }}
84 {% endif %}
85 {% endfor %}
While running the net_demo_installer i saw that the docker version was not updated on the bare metal server.
The docker version on the baremetal was
ladmin@contiv146:~/src/github.com/contiv/demo/net$ docker version
Client:
Version: 1.9.0-dev
API version: 1.21
Go version: go1.4.2
Git commit: 02ae137
Built: Fri Sep 25 17:37:00 UTC 2015
OS/Arch: linux/amd64
Experimental: true
However manually updating docker using the docker provided script did upgrade to 1.9.1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.