princeton-cdh / cdh-ansible Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 3.0 45.14 MB

CDH Ansible playbook repository

License: Apache License 2.0

Python 31.47% Shell 9.53% Jinja 55.83% JavaScript 2.17% HTML 1.00%

ansible ansible-playbook devops

cdh-ansible's People

Contributors

Stargazers

Watchers

Forkers

networkarchetype blms

cdh-ansible's Issues

As a developer, I want to automatically test ansible roles and their dependencies so that I can make changes with confidence.

Describe the solution you'd like

pushing a commit to a role causes automated tests to run that confirm that the role works as expected
pushing a commit to a role causes tests to run for any roles related to the role that was changed
pushing a commit doesn't run tests for roles that are unrelated to the changed one

Additional context
Molecule requires current versions of ansible (>=2.8) so we will need to update. Currently, ansible is pinned <=2.7 to deal with a setuptools issue; maybe we can check if this bug still exists. Note that ansible's most recent stable is 2.10, but it was split into two modules (ansible and ansible-base) after 2.9 so that will require some changes as well. Current plan is to wait until we have molecule tests in place before upgrading ansible to 2.10.

Checklist

update ansible to 2.8
add molecule and any required dependencies
write an example role test
~~adapt PUL strategy for determining which roles to test~~
write github actions workflow for running tests

template files should end with .j2

applies to local settings templates in django role; check for others

Use shallow clone to check out project repo

Is your feature request related to a problem? Please describe.
The build_project_repo role currently does a full git checkout of the project's repository each time. We don't need the full git history on the target machine; only the most recent state is necessary.

Describe the solution you'd like
A shallow clone will allow us to get the files from github without their git history. It should be sufficient to pass depth:1 to the git module in ansible.

The relevant lines are here:

cdh-ansible/roles/build_project_repo/tasks/main.yml

Lines 29 to 37 in 80891c6

 - name: Clone project repository and set to the correct version 

 become: true 

 become_user: "{{ deploy_user }}" 

 git: 

 repo: "{{ repo_url }}/" 

 dest: "{{ clone_root }}/{{ repo }}" 

 version: "{{ gitref }}" 

 # register repo_info for group_vars 

 register: repo_info

write new playbook for S&co PUL production deploy

clean up apache2 stop handler and any other apache2 roles in ansible

As a developer, I want a record of significant changes to our deploy process so that I can refer to it when creating and updating playbooks.

We need a system to document major decisions/changes to the way we use ansible. One option is to use the ADR (architectural decision record) spec, like PUL does in princeton-ansible.

Discussion:

Would you add a note about dropping these in a changelog? (Or do we even have a changelog? Maybe we don't, because no releases...) Would it make sense to start?

Originally posted by @rlskoeser in #49 (comment)

write molecule tests for solr role

Permissions are repeatedly reset during project repo clone

Describe the bug
When deploying apps on PUL infrastructure, the beginning of the build_project_repo role takes significantly longer than other tasks, and reports changed: each time it is run. The slow part is this step, where we recursively set permissions on the deploy directory to prevent permissions errors when cloning:

cdh-ansible/roles/build_project_repo/tasks/main.yml

Lines 18 to 27 in 9a3839c

 - name: Ensure deploy user has access to install root 

 become: true 

 file: 

 dest: "{{ install_root }}" 

 owner: "{{ deploy_user }}" 

 group: "{{ deploy_user }}" 

 recurse: yes 

 # only set owner when deploy user is defined 

 # (i.e. on PUL vms, where deploy user is different than remote user) 

 when: deploy_user is defined and ansible_distribution != 'Springdale'

this is a large recursive operation, so it could potentially take a long time, but it appears to be running (and changing permissions) every time, which doesn't seem right.

Expected behavior
The permissions should be set correctly once, and then ansible should detect that there's no need to make a change and report skipped: , making the step quicker.

update geniza playbooks to add npm/webpack and use new django role

migrate shxco from apache2 to nginx

dev notes

update shxco qa playbook to switch it from apache2 to nginx/passenger and test in qa
once qa is working, make the same change to production playbook
deploy and confirm everything is working in production

apache2 clean up can be done in cdh-ansible once this task is done:

#155

supervisor doesn't get restarted when simulating risk playbook runs

there's a restart handler in the supervisor role, but it doesn't seem like it's being triggered

Advice from @acozine about options for investigation:

I can think of 3 things to check right away. 1. The handler has a condition on it of when: supervisor_started - maybe add a debug task to check that? If the service hasn't been started, the handler won't run. 2. Most of the tasks that call it do something like "make sure X is present / exists", so if the file already exists, Ansible doesn't change anything and the handler does not get called. Try manually removing one of those files and see if the handler runs then. 3. IIRC handler names must be globally unique. Grep through your roles to see if there's another handler somewhere else with the name restart supervisor.

Refer to https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html

Other possibilities:

... if you have other tasks/roles later that are failing, it could be that Ansible has the handler loaded up to run but something else fails before it gets around to actually running it.
you could try adding meta: flush_handlers (as a task) and see if that fixes your issue

Configure custom nginx error page for QA/test/vpn-only sites

Everyone occasionally forgets to log into access test/vpn sites, and it would be helpful to have a different visual reminder to distinguish those sites from library sites.

remind to login to vpn if not already on campus / on vpn
add cdh logo somewhere
should return a 401 not authorized if access is denied due to IP address

Instructions from Francis:

at the bottom of every config generally is

server {
    listen 443 ssl;
    server_name cdh-geniza.princeton.edu;
    ...
    include /etc/nginx/conf.d/templates/prod-maintenance.conf;
}

and the contents of prod-maintenance

error_page 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 421 422 423 424 425 426 428 429 431 451 500 501 502 503 504 505 506 507 508 510 511 = @maintenance;
location @maintenance {
        root /var/local/www/default;
        try_files $uri /index.html =502;
    }

And the html file is the error page displayed.

pre- and post-deploy tasks fail when staging var is undefined

checks at the end of create_deployment and close_deployment are failing when staging is not defined:
https://github.com/Princeton-CDH/CDH_ansible/blob/8b971961a366bf9a24a54bd01417e734e3305665/roles/create_deployment/tasks/main.yml#L39

ansible docs on when statement

see slack link below for more

update solr config for Filter Cache Evictions

flagged by Max on PUL slack for cdh_ppa but may be relevant to other applications using Solr

I'm not sure where your Solr configurations live, but a bunch of ours live here - https://github.com/pulibrary/pul_solr/tree/main/solr_configs. The Catalog filterCache setting lives here - https://github.com/pulibrary/pul_solr/blob/05d87860a7837c62135dedc949ed608644666cd3/solr_configs/catalog-production-v2/conf/solrconfig.xml#LL1[…]C38
There's more documentation here - https://solr.apache.org/guide/8_3/query-settings-in-solrconfig.html
My very basic understanding is that you want to dial in your cache settings so that you don't get a ton of evictions, and you get a fairly high hit ratio, without over-taxing your machine's memory. Right now, what's in Datadog as cdh_ppa has very very high filter cache evictions, and very low filter cache hit ratio, so you might want to give a higher value to your filterCache in your conf/solrconfig.xml

fix invalid group characters in cdh-web group name (deprecation warning)

To reproduce
Steps to reproduce the behavior:

Run any ansible playbook
See the deprecation warning immediately in the output

[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to allow bad characters in group names by default, this will change, but still be user configurable on deprecation. This feature will be removed in version 2.10. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

Additional context
Output from running with -vvvv option:

Not replacing invalid character(s) "{'-'}" in group name (cdh-web_qa)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web_prod)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web_prod)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web_staging)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web_staging)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web)

rename to 'devops'

once we start storing stuff like github action templates in here we might want to rename the repo to be more all-encompassing.

Fix CAS errors

We occasionally still get errors with CAS versions. Update CAS settings to stop this.

Change github deployment role to pre/post deployment step

see for example https://github.com/pulibrary/princeton_ansible/blob/540e2136fdec4becd1336dd1c68ffbf41296de95/playbooks/approvals_production.yml#L12-L21

unfortunately it looks like there's no modules published on ansible galaxy to handle this (we might consider publishing one, actually, if we get it working well). they do have ones for creating/updating github releases, but not deployments.

some digging reveals that pre_tasks and post_tasks are not special at all - they just run arbitrary tasks at the specified time, and will not be run if the playbooks fail. i think the way to approach this is to:

consolidate our create/complete/fail deployment tasks into a single github role (along with cloning)
use include to pull the tasks from this role into each playbook as pre_tasks and post_tasks
make sure the tasks include the special always tag so they run regardless of playbook failure
also add a new tag github_deploy (or similar) to the tasks so that we can easily turn off github deploys when running the playbook with --skip-tag=github_deploy or similar. see this article for more info on this approach.

As a deployer, I would like to have the deploy user automatically created when running playbooks, so that required users are managed by the playbook.

Is your feature request related to a problem? Please describe.
Our typical deploy user deploy is currently being used by pulibrary. If we would like to migrate to PUL infrastructure, we need to begin using the deploy user conan. Ideally, this process would be automated.

Describe the solution you'd like
We should create an ansible role that automatically creates the deploy user conan for all our machines.

python role should install from requirements.lock automatically if present

we're creating lockfiles for most of our apps, so we should use them to install predictable versions of dependencies in production.

configure nfs mount to test-prosody / prosody qa playbook

add the ansible step to mount the nfs share
update media_root: '/srv/www/media/' setting as needed
make sure both conan and www-data group can read/write
make sure media can be served out by nginx

As a developer, I want to manage nodejs-related dependencies and tasks through a single role so that I can simplify existing deploy playbooks.

Is your feature request related to a problem? Please describe.
Compare #63; this situation is similar: functionality related to javascript/nodejs is spread across multiple roles.

build_npm
build_semantic
run_webpack

Describe the solution you'd like
Implementation could be fairly similar to the django and python roles, with tasks to:

install an arbitrary version of nodejs
if present, install nodejs dependencies from a package.json or package-lock.json file
run arbitrary npm scripts, such as those that compile static files

Additional context
This issue could resolve existing issue #9 related to ansible's npm module; perhaps we now have a newer version or can find a smarter way to use the module.

As a developer, I want to manage Django tasks through a single role so that I can simplify existing deploy playbooks.

Is your feature request related to a problem? Please describe.
Currently, functionality related to deploying Django apps is spread across several roles:

configure_logging
configure_media
django_collectstatic
django_compressor
django_migrate
install_local_settings

Describe the solution you'd like
Ideally, these could all be tasks that are part of a single Django-specific role, which could have shared defaults and dependencies on other roles. New Django-related tasks (for example, Princeton-CDH/geniza#117) can also become part of this role.

Additional context

Ansible django_manage plugin, from community.general
existing ansible-role-django on github

python_app_version doesn't get set in partial playbook runs

For python apps, we get the python_app_version from the current checkout of the code and use it to determine the path for the deployed version. When we run playbooks without running the full sequence, e.g. using the --start-at-task option, this variable doesn't get set, and then any task that needs to know the path to the deployed version of the project fails.

The task is currently part of the build_project_repo role: https://github.com/Princeton-CDH/cdh-ansible/blob/main/roles/build_project_repo/tasks/main.yml#L43-L54

On a new deploy, it can't be run until the new version of the code has been checked out; but on an existing deploy it could be run anytime.

As a developer, I want Solr configuration updates and schema changes to be made automatically when I deploy new software, so I can trust that expected changes will always be made.

Create a new Solr role that works similarly to the Solr update logic PUL uses in capistrano, and write molecule tests for the new role.

create new role solr
Create conan user on lib-solr-staging1
add a task to copy solr configset files to the appropriate server
add a task to use zookeeper to update the configset
add a task to create solr collection if it doesn't already exist (?)
add a task to reload the collection if not newly created
figure out what new configurations are needed and if there are common defaults we can use (e.g., configset location?)
add the new role to the geniza qa playbook (can use for testing)
#89

update readme instructions to match current state

automatically clean up old deploys

Is your feature request related to a problem? Please describe.
Currently, the deploy scripts leave all past deploys sitting in the directory.

Describe the solution you'd like
Old deploys should automatically be cleaned up when the deploy finishes. Maybe keep the last 3 based on date (most recent)? Should always preserve current + previous.

See for one possible solution https://www.future500.nl/articles/2014/07/thoughts-on-deploying-with-ansible/

Describe alternatives you've considered
Could be cleaned up manually or by a cron job, but seems better to make it part of the deploy.

passenger is serving out git config file

looks like the config error is a misuse/misunderstanding of the root configuration:

https://www.phusionpassenger.com/library/deploy/nginx/deploy/python/#deploying-an-app-to-a-virtual-host-s-root

# Tell Nginx and Passenger where your app's 'public' directory is
root /path-to-app/public;

We're configuring it to use the app root, but we should point it at static root instead to avoid this behavior.

database migration for qa

database dump

stop webserver connections
```
sudo systemctl stop nginx
```

drop database

pg_dump --format custom --clean --no-owner --no-privileges cdh_geniza > 2023_10_25_cdh_geniza.dump

transfer database (as pulsys)

sudo mv /var/lib/postgresql/2023_10_25_cdh_geniza.dump .
scp 2023_10_25_cdh_geniza.dump lib-postgres-prod1:~

run a db_setup playbook

import database from new host

sudo mv 2023_10_25_cdh_geniza.dump /var/lib/postgresql
pg_restore -h 127.0.0.1 -U cdh_geniza --dbname cdh_geniza --no-owner --no-privileges 2023_10_25_cdh_geniza.dump

restart webserver connection
```
sudo systemctl start nginx
```

work with PUL to upgrade VMs from Bionic to Jammy Jellyfish

details from @kayiwa :

Ubuntu bionic VM goes out of support in June
The process has been: Set up new Operating System with Jammy Jellyfish.
Try to deploy. See what breaks, fix with a PR that accommodates the existence of both Jammy and Bionic

We'll want to test the upgrade in staging first. Francis says he'll do the dependencies work on our PRs. Once everything is working in staging we can schedule an upgrade for production.

projects needing upgrades:

subtasks related to this upgrade

#147
#148

Python version not correctly updating to 3.9

We're getting this error:

pulsys@cdh-test-prosody1:~$ sudo tail /var/log/apache2/ppa_error.log
[Tue Nov 29 18:00:55.640359 2022] [wsgi:error] [pid 4599:tid 140169124407040] [remote 128.112.203.144:36628] Traceback (most recent call last):
[Tue Nov 29 18:00:55.640404 2022] [wsgi:error] [pid 4599:tid 140169124407040] [remote 128.112.203.144:36628]   File "/var/www/ppa/ppa/wsgi.py", line 12, in <module>
[Tue Nov 29 18:00:55.640434 2022] [wsgi:error] [pid 4599:tid 140169124407040] [remote 128.112.203.144:36628]     from django.core.wsgi import get_wsgi_application
[Tue Nov 29 18:00:55.640466 2022] [wsgi:error] [pid 4599:tid 140169124407040] [remote 128.112.203.144:36628] ModuleNotFoundError: No module named 'django'
[Tue Nov 29 18:00:56.046126 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478] mod_wsgi (pid=4599): Target WSGI script '/var/www/ppa/ppa/wsgi.py' cannot be loaded as Python module.
[Tue Nov 29 18:00:56.046239 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478] mod_wsgi (pid=4599): Exception occurred processing WSGI script '/var/www/ppa/ppa/wsgi.py'.
[Tue Nov 29 18:00:56.046376 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478] Traceback (most recent call last):
[Tue Nov 29 18:00:56.046424 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478]   File "/var/www/ppa/ppa/wsgi.py", line 12, in <module>
[Tue Nov 29 18:00:56.046434 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478]     from django.core.wsgi import get_wsgi_application
[Tue Nov 29 18:00:56.046465 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478] ModuleNotFoundError: No module named 'django'

It's possible that the python version is harder to switch when we're on apache vs nginx. we have to have the correct version of mod wsgi installed. @rlskoeser will investigate.

PR: #129

add semanticui build task

PPA builds will continue to look weird until we do this, since they aren't building semantic ui now by default. The task just needs to run npm run build:semantic, in the case of PPA.

per @bwhicks we might want to integrate this into a general 'build' task, or something that aggregates the relevant npm... commands.

allow running django-compressor as part of django role

this task wouldn't run in the main sequence for the role, but it should be runnable on demand for apps that use it (only relevant one is cdhweb).

add setup tags to tasks/roles so we can run deploy playbooks more quickly

one-time setup / provisioning tasks should all be tagged with setup
update ansible config to skip setup tasks by default
document convention and how to run in readme

write new playbook for S&co qa deploy to PUL vm with postgres & solr

Make postgresql role not back up database if it's just been created

Should theoretically be as simple as registering the output of the create_db task to check if it was indeed created, and then conditionally running the backup_db task.

node upgrade doesn't work with snap (should refresh but doesn't)

ansible snap builtin is supposed to run refresh instead of install when a package is already installed, but it failed to do the upgrade

Recommendation from @kayiwa is to use PUL's nodejs role that builds it from source based on a specified version; see the nodejs download to see the syntax needed for version numbers.

remove old S&co playbooks

after migration to PUL infrastructure is complete

allow specifying and installing arbitrary python versions

Is your feature request related to a problem? Please describe.
Stuck on python 3.6 because of ubuntu's package manager.

Describe the solution you'd like
build_dependencies role should enable access to an external apt repo (deadsnakes maybe) and then use it to install a version of python specified by python_version. this should become the default python.

Production deploys fail on context check

Error when executing a ansible-playbook pemm.yml, need to instead execute ansible-playbook pemm.yml -e '{"deploy_contexts": []}'

All the messy logs:

TASK [create_deployment : Create a deployment] *********************************
fatal: [cdh-pemm1.princeton.edu]: FAILED! => changed=false 
  access_control_allow_origin: '*'
  access_control_expose_headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
  connection: close
  content: '{"message":"Conflict: Commit status checks failed for master.","errors":[{"contexts":[{"context":"Travis CI - Branch","state":"success"}],"resource":"Deployment","field":"required_contexts","code":"invalid"}],"documentation_url":"https://developer.github.com/v3/repos/deployments/#create-a-deployment"}'
  content_length: '302'
  content_security_policy: default-src 'none'
  content_type: application/json; charset=utf-8
  date: Fri, 05 Jun 2020 14:50:23 GMT
  json:
    documentation_url: https://developer.github.com/v3/repos/deployments/#create-a-deployment
    errors:
    - code: invalid
      contexts:
      - context: Travis CI - Branch
        state: success
      field: required_contexts
      resource: Deployment
    message: 'Conflict: Commit status checks failed for master.'
  msg: 'Status code was 409 and not [201]: HTTP Error 409: Conflict'
  redirected: false
  referrer_policy: origin-when-cross-origin, strict-origin-when-cross-origin
  server: GitHub.com
  status: 409
  strict_transport_security: max-age=31536000; includeSubdomains; preload
  url: https://api.github.com/repos/Princeton-CDH/pemm-scripts/deployments
  vary: Accept-Encoding, Accept, X-Requested-With
  x_accepted_oauth_scopes: ''
  x_content_type_options: nosniff
  x_frame_options: deny
  x_github_media_type: github.v3; format=json
  x_github_request_id: BC8E:560E:668A2:ACFF7:5EDA5BAF
  x_oauth_scopes: repo_deployment
  x_ratelimit_limit: '5000'
  x_ratelimit_remaining: '4991'
  x_ratelimit_reset: '1591368768'
  x_xss_protection: 1; mode=block
	to retry, use: --limit @/Users/kmcelwee/cdh/CDH_ansible/pemm.retry

PLAY RECAP *********************************************************************
cdh-pemm1.princeton.edu    : ok=1    changed=0    unreachable=0    failed=1

npm module doesn't register error on webpack build

The npm module for Ansible does not note build failures and bubble them up to Ansible. This may need to be replaced with a call to the shell module.

Change default production deploy branch from master -> main

setup passenger role and configure cdhweb & geniza qa playbooks to use it instead of apache

playbook / role to synchronize database and media data from production to qa or dev environment

Is your feature request related to a problem? Please describe.
When we want to test against recent production data, we have to manually download sql dumps and media files, copy them, load them into the database or file system, set permissions, update django/wagtail site in the database. Because it's error prone and because we tend to be lazy, we don't always do always bother with refreshing data when we probably should.

Describe the solution you'd like
A script or playbook I can run that does all of this for me. For QA at least, but for local dev would be super.

general steps needed:

backup production db
sync db backup to staging postgres and restore
backup production site media (user uploaded content)
sync media backup to staging and restore
run django migrations (maybe? or could be out of scope) — needed when staging code differs from production
update django site in db to match environment
sync solr?

Replace get_ver.py script with one-line python command

I think we should be able to replace the get_ver.py script with a one-line python command that can be run directly from the ansible script. I think what we want to generate should look roughly like this:

python -c 'import ppa; print(ppa.__version__)'

It will need configuration for the python package name to import and the working directory.

update python version for PPA deploys

Please try updating ansible to this python version and running the QA deploy to confirm that everything works there. The python version is specified in group_vars/prosody/vars.yml — changing that should be all that's needed, since the ansible playbook installs and configures the version we have set.

Originally posted by @rlskoeser in Princeton-CDH/ppa-django#495 (comment)

Add tests for postgresql role

we need to use the approach francis recommended of installing postgres local to the target container, rather than testing it between two containers.

Finish configuring projects for new rotating logs

The following work needs to happen to make rotating logs happen on all projects:

Configure log rotation on cdh and derridas-margins
Place log files in updated location.
Add logging roles to cdh and derridas-margins playbooks

revisit pre-commit hook

Is your feature request related to a problem? Please describe.
current pre-commit hook script is custom and will be hard to manage if we add more encrypted files with different names

Describe the solution you'd like
suggest implementing with pre-commit, either using repository local hooks or with an existing pre-commit script like this one: https://github.com/IamTheFij/ansible-pre-commit

Then the files that need to be checked would be configured in the pre-commit yaml instead being hard-coded in the script.

change ansible vault password

since potentially leaked in lastpass breach

	- name: Clone project repository and set to the correct version
	become: true
	become_user: "{{ deploy_user }}"
	git:
	repo: "{{ repo_url }}/"
	dest: "{{ clone_root }}/{{ repo }}"
	version: "{{ gitref }}"
	# register repo_info for group_vars
	register: repo_info

	- name: Ensure deploy user has access to install root
	become: true
	file:
	dest: "{{ install_root }}"
	owner: "{{ deploy_user }}"
	group: "{{ deploy_user }}"
	recurse: yes
	# only set owner when deploy user is defined
	# (i.e. on PUL vms, where deploy user is different than remote user)
	when: deploy_user is defined and ansible_distribution != 'Springdale'