GithubHelp home page GithubHelp logo

princeton-cdh / cdh-ansible Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 3.0 45.14 MB

CDH Ansible playbook repository

License: Apache License 2.0

Python 31.47% Shell 9.53% Jinja 55.83% JavaScript 2.17% HTML 1.00%
ansible ansible-playbook devops

cdh-ansible's People

Contributors

acozine avatar blms avatar kayiwa avatar kmcelwee avatar meg-codes avatar quadrismegistus avatar rlskoeser avatar thatbudakguy avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cdh-ansible's Issues

As a developer, I want to automatically test ansible roles and their dependencies so that I can make changes with confidence.

Describe the solution you'd like

  • pushing a commit to a role causes automated tests to run that confirm that the role works as expected
  • pushing a commit to a role causes tests to run for any roles related to the role that was changed
  • pushing a commit doesn't run tests for roles that are unrelated to the changed one

Additional context
Molecule requires current versions of ansible (>=2.8) so we will need to update. Currently, ansible is pinned <=2.7 to deal with a setuptools issue; maybe we can check if this bug still exists. Note that ansible's most recent stable is 2.10, but it was split into two modules (ansible and ansible-base) after 2.9 so that will require some changes as well. Current plan is to wait until we have molecule tests in place before upgrading ansible to 2.10.

Checklist

  • update ansible to 2.8
  • add molecule and any required dependencies
  • write an example role test
  • adapt PUL strategy for determining which roles to test
  • write github actions workflow for running tests

Use shallow clone to check out project repo

Is your feature request related to a problem? Please describe.
The build_project_repo role currently does a full git checkout of the project's repository each time. We don't need the full git history on the target machine; only the most recent state is necessary.

Describe the solution you'd like
A shallow clone will allow us to get the files from github without their git history. It should be sufficient to pass depth:1 to the git module in ansible.

The relevant lines are here:

- name: Clone project repository and set to the correct version
become: true
become_user: "{{ deploy_user }}"
git:
repo: "{{ repo_url }}/"
dest: "{{ clone_root }}/{{ repo }}"
version: "{{ gitref }}"
# register repo_info for group_vars
register: repo_info

As a developer, I want a record of significant changes to our deploy process so that I can refer to it when creating and updating playbooks.

We need a system to document major decisions/changes to the way we use ansible. One option is to use the ADR (architectural decision record) spec, like PUL does in princeton-ansible.

Discussion:

Would you add a note about dropping these in a changelog? (Or do we even have a changelog? Maybe we don't, because no releases...) Would it make sense to start?

Originally posted by @rlskoeser in #49 (comment)

Permissions are repeatedly reset during project repo clone

Describe the bug
When deploying apps on PUL infrastructure, the beginning of the build_project_repo role takes significantly longer than other tasks, and reports changed: each time it is run. The slow part is this step, where we recursively set permissions on the deploy directory to prevent permissions errors when cloning:

- name: Ensure deploy user has access to install root
become: true
file:
dest: "{{ install_root }}"
owner: "{{ deploy_user }}"
group: "{{ deploy_user }}"
recurse: yes
# only set owner when deploy user is defined
# (i.e. on PUL vms, where deploy user is different than remote user)
when: deploy_user is defined and ansible_distribution != 'Springdale'

this is a large recursive operation, so it could potentially take a long time, but it appears to be running (and changing permissions) every time, which doesn't seem right.

Expected behavior
The permissions should be set correctly once, and then ansible should detect that there's no need to make a change and report skipped: , making the step quicker.

migrate shxco from apache2 to nginx

dev notes

  • update shxco qa playbook to switch it from apache2 to nginx/passenger and test in qa
  • once qa is working, make the same change to production playbook
  • deploy and confirm everything is working in production

apache2 clean up can be done in cdh-ansible once this task is done:

supervisor doesn't get restarted when simulating risk playbook runs

there's a restart handler in the supervisor role, but it doesn't seem like it's being triggered

Advice from @acozine about options for investigation:

I can think of 3 things to check right away. 1. The handler has a condition on it of when: supervisor_started - maybe add a debug task to check that? If the service hasn't been started, the handler won't run. 2. Most of the tasks that call it do something like "make sure X is present / exists", so if the file already exists, Ansible doesn't change anything and the handler does not get called. Try manually removing one of those files and see if the handler runs then. 3. IIRC handler names must be globally unique. Grep through your roles to see if there's another handler somewhere else with the name restart supervisor.

Refer to https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html

Other possibilities:

... if you have other tasks/roles later that are failing, it could be that Ansible has the handler loaded up to run but something else fails before it gets around to actually running it.
you could try adding meta: flush_handlers (as a task) and see if that fixes your issue

Configure custom nginx error page for QA/test/vpn-only sites

Everyone occasionally forgets to log into access test/vpn sites, and it would be helpful to have a different visual reminder to distinguish those sites from library sites.

  • remind to login to vpn if not already on campus / on vpn
  • add cdh logo somewhere
  • should return a 401 not authorized if access is denied due to IP address

Instructions from Francis:

at the bottom of every config generally is

server {
    listen 443 ssl;
    server_name cdh-geniza.princeton.edu;
    ...
    include /etc/nginx/conf.d/templates/prod-maintenance.conf;
}

and the contents of prod-maintenance

error_page 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 421 422 423 424 425 426 428 429 431 451 500 501 502 503 504 505 506 507 508 510 511 = @maintenance;
location @maintenance {
        root /var/local/www/default;
        try_files $uri /index.html =502;
    }

And the html file is the error page displayed.

update solr config for Filter Cache Evictions

flagged by Max on PUL slack for cdh_ppa but may be relevant to other applications using Solr

I'm not sure where your Solr configurations live, but a bunch of ours live here - https://github.com/pulibrary/pul_solr/tree/main/solr_configs. The Catalog filterCache setting lives here - https://github.com/pulibrary/pul_solr/blob/05d87860a7837c62135dedc949ed608644666cd3/solr_configs/catalog-production-v2/conf/solrconfig.xml#LL1[…]C38
There's more documentation here - https://solr.apache.org/guide/8_3/query-settings-in-solrconfig.html
My very basic understanding is that you want to dial in your cache settings so that you don't get a ton of evictions, and you get a fairly high hit ratio, without over-taxing your machine's memory. Right now, what's in Datadog as cdh_ppa has very very high filter cache evictions, and very low filter cache hit ratio, so you might want to give a higher value to your filterCache in your conf/solrconfig.xml

fix invalid group characters in cdh-web group name (deprecation warning)

To reproduce
Steps to reproduce the behavior:

  1. Run any ansible playbook
  2. See the deprecation warning immediately in the output

[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to allow bad characters in group names by default, this will change, but still be user configurable on deprecation. This feature will be removed in version 2.10. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

Additional context
Output from running with -vvvv option:

Not replacing invalid character(s) "{'-'}" in group name (cdh-web_qa)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web_prod)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web_prod)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web_staging)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web_staging)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web)
Not replacing invalid character(s) "{'-'}" in group name (cdh-web)

rename to 'devops'

once we start storing stuff like github action templates in here we might want to rename the repo to be more all-encompassing.

Fix CAS errors

We occasionally still get errors with CAS versions. Update CAS settings to stop this.

Change github deployment role to pre/post deployment step

see for example https://github.com/pulibrary/princeton_ansible/blob/540e2136fdec4becd1336dd1c68ffbf41296de95/playbooks/approvals_production.yml#L12-L21

unfortunately it looks like there's no modules published on ansible galaxy to handle this (we might consider publishing one, actually, if we get it working well). they do have ones for creating/updating github releases, but not deployments.

some digging reveals that pre_tasks and post_tasks are not special at all - they just run arbitrary tasks at the specified time, and will not be run if the playbooks fail. i think the way to approach this is to:

  • consolidate our create/complete/fail deployment tasks into a single github role (along with cloning)
  • use include to pull the tasks from this role into each playbook as pre_tasks and post_tasks
  • make sure the tasks include the special always tag so they run regardless of playbook failure
  • also add a new tag github_deploy (or similar) to the tasks so that we can easily turn off github deploys when running the playbook with --skip-tag=github_deploy or similar. see this article for more info on this approach.

As a deployer, I would like to have the deploy user automatically created when running playbooks, so that required users are managed by the playbook.

Is your feature request related to a problem? Please describe.
Our typical deploy user deploy is currently being used by pulibrary. If we would like to migrate to PUL infrastructure, we need to begin using the deploy user conan. Ideally, this process would be automated.

Describe the solution you'd like
We should create an ansible role that automatically creates the deploy user conan for all our machines.

As a developer, I want to manage nodejs-related dependencies and tasks through a single role so that I can simplify existing deploy playbooks.

Is your feature request related to a problem? Please describe.
Compare #63; this situation is similar: functionality related to javascript/nodejs is spread across multiple roles.

  • build_npm
  • build_semantic
  • run_webpack

Describe the solution you'd like
Implementation could be fairly similar to the django and python roles, with tasks to:

  • install an arbitrary version of nodejs
  • if present, install nodejs dependencies from a package.json or package-lock.json file
  • run arbitrary npm scripts, such as those that compile static files

Additional context
This issue could resolve existing issue #9 related to ansible's npm module; perhaps we now have a newer version or can find a smarter way to use the module.

As a developer, I want to manage Django tasks through a single role so that I can simplify existing deploy playbooks.

Is your feature request related to a problem? Please describe.
Currently, functionality related to deploying Django apps is spread across several roles:

  • configure_logging
  • configure_media
  • django_collectstatic
  • django_compressor
  • django_migrate
  • install_local_settings

Describe the solution you'd like
Ideally, these could all be tasks that are part of a single Django-specific role, which could have shared defaults and dependencies on other roles. New Django-related tasks (for example, Princeton-CDH/geniza#117) can also become part of this role.

Additional context

python_app_version doesn't get set in partial playbook runs

For python apps, we get the python_app_version from the current checkout of the code and use it to determine the path for the deployed version. When we run playbooks without running the full sequence, e.g. using the --start-at-task option, this variable doesn't get set, and then any task that needs to know the path to the deployed version of the project fails.

The task is currently part of the build_project_repo role: https://github.com/Princeton-CDH/cdh-ansible/blob/main/roles/build_project_repo/tasks/main.yml#L43-L54

On a new deploy, it can't be run until the new version of the code has been checked out; but on an existing deploy it could be run anytime.

As a developer, I want Solr configuration updates and schema changes to be made automatically when I deploy new software, so I can trust that expected changes will always be made.

Create a new Solr role that works similarly to the Solr update logic PUL uses in capistrano, and write molecule tests for the new role.

  • create new role solr
  • Create conan user on lib-solr-staging1
  • add a task to copy solr configset files to the appropriate server
  • add a task to use zookeeper to update the configset
  • add a task to create solr collection if it doesn't already exist (?)
  • add a task to reload the collection if not newly created
  • figure out what new configurations are needed and if there are common defaults we can use (e.g., configset location?)
  • add the new role to the geniza qa playbook (can use for testing)
  • #89

automatically clean up old deploys

Is your feature request related to a problem? Please describe.
Currently, the deploy scripts leave all past deploys sitting in the directory.

Describe the solution you'd like
Old deploys should automatically be cleaned up when the deploy finishes. Maybe keep the last 3 based on date (most recent)? Should always preserve current + previous.

See for one possible solution https://www.future500.nl/articles/2014/07/thoughts-on-deploying-with-ansible/

Describe alternatives you've considered
Could be cleaned up manually or by a cron job, but seems better to make it part of the deploy.

database migration for qa

database dump

  • stop webserver connections
    sudo systemctl stop nginx
  • drop database
    pg_dump --format custom --clean --no-owner --no-privileges cdh_geniza > 2023_10_25_cdh_geniza.dump
  • transfer database (as pulsys)
    sudo mv /var/lib/postgresql/2023_10_25_cdh_geniza.dump .
    scp 2023_10_25_cdh_geniza.dump lib-postgres-prod1:~
  • run a db_setup playbook
  • import database from new host
    sudo mv 2023_10_25_cdh_geniza.dump /var/lib/postgresql
    pg_restore -h 127.0.0.1 -U cdh_geniza --dbname cdh_geniza --no-owner --no-privileges 2023_10_25_cdh_geniza.dump
  • restart webserver connection
    sudo systemctl start nginx

work with PUL to upgrade VMs from Bionic to Jammy Jellyfish

details from @kayiwa :

Ubuntu bionic VM goes out of support in June
The process has been: Set up new Operating System with Jammy Jellyfish.
Try to deploy. See what breaks, fix with a PR that accommodates the existence of both Jammy and Bionic

We'll want to test the upgrade in staging first. Francis says he'll do the dependencies work on our PRs. Once everything is working in staging we can schedule an upgrade for production.

projects needing upgrades:

  • cdhweb
  • shakespeareandco
  • prosody
  • geniza
  • derridas margins archive

subtasks related to this upgrade

Python version not correctly updating to 3.9

We're getting this error:

pulsys@cdh-test-prosody1:~$ sudo tail /var/log/apache2/ppa_error.log
[Tue Nov 29 18:00:55.640359 2022] [wsgi:error] [pid 4599:tid 140169124407040] [remote 128.112.203.144:36628] Traceback (most recent call last):
[Tue Nov 29 18:00:55.640404 2022] [wsgi:error] [pid 4599:tid 140169124407040] [remote 128.112.203.144:36628]   File "/var/www/ppa/ppa/wsgi.py", line 12, in <module>
[Tue Nov 29 18:00:55.640434 2022] [wsgi:error] [pid 4599:tid 140169124407040] [remote 128.112.203.144:36628]     from django.core.wsgi import get_wsgi_application
[Tue Nov 29 18:00:55.640466 2022] [wsgi:error] [pid 4599:tid 140169124407040] [remote 128.112.203.144:36628] ModuleNotFoundError: No module named 'django'
[Tue Nov 29 18:00:56.046126 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478] mod_wsgi (pid=4599): Target WSGI script '/var/www/ppa/ppa/wsgi.py' cannot be loaded as Python module.
[Tue Nov 29 18:00:56.046239 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478] mod_wsgi (pid=4599): Exception occurred processing WSGI script '/var/www/ppa/ppa/wsgi.py'.
[Tue Nov 29 18:00:56.046376 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478] Traceback (most recent call last):
[Tue Nov 29 18:00:56.046424 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478]   File "/var/www/ppa/ppa/wsgi.py", line 12, in <module>
[Tue Nov 29 18:00:56.046434 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478]     from django.core.wsgi import get_wsgi_application
[Tue Nov 29 18:00:56.046465 2022] [wsgi:error] [pid 4599:tid 140169132799744] [remote 128.112.203.145:38478] ModuleNotFoundError: No module named 'django'

It's possible that the python version is harder to switch when we're on apache vs nginx. we have to have the correct version of mod wsgi installed. @rlskoeser will investigate.

PR: #129

add semanticui build task

PPA builds will continue to look weird until we do this, since they aren't building semantic ui now by default. The task just needs to run npm run build:semantic, in the case of PPA.

per @bwhicks we might want to integrate this into a general 'build' task, or something that aggregates the relevant npm... commands.

allow specifying and installing arbitrary python versions

Is your feature request related to a problem? Please describe.
Stuck on python 3.6 because of ubuntu's package manager.

Describe the solution you'd like
build_dependencies role should enable access to an external apt repo (deadsnakes maybe) and then use it to install a version of python specified by python_version. this should become the default python.

Production deploys fail on context check

Error when executing a ansible-playbook pemm.yml, need to instead execute ansible-playbook pemm.yml -e '{"deploy_contexts": []}'

All the messy logs:

TASK [create_deployment : Create a deployment] *********************************
fatal: [cdh-pemm1.princeton.edu]: FAILED! => changed=false 
  access_control_allow_origin: '*'
  access_control_expose_headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
  connection: close
  content: '{"message":"Conflict: Commit status checks failed for master.","errors":[{"contexts":[{"context":"Travis CI - Branch","state":"success"}],"resource":"Deployment","field":"required_contexts","code":"invalid"}],"documentation_url":"https://developer.github.com/v3/repos/deployments/#create-a-deployment"}'
  content_length: '302'
  content_security_policy: default-src 'none'
  content_type: application/json; charset=utf-8
  date: Fri, 05 Jun 2020 14:50:23 GMT
  json:
    documentation_url: https://developer.github.com/v3/repos/deployments/#create-a-deployment
    errors:
    - code: invalid
      contexts:
      - context: Travis CI - Branch
        state: success
      field: required_contexts
      resource: Deployment
    message: 'Conflict: Commit status checks failed for master.'
  msg: 'Status code was 409 and not [201]: HTTP Error 409: Conflict'
  redirected: false
  referrer_policy: origin-when-cross-origin, strict-origin-when-cross-origin
  server: GitHub.com
  status: 409
  strict_transport_security: max-age=31536000; includeSubdomains; preload
  url: https://api.github.com/repos/Princeton-CDH/pemm-scripts/deployments
  vary: Accept-Encoding, Accept, X-Requested-With
  x_accepted_oauth_scopes: ''
  x_content_type_options: nosniff
  x_frame_options: deny
  x_github_media_type: github.v3; format=json
  x_github_request_id: BC8E:560E:668A2:ACFF7:5EDA5BAF
  x_oauth_scopes: repo_deployment
  x_ratelimit_limit: '5000'
  x_ratelimit_remaining: '4991'
  x_ratelimit_reset: '1591368768'
  x_xss_protection: 1; mode=block
	to retry, use: --limit @/Users/kmcelwee/cdh/CDH_ansible/pemm.retry

PLAY RECAP *********************************************************************
cdh-pemm1.princeton.edu    : ok=1    changed=0    unreachable=0    failed=1 

playbook / role to synchronize database and media data from production to qa or dev environment

Is your feature request related to a problem? Please describe.
When we want to test against recent production data, we have to manually download sql dumps and media files, copy them, load them into the database or file system, set permissions, update django/wagtail site in the database. Because it's error prone and because we tend to be lazy, we don't always do always bother with refreshing data when we probably should.

Describe the solution you'd like
A script or playbook I can run that does all of this for me. For QA at least, but for local dev would be super.

general steps needed:

  • backup production db
  • sync db backup to staging postgres and restore
  • backup production site media (user uploaded content)
  • sync media backup to staging and restore
  • run django migrations (maybe? or could be out of scope) — needed when staging code differs from production
  • update django site in db to match environment
  • sync solr?

Replace get_ver.py script with one-line python command

I think we should be able to replace the get_ver.py script with a one-line python command that can be run directly from the ansible script. I think what we want to generate should look roughly like this:

python -c 'import ppa; print(ppa.__version__)'

It will need configuration for the python package name to import and the working directory.

Add tests for postgresql role

we need to use the approach francis recommended of installing postgres local to the target container, rather than testing it between two containers.

Finish configuring projects for new rotating logs

The following work needs to happen to make rotating logs happen on all projects:

  • Configure log rotation on cdh and derridas-margins
  • Place log files in updated location.
  • Add logging roles to cdh and derridas-margins playbooks

revisit pre-commit hook

Is your feature request related to a problem? Please describe.
current pre-commit hook script is custom and will be hard to manage if we add more encrypted files with different names

Describe the solution you'd like
suggest implementing with pre-commit, either using repository local hooks or with an existing pre-commit script like this one: https://github.com/IamTheFij/ansible-pre-commit

Then the files that need to be checked would be configured in the pre-commit yaml instead being hard-coded in the script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.