ooni / sysadmin Goto Github PK

System administration tools

Shell 20.56% Python 28.38% Io 0.77% Awk 0.01% Makefile 3.27% Nu 2.07% Perl 2.28% Dockerfile 1.09% PHP 0.91% Jinja 40.68%

ooni-infrastructure sysadmin

sysadmin's Introduction

OONI sysadmin

In here live all the tools and scripts related to administering the infrastructure that are part of OONI.

Getting started

It is recommended you use a python virtualenv to install ansible and dnspython.

Once you activate the virtualenv you can install the correct version of ansible and dnspython with:

pip install ansible==2.9.16
pip install requests
pip install requests[socks]

Ansible roles

It is required for all OONI team to run the same ansible version to minimise compatibility issues. All playbooks should include ansible-version.yml at the beginning of the playbook, by including:

---
- import_playbook: ansible-version.yml

If you need some python packages only for ansible module to work and don't need it in system-wide pip repository, then you should put these modules in separate virtual environment and set proper ansible_python_interpreter for the play. See docker_py role and grep for /root/venv for examples.

If you need to store secrets in repository then store them as vaults using ansible/vault script as a wrapper for ansible-vault. Store encrypted variables with vault_ prefix to make world a more grepable place and link location of the variable using same name without prefix in corresponding vars.yml. scripts/ansible-syntax-check checks links between vaults and plaintext files during Travis build. ansible/play wrapper for ansible-playbook will execute a playbook with proper vault secret and inventory.

In order to access secrets stored inside of the vault, you will need a copy of the vault password encrypted with your PGP key. This file should be stored inside of ~/.ssh/ooni-sysadmin.vaultpw.gpg.

SSH Config

You should configure your ~/.ssh/config with the following:

Ciphers aes128-ctr,aes256-ctr,aes128-cbc,aes256-cbc
IdentitiesOnly yes
ServerAliveInterval 120
UserKnownHostsFile ~/.ssh/known_hosts ~/PATH/TO//ooni/sysadmin/ext/known_hosts

Replace ~/PATH/TO//ooni/sysadmin/ext/known_hosts to where you have cloned the ooni/sysadmin repo. This will ensure you use the host key fingeprints from this repo instead of just relying on TOFU.

You probably also want to add:

host *.ooni.io
  user USERNAME
  identityfile ~/.ssh/id_rsa

host *.ooni.nu
  user USERNAME
  identityfile ~/.ssh/id_rsa

You should replace USERNAME with your username from adm_login.

On macOS you may want to also add:

host *
    UseKeychain yes

To use the Keychain to store passwords.

M-Lab deployment

M-Lab deployment process.

Upgrading OONI infrastructure

ooni-backend pitfalls

Ensure that the HS private keys of bouncer and collector are in right PATH (collector/private_key , bouncer/private_key).
Set the bouncer address in bouncer.yaml to the correct HS address.
ooni-backend will not generate missing directories and fail to start

Running a short ooni-probe test will ensure that the backend has been successfully upgraded, an example test:

ooniprobe --collector httpo://CollectorAddress.onion blocking/http_requests \
--url http://ooni.io/

New host HOWTO

come up with a name for $name.ooni.tld using DNS name policy
create a VM to allocate IP address
create A record for the domain name in namecheap web UI (API is hell)
fetch external inventory with ./play ext-inventory.yml, it'll create a git commit
add $name.ooni.tld to location tags section of inventory file, git-commit it
write firewall rules to templates/iptables.filter.part/$name.ooni.tld if needed, git-commit it
bootstrap VM with ./play dom0-bootstrap.yml -l $name.ooni.tld
update Prometheus with ./play deploy-prometheus.yml -t prometheus-conf
check inventory sanity with ./play inventory-check.yml (everything should be ok, no changes, no failures), update inventory-check.yml with new checksum, git-commit it
git push those commits

DNS name policy

Public HTTP services are named ${service}.ooni.org or ${service}.ooni.io (legacy). Public means that it's part of some external system we can't control: published APP or MK versions, web URLs and so on. Public names should not be used as an inventory_hostname to ease migration.

VMs should have FQDN like ${location}-${name}-${number}.ooni.org. VMs can provide one or more public-facing services that can change over time. The name should be as descriptive as possible e.g. the type of services or the most important service being run.

Various legacy names should be cleaned up during re-deploying VMs with newer base OS version.

Rename host HOWTO

First, try hard to avoid renaming hosts. It's pain:

inventory_hostname is stamped in Prometheus SSL certificates
inventory_hostname is stamped as FQDN inside of firewall rules
inventory_hostname is stamped as filename for firewall rules
hostname is stamped in /etc/hosts on the host
hostname is stamped as kernel.hostname on the host
some applications use hostname as configuration value, e.g. MongoDB

But re-deploying is not also an options sometimes due to GH platform limitations. So...

use New host HOWTO as a checklist to keep in mind
on-rename tag can save some time while running dom0-bootstrap: ./play dom0-bootstrap.yml -t on-rename -l $newname.ooni.tld
grep ooni/sysadmin for inventory_hostname, $oldname, $newname (NB: not just oldname.ooni.tld, short name may be used somewhere as well)

PostgreSQL replica bootstrap

pg_basebackup is nice, but does not support network traffic compression out of box and has no obvious way to resume interrupted backup. rsync solves that issue, but it needs either WAL archiving (to external storage) to be configured or wal_keep_segments to be non-zero, because otherwise WAL logs are rotated ASAP (min_wal_size and max_wal_size do not set amount of WAL log available to reader, these options set amount of disk space allocated to writer!). Also replication slot may reserve WAL on creation, but beware, it postpones WAL reservation till replica connection by default.

pg_start_backup() + rsync -az --exclude pg_replslot --exclude postmaster.pid --exclude postmaster.opts is a way to go. And, obviously, don't exclude pg_wal (aka pg_xlog) if neither WAL archiving nor replication slot is not set up.

And don't forget to revoke authorized_keys if SSH was used for rsync!

You may also want to bootstrap the replica from MetaDB snapshot in S3 and switch it to streaming replication afterwards.

Updating firewall rules

If you need to update the firewalling rules, because you added a new host to the have_fw group or you changed the hostname of a host, you should edit the file templates/iptables.filter.part/$HOSTNAME and then run:

./play dom0-bootstrap.yml -t fw

sysadmin's People

Contributors

Stargazers

Watchers

sysadmin's Issues

Deploy a self-hosted email server

This is the ticket to tasks needed to create a mail server for OONI.

Store DNS zone in the repo

Alternative to self-hosted DNS #54 is usage of namecheap API to manage DNS zones we have: namecheap.domains.dns.getHosts and namecheap.domains.dns.setHosts.

error while evaluating conditional: set_ooniprobe_pip == true

When running the install ooniprobe ansible playlist an error occurs when evaluating the set_ooniprobe_pip conditional in the main ooniprobe task.

TASK: [ooniprobe | Add required Debian apt repository] ************************ 
fatal: [xx.xx.xx.xx] => error while evaluating conditional: set_ooniprobe_pip == true

FATAL: all hosts have already failed -- aborting

To fix this error the current line can be changed to see if set_ooniprobe_pip is defined before checking if it is equal to true.

when: set_ooniprobe_pip is defined and set_ooniprobe_pip == 'true'

Use the naming convention suggested by @joelanders

I noticed @joelanders decided to use the naming convention of using - instead of _ I must say I don't have a strong opinion, but I just care about consistency.

So either one we pick it should be changed across the repo.

38.107.216.10 down for 1 hour

Detection: alert from prometheus

Timeline UTC:
01 Aug 22:19: [FIRING] ooni http return json headers
01 Aug 23:21: [RESOLVED] (no actions were taken)

What is still unclear:

what impact does it have? does client automatically try different test-helper?
why are we using that test helper hosted somewhere at Google networks?
what is the cause? was it maintenance? was there an announcement for it?

Move build, test and staging to HKG

That's at least following VMs that are used in non-interactive way, so I see no reason for them to be in AMS:

debuilder.infra.ooni.io
ubuilder.infra.ooni.io
staging.measurements.ooni.io
ooni-probe-demo / demo.probe.ooni.io (do we need both?)
ooni-fdroid-build (BTW, does it need 6G ram?)

Following VMs test raise the question: is it some daily test procedure or just an single-shot on-demand vm that was not deleted?

Investigate connectivity issues with the cloudfronted collector endpoints

In the newly deployed collector endpoints b,c the cloudfronted are not working as expected.
Let's find out what's wrong.

Monitoring of infrastructure with alerts

We currently have some basic monitoring setup here: https://munin.ooni.io/.

However it's currently not sending us alerts when something bad happens, which I guess is an improvement, but still not perfect.

What is needed in order to have us receive alerts when something unexpected happens on the infrastructure? I believe the issue lies in the fact that our @oo accounts don't receive the emails unless they are sent from a legit SMTP server.

What about setting up a gmail account for notifications and use that?

Alternatively we could go for #53, but I suspect that is going to be more laborious and also what happens if email server goes down, who is going to alert us of that?

Consolidate all inventories so that anyone of us can deploy any piece of the infrastructure

@darkk suggested that it's possible to use ansible vaults to store secrets encrypted and commit them directly to this tree.

I think we agree on the fact that there are fairly few secrets inside of the actual inventories (I believe we don't consider hostnames, IPs and ports to be secret).

I would be good to start doing this for at least one of the services so that we have an idea of how it works for one service and @darkk and I can start learning how to deploy them.

This would also I believe encourage me to use ooni-sysadmin more and do less deployment tasks manually.

munin slack notification broken since setup

Impact: munin alerts were non-functional since May 17 till Aug 11

Detection: unexpected alert flood from ooni-munin to slack, noticed by @hellais

Timeline UTC:
17 May 03:30: notify_slack_munin deployed at munin.ooni.io
11 Aug 17:22: darkk deploys dom0-defaults to all:!no_passwd from #136 without filters
11 Aug 17:25: alert flood starts (as there are couple of boxes in warning state and munin alerts on all issues every tick)
11 Aug 18:22: hellais disabled an integration in #ooni-bots channel: ooni-munin
12 Aug 07:00: incident published

What went well:

it was quite easy to silence munin alerting :-)

What went wrong:

notify_slack_munin required curl and was broken since initial setup
darkk did not notice alert flood as it's not relayed to IRC & went AFK half an hour after
innocent-looking apt-get install curl changed behavior of running system

What could be done to prevent relapse and decrease impact:

general rule: avoid any changes to live systems if you're going AFK soon :)
another one: test alerting (e.g. lowering thresholds) while deploying it

slackin: down for two days

Impact: slackin was unavailable for users to join community meeting

Detection: alert from @agrabeli_

Timeline UTC:
29 May 20:40 commit d4de11b with typo
16 Jun 14:19 previous commit becomes commit 2c6f623 after merging #89 and rewriting history
24 Jun 21:01 @darkk reverts master and opens PR #114 because of syntax error in vaulted yaml
25 Jun 20:10 nginx fails to start with [emerg] 122#122: invalid port in upstream "slackin:3000'" in /etc/nginx/conf.d/slack.openobservatory.org.ssl.conf:18 in https-portal container at slack.openobservatory.org
27 Jun 12:20 @agrabeli_ alerts @darkk on slackin being down
27 Jun 12:37 slackin is back up after 47895ae

What went well:

@agrabeli_ checked that slackin works a couple of days before community meeting
it was easy to fix slack.openobservatory.org as deployment recipe was already there

What went wrong:

rewritten history for production infrastructure updates
broken master for a week
@darkk assumed that the root cause is not typo but yet another docker DNS bug

What could be done to prevent relapse and decrease impact:

alerting of TLS certificate freshness (reason for #89)
alerting for slackin and slack-irc services
local pre-commit git-hook that checks basic syntax of ansible playbooks

Deploy staging/testing bouncer, collector and web_connectivity test helper

For testing and implementing this measurement-kit feature it would be good to have a staging/testing bouncer, collector and web_connectivity test helper that has HTTPS support.

This would make development and end-to-end testing much simpler.

tighten sync-user permissions with rrsync

sync-user can execute arbitrary command at the moment.
IMHO, it makes sense to limit it with command="$HOME/bin/rrsync /data/fqdn/renamed",no-agent-forwarding,no-port-forwarding,no-pty,no-user-rc,no-X11-forwarding ssh-rsa... in authorized_keys to reduce privileges.

The deployment script for ooni-backend points to the wrong bouncer file

Currently the deployment script points to bouncer-base.yaml instead of bouncer.yaml.

Because of this it isn't able to get the addresses of the test-helpers and collectors from mlab and hence it was failing to obtain test-helpers and collectors.

explorer was half-broken for 16 hours

Impact: some users accessing explorer web-interface could not load /scripts/vendor.js, so web UI was broken for them

Detection: alert from user

Timeline UTC:
11 Aug 15:22...15:37: @darkk rolls out node_exporter with nginx-prometheus to all:!no_passwd
11 Aug 15:37 first [crit] 13064#0: *1471 open() "/var/lib/nginx/proxy/7/00/0000000007" failed (13: Permission denied) while reading upstream ... record in /var/log/nginx/error.log, NB: this error shows NO anomalies in access.log
12 Aug 02:18: dcf1 reports that explorer is not rendering, followed in 15 minutes by Suddenly started working again
12 Aug 07:30: @hellais confirms: Chrome gives a ERR_INCOMPLETE_CHUNKED_ENCODING error when loading https://explorer.ooni.torproject.org/scripts/vendor.js
12 Aug 07:46: @darkk applies sudo chown -R www-data /var/lib/nginx/ to explorer
12 Aug 07:46: last [crit] 13066#0: *20326 open() "/var/lib/nginx/proxy/3/34/0000000343" failed (13: Permission denied) log line
12 Aug 08:08: @darkk applies chown to other possibly affected hosts (all:!no_nodeexp having two nginx master processes running). These services seem to face NO real issues as there are no corresponding records in nginx/error.log. These hosts are:

slack.openobservatory.org
munin.ooni.io
explorer.ooni.io
get.ooni.io
measurements.ooni.io
notify.proteus.test.ooni.io
prometheus.infra.ooni.io
measurements-beta.ooni.io
ooni-zoo.infra.ooni.io

13 Aug 14:20: incident published

What went well:

@hellais re-checked user's report even after up again

What went wrong:

darkk adjusted separate nginx installation to avoid conflicts with global logrotate, but have forgotten about other global nginx objects like temporary directories

What is still unclear:

should we serve static files via nginx/cache instead of fetching them from application?
will some sort of exception tracking catch that?
is Domain Reliability AKA Network Error Logging already adopted by browsers? Seems, it's not: WICG, W3C, but I may be missing something.

What could be done to prevent relapse and decrease impact:

create separate temporary directories for nginx-prometheus
start nginx-prometheus as unprivileged user instead of default start-as-root mode
setup alerting based on error.log, e.g. using google/mtail exporter or fstab/grok_exporter or something based on export of the log via syslog interface

debuilder.infra.ooni.io compromised month ago

Impact: RCE on a build host unnoticed for a month

Detection: manual observation of anomalies on a broken box

Timeline UTC:
02 Jul 23:41: start time of payload running under jenkins uid
09 Jul 07:43: first No space left on device in the series Jul-09...Jul-23
04 Aug 18:17: dom0-bootstrap.yml failed during apt update: 100% disk space used, 93% inodes used (logrotate is broken since January)
04 Aug 18:45: incident published
04 Aug 19:20: debuilder.infra.ooni.io turned off

What went wrong:

Seems, it was CVE-2017-1000353, jenkins RCE before login

What could be done to prevent relapse and decrease impact:

network-level authentication (SSH, VPN, nginx with basic auth) on internal services
logrotate monitoring
alert on high usage of CPU (miners) or NET (DDoS bots)
alert on SMTP traffic

Improve automation of ansible role for ooni-backend

Ansible role should better automate ooni-backend deployment.

Set up a daily job to generate the bouncer.yaml from mlabns pulled data.
Create a bouncer-base.yaml with the appropriate HS hostnames generated during the setup.
Include a configuration option to deploy the bouncer with already given HS hostnames and keys.

Deploy production HTTPS bouncer, collector and web_connectivity test helper

This ticket is about updating the canonical bouncer, collector and web_connectivity test helper to support.

I believe the bouncer YAML configuration file should look something like this:

collector:
  httpo://ihiderha53f36lsd.onion:
    collector-alternate:
    - {address: 'https://a.collector.ooni.io', type: 'https'}
    test-helper: {dns: '213.138.109.232:57004', ssl: 'https://213.138.109.232', tcp-echo: '213.138.109.232', traceroute: '213.138.109.232', web-connectivity: 'httpo://7jne2rpg5lsaqs6b.onion'}
    test-helper-alternate:
        web-connectivity:
        - {address: 'https://web-connectivity.ooni.io', type: 'https'}

Let me know if you need some A records to be setup for this.

Make the canonical collector and test helpers listen on port 443

It seems like it's fairly common for some networks (such as mobile ones) to block access to anything that is not traffic to common ports.

Currently the canonical collector binds on port 4441 and the canonical web connectivity test helpers binds on 4442.

It would be great if we could setup another machine just to host the collector and web connectivity test helper that has a dedicated IP for each of those services.

The added benefit of this would be that of being able to divert clients to another working collector from the bouncer (that would then reside on it's own machine) in case we encounter difficulties with it.

Update scripts for installing ooni-backend with SSL support

The TLS endpoint branch adds a new configuration option called collector_endpoints and bouncer_endpoints.

We should update the ansible scripts so that they take into account these changes to the configuration files and set them accordingly if we wish to setup a SSL collector and bouncer.

Improve accounts on infrastructure

@anadahz has been talking about improving how accounts are managed and created on the various VMs that run pieces of ooni infrastructure.

I don't have a clear understanding of what the plans are for that, the thing I do remember from these discussions is that having us all login as root on the machines is bad and we should have per user accounts.

@anadahz what are your ideas on this topic?

Implement a way for ooni-backend to reload the updated TLS certificates

This is actually a ticket to overcome the limitation of Github milestones that can include only issues from same repository.
Ref: ooni/backend#90

Auto renewal of SSL certificated is broken on the testing collector

I see that there is a cronjob in place to make SSL certs for the testing collector renew automatically, but it doesn't seem to work.

In particular running:

certbot  certonly --standalone --standalone-supported-challenges http-01 --noninteractive --text --agree-tos --email [email protected] --domains test.ooni.io,bouncer.test.ooni.io,b.collector.test.ooni.io,a.web-connectivity.th.test.ooni.io --pre-hook "docker stop ooni-backend-testing-bouncer" --post-hook "cp /etc/letsencrypt/live/test.ooni.io/privkey.pem /etc/letsencrypt/live/test.ooni.io/fullchain.pem  /data/testing-bouncer/tls   && docker start ooni-backend-testing-bouncer"  --pre-hook "docker stop ooni-backend-testing-bouncer" --post-hook "cp /etc/letsencrypt/live/test.ooni.io/privkey.pem /etc/letsencrypt/live/test.ooni.io/fullchain.pem  /data/testing-bouncer/tls   && docker start ooni-backend-testing-bouncer"

Produces:

-------------------------------------------------------------------------------
The program docker-proxy (process ID 16420) is already listening on TCP port 80.
This will prevent us from binding to that port. Please stop the docker-proxy
program temporarily and then try again.
-------------------------------------------------------------------------------
At least one of the (possibly) required ports is already taken.

This has lead to the current testing instances (the ones used by measurement-kit) to not be available for 48 hours, due to this certificate issue.

I am going to proceed now with manually renewing them.

Cleanup duplicate SSH keys

GH template had an issue that SSH host keys were same for all VMs.
We still have some VMs having same RSA-2048 and Ed25519 keys:
ssh-keyscan -t rsa,ed25519 $hosts | sort -k 3

IMHO these keys should be re-generated.

Cleanup orchestration deployment

Cleanup after #112:

Update node-irc under slack-irc-bridge

Used docker container seems to have old node-irc module that has a bug disrespecting messageSplit that was fixed in martynsmith/node-irc#385.

It leads to messages truncated in the middle of word & loosing ~20 symbols while relaying large message from slack to IRC. That's quite annoying :(

Also, Slack bot messages likely do not propagate to IRC anymore, see hellais fork for the feature.

messageSplit
~~Slack bot -> IRC (for monitoring)~~ (2018-08-30: we don't rely on IRC logs of #ooni-bots anymore)

Deploy self-hosted DNS

This is the ticket to track the tasks needed to deploy a self-hosted DNS for OONI.

Airflow pipeline in da docker (part 2)

See #105

put ooni metadb under the role
link secrets to access oometadb from shovel ~~through airflow variables~~ somehow
common jumphost for ext-inventory.yml playbook
upstream updated /entrypoint.sh
clean dead hosts from inventory and DNS
apply dom0-bootstrap.yml to remaining hosts

Cleanup obsolete/discontinued OONI infrastructure

Let's remove the obsolete infrastructure from this page.

b.web-connectivity.th down for 7.6 hours

Impact: TBD (AFAIK clients should use another test helper if one of them is down, but it may be not the case).

Detection: repeated email alert

Timeline UTC:
00:00 four periodic jobs start: cerbot@cron, certbot@systemd, munin plugins apt update, /data/b.web-connectivity.th.ooni.io/update-bouncer.py
00:00 2017-07-27 00:00:03,678:DEBUG:certbot.storage:Should renew, less than 30 days before certificate expiry 2017-08-25 23:01:00 UTC.
00:00 2017-07-27 00:00:03,678:INFO:certbot.hooks:Running pre-hook command: docker stop ooni-backend-b.web-connectivity.th.ooni.io
00:00 2017-07-27T00:00:03+0000 [-] Received SIGTERM, shutting down.
00:00 Jul 27 00:00:13 b dockerd[366]: time="2017-07-27T00:00:13.717964674Z" level=info msg="Container 6a4379167c880b295f7383d6eab8fc7b9e422ac1b0e6df0ab5cfefa2524fd512 failed to exit within 10 seconds of signal 15 - using the force"
00:00 2017-07-27 00:00:32,230:INFO:certbot.hooks:Running post-hook command: cp ... && docker start ooni-backend-b.web-connectivity.th.ooni.io
00:00 2017-07-27T00:00:34.710428510Z Another twistd server is running, PID 1
00:01 [FIRING] Instance https://b.web-connectivity.th.ooni.io/status down
00:34 2017-07-27 00:34:22,763:INFO:certbot.renewal:Cert not yet due for renewal
07:00 darkk@ wakes up
07:31 darkk@ logs into b.web-connectivity.th.ooni.io
07:33 2017-07-27T07:33:31.007042572Z Pidfile /oonib.pid contains non-numeric value after an attempt to truncate --size 0 /var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:34 2017-07-27T07:34:00.767235934Z Removing stale pidfile /oonib.pid after echo 42 >/var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:36 [RESOLVED] Instance https://b.web-connectivity.th.ooni.io/status down
09:50 incident published

What went well:

repeated alerting made it clear that the problem is ongoing
although I had no darkk user at the host, I had my key at root's authorized_keys
docker aufs makes it easy to do amendments to container FS while it is stopped, device-mapper based backend is worse in that case

What went wrong:

that was recidive of pid=1 in pidfile problem
View in AlertManager link in email and slack notifications is not clickable
two certbot launchers /etc/systemd/system/timers.target.wants/certbot.timer and /etc/cron.d/letsencrypt_renew_certs-... are confusing, I have not noticed one of them at first and spent some time looking for the trigger of SIGTERM
ooni-backend-b.web-connectivity.th.ooni.io container stores significant amount of logs inside, these logs take 7G and overall container size limit is 15G, historical logs are useful for sure, but: 1. renaming of 7400 files on rotation may be slow and twisted is single-threaded, 2. they should not overflow container disk quota
twisted logs do not contain milliseconds, those are useful as a helper to distinguish two nearby events from event producing several log lines. docker logs stores nanoseconds, but oonib does not log to stdout (docker)

What is still unclear and should be discussed:

impact. Do applications really fallback to another test helper?
do we need more insistent notification system that can wake people up?

What could be done to prevent relapse and decrease impact:

move letsencrypt updates to team "business hours", people sleep at UTC midnight
avoid stale pid files somehow: cleanup twisted PID file on container start || grab flock on pid || randomize daemon pid so it's not pid=1
fix AlertManager link
fix two certbot launchers
limit number of /var/log/ooni/oonibackend.log.* files in the container
make /var/log/ooni/oonibackend.log.* files renamed once (e.g. oonibackend.log.${unix_timestamp})
milli- or microseconds in twisted logs || logging to stdout (to docker)
ensure that same storage driver is used across all containers #96

measurements: SSL certificate expired for 17 hours

Impact: https://measurements.ooni.torproject.org/ was giving SSL certificate errors to users trying to access it due to an expired certificate.

Timeline UTC:
4 Jul 13:39: [FIRING] Instance https://measurements.ooni.torproject.org/api/v1/files down
4 Jul 16:34: On IRC @hellais: @darkk are you doing something with measurements API?
4 Jul 16:36: On IRC @darkk no
5 Jul 07:05: @hellais restarts https-portal docker service and SSL certificate is renewed
5 Jul 07:19: [RESOLVED] Instance https://measurements.ooni.torproject.org/api/v1/files down

What went well:

Alerting was timely and relevant.
Seems, alerting checks certificate validity time (?)

What went wrong:

Alerting to IRC was unavailable due to regression #119
It was not clear who was responsible for bringing the service back up
The root cause of the issue was not identified

What could be done to prevent relapse and decrease impact:

https-service should be debugged to ensure that it does not miss renewal in the future (this has happened before with other services)

Collector has runtime exception on missing policy.yml file

This can be a bit confusing for administrators, as it looks like things are working until some HTTP request gives a 500 error.

I'd recommend we load the policy.yml at startup and complain / bail / write out defaults at that time.

Deploy a production instance of the ooni-measurements API

Default https web-connectivity helper and collector are not on port 443

So, I am running an ooniprobe web deck using TheTorProject/ooni-probe#723, I have configured it to use https for bouncer, helpers and collector, and I have seen that the used helper is:

{
    "type": "https",
    "address": "https://a.web-connectivity.th.ooni.io:4442"
}

I believe this has to do with the way in which the bouncer is configured.

To gather more info, I also run the following:

$ ooniprobe -v web_connectivity -u http://www.kernel.org/
No test deck detected
Checking if backend is present
Checking if dns-discovery is present
Tor is not running. Skipping IP lookup via Tor.
Looking up your IP address via ubuntu
Found your IP via a GeoIP service
Running task update-inputs
Did not run update-inputs
Looking up collector and test helpers with https://bouncer.ooni.io
Querying backend https://bouncer.ooni.io/bouncer/net-tests with {'net-tests': [{'test-helpers': ['web-connectivity'], 'version': '0.1.0', 'name': 'web_connectivity', 'input-hashes': []}]}
Querying backend https://a.collector.ooni.io:4441/invalidpath with None
Got this backend error message {u'error': 404}
Querying backend https://a.web-connectivity.th.ooni.io:4442/status with None
Setting collector and test helpers for web_connectivity
Using collector <ooni.backend_client.CollectorClient object at 0x10f98fb90>
Starting f7893b3b346aaa9f (user-run)
Creating report with OONIB Reporter. Please be patient.
This may take up to 1-2 minutes...
Querying backend https://a.collector.ooni.io:4441/report with {'data_format_version': '0.2.0', 'software_name': 'ooniprobe', 'test_version': '0.1.0', 'software_version': '2.2.0.rc0', 'test_name': 'web_connectivity', 'test_start_time': '2017-01-25 22:22:36', 'format': 'json', 'input_hashes': [], 'probe_asn': 'AS3269', 'probe_cc': 'IT'}
Created report with id 20170125T222108Z_AS3269_bG48VXWBaGjbuqNeLCuBLpD8ty2AE9zltMpQWXNRENBGZHk7EC
Starting this task <generator object generateMeasurements at 0x10e171410>
Finished test setup
Running <<class 'ooni.nettest.NetTestCaseWithLocalOptions'> inputs=[None]> test_web_connectivity

Starting test for http://www.kernel.org/
* doing DNS query for www.kernel.org
Checking all tasks for completion 0 == 1
A Lookup successful
([<RR name=www.kernel.org type=CNAME class=IN ttl=600s auth=False>, <RR name=pub.all.kernel.org type=A class=IN ttl=600s auth=False>, <RR name=pub.all.kernel.org type=A class=IN ttl=600s auth=False>, <RR name=pub.all.kernel.org type=A class=IN ttl=600s auth=False>], [], [])
Adding [Query('www.kernel.org', 1, 1)] to report)
* connecting to pub.all.kernel.org:80
* connecting to 198.145.20.140:80
* connecting to 199.204.44.194:80
* connecting to 149.20.4.69:80
* doing HTTP(s) request http://www.kernel.org/
Performing request http://www.kernel.org/ GET {'Accept-Language': ['en-US;q=0.8,en;q=0.5'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36']}
Got response <twisted.web._newclient.Response object at 0x10f98f090>
Processing response body
Adding {'url': 'http://www.kernel.org/', 'headers': {'Accept-Language': ['en-US;q=0.8,en;q=0.5'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36']}, 'body': None, 'method': 'GET', 'tor': {'is_tor': False, 'exit_name': None, 'exit_ip': None}} to report
Adding {'url': 'http://www.kernel.org/', 'headers': {'Accept-Language': ['en-US;q=0.8,en;q=0.5'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36']}, 'body': None, 'method': 'GET', 'tor': {'is_tor': False, 'exit_name': None, 'exit_ip': None}} to report
* performing control request with backend
Querying backend https://a.web-connectivity.th.ooni.io:4442/ with {'tcp_connect': ['pub.all.kernel.org:80', '198.145.20.140:80', '199.204.44.194:80', '149.20.4.69:80'], 'http_request': 'http://www.kernel.org/'}
Got this backend error message {u'error': u'Invalid tcp_connect URL'}
[!] Failed to perform control lookup: unknown_failure 

Result for http://www.kernel.org/
---------------------------------
* Could not determine status of blocking due to failing control request
* Is accessible

Successfully completed measurement: <ooni.tasks.Measurement object at 0x10f9ac250>
Starting this task <ooni.tasks.ReportEntry object at 0x10f98f610>
Successfully performed report <ooni.tasks.ReportEntry object at 0x10f98f610>
None
Starting this task <ooni.tasks.ReportEntry object at 0x10fccc990>
Updating report with id 20170125T222108Z_AS3269_bG48VXWBaGjbuqNeLCuBLpD8ty2AE9zltMpQWXNRENBGZHk7EC
Querying backend https://a.collector.ooni.io:4441/report/20170125T222108Z_AS3269_bG48VXWBaGjbuqNeLCuBLpD8ty2AE9zltMpQWXNRENBGZHk7EC with {'content': "... snip ...", 'format': 'json'}
Successfully performed measurement <ooni.tasks.Measurement object at 0x10f9ac250>
None
Status
------
1 completed 0 remaining
0.0% (ETA: 0s)
Checking all tasks for completion 1 == 1
Summary for web_connectivity
----------------------------

Accessible URLS
---------------
* http://www.kernel.org/
Report ID: 20170125T222108Z_AS3269_bG48VXWBaGjbuqNeLCuBLpD8ty2AE9zltMpQWXNRENBGZHk7EC
Closing report with id 20170125T222108Z_AS3269_bG48VXWBaGjbuqNeLCuBLpD8ty2AE9zltMpQWXNRENBGZHk7EC
Querying backend https://a.collector.ooni.io:4441/report/20170125T222108Z_AS3269_bG48VXWBaGjbuqNeLCuBLpD8ty2AE9zltMpQWXNRENBGZHk7EC/close with None
Successfully performed report <ooni.tasks.ReportEntry object at 0x10fccc990>
None
Deleting log file
Finished f7893b3b346aaa9f (user-run)

In addition to the issue, there is also another issue that I will soon report to ooni-probe.

explorer: down for 10 minutes

Impact: People accessing explorer during that time got a nasty 504 error.

Timeline UTC:
13 Jun 13:23 [FIRING] Instance https://explorer.ooni.torproject.org/api/reports/countByCountry down
13 Jun 13:26 @hellais reverts broken commit: ooni/explorer-legacy@e231ae9
13 Jun 13:28 On Slack: @hellais I am updating ooni-explorer and I merged a PR from @anadahz that contained a syntax error.
13 Jun 13:28 [RESOLVED] Instance https://explorer.ooni.torproject.org/api/reports/countByCountry down

What went well:

alerting was prompt and relevant!

What went wrong:

@hellais deployed untested code on a production system

What could be done to prevent relapse and decrease impact:

have CI tests for ooni-explorer that would have caught the syntax error
When explorer is down we could show a more comforting error message instead of the stock nginx 5xx error page
Do some local smoke testing before merging pull requests
Do some local smoke testing before opening pull requests

Cleanup grav CMS deployment

After #126:

fatal: various configuration & data changes are dropped on re-deployment of the playbook -- check if users was the only affected entity
grav is system of record, so it should have proper backups with freshness monitoring
/srv/grav should be mostly root-owned
cover everything with SSO cookie and expose public part explicitly
download source files with checksum
it's unclear if pgp5-fpm process should have www-data gid to listen on socket.
monitor /ping page
check if our php.ini is php.ini-production or php.ini-development :)

Change A records for stage and test ooni-backends

Stage and test ooni-backends have been migrated to a new server with a different
set of IPs. Please change/set the following DNS A records:

Staging

stage.ooni.io A 37.218.240.139
bouncer.stage.ooni.io A 37.218.240.139
b.collector.stage.ooni.io A 37.218.240.139
a.web-connectivity.th.stage.ooni.io A 37.218.240.139

Testing

test.ooni.io A 37.218.240.140
bouncer.test.ooni.io A 37.218.240.140
b.collector.test.ooni.io A 37.218.240.140
a.web-connectivity.th.test.ooni.io A 37.218.240.140

Update main ooni-backend bouncer

We should update the main bouncer of ooni-backend.

group_vars not found relitive to YAML file.

The installation instructions for both ooni-backend and ooniprobe imply (through the path to install-ooniprobe.yml/install-oonibackend.yml) that they should be run from the root of the ooni-sysadmin repo. If done from the root of the repo the following error will occur.

TASK: [common | setup tor apt repo] ******************************************* 
fatal: [xx.xx.xx.xx] => One or more undefined variables: 'tor_distribution_release' is undefined

FATAL: all hosts have already failed -- aborting

This error occurs because ansible looks for the the group_vars folder relative to its YAML file. The error can be fixed by moving the group_vars folder into the ansible folder, or by placing a sym-link to it in the ansible folder.

Create an ansible letsencrypt role to provision letsencrypt certificates for ooni-backend

This task is about creating a letsnencrypt ansible role to generate and deploy TLS certificates for ooni-backend.

Cleanup piwik analytics deployment

After #127:

use explicit piwik version during deploy
use tar instead of zip during deploy
re-enable letsencrypt role
cleanup variables with same name defined in different roles (e.g. mysql_host, server_name)
it's unclear if pgp5-fpm process should have www-data gid to listen on socket.
define "default" set of protocols / cyphers / DH params and so on for all nginx instances
/srv/piwik/releases/*/piwik should be mostly root-owned
monitor /ping page
fatal: expires max on file that is changed on update makes it cached in browsers forever, re-deploy piwik on different domain to fix that

Cleanup prometheus deployment

After MVP in #108 there is some there is some technical debt:

I'm creating separate ticket for that to merge that branch as it is becoming unreviewable quite quickly :)

slack-irc: support notificatons from other slack bots

#89 part 3 (AKA #114 part 2)

There is also an option in old slack-irc-config.json "muteBots": ["TOKENISH-STRING"].

hellais> ah right, yeah that option is used by my own fork of slack-irc that adds support for IRC->Slack bot notifications
hellais> ekmartin/slack-irc@master...hellais:feature/bot-notifications
hellais> if we don't use my fork, slack bot notifications will not propagate to IRC
hellais🐙> the reason for the muteBots option is that otherwise it will create a feedback loop between the bot then notifying of the bot
darkk> yep, I understand. Am I right that it mutes itself?
hellais> correct

error while evaluating conditional: set_supervisord == 'true'

Similar to issue #21 when running the install ooni-backend ansible playlist an error occurs when evaluating the set_supervisord conditional in the main ooni-backend task.

TASK: [ooni-backend | Template oonibackend supervisord config] **************** 
fatal: [xx.xx.xx.xx] => error while evaluating conditional: set_supervisord == 'true'

FATAL: all hosts have already failed -- aborting

To fix this error the current checks can be changed to see if set_supervisord is defined before checking if it is equal to true.

- include: supervisord.yml
  when: set_supervisord is defined and set_supervisord == 'true'
- include: fetch-HS-info.yml
  when: set_supervisord is defined and set_supervisord == 'true'

Add some A records to a domain for the stage/testing backend

This task is about adding a number of A records that will resolve to the stage/testing backend server.

kernel version without moddep at ssdams

I tried to enforce firewall rules at ssdams.infra.ooni.io and it turned out that the box has no corresponding kernel modules:

darkk@ssdams:~$ sudo iptables -L -vn
modprobe: ERROR: ../libkmod/libkmod.c:557 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.54/modules.dep.bin'
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.

darkk@ssdams:~$ sudo iptables-save -t filter
modprobe: ERROR: ../libkmod/libkmod.c:557 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.54/modules.dep.bin'
iptables-save v1.4.21: Cannot initialize: iptables who? (do you need to insmod?)

darkk@ssdams:~$ uname -a
Linux ssdams.infra.ooni.io 4.4.54 #1 SMP Wed Mar 15 15:28:10 UTC 2017 x86_64 GNU/Linux

Writing it down as I have no time to fix it today.

Cleanup dead code from this repository

This repository contains a bunch of dead code that is no longer needed/used/working.

I was thinking that this was sort of implied by what would be done in #48, but I realise now that maybe the scope of that ticket is too narrow, so I will create a ticket for this.

Things that we should evaluate if they should still stay are are:

Potentially many more.

It would be good to go through all this repo and prune out all the stuff that is obsolete and no longer needed

pipeline: chameleon disk full

Impact: possible reports loss(?)

Detection: luck and curiosity

Timeline UTC:
13 Jun 11:34: @anadahz brings news to #ooni-internal that chameleon.infra.ooni.io has 86 Gb free at /
15 Jun 14:30: @hellais tells @darkk (during the call) that chameleon was cleaned up, actually only ~200Gb were cleaned up according to munin chart
03 Jul 14:30: @darkk logs into chameleon and notices that there are only 155 Mb free at /
02 Jul 02:27: 2017-07-02 04:27:26...No space left on device in ooni-pipeline/ooni-pipeline.log
03 Jul 16:00: @darkk cleaned up reports-raw at chameleon for 2017-{04,05}-* 2017-06-{01..20} checking data against set of canned files at datacollector, it cleans up ~1010Gb; cleanup of sanitised files 2017-05-* 2017-06-{01..20} (checked agasinst s3) cleans ~650Gb more

What could be done to prevent relapse and decrease impact:

alerting on disk space across all nodes
alerting on pipeline failures (happening on 2017-07-02)
automatic cleanup of chameleon

What else could be done?

It's unclear if any data was actually lost, it seems to me that's not the case as rsync's --remove-source-files should remove source only on success, but it's done during the transfer, so it does not produce duplicate report files across different buckets as well.
@hellais do you have any ideas how to double-check that no files were lost?
ooni-pipeline-cron.log is bad clue as $TEMP_DIR/fail file can be created (there are free inodes), but any attempt to write to the file should fail (no diskspace):

$ find tmp.MJMZZMKod6 tmp.FCkAAllTHY tmp.8MLYjpci8m -ls
19144742    4 drwx------   2 soli     soli         4096 Jul  1 04:00 tmp.MJMZZMKod6
19144743    4 drwx------   2 soli     soli         4096 Jul  2 06:30 tmp.FCkAAllTHY
19139487    0 -rw-rw-r--   1 soli     soli            0 Jul  2 06:32 tmp.FCkAAllTHY/fail
18882602    4 drwx------   2 soli     soli         4096 Jul  3 04:08 tmp.8MLYjpci8m
18878873    0 -rw-rw-r--   1 soli     soli            0 Jul  3 06:22 tmp.8MLYjpci8m/fail

bouncer: 35min down after maintenance reboot

Impact: bouncer was down for 35 minutes. Does it cause measurement loss for any ooniprobe versions? What sort of issues?

Timeline UTC:
13 Jun 10:53 [FIRING] Instance https://bouncer.ooni.io/bouncer down
13 Jun 11:02 @darkk pings @anadahz with basic diagnostics
13 Jun 11:20 @anadahz brings news from kargig that VM reboot was caused by maintenance
13 Jun 11:23 @hellais (?) starts dockerd at bouncer
13 Jun 11:28 [RESOLVED] Instance https://bouncer.ooni.io/bouncer down

What went well:

alerting was prompt and relevant!

What went wrong:

@darkk waited for ~10 minutes till it was obvious that alert message is not network flap
four different people were bringing the service back

What could be done to prevent relapse and decrease impact:

bouncer was not reboot-proof
there was no email notification to team@ regarding pending maintenance, it was impossible to switchover
bouncer.ooni.io is SPOF and it has no automatic failover
bouncer deployment recipe was "too meta" for @darkk to understand it during the incident

Improve health checks for bouncer

Currently the health check we use to ensure that the bouncer is working properly are very basic and they could be improved.

As highlighted in #108

The check for the bouncer being up I ended up reverting it to a much simpler one, because I couldn't figure out how to get the regexps to work

This ticket is about figuring out how to get prometheus regexps to work to validate that the bouncer response is in fact correct.

ooni / sysadmin Goto Github PK

sysadmin's Introduction

OONI sysadmin

Getting started

Ansible roles

SSH Config

M-Lab deployment

Upgrading OONI infrastructure

ooni-backend pitfalls

New host HOWTO

DNS name policy

Rename host HOWTO

PostgreSQL replica bootstrap

Updating firewall rules

sysadmin's People

Contributors

Stargazers

Watchers

Forkers

sysadmin's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs