mozilla-services / tokenserver Goto Github PK

View Code? Open in Web Editor NEW

65.0 17.0 28.0 1.19 MB

The Mozilla Token Server

Home Page: http://docs.services.mozilla.com/token/index.html

License: Mozilla Public License 2.0

Makefile 1.88% Python 96.84% Mako 0.16% Dockerfile 0.29% Shell 0.83%

tokenserver's Issues

Log more metrics about background script activity

For our background processing scripts like process_account_events, it would be nice for ops if we:

log ignored, success and errors to datadog as a timeseries graph
log errors as events into datadog's event stream

Configuration / DB re-read without restart

It would be handy to have a way to trigger token server to re-read configuration settings, or pickup changes in the DB without restarting it.

Error in scripts/process_account_events.py

Here is the log dump. The result is that the SQS queue didn't get cleared and it grew until monitoring kicked in. It looks like:

data is not properly checked / sanitized
it crashes on line 100
exception bubbles up, process dies
systemd won't restart it anymore because it has died too many times and too quickly

Feb 26 19:25:15 docker[31898]: Processing account reset for u'<REDACTED>'
Feb 26 19:25:15 docker[31898]: Error while processing account deletion events
Feb 26 19:25:15 docker[31898]: Traceback (most recent call last):
Feb 26 19:25:15 docker[31898]: File "/app/tokenserver/scripts/process_account_events.py", line 100, in process_account_events
Feb 26 19:25:15 docker[31898]: backend.update_user(SERVICE, user, generation - 1)
Feb 26 19:25:15 docker[31898]: File "tokenserver/assignment/sqlnode/sql.py", line 321, in update_user
Feb 26 19:25:15 docker[31898]: 'email': user['email'],
Feb 26 19:25:15 docker[31898]: TypeError: 'NoneType' object has no attribute '__getitem__'
Feb 26 19:25:15 docker[31898]: Traceback (most recent call last):
Feb 26 19:25:15 docker[31898]: File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main
Feb 26 19:25:15 docker[31898]: "__main__", fname, loader, pkg_name)
Feb 26 19:25:15 docker[31898]: File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
Feb 26 19:25:15 docker[31898]: exec code in run_globals
Feb 26 19:25:15 docker[31898]: File "/app/tokenserver/scripts/process_account_events.py", line 145, in <module>
Feb 26 19:25:15 docker[31898]: tokenserver.scripts.run_script(main)
Feb 26 19:25:15 docker[31898]: File "tokenserver/scripts/__init__.py", line 19, in run_script
Feb 26 19:25:15 docker[31898]: exitcode = main()
Feb 26 19:25:15 docker[31898]: File "/app/tokenserver/scripts/process_account_events.py", line 140, in main
Feb 26 19:25:15 docker[31898]: opts.aws_region, opts.queue_wait_time)
Feb 26 19:25:15 docker[31898]: File "/app/tokenserver/scripts/process_account_events.py", line 100, in process_account_events
Feb 26 19:25:15 docker[31898]: backend.update_user(SERVICE, user, generation - 1)
Feb 26 19:25:15 docker[31898]: File "tokenserver/assignment/sqlnode/sql.py", line 321, in update_user
Feb 26 19:25:15 docker[31898]: 'email': user['email'],
Feb 26 19:25:15 docker[31898]: TypeError: 'NoneType' object has no attribute '__getitem__'
Feb 26 19:25:16 systemd[1]: docker-tokenserver-account-events.service: main process exited, code=exited, status=1/FAILURE

Distinguish 401 permission error from assertion rejected due to timestamps in some way

In testing, I am seeing lots of issues with token server timestamps. Specifically, my devices seem to be a second or two ahead of the token server. I anticipate we'll see lots of issues in production.

As a first step, I suggest we distinguish "real" permission denied from "trivial" assertion rejected due to timestamp errors. Then clients can selectively retry after adjusting timestamps. Clients can use the HTTP Date: header for local timestamp adjustments until we find that resolution insufficient.

@rfk is the code change within scope?
@ckarlof this could impact the desktop client, which needs to handle timestamp skew itself.

Update our load test to better match Production traffic

From a discussion today in IRC:

14:15 < mostlygeek> jbonacci: fyi: in prod i had to scale up the TS to 6xm3.medium
14:16 < mostlygeek> the database load has been pretty low but the CPU load has been pretty high...
we should update our load tests to match prod traffic :) 
14:17 < jbonacci> mostlygeek we might already have a bug or two on that for rfkelly|away to look at
14:17 < mostlygeek> jbonacci: cool!

Documentation: mention large_client_header_buffers for nginx on Gentoo

Hi, I ran into an issue making the token server work behind nginx because Gentoo's default config file blocks headers larger than 2k.
The error was the following:

2014/07/23 19:35:25 [info] 3970#0: *163 client sent too long header line: "Authorization: BrowserID eyJhbGciOiJSUzI1NiJ9.......

After discussing it with ckarlof on IRC, he told me that the limit should be increased up to 8k which is nginx's default.
The instructions should mention this issue and suggest to replace the default large_client_header_buffers with nginx's, i.e. large_client_header_buffers 4 8k;

Ubuntu: Unit tests are failing with powerhose timeouts

Here is what I seen when I run "make test" on the AWS ubuntu instance:
https://jbonacci.pastebin.mozilla.org/4032532

Add ssh tunneling support to the loadtest

See the following:
mozilla-services/loads#220
mozilla-services/loads#221

This requires a change to the Makefile, and maybe a change to the config/megabench.ini file...

Get gevent working again

@rfk to fill in the blanks.
We ran into issues with load testing TS Stage - a significant number of 503s with not much info/data to go on in terms of debug.

The LocalVerifier fails on home-made assertions. We need one that works for loadtesting for home-made assertions

Found this with @tarekziade while performing load test on a local install of tokenserver to the qa2 VM.

The entire discussion about this went on in #identity on 2012-04-13

But after making some changes to verifier.py:
diff verifiers.py verifiers.py.BAK
7,10d6
< from browserid.tests.support import patched_key_fetching
< patched = patched_key_fetching()
< patched.enter()
<

Location on qa2: /opt/tokenserver/tokenserver

I am still seeing failures on my system, running the very simple load test as follows:
Terminal 1: ./bin/paster serve etc/tokenserver-dev.ini
Terminal 2:
cd /opt/tokenserver/tokenserver/loadtest
make build
make test

../bin/fl-run-test loadtest.py

F.

FAIL: test_bad_assertions (loadtest.NodeAssignmentTest)

Traceback (most recent call last):
File "/opt/tokenserver/tokenserver/lib/python2.6/site-packages/funkload-1.16.1-py2.6.egg/funkload/FunkLoadTestCase.py", line 946, in call
testMethod()
File "/opt/tokenserver/tokenserver/loadtest/loadtest.py", line 52, in test_bad_assertions
self._do_token_exchange(wrong_issuer, 401)
File "/opt/tokenserver/tokenserver/loadtest/loadtest.py", line 24, in _do_token_exchange
res = self.get(self.root + self.token_exchange, ok_codes=[status])
File "/opt/tokenserver/tokenserver/lib/python2.6/site-packages/funkload-1.16.1-py2.6.egg/funkload/FunkLoadTestCase.py", line 391, in get
method="get", load_auto_links=load_auto_links)
File "/opt/tokenserver/tokenserver/lib/python2.6/site-packages/funkload-1.16.1-py2.6.egg/funkload/FunkLoadTestCase.py", line 299, in _browse
response = self._connect(url, params, ok_codes, method, description)
File "/opt/tokenserver/tokenserver/lib/python2.6/site-packages/funkload-1.16.1-py2.6.egg/funkload/FunkLoadTestCase.py", line 216, in _connect
raise self.failureException, str(value.response)
AssertionError: /1.0/aitc/1.0
HTTP Response 200: OK

Ran 2 tests in 1.638s

FAILED (failures=1)
make: *** [test] Error 1

Break up the loadtest "bench" script into "bench" and "megabench"

So that we can use the Load env if we want...

REF:
https://github.com/mozilla-services/server-syncstorage/blob/master/loadtest/Makefile

Debug uwsgi/gevent incompatibility

Running under uwsgi with gevent-monkey-patching disabled causes the RemoteVeifier to hang. That's bad. Debug this.

Load test: Investigate use of Vaurien with the TS test

Just a tracker issue to figure out the use/feasibility...

Investigate/Document the 503s on TokenServer Stage

Running TokenServer only load tests or Combined load tests, we continue to see some amount of 503s in Stage. This is with default settings for the load test, various configurations of TS Stage, with 1 to 3 instances of various sizes.

We find here /media/ephemeral0/logs/nginx/access.log
54.245.44.231 2014-05-13T23:58:39+00:00 "GET /1.0/sync/1.5 HTTP/1.1" 503 1922 320 "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 0.007

And here
"name": "token.assertion.connection_error"
"name": "token.assertion.verify_failure"

Something to research and document going forward...

Refactor shared secret design to use a master password + HKDF for storage nodes

Building upon our discussions of how the tokenserver and the storage nodes handle pre-shared secrets we will make the following changes:

tokenserver will know a master password. It will use this master password to derive (HMAC-SHA256) passwords used by the storage nodes
the storage nodes will have the derived secrets baked into them

Restrict allowedIssuers when verifying assertion

Are we doing it now? If so what is it restricted to?

Issues running unit tests on rhel6 and Mac

I keep getting some sort of "hang" condition running 'make test" after building tokenserver on qa2 (and other locations).
So, after talking to both of you about upgrading to 0.8.2 to get what is in TS Stage, I did the following:

$ git clone git://github.com/mozilla-services/tokenserver
$ cd tokenserver
$ make build CHANNEL=prod TOKENSERVER=rpm-0.8.2
$ make test

This is what I see:
(note the ^C in the output - that is where I tried to break the apparent "hang" condition)

bin/nosetests --with-xunit tokenserver
...........F....^CE.......

...etc...

Ran 24 tests in 206.026s

FAILED (errors=1, failures=1)
make: *** [test] Error 1

On qa2, I tried each of these steps:
$ make build
vs.
$ make build CHANNEL=dev
vs.
$ make build CHANNEL=prod TOKENSERVER=rpm-0.8.4
vs.
$ make build CHANNEL=prod

In all cases, I get a "hang" condition on "make test".
Once the "hang" condition is remedied with a Ctrl-C, I get several errors.
The pastebin is here: http://jbonacci.pastebin.mozilla.org/1701461

Here is more info:
http://jbonacci.pastebin.mozilla.org/1705814

make test failing with errors on verifiers.py

Looks like one of the most recent commits may have broken something here:

make test
bin/flake8 tokenserver
tokenserver/verifiers.py:33:5: E301 expected 1 blank line, found 0
tokenserver/verifiers.py:40:19: E225 missing whitespace around operator
tokenserver/verifiers.py:42:23: E712 comparison to False should be 'if cond is False:' or 'if not cond:'
tokenserver/verifiers.py:45:19: E225 missing whitespace around operator
tokenserver/verifiers.py:51:80: E501 line too long (80 > 79 characters)
tokenserver/verifiers.py:53:12: E127 continuation line over-indented for visual indent
tokenserver/verifiers.py:56:80: E501 line too long (91 > 79 characters)
tokenserver/verifiers.py:61:1: E303 too many blank lines (3)
tokenserver/verifiers.py:66:1: E302 expected 2 blank lines, found 3
make: *** [test] Error 1

Add the purge broker option to the loadtest makefile

Like this one:
https://github.com/mozilla/fxa-auth-server/blob/master/loadtest/Makefile

👍

Doesn't work with latest cornice

In my dev deployment of tokenserver, I have to downgrade to cornice=0.11 to make things work. Unfortunately I don't recall the specific error I was seeing; this is a note for me to reproduce and debug the issue.

Create some integration tests to cover edge/error cases and hit remote servers

Similar to what we have for FxA-auth

A decent set of integration tests that can be pointed to a remote server

to verify a deployment
to hit the server while load test is running

Localhost unit tests failing on test_purging_of_old_user_records

Here is what I see on all platforms I tested:

make test
bin/flake8 --exclude=messages.py,test_remote_verifier.py tokenserver
bin/nosetests tokenserver/tests
..................E......S......
======================================================================
ERROR: test_purging_of_old_user_records (tokenserver.tests.test_purge_old_records.TestPurgeOldRecordsScript)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/James/tokenserver/tokenserver/tests/test_purge_old_records.py", line 81, in test_purging_of_old_user_records
    user_records = list(self.backend.get_user_records(service, email))
AttributeError: 'SQLNodeAssignment' object has no attribute 'get_user_records'
-------------------- >> begin captured logging << --------------------
circus: INFO: Arbiter exiting
circus: DEBUG: stopping the broker watcher
circus: DEBUG: gracefully stopping processes [broker] for 30.0s
circus: DEBUG: broker: kill process 62437
circus: DEBUG: sending signal 15 to 62437
circus: DEBUG: stopping the workers watcher
circus: DEBUG: gracefully stopping processes [workers] for 30.0s
circus: DEBUG: workers: kill process 62438
circus: DEBUG: sending signal 15 to 62438
circus: DEBUG: manage_watchers is conflicting with another command
circus: INFO: broker stopped
circus: INFO: workers stopped
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 32 tests in 12.770s

FAILED (SKIP=1, errors=1)
make: *** [test] Error 1

Tokenserver could be more graceful in reporting bogus assertions

From irc:
Although, hmm the loadtest passes, but I see the tokenserver with an
error 'missing p in data'. I seem to recall some mention of that at some
point.
jrgm: this is "normal" because the server gets a wrong assertion
and we want to make sure it does not work
tarek: ah, duh (me).
jrgm: but plz send that to alexis with me in cc. the TS should be more graecful here
cc me too
;-)
/cc @tarekziade, @jbonacci

ERROR [powerhose][worker 8] missing p in data - [u'e', u'algorithm', u'n']
File "/home/jrgm/tokenserver/deps/https:/github.com/mozilla-services/powerhose/powerhose/worker.py", line 60, in _handle_recv_back
res = self.target(Job.load_from_string(msg[0]))
File "/home/jrgm/tokenserver/tokenserver/crypto/pyworker.py", line 201, in call
res = getattr(self, function_id)(**data)
File "/home/jrgm/tokenserver/tokenserver/crypto/pyworker.py", line 214, in check_signature
cert = jwt.load_key(algorithm, data)
File "/home/jrgm/tokenserver/lib/python2.6/site-packages/browserid/jwt.py", line 76, in load_key
return key_class(key_data)
File "/home/jrgm/tokenserver/lib/python2.6/site-packages/browserid/jwt.py", line 160, in init
_check_keys(data, ('p', 'q', 'g', 'y'))
File "/home/jrgm/tokenserver/lib/python2.6/site-packages/browserid/jwt.py", line 242, in _check_keys
raise ValueError(msg)
File "/home/jrgm/tokenserver/deps/https:/github.com/mozilla-services/powerhose/powerhose/client.py", line 54, in execute
raise ExecutionError(res[len('ERROR:'):])

Remove the HOST env from the Makefile

Because it does not need to be there and it's confusing me...
;-)

HOST = http://localhost:5000

Follow up to #24, remove source based requirements installation

these lines:

https://github.com/mozilla-services/wimms/archive/rfk/update-deps.zip
https://argparse.googlecode.com/files/argparse-1.2.1.tar.gz

Consider making Doc building work on Pythong 2.6 systems.

See this older issues for reference: #6

We should consider the idea of debugging this and getting Doc building to work on the older platforms and OPs platforms running OS with Python 2.6.6, etc.

Crypto workers stop working after ~24h

Copied from https://bugzilla.mozilla.org/show_bug.cgi?id=757520

For an unknown reason, the crypto worker of the tokenserver stop working after some time running (about after 24 hours).

The particular piece of code hanging out is:

retrieving browserid keys from the hosts, via the idproxy
checking signatures regarding the fetched public keys.

The problem is either here or on the layers on top of that, which are ensuring that the main python server is able to communicate with the workers (circus or powerhose).

Load test: add tests for generation numbers in assertion and client state string

This came up in today's meeting.
A bit more work to really fill out the TS load test.

Create a quick verification test for Stage deploys

Something similar to what we have for server-syncstorage: ./syncstorage/tests/functional/test_storage.py

and for fxa-auth-server: npm run test-remote

If I remember correctly the current unit tests (make test) are only designed to run locally.

Don't depend on fork of pyzmq

Per issue #16, once upstream pyzmq release is made we can stop depending on custom fork

Latest basic TokenServer load tests are returning 401s

I am using the very latest setup of TS and Sync and Verifier (all now in US East).
I am just running the basic test against TS Stage:
$ make test SERVER_URL=https://token.stage.mozaws.net

I see 401s in the nginx access logs.
[21/Feb/2014:14:03:12 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110...
[21/Feb/2014:14:03:37 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 96...

I see errors in the tokenserver token.log file:
"name": "token.assertion.verify_failure"
"token.assertion.audience_mismatch_error"
(always paired)

These are consistent across all 3 instances of TS in Stage.

Fix for TokenServer HIGH CPU

ref: https://bugzilla.mozilla.org/show_bug.cgi?id=976907
patch: https://bugzilla.mozilla.org/attachment.cgi?id=8381914

Load test is failing with 50% error rate against Stage TokenServer

I am basically running this from a local host:
make bench SERVER_URL=https://token-stage3.stage.mozaws.net

What I see is a 50% failure rate, but only listings for 3 (types of) failures: 503s and 500s.

Here is the pastebin of the results:
https://jbonacci.pastebin.mozilla.org/4133736

Update the Load Test to use the Loads cluster with a config file

We want to start using the Loads cluster with an associated Loads config file to run our load tests from now on.
Edits to Makefile and loadtest.py

Change github based requirements to specific versions

Change the lines in requirements.txt to point to specific versions instead of master which could change and cause hard to find bugs.

On these lines:

https://github.com/mozilla-services/tokenserver/blob/master/requirements.txt#L41-L44

New FxA events coming

https://github.com/mozilla-services/tokenserver/blob/master/tokenserver/scripts/process_account_deletions.py#L65

This code currently expects only delete events, logging an error for anything else. Soon there will be other account related events broadcast that tokenserver may or may not be interested in.

We could broadcast each event on a separate SNS topic, but I think I'm in favor of a single topic for all account events. Thoughts?

clean make of tokenserver doesn't install zope.interface so it is found

Steps to reproduce:

do a clean git clone of tokenserver and type make
after a successful make, do "bin/python -c 'import zope.interface'"

I get
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named interface

zope.interface is located here. Not sure why it does not pick that up.
./lib/python2.6/site-packages/zope.interface-3.8.0-py2.6-linux-x86_64.egg/zope/interface

WIthout it available, two of the tests of 'make test' fail.

figure out wrong-key confusion when FxA account is reset

We've been having an email discussion about how to prevent a "slap fight" between two devices on the same account, when the FxA account has been reset (changing the encryption key), but the devices disagree about whether the old key or the new key is correct. The worst case here is that the two devices keep deleting each other's data, because they get an HMAC error when they see something encrypted by the other device, so they wipe the server and re-upload with their own key. This could keep happening until the old-key device's token finally expires, and it cannot get a new one without re-logging in (at which point it will get the new key).

We haven't yet decided how to address this, but we're converging on a couple of possible solutions. Most involve the fxa-auth-server including a "generation number" its certificates, so the tokenserver can distinguish between the "new" device and the old ones (mozilla/fxa-auth-server#486). Some also involve the tokenserver getting a hash of the encryption key, and mapping different (uid, keyhash) pairs to different sync-id values (so old and new devices get different sets of ciphertext).

Opening this ticket so we'll have something to point at for the discussion.
CC @ckarlof @rfk

Port the load test over to Loads

Looks like we are still at Funkload.
Need to get this ported over before we get our TS Stage env in AWS.

SciLinux: Unit Tests fail with ImportError: No module named cffi

Sound familiar ;-)

mozilla-services/loads#195
mozilla/fxa-auth-server#268

So, this took a very simple install and build of TokenServer.
So, either I did not need pre-reqs, or they were already on my VM.

Here is a pastebin of the complete error list:
https://jbonacci.pastebin.mozilla.org/4032487

fd leaks

The token server is leaking fds on stage2.

This is probably in powerhose, either in the client Pool, or in the workers restarting.

Will write a test that counts the number of fds before and after each request to find out where the problem happens

/cc @fetep @ametaireau

Need a quick deployment verification test for Tokenserver Stage

Because we don't have one. Just getting this on the radar...

Change response from https://<token server> from dummy json string to just "ok"

If I hit this site directly: https://token-stage3.stage.mozaws.net/
I get this back:
{"services": {"queuey": ["1.0"], "simple_storage": ["2.0", "2.1"], "durable_storage": ["1.0"]}, "auth": "https://token.services.mozilla.com"}

Which apparently is meaningless...
Same for Dev/Prod, generally...

We should just change this to "ok"

Failure attempting to build TS docs

Was attempting the following:
08:45 < tarek> its in docs/ https://github.com/mozilla-services/tokenserver/tree/master/docs
08:45 < tarek> if you want to build it:
08:46 < tarek> cd docs; SPHINXBUILD=../bin/sphinx-build make html
08:46 < tarek> then you have it in docs/build/html

Was not able to do this from a local install of tokenserver (git clone)

SPHINXBUILD=../bin/sphinx-build make html
../bin/sphinx-build -b html -d build/doctrees source build/html
Running Sphinx v1.1.2
loading pickled environment... done
building [html]: targets for 8 source files that are out of date
updating environment: 0 added, 0 changed, 0 removed
looking for now-outdated files... none found
preparing documents... done
IOError: [Errno 2] No such file or directory: u'../../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css'
Unable to open source file for reading ('../../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css'). Exiting.
make: *** [html] Error 1

From the "docs" directory:
ls ../../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css
ls: cannot access ../../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css: No such file or directory

Instead I try this:
ls ../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css
../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css

That works, so it's looking in the wrong place for docutils?

Add support for Loads --user-id to Tokenserver load tests

Because awesome.

Add "star" label functionality to this repo

Similar to here:
https://github.com/mozilla/persona/issues
So QA can better triage open issues by star value (priority).
Also give QA rights to add and mark labels.

Create a requirements.txt file for easy ops deployment

Our scripts look for a requirements.txt file to add the python dependencies into the virtualenv. When the code is stable enough turn dev-reqs.txt into requirements.txt

Getting "make build" errors on Mac

Well this is new and frustrating (since this was working even on OS 10.9/XCode 5.1.1).

git clone tokenserver
cd tokenserver
make build
make test
this all works

cd loadtest
make build

Now I see this:
make build
ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future ../bin/pip install pexpect
/bin/sh: ../bin/pip: No such file or directory
make: *** [build] Error 127

And, indeed, there is no "bin" directory in tokenserver

tests/{test_service,test_crypto_pyworker}.py require unittest2 on python 2.6

I've just been patching around this issue for a while, and should have put it
in as an issue. Anyways test_service.py and test_crypto_pyworker both use new
features from unittest in 2.7 (assertIn and context manager for assertRaises
respectively).

Since the production boxes run python 2.6.6, can we fix these tests to run there.

Fix circle builds on Pull requests

Logging into Docker is not available when PR come from forks. Fix this.

mozilla-services / tokenserver Goto Github PK

tokenserver's Issues

F.

FAIL: test_bad_assertions (loadtest.NodeAssignmentTest)

Recommend Projects

Recommend Topics

Recommend Org

Jobs