GithubHelp home page GithubHelp logo

mozilla-services / tokenserver Goto Github PK

View Code? Open in Web Editor NEW
65.0 17.0 28.0 1.19 MB

The Mozilla Token Server

Home Page: http://docs.services.mozilla.com/token/index.html

License: Mozilla Public License 2.0

Makefile 1.88% Python 96.84% Mako 0.16% Dockerfile 0.29% Shell 0.83%

tokenserver's Issues

Log more metrics about background script activity

For our background processing scripts like process_account_events, it would be nice for ops if we:

  • log ignored, success and errors to datadog as a timeseries graph
  • log errors as events into datadog's event stream

Error in scripts/process_account_events.py

Here is the log dump. The result is that the SQS queue didn't get cleared and it grew until monitoring kicked in. It looks like:

  1. data is not properly checked / sanitized
  2. it crashes on line 100
  3. exception bubbles up, process dies
  4. systemd won't restart it anymore because it has died too many times and too quickly
Feb 26 19:25:15 docker[31898]: Processing account reset for u'<REDACTED>'
Feb 26 19:25:15 docker[31898]: Error while processing account deletion events
Feb 26 19:25:15 docker[31898]: Traceback (most recent call last):
Feb 26 19:25:15 docker[31898]: File "/app/tokenserver/scripts/process_account_events.py", line 100, in process_account_events
Feb 26 19:25:15 docker[31898]: backend.update_user(SERVICE, user, generation - 1)
Feb 26 19:25:15 docker[31898]: File "tokenserver/assignment/sqlnode/sql.py", line 321, in update_user
Feb 26 19:25:15 docker[31898]: 'email': user['email'],
Feb 26 19:25:15 docker[31898]: TypeError: 'NoneType' object has no attribute '__getitem__'
Feb 26 19:25:15 docker[31898]: Traceback (most recent call last):
Feb 26 19:25:15 docker[31898]: File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main
Feb 26 19:25:15 docker[31898]: "__main__", fname, loader, pkg_name)
Feb 26 19:25:15 docker[31898]: File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
Feb 26 19:25:15 docker[31898]: exec code in run_globals
Feb 26 19:25:15 docker[31898]: File "/app/tokenserver/scripts/process_account_events.py", line 145, in <module>
Feb 26 19:25:15 docker[31898]: tokenserver.scripts.run_script(main)
Feb 26 19:25:15 docker[31898]: File "tokenserver/scripts/__init__.py", line 19, in run_script
Feb 26 19:25:15 docker[31898]: exitcode = main()
Feb 26 19:25:15 docker[31898]: File "/app/tokenserver/scripts/process_account_events.py", line 140, in main
Feb 26 19:25:15 docker[31898]: opts.aws_region, opts.queue_wait_time)
Feb 26 19:25:15 docker[31898]: File "/app/tokenserver/scripts/process_account_events.py", line 100, in process_account_events
Feb 26 19:25:15 docker[31898]: backend.update_user(SERVICE, user, generation - 1)
Feb 26 19:25:15 docker[31898]: File "tokenserver/assignment/sqlnode/sql.py", line 321, in update_user
Feb 26 19:25:15 docker[31898]: 'email': user['email'],
Feb 26 19:25:15 docker[31898]: TypeError: 'NoneType' object has no attribute '__getitem__'
Feb 26 19:25:16 systemd[1]: docker-tokenserver-account-events.service: main process exited, code=exited, status=1/FAILURE

Distinguish 401 permission error from assertion rejected due to timestamps in some way

In testing, I am seeing lots of issues with token server timestamps. Specifically, my devices seem to be a second or two ahead of the token server. I anticipate we'll see lots of issues in production.

As a first step, I suggest we distinguish "real" permission denied from "trivial" assertion rejected due to timestamp errors. Then clients can selectively retry after adjusting timestamps. Clients can use the HTTP Date: header for local timestamp adjustments until we find that resolution insufficient.

@rfk is the code change within scope?
@ckarlof this could impact the desktop client, which needs to handle timestamp skew itself.

Update our load test to better match Production traffic

From a discussion today in IRC:

14:15 < mostlygeek> jbonacci: fyi: in prod i had to scale up the TS to 6xm3.medium
14:16 < mostlygeek> the database load has been pretty low but the CPU load has been pretty high...
we should update our load tests to match prod traffic :) 
14:17 < jbonacci> mostlygeek we might already have a bug or two on that for rfkelly|away to look at
14:17 < mostlygeek> jbonacci: cool!

Related bugs in Bugzilla:
https://bugzilla.mozilla.org/show_bug.cgi?id=997344
https://bugzilla.mozilla.org/show_bug.cgi?id=1022721

Documentation: mention large_client_header_buffers for nginx on Gentoo

Hi, I ran into an issue making the token server work behind nginx because Gentoo's default config file blocks headers larger than 2k.
The error was the following:

2014/07/23 19:35:25 [info] 3970#0: *163 client sent too long header line: "Authorization: BrowserID eyJhbGciOiJSUzI1NiJ9.......

After discussing it with ckarlof on IRC, he told me that the limit should be increased up to 8k which is nginx's default.
The instructions should mention this issue and suggest to replace the default large_client_header_buffers with nginx's, i.e. large_client_header_buffers 4 8k;

Get gevent working again

@rfk to fill in the blanks.
We ran into issues with load testing TS Stage - a significant number of 503s with not much info/data to go on in terms of debug.

The LocalVerifier fails on home-made assertions. We need one that works for loadtesting for home-made assertions

Found this with @tarekziade while performing load test on a local install of tokenserver to the qa2 VM.

The entire discussion about this went on in #identity on 2012-04-13

But after making some changes to verifier.py:
diff verifiers.py verifiers.py.BAK
7,10d6
< from browserid.tests.support import patched_key_fetching
< patched = patched_key_fetching()
< patched.enter()
<

Location on qa2: /opt/tokenserver/tokenserver

I am still seeing failures on my system, running the very simple load test as follows:
Terminal 1: ./bin/paster serve etc/tokenserver-dev.ini
Terminal 2:
cd /opt/tokenserver/tokenserver/loadtest
make build
make test

../bin/fl-run-test loadtest.py

F.

FAIL: test_bad_assertions (loadtest.NodeAssignmentTest)

Traceback (most recent call last):
File "/opt/tokenserver/tokenserver/lib/python2.6/site-packages/funkload-1.16.1-py2.6.egg/funkload/FunkLoadTestCase.py", line 946, in call
testMethod()
File "/opt/tokenserver/tokenserver/loadtest/loadtest.py", line 52, in test_bad_assertions
self._do_token_exchange(wrong_issuer, 401)
File "/opt/tokenserver/tokenserver/loadtest/loadtest.py", line 24, in _do_token_exchange
res = self.get(self.root + self.token_exchange, ok_codes=[status])
File "/opt/tokenserver/tokenserver/lib/python2.6/site-packages/funkload-1.16.1-py2.6.egg/funkload/FunkLoadTestCase.py", line 391, in get
method="get", load_auto_links=load_auto_links)
File "/opt/tokenserver/tokenserver/lib/python2.6/site-packages/funkload-1.16.1-py2.6.egg/funkload/FunkLoadTestCase.py", line 299, in _browse
response = self._connect(url, params, ok_codes, method, description)
File "/opt/tokenserver/tokenserver/lib/python2.6/site-packages/funkload-1.16.1-py2.6.egg/funkload/FunkLoadTestCase.py", line 216, in _connect
raise self.failureException, str(value.response)
AssertionError: /1.0/aitc/1.0
HTTP Response 200: OK


Ran 2 tests in 1.638s

FAILED (failures=1)
make: *** [test] Error 1

Investigate/Document the 503s on TokenServer Stage

Running TokenServer only load tests or Combined load tests, we continue to see some amount of 503s in Stage. This is with default settings for the load test, various configurations of TS Stage, with 1 to 3 instances of various sizes.

We find here /media/ephemeral0/logs/nginx/access.log
54.245.44.231 2014-05-13T23:58:39+00:00 "GET /1.0/sync/1.5 HTTP/1.1" 503 1922 320 "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 0.007

And here
"name": "token.assertion.connection_error"
"name": "token.assertion.verify_failure"

Something to research and document going forward...

Refactor shared secret design to use a master password + HKDF for storage nodes

Building upon our discussions of how the tokenserver and the storage nodes handle pre-shared secrets we will make the following changes:

  • tokenserver will know a master password. It will use this master password to derive (HMAC-SHA256) passwords used by the storage nodes
  • the storage nodes will have the derived secrets baked into them

Issues running unit tests on rhel6 and Mac

I keep getting some sort of "hang" condition running 'make test" after building tokenserver on qa2 (and other locations).
So, after talking to both of you about upgrading to 0.8.2 to get what is in TS Stage, I did the following:

$ git clone git://github.com/mozilla-services/tokenserver
$ cd tokenserver
$ make build CHANNEL=prod TOKENSERVER=rpm-0.8.2
$ make test

This is what I see:
(note the ^C in the output - that is where I tried to break the apparent "hang" condition)

bin/nosetests --with-xunit tokenserver
...........F....^CE.......

...etc...

Ran 24 tests in 206.026s

FAILED (errors=1, failures=1)
make: *** [test] Error 1

On qa2, I tried each of these steps:
$ make build
vs.
$ make build CHANNEL=dev
vs.
$ make build CHANNEL=prod TOKENSERVER=rpm-0.8.4
vs.
$ make build CHANNEL=prod

In all cases, I get a "hang" condition on "make test".
Once the "hang" condition is remedied with a Ctrl-C, I get several errors.
The pastebin is here: http://jbonacci.pastebin.mozilla.org/1701461

Here is more info:
http://jbonacci.pastebin.mozilla.org/1705814

make test failing with errors on verifiers.py

Looks like one of the most recent commits may have broken something here:

make test
bin/flake8 tokenserver
tokenserver/verifiers.py:33:5: E301 expected 1 blank line, found 0
tokenserver/verifiers.py:40:19: E225 missing whitespace around operator
tokenserver/verifiers.py:42:23: E712 comparison to False should be 'if cond is False:' or 'if not cond:'
tokenserver/verifiers.py:45:19: E225 missing whitespace around operator
tokenserver/verifiers.py:51:80: E501 line too long (80 > 79 characters)
tokenserver/verifiers.py:53:12: E127 continuation line over-indented for visual indent
tokenserver/verifiers.py:56:80: E501 line too long (91 > 79 characters)
tokenserver/verifiers.py:61:1: E303 too many blank lines (3)
tokenserver/verifiers.py:66:1: E302 expected 2 blank lines, found 3
make: *** [test] Error 1

Doesn't work with latest cornice

In my dev deployment of tokenserver, I have to downgrade to cornice=0.11 to make things work. Unfortunately I don't recall the specific error I was seeing; this is a note for me to reproduce and debug the issue.

Localhost unit tests failing on test_purging_of_old_user_records

Here is what I see on all platforms I tested:

make test
bin/flake8 --exclude=messages.py,test_remote_verifier.py tokenserver
bin/nosetests tokenserver/tests
..................E......S......
======================================================================
ERROR: test_purging_of_old_user_records (tokenserver.tests.test_purge_old_records.TestPurgeOldRecordsScript)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/James/tokenserver/tokenserver/tests/test_purge_old_records.py", line 81, in test_purging_of_old_user_records
    user_records = list(self.backend.get_user_records(service, email))
AttributeError: 'SQLNodeAssignment' object has no attribute 'get_user_records'
-------------------- >> begin captured logging << --------------------
circus: INFO: Arbiter exiting
circus: DEBUG: stopping the broker watcher
circus: DEBUG: gracefully stopping processes [broker] for 30.0s
circus: DEBUG: broker: kill process 62437
circus: DEBUG: sending signal 15 to 62437
circus: DEBUG: stopping the workers watcher
circus: DEBUG: gracefully stopping processes [workers] for 30.0s
circus: DEBUG: workers: kill process 62438
circus: DEBUG: sending signal 15 to 62438
circus: DEBUG: manage_watchers is conflicting with another command
circus: INFO: broker stopped
circus: INFO: workers stopped
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 32 tests in 12.770s

FAILED (SKIP=1, errors=1)
make: *** [test] Error 1

Tokenserver could be more graceful in reporting bogus assertions

From irc:
Although, hmm the loadtest passes, but I see the tokenserver with an
error 'missing p in data'. I seem to recall some mention of that at some
point.
jrgm: this is "normal" because the server gets a wrong assertion
and we want to make sure it does not work
tarek: ah, duh (me).
jrgm: but plz send that to alexis with me in cc. the TS should be more graecful here
cc me too
;-)
/cc @tarekziade, @jbonacci

ERROR [powerhose][worker 8] missing p in data - [u'e', u'algorithm', u'n']
File "/home/jrgm/tokenserver/deps/https:/github.com/mozilla-services/powerhose/powerhose/worker.py", line 60, in _handle_recv_back
res = self.target(Job.load_from_string(msg[0]))
File "/home/jrgm/tokenserver/tokenserver/crypto/pyworker.py", line 201, in call
res = getattr(self, function_id)(**data)
File "/home/jrgm/tokenserver/tokenserver/crypto/pyworker.py", line 214, in check_signature
cert = jwt.load_key(algorithm, data)
File "/home/jrgm/tokenserver/lib/python2.6/site-packages/browserid/jwt.py", line 76, in load_key
return key_class(key_data)
File "/home/jrgm/tokenserver/lib/python2.6/site-packages/browserid/jwt.py", line 160, in init
_check_keys(data, ('p', 'q', 'g', 'y'))
File "/home/jrgm/tokenserver/lib/python2.6/site-packages/browserid/jwt.py", line 242, in _check_keys
raise ValueError(msg)
File "/home/jrgm/tokenserver/deps/https:/github.com/mozilla-services/powerhose/powerhose/client.py", line 54, in execute
raise ExecutionError(res[len('ERROR:'):])

Crypto workers stop working after ~24h

Copied from https://bugzilla.mozilla.org/show_bug.cgi?id=757520

For an unknown reason, the crypto worker of the tokenserver stop working after some time running (about after 24 hours).

The particular piece of code hanging out is:

  • retrieving browserid keys from the hosts, via the idproxy
  • checking signatures regarding the fetched public keys.

The problem is either here or on the layers on top of that, which are ensuring that the main python server is able to communicate with the workers (circus or powerhose).

Create a quick verification test for Stage deploys

Something similar to what we have for server-syncstorage: ./syncstorage/tests/functional/test_storage.py

and for fxa-auth-server: npm run test-remote

If I remember correctly the current unit tests (make test) are only designed to run locally.

Latest basic TokenServer load tests are returning 401s

I am using the very latest setup of TS and Sync and Verifier (all now in US East).
I am just running the basic test against TS Stage:
$ make test SERVER_URL=https://token.stage.mozaws.net

I see 401s in the nginx access logs.
[21/Feb/2014:14:03:12 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110...
[21/Feb/2014:14:03:37 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 96...

I see errors in the tokenserver token.log file:
"name": "token.assertion.verify_failure"
"token.assertion.audience_mismatch_error"
(always paired)

These are consistent across all 3 instances of TS in Stage.

clean make of tokenserver doesn't install zope.interface so it is found

Steps to reproduce:

  1. do a clean git clone of tokenserver and type make
  2. after a successful make, do "bin/python -c 'import zope.interface'"

I get
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named interface

zope.interface is located here. Not sure why it does not pick that up.
./lib/python2.6/site-packages/zope.interface-3.8.0-py2.6-linux-x86_64.egg/zope/interface

WIthout it available, two of the tests of 'make test' fail.

figure out wrong-key confusion when FxA account is reset

We've been having an email discussion about how to prevent a "slap fight" between two devices on the same account, when the FxA account has been reset (changing the encryption key), but the devices disagree about whether the old key or the new key is correct. The worst case here is that the two devices keep deleting each other's data, because they get an HMAC error when they see something encrypted by the other device, so they wipe the server and re-upload with their own key. This could keep happening until the old-key device's token finally expires, and it cannot get a new one without re-logging in (at which point it will get the new key).

We haven't yet decided how to address this, but we're converging on a couple of possible solutions. Most involve the fxa-auth-server including a "generation number" its certificates, so the tokenserver can distinguish between the "new" device and the old ones (mozilla/fxa-auth-server#486). Some also involve the tokenserver getting a hash of the encryption key, and mapping different (uid, keyhash) pairs to different sync-id values (so old and new devices get different sets of ciphertext).

Opening this ticket so we'll have something to point at for the discussion.
CC @ckarlof @rfk

fd leaks

The token server is leaking fds on stage2.

This is probably in powerhose, either in the client Pool, or in the workers restarting.

Will write a test that counts the number of fds before and after each request to find out where the problem happens

/cc @fetep @ametaireau

Failure attempting to build TS docs

Was attempting the following:
08:45 < tarek> its in docs/ https://github.com/mozilla-services/tokenserver/tree/master/docs
08:45 < tarek> if you want to build it:
08:46 < tarek> cd docs; SPHINXBUILD=../bin/sphinx-build make html
08:46 < tarek> then you have it in docs/build/html

Was not able to do this from a local install of tokenserver (git clone)

SPHINXBUILD=../bin/sphinx-build make html
../bin/sphinx-build -b html -d build/doctrees source build/html
Running Sphinx v1.1.2
loading pickled environment... done
building [html]: targets for 8 source files that are out of date
updating environment: 0 added, 0 changed, 0 removed
looking for now-outdated files... none found
preparing documents... done
IOError: [Errno 2] No such file or directory: u'../../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css'
Unable to open source file for reading ('../../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css'). Exiting.
make: *** [html] Error 1

From the "docs" directory:
ls ../../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css
ls: cannot access ../../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css: No such file or directory

Instead I try this:
ls ../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css
../lib/python2.6/site-packages/docutils-0.8.1-py2.6.egg/docutils/writers/html4css1/html4css1.css

That works, so it's looking in the wrong place for docutils?

Getting "make build" errors on Mac

Well this is new and frustrating (since this was working even on OS 10.9/XCode 5.1.1).

git clone tokenserver
cd tokenserver
make build
make test
this all works

cd loadtest
make build

Now I see this:
make build
ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future ../bin/pip install pexpect
/bin/sh: ../bin/pip: No such file or directory
make: *** [build] Error 127

And, indeed, there is no "bin" directory in tokenserver

tests/{test_service,test_crypto_pyworker}.py require unittest2 on python 2.6

I've just been patching around this issue for a while, and should have put it
in as an issue. Anyways test_service.py and test_crypto_pyworker both use new
features from unittest in 2.7 (assertIn and context manager for assertRaises
respectively).

Since the production boxes run python 2.6.6, can we fix these tests to run there.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.