GithubHelp home page GithubHelp logo

trexa-service's Introduction

Trexa Service

This project aims to provide a weekly Trexa 100k list, which is created from the Tranco and Alexa lists. In theory it could be larger, so file an issue if that's useful to someone.

Running locally

  1. clone the repo
  2. run cd trexa-service
  3. run python3 -m venv env
  4. run source env/bin/activate
  5. run pip3 install -r requirements.txt

And then go to the next section to run tests or start the service.

Running the tests

FLASK_ENV=development pytest

Starting the service

From the project root:

Production mode: FLASK_APP=trexa flask run --host=0.0.0.0

or

Development mode: FLASK_ENV=development FLASK_APP=trexa flask run

Environment variables

The following environment variables can be defined to override defaults:

ZIP_DOWNLOADS_DEST
CSV_DOWNLOADS_DEST
FINAL_LIST_DEST

HTTP Endpoints

This app exposes the following HTTP endpoints

  • /lists: see all lists available for download
  • /lists/trexa-2020-05-21.csv: download a full, single list (150,000+ sites)
  • /api/lists/2020-05-21?count=N: download a single list, trimmed to N sites

License

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/

Code of Conduct

This project and repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details please see the Code of Conduct file.

trexa-service's People

Contributors

birdsarah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

birdsarah karlcow

trexa-service's Issues

first n lines of a file with StopIteration

with open(csv_file, 'r') as csv:
head = [next(csv) for x in range(number)]

This is a fast elegant solution.
Just a warning that if the number is actually bigger than the number of lines. There will be a StopIteration error and the head will be empty.

You could use islice too it will not fail on a number bigger.
It has a reputation to be slower, but on a very small file I didn't find differences.

python -m timeit 'with open("blah.txt", "r") as f:''    head = [next(f) for x in range(3)]'
5000 loops, best of 5: 89.2 usec per loop

python -m timeit 'from itertools import islice;' 'with open("blah.txt", "r") as f:' '    head = list(islice(f,3))'
5000 loops, best of 5: 90.5 usec per loop

Add License headers for all files

Related to #5

There's a general license declaration in the repo.
I don't know if there is a policy from Mozilla that the license should be in each individual file.

Tests failing - AttributeError: 'FlaskClient' object has no attribute 'config'

Related to #5

Clean repo, fresh install.
Tests are failing following the README.

(env) ~/code/trexa-service % FLASK_ENV=development pytest
================================================================= test session starts =================================================================
platform darwin -- Python 3.7.6, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /Users/karl/code/trexa-service
plugins: hypothesis-5.5.4, arraydiff-0.3, remotedata-0.3.2, openfiles-0.4.0, doctestplus-0.5.0, astropy-header-0.1.2
collected 9 items                                                                                                                                     

tests/test_api_endspoints.py ...                                                                                                                [ 33%]
tests/test_api_helpers.py ...                                                                                                                   [ 66%]
tests/test_application_endpoints.py ..F                                                                                                         [100%]

====================================================================== FAILURES =======================================================================
_________________________________________________________________ test_list_download __________________________________________________________________

client = <FlaskClient <Flask 'trexa'>>

    def test_list_download(client):
        """This should only pass for local development or tests."""
>       print(client.config)
E       AttributeError: 'FlaskClient' object has no attribute 'config'

tests/test_application_endpoints.py:42: AttributeError
============================================================= 1 failed, 8 passed in 0.67s =============================================================
% python -V
Python 3.7.6

commit 1105042293e5424dfe97bb9e7f8ce1e96261e9db (HEAD -> master, origin/master, origin/HEAD)
Author: Mike Taylor <[email protected]>
Date:   Thu May 28 15:06:49 2020 -0500

    No issue - change the shape of the API to be driven by date, rather than file name

Tools directory for things unrelated to the app.

Related to #5

Reading get_tranco I realized a couple of inconsistencies.

def get_tranco():
"""Download the latest Traco list."""
print('Downloading the Tranco List...')
headers = {'user-agent': 'mozilla-trexa-service'}
today = strftime('%Y-%m-%d')
todays_csv = f'{CSV_DOWNLOADS_DEST}/tranco-100k-{today}.csv'
r = requests.get(TRANCO_100K_URI, headers)
# We expect a 30X redirect to the list URL that looks like:
# https://tranco-list.eu/list/VK9N/100000
list_url = r.url
# the download URL looks like:
# https://tranco-list.eu/download/VK9N/100000
list_url = list_url.replace('/list/', '/download/')
dl = requests.get(list_url, headers, stream=True)
# Tranco doesn't seem to set a dynamic content-length for these lists
# ¯\_(ツ)_/¯. Maybe they will one day.
total_size = int(dl.headers.get('content-length', 0))
t = tqdm(total=total_size, unit='iB', unit_scale=True)
with open(todays_csv, 'wb') as fd:
for chunk in dl.iter_content(chunk_size=1024):
if chunk:
t.update(len(chunk))
fd.write(chunk)
# return the name of the file, so it can be passed to build_trexa
return todays_csv

TRANCO_100K_URI is hardcoded in the module while CSV_DOWNLOADS_DEST is in the config
That seems inconsistent.

Why not passing the config object in build_list.

alexa = get_alexa()
tranco = get_tranco()
build_trexa(alexa, tranco)
clean_up()

build_list is not connected to the app at all, but for it to function, helpers needs to import the full app to just get access to the config object.

from trexa import app

Probably not very useful. It would be better to import the config object just for the tools.

Suggestion:
Create a tools directory which separate the helpers from the app and the tools.

500 error on lists route failing on prod

related to #5

Just following the instructions here in the README

(env) ~/code/trexa-service % FLASK_APP=trexa flask run --host=0.0.0.0
 * Serving Flask app "trexa"
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [08/Jun/2020 10:01:27] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [08/Jun/2020 10:01:27] "GET /favicon.ico HTTP/1.1" 404 -
[2020-06-08 10:02:01,263] ERROR in app: Exception on /lists [GET]
Traceback (most recent call last):
  File "/Users/karl/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/karl/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/karl/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/Users/karl/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/Users/karl/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/karl/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/karl/code/trexa-service/trexa/__init__.py", line 37, in show_downloads
    files = os.listdir(app.config['FINAL_LIST_DEST'])
FileNotFoundError: [Errno 2] No such file or directory: 'trexa/static/lists'
127.0.0.1 - - [08/Jun/2020 10:02:01] "GET /lists HTTP/1.1" 500 -
127.0.0.1 - - [08/Jun/2020 10:04:02] "GET /lists/trexa-2020-05-21.csv HTTP/1.1" 404 -
127.0.0.1 - - [08/Jun/2020 10:04:24] "GET /api/lists/2020-05-21?count=N HTTP/1.1" 404 -

So there's a 500 error on the lists route.

suggestions for api endpoint

This is part of issue #5 review requested by @miketaylr

return abort(403)

return abort(404)

I don't think the code needs to return abort(404), but just do abort(404), this will effectively abort.

return Response(trim_csv(csv_path, count=count), mimetype='text/csv')

is mimetype (no charset) instead of content_type (with charset) a desire?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.