GithubHelp home page GithubHelp logo

alephdata / memorious Goto Github PK

View Code? Open in Web Editor NEW
307.0 17.0 59.0 1.43 MB

Lightweight web scraping toolkit for documents and structured data.

Home Page: https://docs.alephdata.org/developers/memorious

License: MIT License

Python 98.72% Makefile 0.51% Shell 0.23% Dockerfile 0.55%
crawling scraping scraping-framework

memorious's Introduction

Memorious

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

-- Funes the Memorious, Jorge Luis Borges

image

memorious is a light-weight web scraping toolkit. It supports scrapers that collect structured or un-structured data. This includes the following use cases:

  • Make crawlers modular and simple tasks re-usable
  • Provide utility functions to do common tasks such as data storage, HTTP session management
  • Integrate crawlers with the Aleph and FollowTheMoney ecosystem
  • Get out of your way as much as possible

Design

When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.

memorious handles this by managing a set of crawlers, each of which can be composed of multiple stages. Each stage is implemented using a Python function, which can be re-used across different crawlers.

The basic steps of writing a Memorious crawler:

  1. Make YAML crawler configuration file
  2. Add different stages
  3. Write code for stage operations (optional)
  4. Test, rinse, repeat

Documentation

The documentation for Memorious is available at alephdata.github.io/memorious. Feel free to edit the source files in the docs folder and send pull requests for improvements.

To build the documentation, inside the docs folder run make html

You'll find the resulting HTML files in /docs/_build/html.

memorious's People

Contributors

brrttwrks avatar catileptic avatar danohu avatar dependabot-preview[bot] avatar dependabot[bot] avatar edwardbetts avatar moreymat avatar patcon avatar pudo avatar rhiaro avatar rosencrantz avatar simonwoerpel avatar smmbllsm avatar stchris avatar sunu avatar tillprochaska avatar todorus avatar uhhhuh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

memorious's Issues

Make memorious run under Python 3

This will be made difficult by the current lack of test coverage. It may be worth to have a few basic tests in place in order to know the degree to which using Python 3 will impede the execution of the tool. Porting itself, I imagine, will mainly be about fixing a bunch of imports.

Data validation stage

Make a re-usable stage that validates the incoming data against a specified schema. This could be done using JSON schema, or a light version of that.

Make tags with expiration

Allow crawlers to check for a tag which expires after a given interval, e.g. 90-120 days. Crawlers thus become incremental but also repeat checks after a certain interval.

Load new crawlers without a restart

If new crawlers are added to the CONFIG_PATH the celery workers can't see them without being turned off and on again.

Something about dynamic code reloading.

Add user authentication and scraper namespacing to Memorious

We want other people to run their own scrapers on our platform. Ideally, they will have the permissions to see, run and update their own crawlers. We will have 3 tiers of permissions for users:

  • public: You can see the public crawlers; but can't run/cancel them
  • authorized: You can see, run and cancel your own crawlers
  • admin: God mode

Scraper Namespacing: Scrapers should be part of a scraper group or namespace. We can figure out which namespace or group a user belongs to from req headers provided by Keycloak. The headers are put into the request by Keycloak Gatekeeper.

Running a scraper in the example fails with an error when calling context.set_tag(tag, None)

I ran docker-compose up and was able to view the UI. However, if I run any of the three crawlers I get an error like this:

worker_1  | INFO:book_scraper.init:[book_scraper->init(seed)]: b7feff0ca24511e9af6f0242ac130003
worker_1  | INFO:book_scraper.fetch:[book_scraper->fetch(fetch)]: b7feff0ca24511e9af6f0242ac130003
worker_1  | INFO:book_scraper.fetch:Using cached HTTP response: http://books.toscrape.com/
worker_1  | INFO:book_scraper.fetch:Fetched [200]: 'http://books.toscrape.com/'
worker_1  | ERROR:book_scraper.fetch:Invalid input of type: 'NoneType'. Convert to a byte, string or number first.
worker_1  | Traceback (most recent call last):
worker_1  |   File "/memorious/memorious/logic/context.py", line 75, in execute
worker_1  |     return self.stage.method(self, data)
worker_1  |   File "/memorious/memorious/operations/fetch.py", line 32, in fetch
worker_1  |     context.set_tag(tag, None)
worker_1  |   File "/memorious/memorious/logic/context.py", line 105, in set_tag
worker_1  |     return conn.set(key, data, ex=self.crawler.expire)
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/client.py", line 1451, in set
worker_1  |     return self.execute_command('SET', *pieces)
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/client.py", line 774, in execute_command
worker_1  |     connection.send_command(*args)
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/connection.py", line 620, in send_command
worker_1  |     self.send_packed_command(self.pack_command(*args))
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/connection.py", line 663, in pack_command
worker_1  |     for arg in imap(self.encoder.encode, args):
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/connection.py", line 125, in encode
worker_1  |     "byte, string or number first." % typename)
worker_1  | redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first.

This happens from both the UI and when I ran the command:

docker-compose exec worker memorious run book_scraper

The quote scrapers shows the above error as well as this:

worker_1  | INFO:memorious.logic.crawler:Running aggregator for quote_scraper
worker_1  | ERROR:memorious.task_runner:Task failed to execute:
worker_1  | Traceback (most recent call last):
worker_1  |   File "/memorious/memorious/task_runner.py", line 59, in process
worker_1  |     cls.execute(*item)
worker_1  |   File "/memorious/memorious/task_runner.py", line 49, in execute
worker_1  |     context.crawler.aggregate(context)
worker_1  |   File "/memorious/memorious/logic/crawler.py", line 80, in aggregate
worker_1  |     context, self.aggregator_config.get("params", {})
worker_1  |   File "/crawlers/src/example/quotes.py", line 71, in export
worker_1  |     table = context.datastore[context.params.get("table")]
worker_1  |   File "/usr/lib/python3.7/site-packages/werkzeug/local.py", line 378, in <lambda>
worker_1  |     __getitem__ = lambda x, i: x._get_current_object()[i]
worker_1  |   File "/usr/lib/python3.7/site-packages/dataset/database.py", line 222, in __getitem__
worker_1  |     return self.get_table(table_name)
worker_1  |   File "/usr/lib/python3.7/site-packages/dataset/database.py", line 218, in get_table
worker_1  |     return self.create_table(table_name, primary_id, primary_type)
worker_1  |   File "/usr/lib/python3.7/site-packages/dataset/database.py", line 181, in create_table
worker_1  |     table_name = normalize_table_name(table_name)
worker_1  |   File "/usr/lib/python3.7/site-packages/dataset/util.py", line 79, in normalize_table_name
worker_1  |     raise ValueError("Invalid table name: %r" % name)
worker_1  | ValueError: Invalid table name: None

Crawler 'sample' mode

I would like to be able to run crawlers with a flag that tells them to only download a subset of the data before finishing.

It should be a CLI flag, not in the YAML config I think, so it can be used on the fly for testing or demo purposes, and it needs to make sure the whole pipeline is run beginning to end for at least one 'thing'.

What this means is going to be different for different crawlers.

For the simplest crawlers it would probably be something that hijacks the seed stage, or however URLs are generated, and cuts the list of what gets passed on to the next stage short. For recursive crawlers, or ones where the downloads of the things we actually want happens later in the pipeline (because search results have to be parsed, files have to be fetched, etc) it's going to be more complicated.

It might be impossible to have a sample mode that works uniformly any crawlers that aren't explicitly configured for it. So maybe the best option is to make the presence of a --sample flag available easily to crawlers' Python functions (eg. via the context) so crawlers can customise the most appropriate stage to respond with a subset of results.

Add data validation helpers as part of context logic

Needed methods:
.isNotEmpty(p)
.isNumeric(p)
.isInteger(p)
.matchDate(p)
.matchRegexp(p) / p matched a regexp.
.hasLength(p) / p is expected to be of specific length
.mustContain(p, q) / p contains a specific character q

Be able to issue warnings/exceptions depending on whether the variable is optional or required.

OCR helper function

  • Based on tesseract and imagemagick
  • Allow for single-line text recognition
  • De-noise, increase contrast.

Show a warning if in multi-threaded mode and the datastorage is sqlite

ERROR:tj_procurement.store_record:(sqlite3.ProgrammingError) SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 139692792108800 and this is thread id 139692800501504 (Background on this error at: http://sqlalche.me/e/f405)
Traceback (most recent call last):
File "/home/memorious/env/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1127, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/home/memorious/env/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 637, in _init_compiled
self.cursor = self.create_cursor()
File "/home/memorious/env/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 952, in create_cursor
return self._dbapi_connection.cursor()
File "/home/memorious/env/lib/python3.4/site-packages/sqlalchemy/pool.py", line 977, in cursor
return self.connection.cursor(*args, **kwargs)
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 139692792108800 and this is thread id 139692800501504

Fixed with `export MEMORIOUS_DEBUG=true'

Make crawler discovery and configuration easier

More of a discussion ticket.

  • Can we have multiple crawler search directories? Do we want to split YAML directories and code directories?
  • How can it be easier to reference the code modules. Should it be possible to place the YAML config inside the code?

Let other users add and run their own crawlers on our platform

One potential solution for this is to have a webhook that listens for changes on a fixed set of git repositories and pull crawlers from them into our repository of crawlers.

Note to self:

Possible ways to do it:

  • Webhook triggered
  • Pull github repo
  • Push them into the central repo of crawlers

Option A

  • Figure out what changed(how?)
  • Flush the remaining ops for the crawlers that changed from the queue
  • Reload the crawlers into the manager

Option B

  • Let the user explicitly tell us to reload the crawler
  • Flush the ops
  • Reload the crawler

Frequent database deadlock errors

We have a rate limit in place for db operations. But still under load, the db locks up sometimes throwing errors like:

(psycopg2.errors.DeadlockDetected) deadlock detected DETAIL: Process 974074 waits for ShareLock on transaction 2274650; blocked by process 974077. Process 974077 waits for ShareLock on transaction 2274652; blocked by process 974074.

These errors drown out other less frequent errors on the events screen.

Reference documents from structured data scrapes

As a user, I want to be able to scrape a source which gives me both structured and unstructured data. For example, while scraping a procurement portal, I might want to download contract metadata, but also a contract document as a PDF file. While both things are possible in memorious, there is currently no way to make things show up in aleph such that the structured data record (e.g., a mapped Contract refers to the ingested Document by its ID).

To solve this, we need some mechanism for importing both the structured and unstructured content into the same collection in such a way that structured entities can refer to the documents by their ID.

Handle recursion error when run without queue

When running with the celery backend, the tool will emit tasks from one operation and put them on a queue, rather then executing them directly. When run without that queue active, however, this mode of spawning subsidiary tasks leads to recursion errors when Python limits the stack depth of the process.

This probably means we should handle the non-queued execution of crawlers differently, e.g. by using a Python Queue to put tasks into a local pool in order to have them executed.

ERROR:<CENSORED>.parse:maximum recursion depth exceeded while calling a Python object
Traceback (most recent call last):
  File "/Users/fl/Code/occrp/memorious/memorious/logic/context.py", line 75, in execute
    res = self.stage.method(self, data)
  File "/Users/fl/Code/occrp/data.occrp.org/crawlers/src/<CENSORED>.py", line 48, in parse
    context.emit(data={'url': next_url})
  File "/Users/fl/Code/occrp/memorious/memorious/logic/context.py", line 53, in emit
    handle.apply_async((state, stage, data), countdown=delay)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/celery/app/task.py", line 523, in apply_async
    link=link, link_error=link_error, **options)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/celery/app/task.py", line 741, in apply
    ret = tracer(task_id, args, kwargs, request)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/celery/app/trace.py", line 388, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/celery/app/trace.py", line 374, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/fl/Code/occrp/memorious/memorious/logic/context.py", line 217, in handle
    context.execute(data)
  File "/Users/fl/Code/occrp/memorious/memorious/logic/context.py", line 71, in execute
    self.operation_id = op.id
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 237, in __get__
    return self.impl.get(instance_state(instance), dict_)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 579, in get
    value = state._load_expired(state, passive)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/state.py", line 592, in _load_expired
    self.manager.deferred_scalar_loader(self, toload)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", line 713, in load_scalar_attributes
    only_load_props=attribute_names)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", line 223, in load_on_ident
    return q.one()
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2814, in one
    ret = self.one_or_none()
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2784, in one_or_none
    ret = list(self)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2855, in __iter__
    return self._execute_and_instances(context)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2878, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1046, in _execute_clauseelement
    if not self.schema_for_object.is_default else None)
  File "<string>", line 1, in <lambda>
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 436, in compile
    return self._compiler(dialect, bind=bind, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 442, in _compiler
    return dialect.statement_compiler(dialect, self, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 435, in __init__
    Compiled.__init__(self, dialect, statement, **kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 216, in __init__
    self.string = self.process(self.statement, **compile_kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 242, in process
    return obj._compiler_dispatch(self, **kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py", line 81, in _compiler_dispatch
    return meth(self, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1747, in visit_select
    text, select, inner_columns, froms, byfrom, kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1831, in _compose_select_body
    t = select._whereclause._compiler_dispatch(self, **kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
    return meth(self, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1034, in visit_binary
    return self._generate_generic_binary(binary, opstring, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1059, in _generate_generic_binary
    self, eager_grouping=eager_grouping, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
    return meth(self, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1192, in visit_bindparam
    name = self._truncate_bindparam(bindparam)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1248, in _truncate_bindparam
    bind_name = self._truncated_identifier("bindparam", bind_name)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1259, in _truncated_identifier
    anonname = name.apply_map(self.anon_map)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 4071, in apply_map
    return self % map_
RuntimeError: maximum recursion depth exceeded while calling a Python object

Memorious admin user interface

Needed:

  • See all crawlers and how many runs and operations they have.
  • Run a particular crawler right now
  • Flush a crawler's state
  • See an overview of the operations of each crawler
  • See individual failures per operation

More coherent helpers for seach results

Processing a listing of paged results should be built in.

  • Paging by following 'next' links
  • Paging by calculating a sequence from the number of results or the 'last' link
  • Paging by incrementing a page number until empty result
  • Extracting links and titles from the list of results

Started in helpers/init

`memorious run` command never finishes

I'm trying to automate the creation of new CSV files in opensanctions. When I run memorious run us_ofac, the last thing it does is store the data with a bunch of lines like this:

INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005

Then the command stays stuck open indefinitely. I'd like to write a shell script which performs other steps after this, but I don't see a good way to stop the command once it's finished and continue with the rest.

I don't want to use the memorious schedule or leave anything running all day. I want to briefly bring up the crawler once a day to scrape and produce files and then shut it down. It's also a problem in local development when testing changes to a crawler.

I've opened this here because it happens with all the opensanctions crawlers, so I'm assuming it's a core memorious problem. Does the command ever finish for other crawlers?

Build app-level rate limiting

This should be per-stage, perhaps by just putting items back on the queue if a stage is about to exceed it's rate limit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.