alephdata / memorious Goto Github PK

View Code? Open in Web Editor NEW

307.0 17.0 59.0 1.43 MB

Lightweight web scraping toolkit for documents and structured data.

Home Page: https://docs.alephdata.org/developers/memorious

License: MIT License

Python 98.72% Makefile 0.51% Shell 0.23% Dockerfile 0.55%

crawling scraping scraping-framework

memorious's Introduction

Memorious

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

-- Funes the Memorious, Jorge Luis Borges

memorious is a light-weight web scraping toolkit. It supports scrapers that collect structured or un-structured data. This includes the following use cases:

Make crawlers modular and simple tasks re-usable
Provide utility functions to do common tasks such as data storage, HTTP session management
Integrate crawlers with the Aleph and FollowTheMoney ecosystem
Get out of your way as much as possible

Design

When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.

memorious handles this by managing a set of crawlers, each of which can be composed of multiple stages. Each stage is implemented using a Python function, which can be re-used across different crawlers.

The basic steps of writing a Memorious crawler:

Make YAML crawler configuration file
Add different stages
Write code for stage operations (optional)
Test, rinse, repeat

Documentation

The documentation for Memorious is available at alephdata.github.io/memorious. Feel free to edit the source files in the docs folder and send pull requests for improvements.

To build the documentation, inside the docs folder run make html

You'll find the resulting HTML files in /docs/_build/html.

memorious's People

Contributors

Stargazers

Watchers

memorious's Issues

Consider moving results table to redis

Is this a good idea?

Make memorious run under Python 3

This will be made difficult by the current lack of test coverage. It may be worth to have a few basic tests in place in order to know the degree to which using Python 3 will impede the execution of the tool. Porting itself, I imagine, will mainly be about fixing a bunch of imports.

Data validation stage

Make a re-usable stage that validates the incoming data against a specified schema. This could be done using JSON schema, or a light version of that.

Make tags with expiration

Allow crawlers to check for a tag which expires after a given interval, e.g. 90-120 days. Crawlers thus become incremental but also repeat checks after a certain interval.

Implement search/filtering for scrapers UI

We currenly lack any ability to filter/search the crawlers by columns like Crawler name, Errors, Last run.

Load new crawlers without a restart

If new crawlers are added to the CONFIG_PATH the celery workers can't see them without being turned off and on again.

Something about dynamic code reloading.

Parse: narrow down where in the DOM to look for links

Create dev mode defaults for datastore

i.e. settings.DATASTORE_URI

Add user authentication and scraper namespacing to Memorious

We want other people to run their own scrapers on our platform. Ideally, they will have the permissions to see, run and update their own crawlers. We will have 3 tiers of permissions for users:

public: You can see the public crawlers; but can't run/cancel them
authorized: You can see, run and cancel your own crawlers
admin: God mode

Scraper Namespacing: Scrapers should be part of a scraper group or namespace. We can figure out which namespace or group a user belongs to from req headers provided by Keycloak. The headers are put into the request by Keycloak Gatekeeper.

Document the `nested db` operation

See the code and update the docs.

Running a scraper in the example fails with an error when calling context.set_tag(tag, None)

I ran docker-compose up and was able to view the UI. However, if I run any of the three crawlers I get an error like this:

worker_1  | INFO:book_scraper.init:[book_scraper->init(seed)]: b7feff0ca24511e9af6f0242ac130003
worker_1  | INFO:book_scraper.fetch:[book_scraper->fetch(fetch)]: b7feff0ca24511e9af6f0242ac130003
worker_1  | INFO:book_scraper.fetch:Using cached HTTP response: http://books.toscrape.com/
worker_1  | INFO:book_scraper.fetch:Fetched [200]: 'http://books.toscrape.com/'
worker_1  | ERROR:book_scraper.fetch:Invalid input of type: 'NoneType'. Convert to a byte, string or number first.
worker_1  | Traceback (most recent call last):
worker_1  |   File "/memorious/memorious/logic/context.py", line 75, in execute
worker_1  |     return self.stage.method(self, data)
worker_1  |   File "/memorious/memorious/operations/fetch.py", line 32, in fetch
worker_1  |     context.set_tag(tag, None)
worker_1  |   File "/memorious/memorious/logic/context.py", line 105, in set_tag
worker_1  |     return conn.set(key, data, ex=self.crawler.expire)
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/client.py", line 1451, in set
worker_1  |     return self.execute_command('SET', *pieces)
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/client.py", line 774, in execute_command
worker_1  |     connection.send_command(*args)
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/connection.py", line 620, in send_command
worker_1  |     self.send_packed_command(self.pack_command(*args))
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/connection.py", line 663, in pack_command
worker_1  |     for arg in imap(self.encoder.encode, args):
worker_1  |   File "/usr/lib/python3.7/site-packages/redis/connection.py", line 125, in encode
worker_1  |     "byte, string or number first." % typename)
worker_1  | redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first.

This happens from both the UI and when I ran the command:

docker-compose exec worker memorious run book_scraper

The quote scrapers shows the above error as well as this:

worker_1  | INFO:memorious.logic.crawler:Running aggregator for quote_scraper
worker_1  | ERROR:memorious.task_runner:Task failed to execute:
worker_1  | Traceback (most recent call last):
worker_1  |   File "/memorious/memorious/task_runner.py", line 59, in process
worker_1  |     cls.execute(*item)
worker_1  |   File "/memorious/memorious/task_runner.py", line 49, in execute
worker_1  |     context.crawler.aggregate(context)
worker_1  |   File "/memorious/memorious/logic/crawler.py", line 80, in aggregate
worker_1  |     context, self.aggregator_config.get("params", {})
worker_1  |   File "/crawlers/src/example/quotes.py", line 71, in export
worker_1  |     table = context.datastore[context.params.get("table")]
worker_1  |   File "/usr/lib/python3.7/site-packages/werkzeug/local.py", line 378, in <lambda>
worker_1  |     __getitem__ = lambda x, i: x._get_current_object()[i]
worker_1  |   File "/usr/lib/python3.7/site-packages/dataset/database.py", line 222, in __getitem__
worker_1  |     return self.get_table(table_name)
worker_1  |   File "/usr/lib/python3.7/site-packages/dataset/database.py", line 218, in get_table
worker_1  |     return self.create_table(table_name, primary_id, primary_type)
worker_1  |   File "/usr/lib/python3.7/site-packages/dataset/database.py", line 181, in create_table
worker_1  |     table_name = normalize_table_name(table_name)
worker_1  |   File "/usr/lib/python3.7/site-packages/dataset/util.py", line 79, in normalize_table_name
worker_1  |     raise ValueError("Invalid table name: %r" % name)
worker_1  | ValueError: Invalid table name: None

Crawler 'sample' mode

I would like to be able to run crawlers with a flag that tells them to only download a subset of the data before finishing.

It should be a CLI flag, not in the YAML config I think, so it can be used on the fly for testing or demo purposes, and it needs to make sure the whole pipeline is run beginning to end for at least one 'thing'.

What this means is going to be different for different crawlers.

For the simplest crawlers it would probably be something that hijacks the seed stage, or however URLs are generated, and cuts the list of what gets passed on to the next stage short. For recursive crawlers, or ones where the downloads of the things we actually want happens later in the pipeline (because search results have to be parsed, files have to be fetched, etc) it's going to be more complicated.

It might be impossible to have a sample mode that works uniformly any crawlers that aren't explicitly configured for it. So maybe the best option is to make the presence of a --sample flag available easily to crawlers' Python functions (eg. via the context) so crawlers can customise the most appropriate stage to respond with a subset of results.

Add data validation helpers as part of context logic

Needed methods:
.isNotEmpty(p)
.isNumeric(p)
.isInteger(p)
.matchDate(p)
.matchRegexp(p) / p matched a regexp.
.hasLength(p) / p is expected to be of specific length
.mustContain(p, q) / p contains a specific character q

Be able to issue warnings/exceptions depending on whether the variable is optional or required.

Allow Xpath queries returning text, not just elements, in built-in parse function

The problem probably lies in /memorious/memorious/operations/parse.py line 79 where text is being extracted from element, without check that it is actually an element.

Make sure crawler status actively updates periodically instead of relying on page reload

OCR helper function

Based on tesseract and imagemagick
Allow for single-line text recognition
De-noise, increase contrast.

Handle indexing of documents with parent/child relationship in Aleph_emit

e.g. by using a parent_foreign_id field?

Flush all data generated by a crawler

An option to clear all the op records, results and tags for a specific crawler.

Show a warning if in multi-threaded mode and the datastorage is sqlite

ERROR:tj_procurement.store_record:(sqlite3.ProgrammingError) SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 139692792108800 and this is thread id 139692800501504 (Background on this error at: http://sqlalche.me/e/f405)
Traceback (most recent call last):
File "/home/memorious/env/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1127, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/home/memorious/env/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 637, in _init_compiled
self.cursor = self.create_cursor()
File "/home/memorious/env/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 952, in create_cursor
return self._dbapi_connection.cursor()
File "/home/memorious/env/lib/python3.4/site-packages/sqlalchemy/pool.py", line 977, in cursor
return self.connection.cursor(*args, **kwargs)
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 139692792108800 and this is thread id 139692800501504

Fixed with `export MEMORIOUS_DEBUG=true'

Make crawler discovery and configuration easier

More of a discussion ticket.

Can we have multiple crawler search directories? Do we want to split YAML directories and code directories?
How can it be easier to reference the code modules. Should it be possible to place the YAML config inside the code?

Move aleph_emit operation into alephclient

Persuade extract to handle FTS zips

FTS zip files between 2007 and 2014 (inclusive) have mime type application/octet-stream and don't extract with any of the included methods.

Egs. http://ec.europa.eu/budget/remote/fts/dl/export_YYYY_en.zip

Ability to start or cancel multiple crawlers at a time

Should work well with #64

Don't fail if one crawler has bugs

Let other users add and run their own crawlers on our platform

One potential solution for this is to have a webhook that listens for changes on a fixed set of git repositories and pull crawlers from them into our repository of crawlers.

Note to self:

Possible ways to do it:

Webhook triggered
Pull github repo
Push them into the central repo of crawlers

Option A

Figure out what changed(how?)
Flush the remaining ops for the crawlers that changed from the queue
Reload the crawlers into the manager

Option B

Let the user explicitly tell us to reload the crawler
Flush the ops
Reload the crawler

Include mapping inside memorious crawler

Provide an ability to run mapping from inside the scraper (in scraper YAML config) as a final step.
alephdata/followthemoney#41

Make sure all methods append to `data`

Some of them reset the data dict

Crawler action button is incosistent with the current state of the crawler

Sometimes the crawler details page says that that the crawler has ended even though the crawler is still running.
Possible fix is in #55

Frequent database deadlock errors

We have a rate limit in place for db operations. But still under load, the db locks up sometimes throwing errors like:

(psycopg2.errors.DeadlockDetected) deadlock detected DETAIL: Process 974074 waits for ShareLock on transaction 2274650; blocked by process 974077. Process 974077 waits for ShareLock on transaction 2274652; blocked by process 974074.

These errors drown out other less frequent errors on the events screen.

Reference documents from structured data scrapes

As a user, I want to be able to scrape a source which gives me both structured and unstructured data. For example, while scraping a procurement portal, I might want to download contract metadata, but also a contract document as a PDF file. While both things are possible in memorious, there is currently no way to make things show up in aleph such that the structured data record (e.g., a mapped Contract refers to the ingested Document by its ID).

To solve this, we need some mechanism for importing both the structured and unstructured content into the same collection in such a way that structured entities can refer to the documents by their ID.

Don't depend on httpbin.org for testing

We use httpbin.org in some of our tests. So the most obvious problem it creates is that you need internet to run the test suite. Secondly, httpbin is spewing random 503 responses these days that make our tests to fail randomly.

https://us.pycon.org/2015/schedule/presentation/344/ has interesting ideas about testing requests without the internet. May be we should use some of that.

Docker-compose example doesn't work

Remove postgres dependency

Make a version that has the option to run SQLite as the backend instead of Postgres

Introduce database migrations

Currently, the database is not managed by alembic, that needs to be changed before putting this into production.

Handle recursion error when run without queue

When running with the celery backend, the tool will emit tasks from one operation and put them on a queue, rather then executing them directly. When run without that queue active, however, this mode of spawning subsidiary tasks leads to recursion errors when Python limits the stack depth of the process.

This probably means we should handle the non-queued execution of crawlers differently, e.g. by using a Python Queue to put tasks into a local pool in order to have them executed.

ERROR:<CENSORED>.parse:maximum recursion depth exceeded while calling a Python object
Traceback (most recent call last):
  File "/Users/fl/Code/occrp/memorious/memorious/logic/context.py", line 75, in execute
    res = self.stage.method(self, data)
  File "/Users/fl/Code/occrp/data.occrp.org/crawlers/src/<CENSORED>.py", line 48, in parse
    context.emit(data={'url': next_url})
  File "/Users/fl/Code/occrp/memorious/memorious/logic/context.py", line 53, in emit
    handle.apply_async((state, stage, data), countdown=delay)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/celery/app/task.py", line 523, in apply_async
    link=link, link_error=link_error, **options)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/celery/app/task.py", line 741, in apply
    ret = tracer(task_id, args, kwargs, request)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/celery/app/trace.py", line 388, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/celery/app/trace.py", line 374, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/fl/Code/occrp/memorious/memorious/logic/context.py", line 217, in handle
    context.execute(data)
  File "/Users/fl/Code/occrp/memorious/memorious/logic/context.py", line 71, in execute
    self.operation_id = op.id
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 237, in __get__
    return self.impl.get(instance_state(instance), dict_)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 579, in get
    value = state._load_expired(state, passive)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/state.py", line 592, in _load_expired
    self.manager.deferred_scalar_loader(self, toload)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", line 713, in load_scalar_attributes
    only_load_props=attribute_names)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", line 223, in load_on_ident
    return q.one()
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2814, in one
    ret = self.one_or_none()
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2784, in one_or_none
    ret = list(self)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2855, in __iter__
    return self._execute_and_instances(context)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2878, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1046, in _execute_clauseelement
    if not self.schema_for_object.is_default else None)
  File "<string>", line 1, in <lambda>
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 436, in compile
    return self._compiler(dialect, bind=bind, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 442, in _compiler
    return dialect.statement_compiler(dialect, self, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 435, in __init__
    Compiled.__init__(self, dialect, statement, **kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 216, in __init__
    self.string = self.process(self.statement, **compile_kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 242, in process
    return obj._compiler_dispatch(self, **kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py", line 81, in _compiler_dispatch
    return meth(self, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1747, in visit_select
    text, select, inner_columns, froms, byfrom, kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1831, in _compose_select_body
    t = select._whereclause._compiler_dispatch(self, **kwargs)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
    return meth(self, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1034, in visit_binary
    return self._generate_generic_binary(binary, opstring, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1059, in _generate_generic_binary
    self, eager_grouping=eager_grouping, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py", line 93, in _compiler_dispatch
    return meth(self, **kw)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1192, in visit_bindparam
    name = self._truncate_bindparam(bindparam)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1248, in _truncate_bindparam
    bind_name = self._truncated_identifier("bindparam", bind_name)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py", line 1259, in _truncated_identifier
    anonname = name.apply_map(self.anon_map)
  File "/Users/fl/.virtualenvs/funes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 4071, in apply_map
    return self % map_
RuntimeError: maximum recursion depth exceeded while calling a Python object

Implement Aleph bulk upload as an aggregation operation

Whenever a scraper uses balkhash_emit to put entities into a Balkhash dataset, we want to push all those entities on to an Aleph collection after the crawler has finished running.

Memorious admin user interface

Needed:

See all crawlers and how many runs and operations they have.
Run a particular crawler right now
Flush a crawler's state
See an overview of the operations of each crawler
See individual failures per operation

Make it possible to view errors only from the latest run

This can be done by making it possible to flush all the errors of a particular crawler and/or by showing the errors in the UI grouped by runs.

More coherent helpers for seach results

Processing a listing of paged results should be built in.

Paging by following 'next' links
Paging by calculating a sequence from the number of results or the 'last' link
Paging by incrementing a page number until empty result
Extracting links and titles from the list of results

Started in helpers/init

Use context params for aleph metadata

Instead of just data. As a fallback if data is missing? Maybe?

HTTP basic login (Session config) operation

Needed for webdav support

Move information from tag table to redis

Is this a good idea?
Move tags to redis
Make them have good timeouts (90 days at most)

Use the OCR service thorugh the ServiceLayer

Instead of using tesserocr we should use the service that is provided by the servicelayer.

Document making a basic crawler

In the README file.

Generalise datastore

Add it as an actual store method, as an alternative to aleph and directory.

`memorious run` command never finishes

I'm trying to automate the creation of new CSV files in opensanctions. When I run memorious run us_ofac, the last thing it does is store the data with a bunch of lines like this:

INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005
INFO:us_ofac.store:[us_ofac->store(balkhash_put)]: 2a24a44ca3e211e9a2fc0242ac160005

Then the command stays stuck open indefinitely. I'd like to write a shell script which performs other steps after this, but I don't see a good way to stop the command once it's finished and continue with the rest.

I don't want to use the memorious schedule or leave anything running all day. I want to briefly bring up the crawler once a day to scrape and produce files and then shut it down. It's also a problem in local development when testing changes to a crawler.

I've opened this here because it happens with all the opensanctions crawlers, so I'm assuming it's a core memorious problem. Does the command ever finish for other crawlers?

Why is `cleanup` removed?

I see in the documentation that there's an yml configuration called cleanup (https://github.com/alephdata/memorious/blob/c57b7350972b173d91947ab99561434c6c5ce6ff/docs/buildingcrawler.md). However, in the latest version of memorious, this option was removed and no longer works.

May I know why it was removed?

I'm trying to do a post crawler action to notify my team via Telegram/slack using the cleanup option, and I'm trying to figure out how to do it.

alephdata / memorious Goto Github PK

memorious's Introduction

Memorious

Design

Documentation

memorious's People

Contributors

Stargazers

Watchers

Forkers

memorious's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs