stadt-karlsruhe / ckanext-extractor Goto Github PK

View Code? Open in Web Editor NEW

19.0 7.0 16.0 162 KB

A full text and metadata extractor for CKAN

License: GNU Affero General Public License v3.0

Shell 3.15% Python 96.85%

ckan ckan-extension search fulltext fulltext-search metadata

ckanext-extractor's Introduction

ckanext-extractor

A CKAN extension for automatically extracting text and metadata from datasets.

ckanext-extractor automatically extracts text and metadata from your resources and adds them to the search index so that they can be used to find your data.

Requirements

ckanext-extractor has been developed and tested with CKAN 2.6 and later. Other versions may or may not work.

Since ckanext-extractor relies on the background job system introduced in CKAN 2.7, users of earlier CKAN versions need to also install ckanext-rq.

Installation

Note: The following steps assume a standard CKAN source installation.

Install Python Package

Activate your CKAN virtualenv:

. /usr/lib/ckan/default/bin/activate

Install the latest development version of ckanext-extractor and its dependencies:

cd /usr/lib/ckan/default
pip install -e git+https://github.com/stadt-karlsruhe/ckanext-extractor#egg=ckanext-extractor
pip install -r src/ckanext-extractor/requirements.txt

On a production system you'll probably want to pin a certain release version of ckanext-extractor instead:

pip install -e git+https://github.com/stadt-karlsruhe/[email protected]#egg=ckanext-extractor

Configure CKAN

Open your CKAN configuration file (e.g. /etc/ckan/default/production.ini) and add extractor to the list of plugins:

ckan.plugins = ... extractor

Initialize the database:

paster --plugin=ckanext-extractor init -c /etc/ckan/default/production.ini

Start Background Worker

ckanext-extractor uses background jobs to perform the extraction asynchronously so that they do not block the web server. You therefore need to make sure that a CKAN background worker is running:

paster --plugin=ckan jobs worker --config=/etc/ckan/default/production.ini

See the CKAN documentation for more information on background jobs and for tips on how to run workers in production environments.

Configure Solr

For the actual extraction CKAN's Apache Solr server is used. However, the necessary Solr plugins are deactivated by default. To enable them, find your main Solr configuration file (usually /etc/solr/conf/solrconfig.xml) and add/uncomment the following lines:

<lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="../../contrib/extraction/lib" regex=".*\.jar" />

Note: The Solr packages on Ubuntu are broken and do not contain the necessary files. You can simply download an official release of the same version, unpack it to a suitable location (without installing it) and adjust the dir arguments in the configuration lines above accordingly. For example, if you have unpacked the files to /var/lib/apache-solr, then you would need to put the following lines into solrconfig.xml:

<lib dir="/var/lib/apache-solr/dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="/var/lib/apache-solr/contrib/extraction/lib" regex=".*\.jar" />

Once the text and metadata have been extracted they need to be added to the Solr index, which requires appropriate Solr fields. To set them up add the following lines to your Solr schema configuration (usually /etc/solr/conf/schema.xml):

# Directly before the line that says "</fields>"
<dynamicField name="ckanext-extractor_*" type="text" indexed="true" stored="false"/>

# Directly before the line that says "</schema>"
<copyField source="ckanext-extractor_*" dest="text"/>

Make sure to restart Solr after you have applied the changes. For example, if you're using Jetty as an application server for Solr, then

sudo service jetty restart

Restart CKAN

Finally, restart your CKAN server:

sudo service apache2 restart

Test your Installation

The installation is now complete. To verify that everything is working open the URL /api/3/action/extractor_list, e.g. via

wget -qO - http://localhost/api/3/action/extractor_list

The output should look like this (in particular, success should be true):

{"help": "http://localhost/api/3/action/help_show?name=extractor_list", "success": true, "result": []}

You're Done!

Your installation of ckanext-extractor is now complete, and new/updated resources will have their metadata automatically indexed. You may want to adapt the configuration to your needs, see below for details. Once that is done you may also want to extract metadata from your existing resources:

. /usr/lib/ckan/default/bin/activate
paster --plugin=ckanext-extractor extract all -c /etc/ckan/default/production.ini

This and other paster administration commands are explained below in more detail.

Configuration

ckanext-extractor can be configured via the usual CKAN configuration file (e.g. /etc/ckan/default/production.ini). You must restart your CKAN server after updating the configuration.

Formats for Extraction

While Solr can extract text and metadata from many file formats not all of them might be of interest to you. You can therefore configure for which formats extraction is performed via the ckanext.extractor.indexed_formats option. It takes a list of space-separated formats, where the format is the one specified in a resource's CKAN metadata (and not the file extension or MIME type):

ckanext.extractor.indexed_formats = pdf txt

Formats are case-insensitive. You can use wildcards (* and ?) to match multiple formats. To extract data from all formats simply set

ckanext.extractor.indexed_formats = *

By default, extraction is only enabled for the PDF format:

ckanext.extractor.indexed_formats = pdf

Fields for Indexing

Once text and metadata have been extracted they can be added to the search index. Again, Solr supports more metadata fields than one usually needs. You can therefore configure which fields are indexed via the ckanext.extractor.indexed_fields option. It takes a space-separated list of field names:

ckanext.extractor.indexed_fields = fulltext author

The full text of a document is available via the fulltext field. Field names are case-insensitive. You can use wildcards (* and ?) to match multiple field names. To index all fields simply set

ckanext.extractor.indexed_fields = *

By default, only the full text of a document is indexed:

ckanext.extractor.indexed_fields = fulltext

Note: ckanext-extractor normalizes the field names reported by Solr by replacing underscores (_) with hyphens (-). In addition, multiple values for the same field in the same document are collapsed into a single value.

Paster Commands

In general, ckanext-extractor works automatically: whenever a new resource is created or an existing resource changes, its metadata is extracted and indexed. However, for administration purposes, metadata can also be managed from the command line using the paster tool.

Note: You have to activate your virtualenv before you can use these commands:

. /usr/lib/ckan/default/bin/activate

The general form for a paster command is

paster --plugin=ckanext-extractor COMMAND ARGUMENTS --config=/etc/ckan/default/production.ini

Replace COMMAND and ARGUMENTS as described below. For example:

paster --plugin=ckanext-extractor extract all --config=/etc/ckan/default/production.ini

delete (all | ID [ID [...]]): Delete metadata. You can specify one or more resource IDs or a single all argument (in which case all metadata is deleted).
extract [--force] (all | ID [ID [...]]): Extract metadata. You can specify one or more resource IDs or a single all argument (in which case metadata is extracted from all resources with appropriate formats). An optional --force argument can be used to force extraction even if the resource is unchanged, or if another extraction job already has been scheduled for that resource.

Note that this command only schedules the necessary extraction background tasks. A background jobs worker has to be running for the extraction to actually happen.
init: Initialize the database tables for ckanext-extractor. You only need to use this once (during the installation).
list: List the IDs of all resources for which metadata has been extracted.
show (all | ID [ID [...]]): Show extracted metadata. You can specify one or more resource IDs or a single all argument (in which case all metadata is shown).

API

Metadata can be managed via the standard CKAN API. Unless noted otherwise all commands are only available via POST requests to authenticated users.

`extractor_delete`

Delete metadata.

Only available to administrators.

Parameters:

id: ID of the resource for which metadata should be deleted.

`extractor_extract`

Extract metadata.

This function schedules a background task for extracting metadata from a resource.

Only available to administrators.

Parameters:

id: ID of the resource for which metadata should be extracted.
force: Optional boolean flag to force extraction even if the resource is unchanged, or if an extraction task has already been scheduled for that resource.

Returns a dict with the following entries:

status

A string describing the state of the metadata. This can be one of the following:

new: if no metadata for the resource existed before
update: if metadata existed but is going to be updated
unchanged: if metadata existed but won't get updated (for example because the resource's URL did not change since the last extraction)
inprogress: if a background extraction task for this resource is already in progress
ignored: if the resource format is configured to be ignored

Note that if force is true then an extraction job will be scheduled regardless of the status reported, unless that state is ignored.

task_id

The ID of the background task. If state is new or update then this is the ID of a newly created task. If state is inprogress then it's the ID of the existing task. Otherwise it is null.

If force is true then this is the ID of the new extraction task.

`extractor_list`

List resources with metadata.

Returns a list with the IDs of all resources for which metadata has been extracted.

Available to all (even anonymous) users via GET and POST.

`extractor_show`

Show the metadata for a resource.

Parameters:

id: ID of the resource for which metadata should be extracted.

Returns a dict with the resource's metadata and information about the last extraction.

Available to all (even anonymous) users via GET and POST.

Postprocessing Extraction Results

The ckanext.extractor.interfaces.IExtractorPostprocessor interface can be used to hook into the extraction process. It allows you to postprocess extraction results and to automatically trigger actions that use the extraction results for other purposes.

The interface offers 3 hooks:

extractor_after_extract(resource_dict, extracted) is called right after the extraction before the extracted metadata extracted is filtered and stored. You can modify extracted (in-place) and the changes will end up in the database.
extractor_after_save(resource_dict, metadata_dict) is called after the metadata has been filtered and stored in the database but before it is indexed. metadata_dict is a dict-representation of a ckanext.extractor.model.ResourceMetadata instance and contains both the extracted metadata and information about the extraction process (meta-metadata, so to speak).
extractor_after_index(resource_dict, metadata_dict) is called at the very end of the extraction process, after the metadata has been extracted, filtered, stored and indexed.

Adjusting the download request

The ckanext.extractor.interfaces.IExtractorRequest interface can be used to modify the HTTP request made for downloading a resource file for extraction. A typical use case would be to add custom authentication headers required by the remote server which are normally provided by the user's browser.

The interface offers 1 hook:

extractor_before_request(request) is called before a request is sent to download a resource file for extraction. The request parameter is a PreparedRequest object from the requests library.

Development

. /usr/lib/ckan/default/bin/activate
git clone https://github.com/stadt-karlsruhe/ckanext-extractor.git
cd ckanext-extractor
python setup.py develop
pip install -r dev-requirements.txt

Running the Tests

To run the tests, activate your CKAN virtualenv and do:

./runtests.sh

Any additional arguments are passed on to nosetests.

Change Log

See the file CHANGELOG.md.

License

Distributed under the GNU Affero General Public License. See the file LICENSE for details.

ckanext-extractor's People

Contributors

Stargazers

Watchers

Forkers

jqnatividad vaquer thehyve liip-forks open-data data-govt-nz jbothma catlibwilk opengov-opendata salsadigitalauorg stadtkarlsruheit dathere

ckanext-extractor's Issues

the documentation page on extensions.ckan.org needs an update

https://extensions.ckan.org/extension/extractor/ still contains references to celery, which has now been updated.

Any plans to migrate extension to 2.9?

And if not, any guidance would be appreciated, as I'd like to take a crack at it and contribute it back.

Handling of HTTP errors

Hello and thanks for your work putting this extension together! I have CKAN running on a Windows machine but the index is not the full text index of all the assets stored in CKAN. Is redis required for the plugin to work? The celery error I get is:

[2017-04-05 11:09:18,331: INFO/MainProcess] Received task: extractor.extract[d5b7a2fb-31a5-4395-adf6-0e58c076a300]
[2017-04-05 11:09:31,660: ERROR/MainProcess] Task extractor.extract[d5b7a2fb-31a5-4395-adf6-0e58c076a300] raised unexpected: HTTPError('404 Client Error: Not Found',)
Traceback (most recent call last):
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\celery\app\trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\celery\app\trace.py", line 438, in protected_call
return self.run(*args, **kwargs)
File "c:\users\sbarn_000\source\repos\ckan\ckanenv32-2.7\src\ckanext-extractor\ckanext\extractor\tasks.py", line 63, in extract
extracted = download_and_extract(res_dict['url'])
File "c:\users\sbarn_000\source\repos\ckan\ckanenv32-2.7\src\ckanext-extractor\ckanext\extractor\lib.py", line 38, in download_and_extract
r.raise_for_status()
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\requests\models.py", line 851, in raise_for_status
raise HTTPError(http_error_msg, response=self)
HTTPError: 404 Client Error: Not Found

Thank you for your help.

Migration to ckan 2.7 background tasks

Do you have any plans to support the new background jobs system soon?

extracter conflicts with a before_index method

I have the method before_index in an extension which takes a multivalued field and helps sort it in slor for tags so instead of appearing as a list they appear individually.

Basically, business_area looks like ["tag1", "tag2", "tag3" ]

If there are no tags its just an empty list "[]"

    def before_index(self, data_dict):
	print(data_dict)
	#print(json.loads(data_dict.get('business_area', '[]')))
	if data_dict.get('business_area'):
            data_dict['business_area'] = json.loads(data_dict.get('business_area', '[]'))
	return data_dict

The problem is when extractor picks up it keeps getting this stack trace on every push. It keeps saying
TypeError: expected string or buffer

If I remove that method before_index all together it works fine and I don't get this error.

Why does extractor keep calling this in a separate extension? And is there any way to fix it from erroring that? Also why is it giving errors? This method works fine besides extractor complaining about it.

Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 202, in extract
    index_for('package').update_dict(pkg_dict)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 101, in update_dict
    self.index_package(pkg_dict, defer_commit)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 278, in index_package
    pkg_dict = item.before_index(pkg_dict)
  File "/usr/lib/ckan/default/src/ckanext-datasettheme/ckanext/datasettheme/plugin.py", line 99, in before_index
    data_dict['business_area'] = json.loads(data_dict.get('business_area', '[]'))
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

Memory leak - celery threads don't go away

Celery threads don't seem to go away after extract all which makes the machine run out of memory pretty quickly.

Have you seen this before?

ckan_worker interfering with other plugins

When ckan-worker runs to pull the metadata out using ckan-extractor, im getting this error below all the time.

Datasetthumbnail is a seperate plugin that has nothing to do with and is properly installed. If i disable the thumbnail plugin from prod.ini then it works and i dont get this error in ckan-worker.

Why is extracter ckan-worker doing this? Ive seen it do it before with other plugins too but never figured out the solution other than disabling the plugin it interferes with, but its always a random plugin that it says it interferes with. Everything is properly installed.

2018-10-14 13:57:43,667 INFO  [ckan.lib.jobs] Worker rq:worker:localhost.13874 has finished job fba8e776-22fe-4de9-99e1-77660fdada72 from queue "default"
2018-10-14 13:57:43,669 INFO  [rq.worker] 
2018-10-14 13:57:43,669 INFO  [rq.worker] *** Listening on ckan:default:default...
2018-10-14 13:57:48,763 INFO  [rq.worker] ckan:default:default: ckanext.extractor.tasks.extract('/etc/ckan/default/production.ini', {u'cache_last_updated': None, u'cache_url': None, u'mimetype_inner': None, u'hash': u'', u'description': u'', u'format': u'CSV', u'url': u'http://127.0.0.1/dataset/086935f5-0bd0-4171-83a6-91076e4fdfb1/resource/3f07b9d5-be73-40f1-a6b6-108f2069c332/download/5000-cc-records.csv', u'created': '2018-10-14T20:57:42.620174', u'state': u'active', u'package_id': u'086935f5-0bd0-4171-83a6-91076e4fdfb1', u'last_modified': '2018-10-14T20:57:42.592766', u'mimetype': u'text/csv', u'url_type': u'upload', u'position': 5, u'revision_id': u'2d9f79b7-ab4c-431e-878b-51bcd364d98b', u'size': 460482L, u'datastore_active': True, u'id': u'3f07b9d5-be73-40f1-a6b6-108f2069c332', u'resource_type': None, u'name': u'5000 CC Records.csv'}) (a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8)
2018-10-14 13:57:48,763 INFO  [ckan.lib.jobs] Worker rq:worker:localhost.13874 starts job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 from queue "default"
2018-10-14 13:57:49,128 ERROR [ckan.lib.jobs] Job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 on worker rq:worker:localhost.13874 raised an exception: datasetthumbnail
Traceback (most recent call last):
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
    load_config(ini_path)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
    load_environment(conf.global_conf, conf.local_conf)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
    p.load_all()
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
    load(*plugins)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
    service = _get_service(plugin)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
    raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
2018-10-14 13:57:49,129 ERROR [rq.worker] PluginNotFoundException: datasetthumbnail
Traceback (most recent call last):
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
    load_config(ini_path)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
    load_environment(conf.global_conf, conf.local_conf)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
    p.load_all()
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
    load(*plugins)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
    service = _get_service(plugin)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
    raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
Traceback (most recent call last):
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
    load_config(ini_path)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
    load_environment(conf.global_conf, conf.local_conf)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
    p.load_all()
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
    load(*plugins)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
    service = _get_service(plugin)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
    raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
2018-10-14 13:57:49,129 WARNI [rq.worker] Moving job to u'failed' queue
2018-10-14 13:57:49,135 INFO  [ckan.lib.jobs] Worker rq:worker:localhost.13874 has finished job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 from queue "default"
2018-10-14 13:57:49,136 INFO  [rq.worker] 
2018-10-14 13:57:49,137 INFO  [rq.worker] *** Listening on ckan:default:default...

How do we handle resources uploaded via datastore_create?

How do we handle resources that are uploaded via API to datastore_create? It just says the following error below that it cant find the URL:


2018-09-07 20:15:41,106 INFO  [rq.worker] ckan:default:default: ckanext.extractor.tasks.extract('/etc/ckan/default/production.ini', {u'cache_last_updated': None, u'package_id': u'045626ce-96c5-4eb5-a248-d3b1e5a9eb2f', u'datastore_active': True, u'id': u'ce1b08ed-51e2-4ec8-9eb9-1bb0bec237e9', u'size': None, u'restricted': u'{"allowed_users": "", "level": "public"}', u'state': u'active', u'hash': u'', u'description': u'', u'format': u'data dictionary', u'mimetype_inner': None, u'url_type': None, u'mimetype': None, u'cache_url': None, u'name': u'rees', u'created': '2018-09-07T20:14:21.774267', u'url': u'', u'last_modified': None, u'position': 7, u'revision_id': u'11d499ed-19f0-491c-a2bb-482b74c3cdca', u'tag_string_resource': u'', u'resource_type': u''}) (a92631fc-7197-416e-bd0e-1df1b7d5e421)
2018-09-07 20:15:41,109 INFO  [ckan.lib.jobs] Worker rq:worker:MECALDDMPCKN01.19289 starts job a92631fc-7197-416e-bd0e-1df1b7d5e421 from queue "default"
2018-09-07 20:15:44,209 DEBUG [ckanext.extractor.model] Resource metadata table already defined
2018-09-07 20:15:44,209 DEBUG [ckanext.extractor.model] Resource metadatum table already defined
2018-09-07 20:15:47,618 DEBUG [ckanext.extractor.model] Resource metadata table already defined
2018-09-07 20:15:47,618 DEBUG [ckanext.extractor.model] Resource metadatum table already defined
2018-09-07 20:15:49,236 WARNI [ckanext.extractor.tasks] Failed to download resource data from "": Invalid URL '': No schema supplied. Perhaps you meant http://?
2018-09-07 20:15:49,306 DEBUG [ckanext.extractor.logic.action] extractor_show 53fc14dd-3ffb-4407-88ea-a66feeef87e0
2018-09-07 20:15:49,314 DEBUG [ckanext.extractor.logic.action] extractor_show 990d609a-1231-4ed8-8d95-a2e53311cf6d
2018-09-07 20:15:49,330 DEBUG [ckanext.extractor.logic.action] extractor_show 9de52d49-2e53-45fb-b5be-c287b10d3cb3
2018-09-07 20:15:49,339 DEBUG [ckanext.extractor.logic.action] extractor_show 849c6583-f795-4de6-89b8-b5130fb1e3e9
2018-09-07 20:15:49,346 DEBUG [ckanext.extractor.logic.action] extractor_show 3e67f293-0cbb-417e-949c-7ceb7116829a
2018-09-07 20:15:49,354 DEBUG [ckanext.extractor.logic.action] extractor_show 45b35abb-4f05-4fed-85a5-8b198e58e578
2018-09-07 20:15:49,361 DEBUG [ckanext.extractor.logic.action] extractor_show 207db850-c5e2-4508-887b-4107d8a7684a
2018-09-07 20:15:49,368 DEBUG [ckanext.extractor.logic.action] extractor_show ce1b08ed-51e2-4ec8-9eb9-1bb0bec237e9

NotAuthorized Error

I am seeing a NotAuthorized Error when the celery Daemon receives a task.

TimeOut when extracting a large dataset

I have a big dataset with 8,00,000 records. When i do the extraction it came with the following error:

[2017-10-10 14:43:08,491: ERROR/MainProcess] Task extractor.extract[401c7ccc-7a3c-455e-a5f4-f23b804ae43d] raised unexpected: SearchIndexError('Solr returned an error: (u"Connection to server 'http://solr_server/solr/ckan/update/?commit=true' timed out: HTTPConnectionPool(host='#########', port=8983): Read timed out. (read timeout=60)",)',)
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 438, in protected_call
return self.run(*args, **kwargs)
File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 94, in extract
index_for('package').update_dict(pkg_dict)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 101, in update_dict
self.index_package(pkg_dict, defer_commit)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 295, in index_package
raise SearchIndexError(msg)
SearchIndexError: Solr returned an error: (u"Connection to server 'http://XXXXXXXXXXXXXXXXX/solr/ckan/update/?commit=true' timed out: HTTPConnectionPool(host='xxxxxxxxxx', port=8983): Read timed out. (read timeout=60)",)

Did anyone else had same issue or can anyone please let me know how to fix it.
Thanks in Advance.!!

Error for PDF Resources

When I attempt to create a resource with a PDF file I am receiving the following error:
raised unexpected: SearchIndexError("Solr returned an error: (u'Solr responded with an error (HTTP 400): [Reason: ERROR: [doc=6b8e5b3b06fb3097149ebc2caffa7ffa] multiple values encountered for non multiValued field ckanext-extractor_b7e8d049-9b51-4c98-8d09-33b4879a45d7_x-parsed-by: [org.apache.tika.parser.DefaultParser, org.apache.tika.parser.pdf.PDFParser]]',)",) the following error:

Any thoughts?

After the extraction the index is not updated

I'm not sure if I understand the handling correctly. As far as I can tell, thanks to celeryd all new uploads (e.g. of a PDF file) are automatically extracted. But then the result is not yet present in the search index. So to actually make use of the extracted fulltext for the search, I have to rebuild the index.

Is this correct? Or should the index eventually be updated?

Highlighting (snippets)

Have you seen any need for showing search match snippets?

We'd like to show search match relevance. Snippets from the usual indexed fields as well as the full text field would probably be very useful for us. Just thought I'd raise it here, although it should probably be done as a separate plugin.

One way of implementing it might just be to configure solr to store the fulltext field and enable highlighting. And then have the package search API include the highlighting results in the response somehow.

I'm keen to hear your thoughts.

Auto tagging

Consider integrating with http://api.reegle.info/ to automatically create tags for concepts, places, and other entities.

Some more context - actually prototyped integration with Semantic Mediawiki and it worked surprisingly well even for non-cleantech related content. It was still able to recognize generic concepts, places, and people. And the best part is the auto-tagging API is free.

Vocabulary Tags removed from dataset when worker extracts text

I have a few custom ckan tag vocabularies for my datasets. It looks like when the worker extracts the text, the vocabulary tags are removed from the dataset.

I haven't looked into the worker code yet and I'm still on [email protected]

Basically the only thing I have using celery (yes, still on celery despite you upgrading it to work with redis on my request, sorry) is this.

When I create a dataset, I assign a couple of vocabulary tags to it.

When I add a PDF resource to it and programmatically request the package immediately afterwords, they're still set correctly.

A few seconds later they're not set any more.

If I stop the celery worker, the tags will stay in place until I start the worker again.

Any idea why this might be? I'll dive into the worker code ASAP but it's taken me a day or so to track this down to this plugin so it might not be tomorrow.

As always, I'm such a huge fan of this and appreciate it very much. Just posting here so long in case you know very quickly what it is. I'll update when I know more.

I think this has been hidden in the past because I used a script that would update (and fix) the package each time I add a resource, and I generally add XLS resources after adding PDF resources to the same datasets, and I have extractor configured to only extract PDF resources.

Installation on Ubuntu 14.04

I have a problem installing this extension on Ubuntu 14.04.

You wrote that Solr packages are broken for Ubuntu and that people should download the correct jar files. I downloaded them then added the appropriate lines to solrconfig.xml. However, I still have issues with Solr extracting metadata.

Here's the error I get

Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 62, in extract
    extracted = download_and_extract(res_dict['url'])
  File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/lib.py", line 43, in download_and_extract
    data = pysolr.Solr(config['solr_url']).extract(f, extractFormat='text')
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/pysolr.py", line 979, in extract
    files={'file': (file_obj.name, file_obj)})
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/pysolr.py", line 394, in _send_request
    raise SolrError(error_message % (resp.status_code, solr_message))
SolrError: Solr responded with an error (HTTP 500): [Reason: None]

Any advice you can give would be appreciated.

stadt-karlsruhe / ckanext-extractor Goto Github PK

ckanext-extractor's Introduction

ckanext-extractor

Requirements

Installation

Install Python Package

Configure CKAN

Start Background Worker

Configure Solr

Restart CKAN

Test your Installation

You're Done!

Configuration

Formats for Extraction

Fields for Indexing

Paster Commands

API

extractor_delete

extractor_extract

extractor_list

extractor_show

Postprocessing Extraction Results

Adjusting the download request

Development

Running the Tests

Change Log

License

ckanext-extractor's People

Contributors

Stargazers

Watchers

Forkers

ckanext-extractor's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`extractor_delete`

`extractor_extract`

`extractor_list`

`extractor_show`