GithubHelp home page GithubHelp logo

sickle's Introduction

Sickle: OAI-PMH for Humans

https://travis-ci.org/mloesch/sickle.svg?branch=master

Sickle is a lightweight OAI-PMH client library written in Python. It has been designed for conveniently retrieving data from OAI interfaces the Pythonic way:

>>> from sickle import Sickle
>>> sickle = Sickle('http://elis.da.ulcc.ac.uk/cgi/oai2')
>>> records = sickle.ListRecords(metadataPrefix='oai_dc')
>>> records.next()
<Record oai:eprints.rclis.org:4088>

Features

  • Easy harvesting of OAI-compliant interfaces
  • Support for all six OAI verbs
  • Convenient object representations of OAI items (records, headers, sets, ...)
  • Automatic de-serialization of Dublin Core-encoded metadata payloads to Python dictionaries
  • Option for ignoring deleted items

Installation

pip install sickle

Dependencies:

Documentation

Documentation is available at Read the Docs

Development

sickle's People

Contributors

gaubert avatar gugek avatar lnielsen avatar mloesch avatar mrmiguez avatar sourcefilter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sickle's Issues

Record Range Pull Inconsistency

I am using Python's Sickle library to harvest metadata records from 'http://export.arxiv.org/oai2', with a condition to obtain records published between 2020-01-01 to 2020-01-10 only.

Below is my code block.

from sickle import Sickle
sickle = Sickle('http://export.arxiv.org/oai2')
records = sickle.ListRecords(**{'metadataPrefix': 'oai_dc', 'from': '2020-01-01', 'until': '2020-01-10', 'ignore_deleted':'True'})
for i in records:
    metadata = i.get_metadata()
    title = metadata.get('title')[0]
    print(metadata)
    break

Yet, it is giving an output of a record published to Arxiv on 2007-05-14. This is a bit confusing. Can you please help?

Project status

Hi @mloesch,

thank you for this great library. I would like to know:

  • What is the status of this project?
  • Is it still actively maintained?
  • Do you accept PRs?
  • Are you looking for contributors and/or maintainers?

Thank you in advance for your response.

Use response.content instead of response.text.encode("utf-8")?

I believe this supersedes issue 20. The solution is the same but I've dug around a bit to explain why.

Reproducing the bug

from sickle import Sickle

record = (Sickle('https://archive-it.org/oai')
          .GetRecord(identifier='http://archive-it.org/collections/2323',
                     metadataPrefix='oai_dc'))

print(record.metadata, '\n\n', record.metadata['description'][0][124:137])
{'title': ['Jasmine Revolution - Tunisia 2011'], 'subject': ['spontaneousEvents', 'blogsAndSocialMedia', 'government-National'], 'description': ['This collection consists of websites documenting the revolution in Tunisia in 2011. Our partners at Library of Congress and Bibliothèque Nationale de France have contributed websites for this collection, and the sites are primarily in French and Arabic with some in English.'], 'identifier': ['http://archive-it.org/collections/2323']} 

 Bibliothèque

The problem

  • requests tries to be clever and detect the encoding
  • but it doesn't look at the explicit xml "encoding" property! (cf. requests docs)
  • thus, response.text is an incorrectly-decoded version of response.content
import requests

response = requests.get('https://archive-it.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=http://archive-it.org/collections/2323')
print(response.content[:38], response.encoding, sep='\n')
b'<?xml version="1.0" encoding="UTF-8"?>'
ISO-8859-1

More info: psf/requests#1604

The solution

  • pass response.content (the raw response bytestring) to lxml instead of re-encoding response.text
  • presumably, lxml is aware of and uses the xml encoding element
from lxml import etree

tree = etree.XML(response.content)
(tree.getchildren()[2].getchildren()[0].getchildren()[1]
 .getchildren()[0].getchildren()[4].text[124:136])
'Bibliothèque'

AttributeError: 'NoneType' object has no attribute 'find'

I am unsure what the problem is but I keep getting the following error when trying to harvest a collection from Qatar Digital Library. I have to harvest through a whitelisted server, so unfortunately, no one will be able to test but I'm hoping someone has a better instinct about why I'm getting this error and, more importantly, how to avoid it. The last time I harvested these records there were more that 32k but I keep getting this error on number 18,108. I would like to just pass over this record (and any other record with a similar problem) and harvest the rest of them but the script always stops on this record. Here is the complete error message:

Traceback (most recent call last):
  File "qnl-harvest.py", line 26, in <module>
    for count, record in enumerate(records, start=1):
  File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 52, in __next__
    return self.next()
  File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 151, in next
    self._next_response()
  File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 138, in _next_response
    super(OAIItemIterator, self)._next_response()
  File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 85, in _next_response
    error = self.oai_response.xml.find(
AttributeError: 'NoneType' object has no attribute 'find'

Here is my script:

import errno, os
from sickle import Sickle
from sickle.iterator import OAIResponseIterator

# where to write data to (relative to the dlme-harvest repo folder)
base_output_folder = 'output'

sickle = Sickle('https://api.qdl.qa/oaipmh')
print("Sickle instance created.") # status update

records = sickle.ListRecords(metadataPrefix='mods', ignore_deleted=True)
print("Records created.") # status update

directory = "output/qnl/data/"
os.makedirs(os.path.dirname(directory), exist_ok=True)

for count, record in enumerate(records, start=1):
    try:
        print("Record number " + str(count))
        out_file = 'output/qnl/data/qnl-{}.xml'.format(count)
        directory_name = os.path.dirname(out_file)
        with open(out_file, 'w') as f:
        	f.write(record.raw)
    except Exception as err:
        print(err)

Resumption Token with until

I'm not sure if this is a problem with a particular OAI endpoint I am working with, or with Sickle (although I'm leaning towards the former). I'm trying to selectively harvest an endpoint using an until timestamp:

import logging
from sickle import Sickle

logging.basicConfig(level=logging.DEBUG)

sickle = Sickle("https://api.qdl.qa/api/oaipmh")
records = sickle.ListRecords(
    metadataPrefix='mods_no_ocr',
    until="2019-10-15T19:00:00Z"
)

for rec in records:
    print(rec.header.xml.find('{http://www.openarchives.org/OAI/2.0/}datestamp').text)

When I run this I see:

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?metadataPrefix=mods_no_ocr&until=2019-10-15T19%3A00%3A00Z&verb=ListRecords HTTP/1.1" 200 None
2019-10-15T16:43:48.818Z
2019-10-15T16:43:48.818Z
2019-10-15T16:45:27.094Z
2019-10-15T16:45:27.094Z
2019-10-15T16:46:40.424Z
2019-10-15T16:46:40.424Z
2019-10-15T16:52:13.539Z
2019-10-15T16:52:13.539Z
2019-10-15T17:08:29.977Z
2019-10-15T17:08:29.977Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=10mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2019-10-15T16:52:13.539Z
2019-10-15T16:52:13.539Z
2019-10-15T17:08:29.977Z
2019-10-15T17:08:29.977Z
2019-10-15T18:46:15.172Z
2019-10-15T18:46:15.172Z
2020-05-24T04:16:31.944Z
2020-05-24T04:16:31.944Z
2019-10-15T18:52:19.668Z
2019-10-15T18:52:19.668Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=20mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2019-10-15T18:52:34.072Z
2019-10-15T18:52:34.072Z
2019-10-15T18:52:46.162Z
2019-10-15T18:52:46.162Z
2019-10-15T18:53:08.176Z
2019-10-15T18:53:08.176Z
2019-10-15T18:53:28.807Z
2019-10-15T18:53:28.807Z
2019-11-07T12:37:23.928Z
2019-11-07T12:37:23.928Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=30mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2019-10-15T18:55:50.294Z
2019-10-15T18:55:50.294Z
2020-05-27T08:48:04.998Z
2020-05-27T08:48:04.998Z
2019-11-13T10:41:20.582Z
2019-11-13T10:41:20.582Z
2019-11-14T10:39:25.215Z
2019-11-14T10:39:25.215Z
2020-08-28T13:04:28.351Z
2020-08-28T13:04:28.351Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=40mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2019-11-13T16:59:41.238Z
2019-11-13T16:59:41.238Z
2019-11-13T14:19:06.199Z
2019-11-13T14:19:06.199Z
2020-08-28T13:21:22.953Z
2020-08-28T13:21:22.953Z
2020-09-28T15:30:41.700Z
2020-09-28T15:30:41.700Z
2020-09-28T15:48:37.471Z
2020-09-28T15:48:37.471Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=50mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2022-02-14T20:44:17.077Z
2022-02-14T20:44:17.077Z
2022-02-14T20:44:24.159Z
2022-02-14T20:44:24.159Z
2022-02-14T20:44:17.142Z
2022-02-14T20:44:17.142Z
2022-02-14T20:44:52.422Z
2022-02-14T20:44:52.422Z
2022-02-14T20:44:59.224Z
2022-02-14T20:44:59.224Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=60mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2022-03-28T11:08:22.770Z
2022-03-28T11:08:22.770Z
2022-02-14T20:45:42.845Z
2022-02-14T20:45:42.845Z
2022-02-14T20:46:43.260Z
2022-02-14T20:46:43.260Z
2022-02-14T20:46:54.444Z
2022-02-14T20:46:54.444Z
2022-05-25T05:29:34.465Z
2022-05-25T05:29:34.465Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443

The timestamps for the records clearly show that the server isn't respecting the until value as it uses the resumptionToken. But I noticed that if I manually craft a URL that includes until with the resumptionToken that it seems to work properly, since it returns the next 10 records in the set of 52?

https://api.qdl.qa/api/oaipmh?resumptionToken=10mods_no_ocr&verb=ListRecords&until=2019-10-15T19%3A00%3A00Z

My understanding from the specification is that calls to ListRecords with the resumptionToken shouldn't include until because resumptionToken is exclusive? So it appears that Sickle is behaving properly and the server is broken?

Any help confirming this conclusion would be greatly appreciated.

PS. Thank you for a rock solid and extensible OAI-PMH library!

Connection broken: IncompleteRead

Hello,
when trying to harvest the data from the server http://staroai.theses.fr/OAIHandler with the tef metadataPrefix I end up getting a message saying :

Connection broken: IncompleteRead

It's not easy to replicate as it happened, for my last two retries when trying to get page number 299 or page 155, so seems to be some king of timeout issue, but I am not sure on the best way to handle it.

ListRecords not picking up on resumption token

This is most likely a user error but I've been through the docs, issues, etc. and can't figure this out. I am expecting this to use the resumption token to continue to retrieve the next set of records but it doesn't. Any pointers would be appreciated. Here is my code:

import errno
import os
from lxml import etree
from sickle import Sickle
from sickle.iterator import OAIResponseIterator

def to_str(bytes_or_str):
    '''Takes bytes or string and returns string'''
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf-8')
    else:
        value = bytes_or_str
    return value  # Instance of str

sickle = Sickle('http://cdm21044.contentdm.oclc.org/oai/oai.php', iterator=OAIResponseIterator)

sets = ['Kitapvehat', 'ResimKlksyn', 'emirgan', 'abidindino']

for item in sets:
    records = sickle.ListRecords(metadataPrefix='oai_dc', set=item)
    file_name = '{}/data/{}.xml'.format(item, item)
    if not os.path.exists(os.path.dirname(file_name)):
        try:
            os.makedirs(os.path.dirname(file_name))
        except OSError as exc: # Guard against race condition
            if exc.errno != errno.EEXIST:
                raise
    
    with open(file_name, 'w') as f:
        f.write(to_str(records.next().raw.encode('utf8')))
    
    f.close()```

Iteration with next() is very slow

Iteration with next() gets very slow when the OAIItemIterator is "empty" but StopIteration has not been raised yet. It takes several minutes.

Example:
from sickle import Sickle oai_end = 'http://ws.pangaea.de/oai/provider' sickle = Sickle(oai_end) records= sickle.ListRecords(**{'metadataPrefix':'oai_dc', 'set': 'query~cHJvamVjdDpsYWJlbDpEQU0gQU5EIGV2ZW50Om1ldGhvZDpGZXJyeUJveA', 'ignore_deleted':'True'}) entry = records(next) # records contains only one entry for the time being. This may change in future records(next)

SSL issues where there shouldn't be any

Hi,

When trying to use Sickle to hit https://rdmtest1.computecanada.ca/oai/request, which has a valid cert according to Chrome, I get this SSL error:

Traceback (most recent call last):
  File "globus_harvester.py", line 302, in <module>
oai_harvest_with_thumbnails(repository)
  File "globus_harvester.py", line 264, in oai_harvest_with_thumbnails
    records = sickle.ListRecords(metadataPrefix='oai_dc', ignore_deleted=True)
  File "/usr/local/lib/python3.5/dist-packages/sickle/app.py", line 129, in ListRecords
    return self.iterator(self, params, ignore_deleted=ignore_deleted)
  File "/usr/local/lib/python3.5/dist-packages/sickle/iterator.py", line 135, in __init__
    super(OAIItemIterator, self).__init__(sickle, params, ignore_deleted)
  File "/usr/local/lib/python3.5/dist-packages/sickle/iterator.py", line 46, in __init__
    self._next_response()
  File "/usr/local/lib/python3.5/dist-packages/sickle/iterator.py", line 138, in _next_response
    super(OAIItemIterator, self)._next_response()
  File "/usr/local/lib/python3.5/dist-packages/sickle/iterator.py", line 84, in _next_response
    self.oai_response = self.sickle.harvest(**params)
  File "/usr/local/lib/python3.5/dist-packages/sickle/app.py", line 102, in harvest
    auth=self.auth)
  File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 71, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 57, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 585, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 477, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

This is after installing certifi which normally works around Python SSL issues, and updating requests. Can duplicate in 2.7.11 and 3.5.1.

Is there an easy way of just passing verify=False to sickle, assuming I can't get Python to accept this cert?

Retry on timeouts and connection errors

I'm harvesting from a server which frequently times out (requests.exceptions.Timeout). Then, the request is not retried even though I set max_retries, since the retry functionality only covers the case where you actually get a response from the server.

I would like to extend the retry functionality to also include timeouts, but rather than increasing the complexity of the _request method further, I think it's worth considering switching to the tested and tried Retry urllib3 class. For some background on the class, see https://kevin.burke.dev/kevin/urllib3-retries/

Retry also handles the Retry-After header, so it shouldn't be that different from the current behaviour. The main difference is that it uses a backoff factor instead of a fixed sleep time:

sleep_time = backoff_factor * (2**retry_number)

Since OAI-PMH servers can be quite slow, we could set the default backoff factor to something like 2, to make the sleep time increase quickly. It is capped to BACKOFF_MAX=120 seconds by default

>>> for x in range(2,10):
>>>     print('Retry %s : sleep time %.1f seconds' % (x, min(120, 2 * (2**x))))
Retry 2 : sleep time 8.0 seconds
Retry 3 : sleep time 16.0 seconds
Retry 4 : sleep time 32.0 seconds
Retry 5 : sleep time 64.0 seconds
Retry 6 : sleep time 120.0 seconds
Retry 7 : sleep time 120.0 seconds
Retry 8 : sleep time 120.0 seconds
Retry 9 : sleep time 120.0 seconds

Breaking change: This means that the default_retry_after argument would no longer be supported.

Let me know what you think, and whether there is a chance a PR for this would be accepted.

Python3 sickle?

Hi, should sickle work in python3? I converted one of my old python scripts to python3 via 2to3 but getting now some errors with sickle.

When script has just

from sickle import Sickle

getting with (anaconda) python 3.6.8 error :

Traceback (most recent call last):
  File "minikoe.py", line 3, in <module>
    from sickle import Sickle
  File "c:\softat\anaconda3\lib\site-packages\sickle\__init__.py", line 13, in <module>
    from .app import Sickle
  File "c:\softat\anaconda3\lib\site-packages\sickle\app.py", line 15, in <module>
    from sickle.iterator import BaseOAIIterator, OAIItemIterator
  File "c:\softat\anaconda3\lib\site-packages\sickle\iterator.py", line 12, in <module>
    from sickle.models import ResumptionToken
  File "c:\softat\anaconda3\lib\site-packages\sickle\models.py", line 11, in <module>
    from lxml import etree
ImportError: cannot import name 'etree'  

I'll check if my lxml is too old or what ...

ListRecords ignores addresses with non-OAI content

while True:
  try:
    sickle = Sickle('https://furkankalkan.com')
    records = sickle.ListRecords(metadataPrefix='oai_dc', ignore_deleted=True)
    records.next()
  except StopIteration:
    break
  except Exception as e:
    raise e

Content of my site is HTML, but it seems sickle doesn't throw an error when encountering non-OAI-PMH content

Why token is invalid after iteration ?

Hello,
On Python 3.5.2 + Sickle 0.5, if we display the token after an iteration on a ListRecords, we get an AttributeError. Do you have any explanation about that ?

from sickle import Sickle

cm_sickle = Sickle('http://oai.openedition.org' )
records_original = cm_sickle.ListRecords(set='journals:cm', metadataPrefix='mets')
print(records_original.resumption_token.token) 
# Prints correctly the token

for record in records_original:
    print(record.raw)

print(records_original.resumption_token.token) 
# Error AttributeError: 'NoneType' object has no attribute 'token'

Thanks,

AttributeError when harvesting OAI records without a metadata child

Our Islandora repository publishes collection records along side item records. The collection records have a <header> child but not a <metadata> child, raising an AttributeError when Sickle harvests them.

Example collection record: http://fsu.digital.flvc.org/oai2?verb=GetRecord&identifier=oai:fsu.digital.flvc.org:fsu_avc50&metadataPrefix=mods

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2020-04-08T13:25:58Z</responseDate>
  <request>http://fsu.digital.flvc.org/oai2</request>
  <GetRecord>
    <record>
      <header>
        <identifier>oai:fsu.digital.flvc.org:fsu_avc50</identifier>
        <datestamp>2019-02-27T19:54:49Z</datestamp>
        <setSpec>fsu_stucamplifemain</setSpec>
      </header>
    </record>
  </GetRecord>
</OAI-PMH>

Python example:

from sickle import Sickle

h = Sickle("https://fsu.digital.flvc.org/oai2")

# item record works
rec1 = h.GetRecord(identifier="oai:fsu.digital.flvc.org:fsu_666", metadataPrefix='mods')
# collection record fails
rec2 = h.GetRecord(identifier="oai:fsu.digital.flvc.org:fsu_avc50", metadataPrefix='mods')

A try/except block in sickle.models.Record fixes the issue.

    def __init__(self, record_element, strip_ns=True):
            # ...snipped...
            try:
                self.metadata = xml_to_dict(
                    self.xml.find(
                        './/' + self._oai_namespace + 'metadata'
                    ).getchildren()[0], strip_ns=self._strip_ns)
            except AttributeError:
                self.metadata = None

Sickle not retrieving all records from repository

I have been working on retrieving all records from an OAI-PHM repository from various research institutions using the Sickle program in Python. I have written a code that performs a consecutive harvesting that iterates over the records of the various repositories and saves the records as an XML-file as well as into a SQL-data. Below is an excerpt of the code that specifies the consecutive harvesting of the OAI repository from a smaller research institution.

However, for some reason I am unable to retrieve all the records in the repositories. In the given example below for one institution, I am only able to retrieve around 2.900 records from the repository even though the completeListSize is 4.041 last time I checked. If I use the from parameter and perform a series of selective harvesting by date in a loop, I am able to retrieve some additional records but not all of them.

The OAI interface appears to be sending back an empty resumptionToken indicating that all records have been retrieved and therefore no errors are raised. I suspect the issue might be due to the fact that some of the records in the OAI repository are somehow empty or incomplete and that program therefore believes that all records in the repository has been retrieved. A similar but not identical issue with resumptionTokens have been raised in #25 but in that case the sickle program raised an issue.

I am unsure if it’s possible to solve the issue by adding an additional parameter that skips a record that is empty or issues a repeat request or something along those lines?


from sickle import Sickle
import re
import uuid
import pyodbc
import xml.dom.minidom
import xml.sax

api_list = [ \
"https://pure.itu.dk/ws/oai", \
]

date="2020-08.01"
last_retrieval="1950.01.01"


for api in api_list:
    institution = ""
    institution = inst_institution(api)
    record_total=0
    sickle = Sickle(api) 

    harvest_id = uuid.uuid4() # generating a random ID for the record. 

    recs = sickle.ListRecords(**{'metadataPrefix': 'ddf-mxd', 'from': last_retrieval, 'until': date})
    headers = sickle.ListIdentifiers(**{'metadataPrefix': 'ddf-mxd', 'from': last_retrieval, 'until': date})
    for header in headers:
        record_total = record_total + 1
        try:    
            r=recs.next()

        except IndexError:
            record_fail_total = record_fail_total + 1
            failed_record_function(harvest_id, Sidste_indhentning, dagsdato, api, institution, record_fail_total, day_of_harvest) # Failed records being saved to SQL table ”records_failed” 

            
         rec_id = re.search('rec_id=' + chr(34) + '(.+?)' + chr(34) + ' rec_created=', str(r)).group(1)
        print (str(record_total) + " - " + str(rec_id) + " - " + str(institution)) #save a XML-file for each record
        Fil_placering = r"C:\Users\sigur\OneDrive\Skrivebord\Data\\itu\\" + str(rec_id) + ".xml"
        with open(r"C:\Users\sigur\OneDrive\Skrivebord\Data\\itu\\" + str(rec_id) + ".xml", "w", encoding="UTF-8") as text_file:
            print(str(r), file=text_file)

`.encode` in `__repr__` incorrect

I think the encode method calls in the __repr__ methods on sickle/models.py is incorrect. FWIW, I get exceptions with those:

>>> list(sickle.ListSets())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/sickle/models.py", line 168, in __repr__
    return u'<Set %s>'.encode('utf8') % self.setName
TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

If I delete them, everything looks fine.

Python >3.6 yields deprecation warning for regex

Hello, and thanks for this great project.

I'm running Sickle on Python 3.8.5, but this issue seems to be valid for Python 3.6 upwards due to a change in re it seems.

return re.search('(\{.*\})', element.tag).group(1)
yields the following warning (running via pytest for this specific instance):

venv/lib/python3.8/site-packages/sickle/utils.py:20
  /home/user/src/venv/lib/python3.8/site-packages/sickle/utils.py:20: DeprecationWarning: invalid escape sequence \{
    return re.search('(\{.*\})', element.tag).group(1)

-- Docs: https://docs.pytest.org/en/stable/warnings.html

Seems to be something that's promising an easy fix, and I'll see if I can find the time to submit a PR for this.

Empty resumption token

Error
When I request the last entries of CiteSeer (for now it is http://citeseerx.ist.psu.edu/oai2?verb=ListRecords&metadataPrefix=oai_dc&from=2015-12-07), I get the error:

sickle.oaiexceptions.BadArgument: metadataPrefix is required

Reason for this error (maybe)
Since I receive this error after I read all entries, I think the problem could be the following:

CiteSeer returns an empty <resumptionToken/> element, which is correct according to the OAI PHM documentation:

the response containing the incomplete list that completes the list must include an empty resumptionToken element;

But in https://github.com/mloesch/sickle/blob/master/sickle/iterator.py#L61 you only check, if the element exists and not if also the text of this element exists.

Sickle throws an exception if resumption token was repeated

I have a problem with harvesting a OAI repository. After the last valid record set is downloaded, the repository (DSpace based) sends the same resumption token. This makes Sickle send another request with no metadata_prefix, which causes an error with the repository. Any idea how to fix this?

from sickle import Sickle
from sickle.iterator import OAIResponseIterator

sickle = Sickle('https://www.ssoar.info/OAIHandler/request', iterator=OAIResponseIterator)

for record_set in sickle.ListRecords(metadataPrefix='oai_genios', ignore_deleted=True):
    print(record_set)

Recovering from BadResumptionToken?

Hey, thanks for Sickle! It's a great tool and it's saved me a lot of time in a current project.

I've been encountering issues, though, with a repository that keeps timing out mid-harvest, or going offline. I'm not sure, but I had to make small changes to catch an exception from Requests when this happened.

Catching that exception reveals a new problem; I get sickle.oaiexceptions.BadResumptionToken: The value of the resumptionToken argument is invalid or expired mid-harvest, and the harvest aborts.

I don't have much deep-knowledge on OAI-PMH, so I don't know if I can assume some things about the protocol. For example, I don't know if it's safe to do something like:

  1. Count previously harvested items
  2. Repeat harvest under same configuration, with offset equal to (1)

What is the idiomatic way to resolve this issue, if any?
Thanks!

Python 3 Compatibility

Hello!

I would like to use Sickle in a Python 3 environment. Is there any interest from somebody else in porting it? Has this been considered before?

Many thanks.

Sickle output

I easily installed and run Sickle, but the data I get from the OAI-PMH endpoint are formatted in a strange format which is neither a python dictionary nor a JSON, and I can't seem to parse it well. I don't understand how I can set the output (XML? JSON?), I didn't find anything in the documentation.

Sickle retrieving partial data from collection

Hi there,

I have been attempting to extract metadata from this library with this code:

URL = 'https://jscholarship.library.jhu.edu/oai/request?set=col_1774.2_34121'
sickle = Sickle(URL)
records = sickle.ListRecords(metadataPrefix='oai_dc')

And it can return 6 out of 11 entries from this collection. I cannot figure out why the other 5 entries are missing. Any advice would be appreciated!

A question about Sickle

Dear Mr. Mathias.Loesch:
I'm a researcher from Agriculture Information Institute, Chinese academy of agriculture sciences.
I have used your sickle package for harvesting some data from a given OAI server, But the server need me to input the username and password, I sent the auth=("xxxxx","xxxx") into the sickle initialize sentence.
Here's my code:
sickle=Sickle('http://xxx.xxx.xx/xxx/oaihandler',auth=('xxx','xx'))
records = sickle.ListRecords(metadataPrefix='oai_dc')

    but when I run the program, it told me like this:
    sickle.oaiexceptions.OAIError : username is required parameter for this verb.
    I searched the problem in internet for long time, but I didn't get the answer, So I finally decide to sent you the email.
    Sorry for bothering you for this.
    Thank you very much.
    My mail address is :[email protected].

Best Regards
Cui

str/bytes from requests response interaction with lxml

Is there a reason that the utf-8 (str/unicode) text is being used in the XML property/method in response.py?

I'm finding lxml is having problems with unicode strings (bytes in py3) that include external entities or non-ascii/latin characters.

Here is an example record: there are right single quotes \u2019 embedded in there.

<record>
	<header>
		<identifier>oai:scholarship.law.duke.edu:dlj-3910</identifier>
		<datestamp>2017-10-04T15:03:25Z</datestamp>
		<setSpec>publication:journals</setSpec>
		<setSpec>publication:dlj</setSpec>
	</header>
	<metadata>
		<oai_dc:dc
			xmlns:oai_dc="https://www.openarchives.org/OAI/2.0/oai_dc/"
			xmlns:dc="http://purl.org/dc/elements/1.1/"
			xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
			xsi:schemaLocation="https://www.openarchives.org/OAI/2.0/oai_dc/ https://www.openarchives.org/OAI/2.0/oai_dc.xsd">
			<dc:title>Rule 24 Notwithstanding: Why Article III Should Not Limit Intervention of Right</dc:title>
			<dc:creator>Ferguson, Zachary N.</dc:creator>
			<dc:description>The Supreme Court recently decided in Town of Chester v. Laroe Estates, Inc. that intervenors of right under Federal Rule of Civil Procedure 24(a)(2) must demonstrate independent Article III standing when they pursue relief different from that requested by an original plaintiff. This decision resolved, in part, a decades-long controversy among the Courts of Appeals over the proper relationship between Rule 24 intervention and Article III standing that the Court first acknowledged in Diamond v. Charles. But the Court’s narrow decision in Town of Chester hardly disposed of the controversy, and Courts of Appeals are still free to require standing of defendant-intervenors and, it stands to reason, plaintiff-intervenors even if they do not pursue different relief. With this debate yet unresolved, this Note takes a less conventional approach. In addition to arguing that the Supreme Court’s precedents implicitly resolved this question before Town of Chester, this Note argues that the nature of judicial decisions raises two concerns that a liberal application of Rule 24(a)(2) would mitigate. First, this Note argues that stare decisis limits the right of litigants to be heard on the merits of their claims and defenses in a way that undermines the principles of due process. Second, this Note argues that the process of judicial decisionmaking is fraught with potential epistemic problems that can produce suboptimal legal rules. After considering these two concerns, this Note argues that Rule 24(a)(2) is a better and more practical way to mitigate these problems than are Rule 24(a)(2)’s alternatives.</dc:description>
			<dc:date>2017-10-04T07:00:00Z</dc:date>
			<dc:type>text</dc:type>
			<dc:format>application/pdf</dc:format>
			<dc:identifier >http://scholarship.law.duke.edu/dlj/vol67/iss1/4</dc:identifier>
			<dc:identifier>http://scholarship.law.duke.edu/cgi/viewcontent.cgi?article=3910&#38;amp;context=dlj</dc:identifier>
			<dc:source>Duke Law Journal</dc:source>
			<dc:publisher>Duke University School of Law</dc:publisher>
			<dc:subject>Law</dc:subject>

		</oai_dc:dc>
	</metadata>
</record>

lxml does some optimizations in py2 where it sometimes will output a unicode object and other times a text one. In py3 it will always output unicode. But for whatever reason when you get the text from the requests response, encode it back to str/bytes and then parse it with lxml.etree some unicode (and maybe external entities) are getting processed incorrectly.

in: sickle/sickle/response.py

    @property
    def xml(self):
        """The server's response as parsed XML."""
        return etree.XML(self.http_response.text.encode("utf8"),
                         parser=XMLParser)

I think your tests are passing because you are reading from a file object which is directly providing string/byte, though you may not have coverage of anything past the ASCII character space.

I think the simple fix is in sickle/sickle/response.py to just parse the response content rather than encoding the text which is already being processed by requests and which needs to be encoded to be handled by lxml.etree

    @property
    def xml(self):
        """The server's response as parsed XML."""
        return etree.XML(self.http_response.content,
                         parser=XMLParser)

Project stability and versioning?

It looks like this project is semantically versioned, but still on major release zero. I see encouraging recent activity, but just wanted to check if the versioning means the project is really in an alpha state.

Also wondering if you all might have a comment regarding a similar issue I posted on pyoai, which seems like it's no longer getting as much attention from the devs as sickle: infrae/pyoai#43

Is sickle aiming for a major release anytime soon? I'd love to have some guarantees regarding stability, and am aways a bit reluctant about adding dependencies if they are 0.x.x.. Thoughts?

None of this meant as a criticism by the way, really appreciate the work you guys have done, just want to find out why there's not been a major release yet...

Thanks!

InsecurePlatformWarning for SSL connections

When using Sickle with a https:// (SSL) URL I get the following error message:

[...]/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning

Any thoughts on how to fix this in Sickle?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.