ozekik / lightrdf Goto Github PK

View Code? Open in Web Editor NEW

25.0 4.0 2.0 246 KB

A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3

License: Apache License 2.0

Python 56.23% Rust 43.77%

semantic-web python rdf rust pyo3 linked-data owl ntriples turtle

lightrdf's Introduction

LightRDF

A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3.

Features
Install
Basic Usage
- Load file objects / texts
Benchmark (WIP)
Alternatives
Todo
License

Features

Supports N-Triples, Turtle, and RDF/XML
Handles large-size RDF documents
Simple interfaces for parsing and searching triples
Support regex in triple patterns

Install

pip install lightrdf

Basic Usage

Iterate over all triples

With Parser:

import lightrdf

parser = lightrdf.Parser()

for triple in parser.parse("./go.owl", base_iri=None):
    print(triple)

With RDFDocument:

import lightrdf

doc = lightrdf.RDFDocument("./go.owl")

# `None` matches arbitrary term
for triple in doc.search_triples(None, None, None):
    print(triple)

Search triples with a triple pattern

import lightrdf

doc = lightrdf.RDFDocument("./go.owl")

for triple in doc.search_triples("http://purl.obolibrary.org/obo/GO_0005840", None, None):
    print(triple)

# Output:
# ('<http://purl.obolibrary.org/obo/GO_0005840>', '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>', '<http://www.w3.org/2002/07/owl#Class>')
# ('<http://purl.obolibrary.org/obo/GO_0005840>', '<http://www.w3.org/2000/01/rdf-schema#subClassOf>', '<http://purl.obolibrary.org/obo/GO_0043232>')
# ...
# ('<http://purl.obolibrary.org/obo/GO_0005840>', '<http://www.geneontology.org/formats/oboInOwl#inSubset>', '<http://purl.obolibrary.org/obo/go#goslim_yeast>')
# ('<http://purl.obolibrary.org/obo/GO_0005840>', '<http://www.w3.org/2000/01/rdf-schema#label>', '"ribosome"^^<http://www.w3.org/2001/XMLSchema#string>')

Search triples with a triple pattern (Regex)

import lightrdf
from lightrdf import Regex

doc = lightrdf.RDFDocument("./go.owl")

for triple in doc.search_triples(Regex("^<http://purl.obolibrary.org/obo/.*>$"), None, Regex(".*amino[\w]+?transferase")):
    print(triple)

# Output:
# ('<http://purl.obolibrary.org/obo/GO_0003961>', '<http://www.w3.org/2000/01/rdf-schema#label>', '"O-acetylhomoserine aminocarboxypropyltransferase activity"^^<http://www.w3.org/2001/XMLSchema#string>')
# ('<http://purl.obolibrary.org/obo/GO_0004047>', '<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>', '"S-aminomethyldihydrolipoylprotein:(6S)-tetrahydrofolate aminomethyltransferase (ammonia-forming) activity"^^<http://www.w3.org/2001/XMLSchema#string>')
# ...
# ('<http://purl.obolibrary.org/obo/GO_0050447>', '<http://www.w3.org/2000/01/rdf-schema#label>', '"zeatin 9-aminocarboxyethyltransferase activity"^^<http://www.w3.org/2001/XMLSchema#string>')
# ('<http://purl.obolibrary.org/obo/GO_0050514>', '<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>', '"spermidine:putrescine 4-aminobutyltransferase (propane-1,3-diamine-forming)"^^<http://www.w3.org/2001/XMLSchema#string>')

Load file objects / texts

Load file objects with Parser:

import lightrdf

parser = lightrdf.Parser()

with open("./go.owl", "rb") as f:
    for triple in parser.parse(f, format="owl", base_iri=None):
        print(triple)

Load file objects with RDFDocument:

import lightrdf

with open("./go.owl", "rb") as f:
    doc = lightrdf.RDFDocument(f, parser=lightrdf.xml.PatternParser)

    for triple in doc.search_triples("http://purl.obolibrary.org/obo/GO_0005840", None, None):
        print(triple)

Load texts:

import io
import lightrdf

data = """<http://one.example/subject1> <http://one.example/predicate1> <http://one.example/object1> .
_:subject1 <http://an.example/predicate1> "object1" .
_:subject2 <http://an.example/predicate2> "object2" ."""

doc = lightrdf.RDFDocument(io.BytesIO(data.encode()), parser=lightrdf.turtle.PatternParser)

for triple in doc.search_triples("http://one.example/subject1", None, None):
    print(triple)

Benchmark (WIP)

On MacBook Air (13-inch, 2017), 1.8 GHz Intel Core i5, 8 GB 1600 MHz DDR3

https://gist.github.com/ozekik/b2ae3be0fcaa59670d4dd4759cdffbed

$ wget -q http://purl.obolibrary.org/obo/go.owl
$ gtime python3 count_triples_rdflib_graph.py ./go.owl  # RDFLib 4.2.2
1436427
235.29user 2.30system 3:59.56elapsed 99%CPU (0avgtext+0avgdata 1055816maxresident)k
0inputs+0outputs (283major+347896minor)pagefaults 0swaps
$ gtime python3 count_triples_lightrdf_rdfdocument.py ./go.owl  # LightRDF 0.1.1
1436427
7.90user 0.22system 0:08.27elapsed 98%CPU (0avgtext+0avgdata 163760maxresident)k
0inputs+0outputs (106major+41389minor)pagefaults 0swaps
$ gtime python3 count_triples_lightrdf_parser.py ./go.owl  # LightRDF 0.1.1
1436427
8.00user 0.24system 0:08.47elapsed 97%CPU (0avgtext+0avgdata 163748maxresident)k
0inputs+0outputs (106major+41388minor)pagefaults 0swaps

https://gist.github.com/ozekik/636a8fb521401070e02e010ce591fa92

$ wget -q http://downloads.dbpedia.org/2016-10/dbpedia_2016-10.nt
$ gtime python3 count_triples_rdflib_ntparser.py dbpedia_2016-10.nt  # RDFLib 4.2.2
31050
1.63user 0.23system 0:02.47elapsed 75%CPU (0avgtext+0avgdata 26568maxresident)k
0inputs+0outputs (1140major+6118minor)pagefaults 0swaps
$ gtime python3 count_triples_lightrdf_ntparser.py dbpedia_2016-10.nt  # LightRDF 0.1.1
31050
0.21user 0.04system 0:00.36elapsed 71%CPU (0avgtext+0avgdata 7628maxresident)k
0inputs+0outputs (534major+1925minor)pagefaults 0swaps

Alternatives

RDFLib – (Pros) pure-Python, matured, feature-rich / (Cons) takes some time to load triples
pyHDT – (Pros) extremely fast and efficient / (Cons) requires pre-conversion into HDT

Todo

License

Rio and PyO3 are licensed under the Apache-2.0 license.

Copyright 2020 Kentaro Ozeki

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

lightrdf's People

Contributors

Stargazers

Watchers

Forkers

eggplants sciumo

lightrdf's Issues

Incorrect parsing

Hi @ozekik!

I found a bug when parsing. I considered generations.rdf file when parsing, but a similar bug appeared in many other files. For the some reason the library recognizes this tag

<ns2:versionInfo rdf:datatype="http://www.w3.org/2001/XMLSchema#string">An example ontology created by Matthew Horridge</ns2:versionInfo>

like this str

'"An example ontology created by Matthew Horridge"^^<http://www.w3.org/2001/XMLSchema#string>'

in last item of triple ( triple[-1] ).

When using the rdflib library, I was not getting a similar problem.
Thanks.

expose trig support

looks like Rio can handle it, there's just no module for it in lightrdf

Add namespace support

It would be convenient to pass in a map of prefix-namespace expansions and have the search option return CURIEs where contraction is possible

While trivial to do in a python wrapper, it would be presumably faster to do at the rust level

Rio libraries need updating to fix a very weird bug

When using LightRDF in the Ontology Development Kit, we have come across a very strange bug where LightRDF would fail to parse RDF/XML files that seem completely valid.

Here is a file that LightRDF fails to parse: https://github.com/INCATools/ontology-development-kit/files/10042121/tdm-bad.txt

(Sorry for the size of the file, but I was unable to reduce the error case to a minimal demonstrating example.)

Trying to parse that file with LightRDF as follows:

import sys
from lightrdf import Parser
parser = Parser()
try:
    for triple in parser.parse("tdm-bad.xml"):
        pass
except Exception as e:
    print(e)
    sys.exit(1)

yields the following error: Unexpected EOF during reading Comment.

I have no idea where the bug exactly is. However, rebuilding LightRDF after updating the Rio dependencies (rio_api, rio_turtle, and rio_xml) in Cargo.toml to their latest version (0.8.3) seems enough to fix it.

Providing a Linux-arm64 wheel

LightRDF is available as a wheel on PyPI for many combinations of systems and architectures (Windows/MacOS/Linux…, i686/x86_64/arm64…). Thanks for that!

However, one particular combination that is missing is Linux/arm64. Any chance it could be added?

Add support for parsing objects into literals vs URIs vs blank nodes

Currently the user has to parse the object to be able to do a lot of operations on it

This is relatively straightforward, I think:

^riog\d+$ is a blank node
Literals:
- ^"(.*)"^^<(\S+)>$ type
- ^"(.*)"@\w+$ language
- ^"(.*)"$ untyped
Otherwise a URI

But it might be nice to centralize this, or do it in rust for speed. To avoid the overhead of an OO interface how about a parallel search_statements with arguments s, p, o_uri, o_literal_value, o_datatype, o_lang?

This is my use case:

https://github.com/INCATools/rdf-sql-bulkloader

For now I am doing this in python

lightrdf.Error: error while parsing IRI '': No scheme found in an absolute IRI

Hi @ozekik!

Thank you for the awesome library! 👏

Unfortunately, while using your library, I got the error 🐛 mentioned in the title. 😞
But using rdflib I was not getting a similar error. 🤔

Environment

OS: Ubuntu 20.04
Python: 3.8.5
LightRDF: 0.2.1

Steps to reproduce.

Download pathways archive.

wget -q https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/pathways.rdf.xz

Unzip it using xz package.

sudo apt install xz-utils
unxz pathways.rdf.xz

Run count_triples_lightrdf_parser.py.

python3 count_triples_lightrdf_parser.py pathways.rdf

Error log.

Traceback (most recent call last):
  File "count_triples_lightrdf_parser.py", line 8, in <module>
    for triple in parser.parse(sys.argv[1]):
lightrdf.Error: error while parsing IRI '': No scheme found in an absolute IRI

Please tell me where I am wrong. Thank you 🙏

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

First I have to say nice job on the library, I really love the speed and simplicity of lightrdf, and it worked very well with ChEMBL_27, but I ran into an issue when I tried to read the wikidata tll.

Details

wikidata's file latest-all.ttl

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

--- nearby lines from file, including problematic line, if my sed is correct

sed -n '3135042,3135046p;3135047q' latest-all.ttl

ref:8f38c16e1f141b68f172b65f48b0982234890e56 a wikibase:Reference ;
	pr:P854 <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> ;
	pr:P813 "2020-03-12T00:00:00Z"^^xsd:dateTime ;
	prv:P813 v:ae909ef12942e232eea24326bdd78c8e ;

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
gunzip and parse the file

Note: these are big files, another strength of lightrdf, the ability to process large files without requiring the entire data set to fit in RAM.


parse the file` 'revisions_lang=en_uris.ttl'`
for each .ttl in this dataset, I'm simply dumping the triples to a file like this

parser = lightrdf.Parser()
triples = parser.parse(str(input_file), format="ttl", base_iri=None)
for (s, p, o) in triples:
    f_triples.write(f"{s}\t{p}\t{o}\n")


Other notes: 
- if there is a way to use a try catch to move past the line I would like to know how that is done
- ubuntu 20.04, conda env pip install lightrdf

Serialize RDF

I was looking for a replacement for RDFLib for just parsing, do some BGP searching and write the new triples back.
It seems that LightRDF can handle parsing and BGP searching, but not serializing the RDF triples again to a file.

Are there any plans for this?

Parse from String

Hello,

I am interested in using your library for fast parsing from turtle to n-triples.
However, as the current API only supports parsing from a file, I was wondering if it would be possible to extend the library to also parse string objects?

Thanks
Lars