GithubHelp home page GithubHelp logo

ozekik / lightrdf Goto Github PK

View Code? Open in Web Editor NEW
25.0 4.0 2.0 246 KB

A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3

License: Apache License 2.0

Python 56.23% Rust 43.77%
semantic-web python rdf rust pyo3 linked-data owl ntriples turtle

lightrdf's Introduction

LightRDF

PyPI PyPI - Downloads

A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3.

Contents

Features

  • Supports N-Triples, Turtle, and RDF/XML
  • Handles large-size RDF documents
  • Simple interfaces for parsing and searching triples
  • Support regex in triple patterns

Install

pip install lightrdf

Basic Usage

Iterate over all triples

With Parser:

import lightrdf

parser = lightrdf.Parser()

for triple in parser.parse("./go.owl", base_iri=None):
    print(triple)

With RDFDocument:

import lightrdf

doc = lightrdf.RDFDocument("./go.owl")

# `None` matches arbitrary term
for triple in doc.search_triples(None, None, None):
    print(triple)

Search triples with a triple pattern

import lightrdf

doc = lightrdf.RDFDocument("./go.owl")

for triple in doc.search_triples("http://purl.obolibrary.org/obo/GO_0005840", None, None):
    print(triple)

# Output:
# ('<http://purl.obolibrary.org/obo/GO_0005840>', '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>', '<http://www.w3.org/2002/07/owl#Class>')
# ('<http://purl.obolibrary.org/obo/GO_0005840>', '<http://www.w3.org/2000/01/rdf-schema#subClassOf>', '<http://purl.obolibrary.org/obo/GO_0043232>')
# ...
# ('<http://purl.obolibrary.org/obo/GO_0005840>', '<http://www.geneontology.org/formats/oboInOwl#inSubset>', '<http://purl.obolibrary.org/obo/go#goslim_yeast>')
# ('<http://purl.obolibrary.org/obo/GO_0005840>', '<http://www.w3.org/2000/01/rdf-schema#label>', '"ribosome"^^<http://www.w3.org/2001/XMLSchema#string>')

Search triples with a triple pattern (Regex)

import lightrdf
from lightrdf import Regex

doc = lightrdf.RDFDocument("./go.owl")

for triple in doc.search_triples(Regex("^<http://purl.obolibrary.org/obo/.*>$"), None, Regex(".*amino[\w]+?transferase")):
    print(triple)

# Output:
# ('<http://purl.obolibrary.org/obo/GO_0003961>', '<http://www.w3.org/2000/01/rdf-schema#label>', '"O-acetylhomoserine aminocarboxypropyltransferase activity"^^<http://www.w3.org/2001/XMLSchema#string>')
# ('<http://purl.obolibrary.org/obo/GO_0004047>', '<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>', '"S-aminomethyldihydrolipoylprotein:(6S)-tetrahydrofolate aminomethyltransferase (ammonia-forming) activity"^^<http://www.w3.org/2001/XMLSchema#string>')
# ...
# ('<http://purl.obolibrary.org/obo/GO_0050447>', '<http://www.w3.org/2000/01/rdf-schema#label>', '"zeatin 9-aminocarboxyethyltransferase activity"^^<http://www.w3.org/2001/XMLSchema#string>')
# ('<http://purl.obolibrary.org/obo/GO_0050514>', '<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>', '"spermidine:putrescine 4-aminobutyltransferase (propane-1,3-diamine-forming)"^^<http://www.w3.org/2001/XMLSchema#string>')

Load file objects / texts

Load file objects with Parser:

import lightrdf

parser = lightrdf.Parser()

with open("./go.owl", "rb") as f:
    for triple in parser.parse(f, format="owl", base_iri=None):
        print(triple)

Load file objects with RDFDocument:

import lightrdf

with open("./go.owl", "rb") as f:
    doc = lightrdf.RDFDocument(f, parser=lightrdf.xml.PatternParser)

    for triple in doc.search_triples("http://purl.obolibrary.org/obo/GO_0005840", None, None):
        print(triple)

Load texts:

import io
import lightrdf

data = """<http://one.example/subject1> <http://one.example/predicate1> <http://one.example/object1> .
_:subject1 <http://an.example/predicate1> "object1" .
_:subject2 <http://an.example/predicate2> "object2" ."""

doc = lightrdf.RDFDocument(io.BytesIO(data.encode()), parser=lightrdf.turtle.PatternParser)

for triple in doc.search_triples("http://one.example/subject1", None, None):
    print(triple)

Benchmark (WIP)

On MacBook Air (13-inch, 2017), 1.8 GHz Intel Core i5, 8 GB 1600 MHz DDR3

https://gist.github.com/ozekik/b2ae3be0fcaa59670d4dd4759cdffbed

$ wget -q http://purl.obolibrary.org/obo/go.owl
$ gtime python3 count_triples_rdflib_graph.py ./go.owl  # RDFLib 4.2.2
1436427
235.29user 2.30system 3:59.56elapsed 99%CPU (0avgtext+0avgdata 1055816maxresident)k
0inputs+0outputs (283major+347896minor)pagefaults 0swaps
$ gtime python3 count_triples_lightrdf_rdfdocument.py ./go.owl  # LightRDF 0.1.1
1436427
7.90user 0.22system 0:08.27elapsed 98%CPU (0avgtext+0avgdata 163760maxresident)k
0inputs+0outputs (106major+41389minor)pagefaults 0swaps
$ gtime python3 count_triples_lightrdf_parser.py ./go.owl  # LightRDF 0.1.1
1436427
8.00user 0.24system 0:08.47elapsed 97%CPU (0avgtext+0avgdata 163748maxresident)k
0inputs+0outputs (106major+41388minor)pagefaults 0swaps

https://gist.github.com/ozekik/636a8fb521401070e02e010ce591fa92

$ wget -q http://downloads.dbpedia.org/2016-10/dbpedia_2016-10.nt
$ gtime python3 count_triples_rdflib_ntparser.py dbpedia_2016-10.nt  # RDFLib 4.2.2
31050
1.63user 0.23system 0:02.47elapsed 75%CPU (0avgtext+0avgdata 26568maxresident)k
0inputs+0outputs (1140major+6118minor)pagefaults 0swaps
$ gtime python3 count_triples_lightrdf_ntparser.py dbpedia_2016-10.nt  # LightRDF 0.1.1
31050
0.21user 0.04system 0:00.36elapsed 71%CPU (0avgtext+0avgdata 7628maxresident)k
0inputs+0outputs (534major+1925minor)pagefaults 0swaps

Alternatives

  • RDFLib – (Pros) pure-Python, matured, feature-rich / (Cons) takes some time to load triples
  • pyHDT – (Pros) extremely fast and efficient / (Cons) requires pre-conversion into HDT

Todo

  • Push to PyPI
  • Adopt CI
  • Handle Base IRI
  • Add basic tests
  • Switch to maturin-action from cibuildwheel
  • Support NQuads and TriG
  • Add docs
  • Add tests for w3c/rdf-tests
  • Resume on error
  • Allow opening fp
  • Support and test on RDF-star

License

Rio and PyO3 are licensed under the Apache-2.0 license.

Copyright 2020 Kentaro Ozeki

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

lightrdf's People

Contributors

ozekik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

eggplants sciumo

lightrdf's Issues

Incorrect parsing

Hi @ozekik!

I found a bug when parsing. I considered generations.rdf file when parsing, but a similar bug appeared in many other files. For the some reason the library recognizes this tag

<ns2:versionInfo rdf:datatype="http://www.w3.org/2001/XMLSchema#string">An example ontology created by Matthew Horridge</ns2:versionInfo>

like this str

'"An example ontology created by Matthew Horridge"^^<http://www.w3.org/2001/XMLSchema#string>'

in last item of triple ( triple[-1] ).

When using the rdflib library, I was not getting a similar problem.
Thanks.

Add namespace support

It would be convenient to pass in a map of prefix-namespace expansions and have the search option return CURIEs where contraction is possible

While trivial to do in a python wrapper, it would be presumably faster to do at the rust level

Rio libraries need updating to fix a very weird bug

When using LightRDF in the Ontology Development Kit, we have come across a very strange bug where LightRDF would fail to parse RDF/XML files that seem completely valid.

Here is a file that LightRDF fails to parse: https://github.com/INCATools/ontology-development-kit/files/10042121/tdm-bad.txt

(Sorry for the size of the file, but I was unable to reduce the error case to a minimal demonstrating example.)

Trying to parse that file with LightRDF as follows:

import sys
from lightrdf import Parser
parser = Parser()
try:
    for triple in parser.parse("tdm-bad.xml"):
        pass
except Exception as e:
    print(e)
    sys.exit(1)

yields the following error: Unexpected EOF during reading Comment.

I have no idea where the bug exactly is. However, rebuilding LightRDF after updating the Rio dependencies (rio_api, rio_turtle, and rio_xml) in Cargo.toml to their latest version (0.8.3) seems enough to fix it.

Providing a Linux-arm64 wheel

LightRDF is available as a wheel on PyPI for many combinations of systems and architectures (Windows/MacOS/Linux…, i686/x86_64/arm64…). Thanks for that!

However, one particular combination that is missing is Linux/arm64. Any chance it could be added?

Add support for parsing objects into literals vs URIs vs blank nodes

Currently the user has to parse the object to be able to do a lot of operations on it

This is relatively straightforward, I think:

  1. ^riog\d+$ is a blank node
  2. Literals:
    • ^"(.*)"^^<(\S+)>$ type
    • ^"(.*)"@\w+$ language
    • ^"(.*)"$ untyped
  3. Otherwise a URI

But it might be nice to centralize this, or do it in rust for speed. To avoid the overhead of an OO interface how about a parallel search_statements with arguments s, p, o_uri, o_literal_value, o_datatype, o_lang?

This is my use case:

https://github.com/INCATools/rdf-sql-bulkloader

For now I am doing this in python

lightrdf.Error: error while parsing IRI '': No scheme found in an absolute IRI

Hi @ozekik!

Thank you for the awesome library! 👏

Unfortunately, while using your library, I got the error 🐛 mentioned in the title. 😞
But using rdflib I was not getting a similar error. 🤔

Environment

  • OS: Ubuntu 20.04
  • Python: 3.8.5
  • LightRDF: 0.2.1

Steps to reproduce.

  1. Download pathways archive.
wget -q https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/pathways.rdf.xz
  1. Unzip it using xz package.
sudo apt install xz-utils
unxz pathways.rdf.xz 
  1. Run count_triples_lightrdf_parser.py.
python3 count_triples_lightrdf_parser.py pathways.rdf
  1. Error log.
Traceback (most recent call last):
  File "count_triples_lightrdf_parser.py", line 8, in <module>
    for triple in parser.parse(sys.argv[1]):
lightrdf.Error: error while parsing IRI '': No scheme found in an absolute IRI

Please tell me where I am wrong. Thank you 🙏

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

First I have to say nice job on the library, I really love the speed and simplicity of lightrdf, and it worked very well with ChEMBL_27, but I ran into an issue when I tried to read the wikidata tll.

Details

wikidata's file latest-all.ttl

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

--- nearby lines from file, including problematic line, if my sed is correct

sed -n '3135042,3135046p;3135047q' latest-all.ttl

ref:8f38c16e1f141b68f172b65f48b0982234890e56 a wikibase:Reference ;
	pr:P854 <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> ;
	pr:P813 "2020-03-12T00:00:00Z"^^xsd:dateTime ;
	prv:P813 v:ae909ef12942e232eea24326bdd78c8e ;

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
gunzip and parse the file

Note: these are big files, another strength of lightrdf, the ability to process large files without requiring the entire data set to fit in RAM.


parse the file` 'revisions_lang=en_uris.ttl'`
for each .ttl in this dataset, I'm simply dumping the triples to a file like this

parser = lightrdf.Parser()
triples = parser.parse(str(input_file), format="ttl", base_iri=None)
for (s, p, o) in triples:
    f_triples.write(f"{s}\t{p}\t{o}\n")

Other notes: 
- if there is a way to use a try catch to move past the line I would like to know how that is done
- ubuntu 20.04, conda env pip install lightrdf


Serialize RDF

I was looking for a replacement for RDFLib for just parsing, do some BGP searching and write the new triples back.
It seems that LightRDF can handle parsing and BGP searching, but not serializing the RDF triples again to a file.

Are there any plans for this?

Parse from String

Hello,

I am interested in using your library for fast parsing from turtle to n-triples.
However, as the current API only supports parsing from a file, I was wondering if it would be possible to extend the library to also parse string objects?

Thanks
Lars

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.