GithubHelp home page GithubHelp logo

delb-xml / snakesist Goto Github PK

View Code? Open in Web Editor NEW
14.0 5.0 1.0 674 KB

A Python database interface for eXist-db

Home Page: https://snakesist.readthedocs.io

License: MIT License

Python 99.26% Dockerfile 0.74%
exist-db xml python database-adapter database-interface

snakesist's Introduction

https://i.ibb.co/JsZqM7z/snakesist-logo.png

snakesist

Documentation Status

snakesist is a Python database interface for eXist-db. It supports basic CRUD operations and uses delb for representing the yielded resources.

pip install snakesist

snakesist allows you to access individual documents from the database using a delb.Document object, either by simply passing a URL

>>> from delb import Document

>>> manifest = Document("existdb://admin:@localhost:8080/exist/db/manifestos/dada_manifest.xml")
>>> [header.full_text for header in manifest.xpath("//head")]
["Hugo Ball", "Das erste dadaistische Manifest"]

or by passing a relative path to the document along with a database client which you can subsequently reuse

>>> from snakesist import ExistClient

>>> my_local_db = ExistClient(host="localhost", port=8080, user="admin", password="", root_collection="/db/manifests")
>>> dada_manifest = Document("dada_manifest.xml", existdb_client=my_local_db)
>>> [header.full_text for header in dada_manifest.xpath("//head")]
["Hugo Ball", "Das erste dadaistische Manifest"]
>>> communist_manifest = Document("communist_manifest.xml", existdb_client=my_local_db)
>>> communist_manifest.xpath("//head").first.full_text
"Manifest der Kommunistischen Partei"

and not only for accessing individual documents, but also for querying data across multiple documents

>>> all_headers = my_local_db.xpath("//*:head")
>>> [header.node.full_text for header in all_headers]
["Hugo Ball", "Das erste dadaistische Manifest", "Manifest der Kommunistischen Partei", "I. Bourgeois und Proletarier.", "II. Proletarier und Kommunisten", "III. Sozialistische und kommunistische Literatur", "IV. Stellung der Kommunisten zu den verschiedenen oppositionellen Parteien"]

You can of course also modify and store documents back into the database or create new ones and store them.

Your eXist instance

snakesist leverages the eXist RESTful API for database queries. This means that allowing database queries using POST requests on the RESTful API is a requirement in the used eXist-db backend. eXist allows this by default, so if you haven't configured your instance otherwise, don't worry about it.

We aim to directly support all most recent releases from each major branch. Yet, there's no guarantee that releases older than two years will be kept as a target for tests. Pleaser refer to the values of jobs/tests/matrix/exist-version in the CI's configuration file for what's currently considered.

snakesist's People

Contributors

03b8 avatar dependabot[bot] avatar funkyfuture avatar jkatzwinkel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

jkatzwinkel

snakesist's Issues

Add CITATION.cff

now, with one step closer to domination, the next strategic milestone will be to add a CITATION.cff to be clearly identified as serious ding. once delb-xml/delb-py#34 is established.

Make client's connection properties immutable?

afaict there'd be not use-case for changing the connection properties of a client once it has been initialized. hence i propose to make them read-only. also, and that was the impulse, the base_url property needs to be assembled only once.

on the matter of the transport protocol

it crossed my mind today that one could also talk to an eXist instance via http. and the database fixture does that.

i see two options in this regard:

  1. support only https
  2. extend the designated url schema with the transport protocols to existdb+http:// and existdb+https://.

usually i tend to 1), but i can think of some use-cases that'd make 2) preferable.

i just realize that the exist client is http only for far.

Sample xar file? Or: What the foo is going on?

i want to achieve the following:

  • use one of the available pytest fixtures that control services w/ Docker-Compose to provide an eXist-db instance for tests
  • (at this point) only sample collections should added to the bare exist-db image or deployed container
    • best case simply by just mounting folders (repr. collections) into a container

i have my troubles figuring out how that exist-storage works under the hood. the container image seems to shadow most binaries that one could use for inspection (shell, ls, …). but the image configuration wouldn't indicate that, PATH is properly populated. also, searching for added files (via web gui) in the host's filesystem doesn't yield any result. so, i have doubts that a simple solution is possible.

the image's readme only mentions some autodeployment w/ xar-files (which is a format for applications, isn't it?). if that would be feasible to get a collection w/ sample documents into the storage, could you provide me one?

meanwhile is study possible pytest fixtures.

Prepare minor release

Since the recent changes will break compatibility, I propose that the next release be a major one (1.0.0). Here are some points which should be addressed before releasing 0.2.0:

  • Review documentation, including the Readme
  • Check if eXist 5 compatibilty is still a problem (and update the Readme accordingly if that's not the case)
    I ran the tests with different eXist container images and it's as expected: it looks like the current release (5.2.0) is still incompatible while the upcoming release of eXist (5.3.0) will no longer have the issue.
  • Upgrade to Python 3.8 postponed
  • Include some user friendly how-to pages to complement the API docs (optional)
  • Use tox (optional) not a great idea atm
  • Add a black check to the CI config (optional)
  • Add a test coverage check to the CI config and maybe a coverage badge to the Readme (optional)
  • possibly remove type: ignore from delb imports when its 0.2 release is available

The list is open for new proposals and change suggestions.

Cover newer eXist-db versions in tests?

ciao @JKatzwinkel, are you still interested in adding what the summary says? together w/ #30 that'd make a nice release.

i'd find it useful to have a policy that lays out which eXist-db should be supported, but i'm completely unaware what their maintenance policy actually is.

Move away from poetry

...because of the reasons layed out here.

Not sure yet what we'll replace it with, but I'll take inspiration from delb in that regard.

Performance bottleneck

I detected a performance bottleneck which appeared with the restructuring and refactoring done from 0.1.0-b1 to 0.1.0 I ran an example query through the profiler and 0.1.0 (6.566 s) performs substantially slower than 0.1.0-b1 (0.505 s). Here's an interesting section of the profiling report:

0.1.0-b1

        1    0.000    0.000    0.001    0.001 nodes.py:1050(TagNode)
     2207    0.004    0.000    0.016    0.000 nodes.py:1113(__init__)
     1752    0.003    0.000    0.014    0.000 nodes.py:1149(__getitem__)
      584    0.001    0.000    0.048    0.000 nodes.py:1170(__len__)
      584    0.001    0.000    0.047    0.000 nodes.py:1171()
     4373    0.007    0.000    0.056    0.000 nodes.py:1241(child_nodes)
     6432    0.002    0.000    0.008    0.000 nodes.py:1257()
        1    0.000    0.000    0.007    0.007 nodes.py:1293(css_select)
        1    0.001    0.001    0.007    0.007 nodes.py:1490(xpath)
      585    0.000    0.000    0.000    0.000 nodes.py:1569()
        1    0.000    0.000    0.006    0.006 nodes.py:1577()
        1    0.000    0.000    0.000    0.000 nodes.py:1583(TextNode)
        1    0.000    0.000    0.000    0.000 nodes.py:161(_TagDefinition)
     4491    0.006    0.000    0.008    0.000 nodes.py:1610(__init__)
        1    0.000    0.000    0.008    0.008 nodes.py:17()
     2284    0.002    0.000    0.003    0.000 nodes.py:1898(_exists)
     1583    0.004    0.000    0.022    0.000 nodes.py:1942(next_node)
     2132    0.001    0.000    0.002    0.000 nodes.py:1970()
     1076    0.001    0.000    0.008    0.000 nodes.py:1997(__next_candidate_of_tail)
     5396    0.002    0.000    0.011    0.000 nodes.py:279(_is_tag_or_text_node)
        2    0.000    0.000    0.000    0.000 nodes.py:286(altered_default_filters)
        1    0.000    0.000    0.002    0.002 nodes.py:324(NodeBase)
     6775    0.001    0.000    0.001    0.000 nodes.py:327(__init__)
       44    0.000    0.000    0.000    0.000 nodes.py:62()
     2284    0.005    0.000    0.023    0.000 nodes.py:71(_get_or_create_element_wrapper)
        1    0.000    0.000    0.001    0.001 nodes.py:739(_ChildLessNode)
        1    0.000    0.000    0.000    0.000 nodes.py:785(_ElementWrappingNode)
     2284    0.004    0.000    0.009    0.000 nodes.py:786(__init__)
1116/1078    0.002    0.000    0.009    0.000 nodes.py:901(next_node)
     2228    0.001    0.000    0.005    0.000 nodes.py:916()
        2    0.000    0.000    0.000    0.000 nodes.py:921(parent)
        1    0.000    0.000    0.000    0.000 nodes.py:930(previous_node)
        1    0.000    0.000    0.000    0.000 nodes.py:968(CommentNode)
        1    0.000    0.000    0.000    0.000 nodes.py:998(ProcessingInstructionNode)

0.1.0

        1    0.000    0.000    0.001    0.001 nodes.py:1050(TagNode)
     8838    0.018    0.000    0.072    0.000 nodes.py:1113(__init__)
     1752    0.005    0.000    0.006    0.000 nodes.py:1149(__getitem__)
     1168    0.005    0.000    4.150    0.004 nodes.py:1170(__len__)
     1168    0.142    0.000    4.145    0.004 nodes.py:1171()
788898/707319    0.807    0.000    4.383    0.000 nodes.py:1241(child_nodes)
   705567    0.168    0.000    0.169    0.000 nodes.py:1257()
        1    0.000    0.000    0.007    0.007 nodes.py:1293(css_select)
     1168    0.003    0.000    4.168    0.004 nodes.py:1309(document)
      584    0.001    0.000    0.022    0.000 nodes.py:1317(first_child)
        1    0.001    0.001    0.007    0.007 nodes.py:1490(xpath)
     1168    0.118    0.000    5.512    0.005 nodes.py:150(_prune_wrapper_cache)
      585    0.000    0.000    0.000    0.000 nodes.py:1569()
     1168    0.981    0.001    1.181    0.001 nodes.py:157()
        1    0.000    0.000    0.006    0.006 nodes.py:1577()
        1    0.000    0.000    0.000    0.000 nodes.py:1583(TextNode)
        1    0.000    0.000    0.000    0.000 nodes.py:161(_TagDefinition)
    17926    0.029    0.000    0.036    0.000 nodes.py:1610(__init__)
        1    0.000    0.000    0.008    0.008 nodes.py:17()
   700621    0.532    0.000    0.650    0.000 nodes.py:1898(_exists)
    14368    0.033    0.000    0.166    0.000 nodes.py:1942(next_node)
     7745    0.003    0.000    0.003    0.000 nodes.py:1970()
     7792    0.009    0.000    0.050    0.000 nodes.py:1997(__next_candidate_of_tail)
     1168    0.001    0.000    0.002    0.000 nodes.py:2129(is_root_node)
      584    0.000    0.000    0.001    0.000 nodes.py:279(_is_tag_or_text_node)
     2338    0.001    0.000    0.001    0.000 nodes.py:286(altered_default_filters)
        1    0.000    0.000    0.001    0.001 nodes.py:324(NodeBase)
    27014    0.006    0.000    0.006    0.000 nodes.py:327(__init__)
     2336    0.004    0.000    4.161    0.002 nodes.py:386(ancestors)
     2336    0.001    0.000    0.003    0.000 nodes.py:395()
       44    0.000    0.000    0.000    0.000 nodes.py:62()
   694704    0.383    0.000    0.714    0.000 nodes.py:71(_get_or_create_element_wrapper)
        1    0.000    0.000    0.001    0.001 nodes.py:739(_ChildLessNode)
        1    0.000    0.000    0.000    0.000 nodes.py:785(_ElementWrappingNode)
     9088    0.017    0.000    0.040    0.000 nodes.py:786(__init__)
      584    0.015    0.000    5.907    0.010 nodes.py:859(detach)
   690032    1.251    0.000    2.917    0.000 nodes.py:901(next_node)
   688861    0.187    0.000    0.187    0.000 nodes.py:916()
     5258    0.006    0.000    0.010    0.000 nodes.py:921(parent)
        1    0.000    0.000    0.000    0.000 nodes.py:930(previous_node)
        1    0.000    0.000    0.000    0.000 nodes.py:968(CommentNode)
        1    0.000    0.000    0.000    0.000 nodes.py:998(ProcessingInstructionNode)

Restrict ExistClient to low-level operations, move the rest to Resource

as @funkyfuture said:

i think it may be clearer if the specific logics (like assembling the queries) for each resource type are moved to the Resource subclasses and leave only the lower level operations to communicate to the database. if that makes sense and implemented, it might also be clearer to have a staticmethod Resource.from_query_result that yields either a Node… or DocumentResource.

Avoid making database queries via query params in `GET` requests

Currently snakesist makes database queries by passing them in a query param of a GET request. This is quite an odd way to pass data to the db and also limits the length of the query that we pass to the database. This means that querying something containing some lengthy text will likely fail because of this. I will try to confirm this hypothesis in a test first and put it in a PR draft.

The solution to this issue is to use POST requests and pass the query in the request body as documented here. I'm not sure how well this works, so it might turn out to be a bad idea, but I think it's worth a try.

Proposal: Distinguish documents and nodes

if i understood correctly, exist can return both, whole documents and nodes, from such. as delb offers distinct classes for these, i propose to reflect these circumstances by introducing a DocumentResource and a NodeResource that base on Resource and just differ by their content's type.

i guess i figured correctly that this is distinguishable in exist's results by the presence of a node id attribute. atm i'm just unclear about ExistClient.retrieve_resources, this always yield nodes as the method's input data is an xpath expression, doesn't it?

How shall missing ids be represented in Resource?

after i fixed delb to be considered by static type checkers, i found some errors i committed before. in the course of correcting i came to wonder how missing / empty values shall be represented in Resource, as None or as empty string? and what is to be expected in the response of the api, are these attributes missing or empty strings? as i have no experience with exist's rest api, i'm certainly not the one to reason about this.

Proposal to rename the 'master' branch to 'main'

please note that i am posting the following text / issue description to various projects that i'm (considering myself to be) significantly involved with. in fact it is about a general issue that is not specific to this project. but too often we just focus on the nitty-gritty details of design and implementations whilst operating within, supplying for, and are depending upon much broader and complex technological, social, economic and ecological relationships.

the torture that resulted in the death of George Floyd in this year's May intensified antiracist movements and debates colonial heritage that hasn't been overcome (or even compensated for) yet. it also initiated discussions about terminology used in technological contexts, their etymology, and its link to the aforementioned ideologies and practices of discrimination. though circumstances aren't homogeneous across societies where technological terminology is used, one must acknowledge that the context in which this terminology is evolving is American English, which reflects and manifests specific inequalities based on 'race' in the United States of America. thus, the connotations that are inherent to that language cannot be ignored elsewhere.

i'd have every understanding for anyone who would hesitate to contribute to this project because of language used that reproduces bullshit discrimination. i therefore propose to rename this project's git branch, from master to main. i'd have some imo more interesting, better-fitting alternatives to propose, but main is pragmatic because of its adoption in the Linux kernel VCS (and probable further adoptions that will reflect this), as well as the stable use of auto-completion in a shell.

please refer to this proposal for an RFC to establish an inclusive language within the "tech community", this discussion on the git-related etymology of the master term and this meanwhile accepted patch and related debate that prompted the change in the Linux kernel VCS. as the web is the web, you'll easily find more resources on the topic, possibly in your preferred language.

due to a lack of time on my side, i foresee this change taking place over this year's autumn in repositores where i'm authorized to do so. please consider that as a timeframe for feedback. i'm open to critical arguments on why we should withhold from that change, but trolls will be blocked right away where i have the privilege to do so. for GitHub hosted repositories there's this relevant piece of information.

Invalid badge in README.rst

after replacing the CI with Github actions, the badge https://travis-ci.org/delb-xml/snakesist became completely obsolete. @JKatzwinkel do you know a replacement out of your head?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.