GithubHelp home page GithubHelp logo

xml4h's Introduction

xml4h: XML for Humans in Python

xml4h is an MIT licensed library for Python to make it easier to work with XML.

This library exists because Python is awesome, XML is everywhere, and combining the two should be a pleasure but often is not. With xml4h, it can be easy.

As of version 1.0 xml4h supports Python versions 2.7 and 3.5+.

Features

xml4h is a simplification layer over existing Python XML processing libraries such as lxml, ElementTree and the minidom. It provides:

  • a rich pythonic API to traverse and manipulate the XML DOM.
  • a document builder to simply and safely construct complex documents with minimal code.
  • a writer that serialises XML documents with the structure and format that you expect, unlike the machine- but not human-friendly output you tend to get from other libraries.

The xml4h abstraction layer also offers some other benefits, beyond a nice API and tool set:

  • A common interface to different underlying XML libraries, so code written against xml4h need not be rewritten if you switch implementations.
  • You can easily move between xml4h and the underlying implementation: parse your document using the fastest implementation, manipulate the DOM with human-friendly code using xml4h, then get back to the underlying implementation if you need to.

Installation

Install xml4h with pip:

$ pip install xml4h

Or install the tarball manually with:

$ python setup.py install

Links

Introduction

With xml4h you can easily parse XML files and access their data.

Let's start with an example XML document:

$ cat tests/data/monty_python_films.xml
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python">
    <Film year="1971">
        <Title>And Now for Something Completely Different</Title>
        <Description>
            A collection of sketches from the first and second TV series of
            Monty Python's Flying Circus purposely re-enacted and shot for film.
        </Description>
    </Film>
    <Film year="1974">
        <Title>Monty Python and the Holy Grail</Title>
        <Description>
            King Arthur and his knights embark on a low-budget search for
            the Holy Grail, encountering humorous obstacles along the way.
            Some of these turned into standalone sketches.
        </Description>
    </Film>
    <Film year="1979">
        <Title>Monty Python's Life of Brian</Title>
        <Description>
            Brian is born on the first Christmas, in the stable next to
            Jesus'. He spends his life being mistaken for a messiah.
        </Description>
    </Film>
    <... more Film elements here ...>
</MontyPythonFilms>

With xml4h you can parse the XML file and use "magical" element and attribute lookups to read data:

>>> import xml4h
>>> doc = xml4h.parse('tests/data/monty_python_films.xml')

>>> for film in doc.MontyPythonFilms.Film[:3]:
...     print(film['year'] + ' : ' + film.Title.text)
1971 : And Now for Something Completely Different
1974 : Monty Python and the Holy Grail
1979 : Monty Python's Life of Brian

You can also use more explicit (non-magical) methods to traverse the DOM:

>>> for film in doc.child('MontyPythonFilms').children('Film')[:3]:
...     print(film.attributes['year'] + ' : ' + film.children.first.text)
1971 : And Now for Something Completely Different
1974 : Monty Python and the Holy Grail
1979 : Monty Python's Life of Brian

The xml4h builder makes programmatic document creation simple, with a method-chaining feature that allows for expressive but sparse code that mirrors the document itself. Here is the code to build part of the above XML document:

>>> b = (xml4h.build('MontyPythonFilms')
...     .attributes({'source': 'http://en.wikipedia.org/wiki/Monty_Python'})
...     .element('Film')
...         .attributes({'year': 1971})
...         .element('Title')
...             .text('And Now for Something Completely Different')
...             .up()
...         .elem('Description').t(
...             "A collection of sketches from the first and second TV"
...             " series of Monty Python's Flying Circus purposely"
...             " re-enacted and shot for film."
...             ).up()
...         .up()
...     )

>>> # A builder object can be re-used, and has short method aliases
>>> b = (b.e('Film')
...     .attrs(year=1974)
...     .e('Title').t('Monty Python and the Holy Grail').up()
...     .e('Description').t(
...         "King Arthur and his knights embark on a low-budget search"
...         " for the Holy Grail, encountering humorous obstacles along"
...         " the way. Some of these turned into standalone sketches."
...         ).up()
...     .up()
... )

Pretty-print your XML document with xml4h's writer implementation with methods to write content to a stream or get the content as text with flexible formatting options:

>>> print(b.xml_doc(indent=4, newline=True)) # doctest: +ELLIPSIS
<?xml version="1.0" encoding="utf-8"?>
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python">
    <Film year="1971">
        <Title>And Now for Something Completely Different</Title>
        <Description>A collection of sketches from ...</Description>
    </Film>
    <Film year="1974">
        <Title>Monty Python and the Holy Grail</Title>
        <Description>King Arthur and his knights embark ...</Description>
    </Film>
</MontyPythonFilms>
<BLANKLINE>

Why use xml4h?

Python has three popular libraries for working with XML, none of which are particularly easy to use:

  • xml.dom.minidom is a light-weight, moderately-featured implementation of the W3C DOM that is included in the standard library. Unfortunately the W3C DOM API is verbose, clumsy, and not very pythonic, and the minidom does not support XPath expressions.
  • xml.etree.ElementTree is a fast hierarchical data container that is included in the standard library and can be used to represent XML, mostly. The API is fairly pythonic and supports some basic XPath features, but it lacks some DOM traversal niceties you might expect (e.g. to get an element's parent) and when using it you often feel like your working with something subtly different from XML, because you are.
  • lxml is a fast, full-featured XML library with an API based on ElementTree but extended. It is your best choice for doing serious work with XML in Python but it is not included in the standard library, it can be difficult to install, and it gives you the same it's-XML-but-not-quite feeling as its ElementTree forebear.

Given these three options it can be difficult to choose which library to use, especially if you're new to XML processing in Python and haven't already used (struggled with) any of them.

In the past your best bet would have been to go with lxml for the most flexibility, even though it might be overkill, because at least then you wouldn't have to rewrite your code if you later find you need XPath support or powerful DOM traversal methods.

This is where xml4h comes in. It provides an abstraction layer over the existing XML libraries, taking advantage of their power while offering an improved API and tool set.

Development Status: beta

Currently xml4h includes adapter implementations for three of the main XML processing Python libraries.

If you have lxml available (highly recommended) it will use that, otherwise it will fall back to use the (c)ElementTree then the minidom libraries.

History

1.0

  • Add support for Python 3 (3.5+)
  • Dropped support for Python versions before 2.7.
  • Fix node namespace prefix values for lxml adapter.
  • Improve builder's up() method to accept and distinguish between a count of parents to step up, or the name of a target ancestor node.
  • Add xml() and xml_doc() methods to document builder to more easily get string content from it, without resorting to the write methods.
  • The write() and write_doc() methods no longer send output to sys.stdout by default. The user must explicitly provide a target writer object, and hopefully be more mindful of the need to set up encoding correctly when providing a text stream object.
  • Handling of redundant Element namespace prefixes is now more consistent: we always strip the prefix when the element has an xmlns attribute defining the same namespace URI.

0.2.0

  • Add adapter for the (c)ElementTree library versions included as standard with Python 2.7+.
  • Improved "magical" node traversal to work with lowercase tag names without always needing a trailing underscore. See also improved docs.
  • Fixes for: potential errors ASCII-encoding nodes as strings; default XPath namespace from document node; lookup precedence of xmlns attributes.

0.1.0

  • Initial alpha release with support for lxml and minidom libraries.

xml4h's People

Contributors

aniket-netskope avatar jmurty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

xml4h's Issues

Add (c)ElementTree Adapter

Add adapter to use the standard library's (c)ElementTree XML library.

This will lead to the following benefits without requiring installation of the extra lxml library:

  • faster XML processing (if user has C underpinnings installed)
  • make XPath available to users in almost all cases

Contact

Hi team,

I've contacted you via email, please let me know if you've received my email.

Thanks

Richer NodeList interactions to add/remove nodes directly (?)

Think about improving the NodeList implementation for children, entities, notations, etc to allow for simple and intuitive interactions with lists of nodes, such as add/remove nodes directly via the nodelist and have the corresponding changes made to the underlying DOM.

Fix underscore auto-prefix alias to work on Document node

The underscore auto-alias xml4h provides to make it easier to find default-namespace elements in a document breaks if used on the top-level Document node, because it doesn't know how to find the default namespace from there.

XPath on Document node:

>>> d.xpath('//_:coverage')
Traceback (most recent call last):
. . .
lxml.etree.XPathEvalError: Undefined namespace prefix

Same XPath on root Element node:

>>> d.root.xpath('//_:coverage')
[<xml4h.nodes.Element: "coverage">, <xml4h.nodes.Element: "coverage">]

Add SAX parser helpers

Provide tools or helpers of some kind to make it very easy to perform SAX parsing.

Need to figure out exactly what this feature's goals should be, and crib examples from existing libraries.

write_doc only substitutes attributes partially

Hello,

I have two sets of XMLs that describe the same content (articles in an academic journal) with differing schema versions and different content that I was hoping to merge with the help of your xml4h library. I tried starting out by replacing the attributes of the root tag:

<doi_batch xmlns="http://www.crossref.org/schema/4.3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.3.0" xsi:schemaLocation="http://www.crossref.org/schema/4.3.0 http://www.crossref.org/schema/deposit/crossref4.3.0.xsd">

should be replaced with:

<doi_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crossref.org/schema/4.4.2 https://www.crossref.org/schemas/crossref4.4.2.xsd" xmlns="http://www.crossref.org/schema/4.4.2" xmlns:jats="http://www.ncbi.nlm.nih.gov/JATS1" xmlns:fr="http://www.crossref.org/fundref.xsd" xmlns:ai="http://www.crossref.org/AccessIndicators.xsd" version="4.4.2">

Somehow, when applying the script below, the attributes all update, except the xmlns link, which keeps referring to: "http://www.crossref.org/schema/4.3.0" instead of "http://www.crossref.org/schema/4.4.2".

Could I ask for your help in assessing whether we're approaching this the wrong way or it's part of the way the library functions?

Kind regards, and thanks for the help,
Erwin

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Apr 23 15:41:16 2020
​
@authors: erwin, Philo
"""
​
import xml4h
​
attrsWanted = {
  'xmlns:xsi': "http://www.w3.org/2001/XMLSchema-instance", 
  'xsi:schemaLocation': "http://www.crossref.org/schema/4.4.2 https://www.crossref.org/schemas/crossref4.4.2.xsd", 
  'xmlns': "http://www.crossref.org/schema/4.4.2", 
  'xmlns:jats': "http://www.ncbi.nlm.nih.gov/JATS1", 
  'xmlns:fr': "http://www.crossref.org/fundref.xsd", 
  'xmlns:ai': "http://www.crossref.org/AccessIndicators.xsd", 
  'version': "4.4.2"
}
​
# Parse an XML document
doc = xml4h.parse('crossref-view-issue04_test4.xml')
​
# Find an Element node of interest
doi_batch = doc.doi_batch
​
print("\n🌱  before change:")
for item in doi_batch.attributes.items():
  print(item)
​
print("\n💧  wanted change:")
for item in attrsWanted.items():
  print(item)
​
doi_batch.attributes = attrsWanted
print("\n🌳  after change:")
for item in doi_batch.attributes.items():
  print(item)
​
​
# Write to a file
with open('crossref-view-issue04_test4-updated.xml', 'wb') as f:
    doi_batch.write_doc(f)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.