GithubHelp home page GithubHelp logo

hochfrequenz / kohlrahbi Goto Github PK

View Code? Open in Web Editor NEW
5.0 4.0 2.0 42.74 MB

An Anwendungshandbücher (AHB) scraper that extracts tables from docx files

Home Page: https://pypi.org/project/kohlrahbi/

License: GNU General Public License v3.0

Python 100.00%
ahb anwendungshandbuch bdew energiewirtschaft

kohlrahbi's Introduction

KohlrAHBi

kohlrahbi-logo

Unittests status badge Coverage status badge Linting status badge Black status badge PyPI

Kohlrahbi generates machine-readable files from AHB documents. Kohlrahbi's sister is MIG_mose.

Rationale

German utilities exchange data using EDIFACT; This is called market communication (mako). The Forum Datenformate of the BDEW publishes the technical regulations of the EDIFACT based market communication on edi-energy.de. These rules are not stable but change twice a year (in theory) or few times per year (in reality).

Specific rules, which are binding for every German utility are kind of formalised in so called "Anwendungshandbüchern" (AHB). Those AHBs are basically long tables that describe:

As a utility, if I want to exchange data about business process XYZ with a market partner, then I have to provide the following information: [...]

In total the regulations from these Anwendungshandbücher span several thousand pages. And by pages, we really mean pages. EDIFACT communication is basically the API between German utilities for most of their B2B processes. However, the technical specifications of this API are

  • prose
  • on DIN A4 pages.

The Anwendungshandbücher are the epitome of digitization with some good intentions.

Although the AHBs are publicly available as PDF or Word files on edi-energy.de, they are hardly accessible in a technical sense:

  • You cannot automatically extract information from the AHBs.
  • You cannot run automatic comparisons between different versions.
  • You cannot automatically test your own API against the set of rules, described in the AHBs (as prose).
  • You cannot view or visualize the information from the AHBs in any more intuitive or practical way, than the raw tables from the AHB files.
  • ...any many more...

The root cause for all these inaccessibility is a technical one: Information that are theoretically structured are published in an unstructured format (PDF or Word), which is not suited for technical specifications in IT.

KohlrAHBi as a tool helps you to break those chains and access the AHBs as you'd expect it from technical specs: easy and automatically instead of with hours of mindless manual work.

KohlrAHBi takes the .docx files published by edi-energy.de as an input and returns truly machine-readable data in a variety of formats (JSON, CSV...) as a result.

Hence, KohlrAHBi is the key for unlocking any automation potential that is reliant on information hidden in the Anwendungshandbücher.

We're all hoping for the day of true digitization on which this repository will become obsolete.

Installation

Kohlrahbi is a Python based tool. Therefor you have to make sure, that Python is running on your machine.

We recommend to use virtual environments to keep your system clean.

Create a new virtual environment with

python -m venv .venv

The activation of the virtual environment depends on your used OS.

Windows

.venv\Scripts\activate

MacOS/Linux

source .venv/bin/activate

Finally, install the package with

pip install kohlrahbi

Usage

Kohlrahbi is a command line tool. You can use it in three different ways:

  1. Extract AHB tables for all prüfidentifikatoren or a specific prüfidentifikator of a provided format version.
  2. Extract all conditions for each format of a provided format version.
  3. Extract the change history of a provided format version.

You can run the following command to get an overview of all available commands and options.

kohlrahbi --help

Note

For the following steps we assume that you cloned our edi_energy_mirror to a neighbouring directory. The edi_energy_mirror contains the .docx files of the AHBs. The folder structure should look like this:

.
├── edi_energy_mirror
└── kohlrahbi

Extract AHB table

To extract the all AHB tables for each pruefi of a specific format version, you can run the following command.

kohlrahbi ahb --edi-energy-mirror-path ../edi_energy_mirror/ --output-path ./output/ --file-type csv --format-version FV2310

To extract the AHB tables for a specific pruefi of a specific format version, you can run the following command.

kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --pruefis 13002 --format-version FV2310

You can also provide multiple pruefis.

kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --pruefis 13002 --pruefis 13003 --pruefis 13005 --format-version FV2310

And you can also provide multiple file types.

kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --file-type xlsx --file-type flatahb --pruefis 13002 --format-version FV2310

Extract all conditions

To extract all conditions for each format of a specific format version, you can run the following command.

kohlrahbi conditions -eemp ../edi_energy_mirror/ --output-path ./output/ --format-version FV2310

This will provide you with:

  • all conditions
  • all packages

found in all AHBs (including the condition texts from package tables) within the specified folder with the .docx files. The output will be saved for each Edifact format separately as conditions.json and packages.json in the specified output path. Please note that the information regarding the conditions collected here may more comprehensive compared to the information collected for the AHBs above. This is because conditions uses a different routine than ahb.

Extract change history

kohlrahbi changehistory -eemp ../edi_energy_mirror/ --output-path ./output/ --format-version FV2310

.docx Data Sources

kohlrahbi internally relies on a specific naming schema of the .docx files in which the file name holds information about the edifact format and validity period of the AHBs contained within the file. The easiest way to be compliant with this naming schema is to clone our edi_energy_mirror repository to your localhost.

Results

There is a kohlrahbi based CI pipeline from the edi_energy_mirror mentioned above to the repository machine-readable_anwendungshandbuecher where you can find scraped AHBs as JSON, CSV or Excel files.

Workflow

flowchart TB
    S[Start] --> RD[Read docx]
    RD --> RPT[Read all paragraphs <br> and tables]
    RPT --> I[Start iterating]
    I --> NI[Read next item]
    %% check for text paragraph %%
    NI --> CTP{Text Paragraph?}
    CTP -- Yes --> NI
    CTP -- No --> CCST{Is item just<br>Chapter or Section Title?}
    CCST -- Yes --> CTAenderunghistorie{Is Chapter Title<br>'Änderungshistorie'?}
    CTAenderunghistorie -- Yes --> EXPORT[Export Extract]
    CCST -- No --> CT{Is item a table<br>with prüfis?}
    CT -- Yes --> Extract[Create Extract]

AHB page number per Format

The following table shows the page number of the AHBs for each format of the format version FV2310.

Format Page number Hint
UTILMD Strom 1064
UTILMD Gas 345
REQOTE 264 together with QUOTES, ORDERS, ORDRSP, ORDCHG
QUOTES 264 together with REQOTE, ORDERS, ORDRSP, ORDCHG
ORDRSP 264 together with REQOTE, QUOTES, ORDERS, ORDCHG
ORDERS 264 together with REQOTE, QUOTES, ORDRSP, ORDCHG
ORDCHG 264 together with REQOTE, QUOTES, ORDERS, ORDRSP
MSCONS 164
UTILMD MaBis 133
REMADV 91 together with INVOIC
INVOIC 91 together with REMADV
IFTSTA 82
CONTRL 72 together with APERAK, contains no Prüfis
APERAK 72 together with CONTRL, contains no Prüfis
PARTIN 69
UTILTS 34
ORDRSP 30 together with ORDERS
ORDERS 30 together with ORDRSP
PRICAT 25
COMDIS 10 good test for tables which are above change history

Development

Setup

To set up the development environment, you have to install the dev dependencies.

tox -e dev

Run all tests and linters

To run the tests, you can use tox.

tox

See our Python Template Repository for detailed explanations.

Contribute

You are very welcome to contribute to this template repository by opening a pull request against the main branch.

Related Tools and Context

This repository is part of the Hochfrequenz Libraries and Tools for a truly digitized market communication.

kohlrahbi's People

Contributors

deltadaniel avatar dependabot[bot] avatar hf-kklein avatar hf-krechan avatar hf-sheese avatar lord-haffi avatar mj0nez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

kohlrahbi's Issues

Add a further flag to choose the format version

with the PR #209 we can now read the concrete file name for each Prüfi.
But the file name contains also the version number of the document, which correlates with the format version (e.g. FV2310).

So the Python script collect_pruefis.py should be able to create different format versions of the output all_known_pruefis.toml. I would suggest the following folder setup

|- all_known_pruefis
    |- FV2310_all_known_pruefis
    |- FV2404_all_known_pruefis
    |- FV2410_all_known_pruefis
    |- FV2504_all_known_pruefis
    |- ...

If you than run kohlrahbi you should provide a flag like --format-version / -fv and a format version e.g. FV2310.
The total command to read the AHB for the Prüfi 13007 in the format version FV2310 can look like this:

python /src/kohlrahbi/__init__.py --format-version FV2310 -p 13007 --file-type flatahb --file-type csv --file-type
 xlsx --input_path /Users/kevin/workspaces/hochfrequenz/edi_energy_mirror/edi_energy_de/future --output_path /Users/kev
in/workspaces/hochfrequenz/kohlrahbi/output 

There should be a default format version. But I am not sure if it should be the current or future format version.
@hf-kklein do you have any feelings about this question?

Add column for discriminator

At the moment you can not distinguish between a Freitextfeld and a Qualifier.
So we add an extra column for this purpose.

Get the latest docx version for a given format and format version

If there are several docx files for a given format and a format version kohlrAHBi should always pick the most recent file.

For example:

In FV2310 for UTILMD, kohlrAHBi picks

"UTILMDAHBStrom-informatorischeLesefassung1.1KonsolidierteLesefassungmitFehlerkorrekturenStand29.06.2023_20230928_20231001.docx"

instead of

"UTILMDAHBStrom-informatorischeLesefassung1.1KonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240402_20231212.docx"

Idea: Save the Page Number

Tobias von Lynqtech möchte gerne die Seitenzahlen von wo die Informationen kommen mit ab speichern.
Damit soll es einfacher sein das geparste Ergebnis zu validieren.

Use classes to improve program structure

try to find a new structure which is based on classes.
This should reduce the amount of arguments which have to passed at the moment.
One class could be realized for the row of a docx table.

Export for Bedingungen only

Add the option to export for each Format all Bedingungen.
The spreadsheet should contain to columns: Bedingung-Key and Bedingung-Text.
All Formate (e.g. UTILMD, ORDERS, APERAK etc.) should be in one excel file, but on different sheets.

Memorize which File contains which Pruefi Tables

As of today we're re-reading the same docx.Documents over and over again to find out if they contain a specific pruefi. Ideally we'd reduce the re-reading and memorize which pruefi is (not) contained in which file. Even the information that a file does not contain a pruefi could significantly speed up the overall scraping and I think is easier to integrate in the existing code base.

Log more info on success/failure

I'm using kohlrahbi in a CI tool:

Run kohlrahbi --input_path edi_energy_mirror --output_path ./machine-readable_anwendungshandbuecher/FV2210/UTILMD/ --pruefis 11039
kohlrahbi --input_path edi_energy_mirror --output_path ./machine-readable_anwendungshandbuecher/FV2210/UTILMD/ --pruefis 11039
shell: /usr/bin/bash -e {0}
env:
pythonLocation: /opt/hostedtoolcache/Python/3.11.2/x64
PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.11.2/x64/lib/pkgconfig
Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.11.2/x64/lib
INFO [kohlrahbi] start looking for pruefi '11039'

The logs state that kohlrahbi started looking for a pruefi but nothing more. Was it found? was there any result written to anywhere?

Read Filename where to expect the Pruefi from Excel Overview

          Es wäre tatsächlich möglich, wenn du diese datei einliest:

https://github.com/Hochfrequenz/edi_energy_mirror/blob/master/edi_energy_de/current/Anwendungs%C3%BCbersichtderPr%C3%BCfidentifikatoren-informatorischeLesefassung2.0Au%C3%9FerordentlicheVer%C3%B6ffentlichung_20230331_20221201.xlsx

das hieße aber extra aufwand den ich in diesem pr nicht machen würde. performance ist ja nicht unsere hauptprio (und wird es auch nie werden an der stelle)

Originally posted by @hf-kklein in #53 (comment)

Write tests for all the functions that use classes from `docx`

I analyzed which parts of the code are not covered by tests yet.

C:\github\AHBExtractor\src\ahbextractor_init_.py 2 0 0 100%
C:\github\AHBExtractor\src\ahbextractor\helper_init_.py 0 0 0 100%
C:\github\AHBExtractor\src\ahbextractor\helper\check_row_type.py 49 3 0 94%
C:\github\AHBExtractor\src\ahbextractor\helper\elixir.py 36 18 0 50%
C:\github\AHBExtractor\src\ahbextractor\helper\export_functions.py 64 41 0 36%
C:\github\AHBExtractor\src\ahbextractor\helper\write_functions.py 129 19 0 85%
C:\github\AHBExtractor\unittests_init_.py 0 0 0 100%
C:\github\AHBExtractor\unittests\test_check_row_type.py 17 0 0 100%
C:\github\AHBExtractor\unittests\test_export_functions.py 8 0 0 100%
C:\github\AHBExtractor\unittests\test_write_functions.py 151 2 0 99%
Total 1404 513 0 63%

The biggest gaps in the test coverage are where docx-instances are used as arguments (e.g. Tables, Cells, Paragraphs...)

Order of CLI flags matters

Current situation:
Using for example the following promt:

  • kohlrahbi conditions --assume-yes -eemp some/path -o output/path --format-version FV2310
    sets assume_yes = True and generates the output directory if it does not exist.
  • kohlrahbi conditions -eemp some/path -o output/path --assume-yes --format-version FV2310
    sets assume_yes = False when the existence of the output directory is checked

Expected/favored behavior:
The order of cli flags should not matter.

Fix initialisation of `elixir` in `read_functions.py`

mypy and linter complain correctly about the unknown elixir variable in some if cases.
The main workflow works but it is not clean code!

Error message

ahbextractor/helper/read_functions.py:177: error: Cannot determine type of "elixir" [has-type]

Ensure local FVYYMM_pruefi_docx_filename_map.toml is reasonably up-to-date

When KohlrAHBi is used to parse docx documents for a specific format version (FV), it initially scans the compatible directory in the local edi_energy_mirror repository to create a pruefi to docx filemapping. These files are kept in a cache folder. This speeds up later parsing processes. However, if some files are changed (updated repository or alternating use of real and test repositories) those mappings need to be updated in order to provide reliable results. Therefore, we could either

  • provide a --delete-cache flag or
  • check for the last date the specific filemapping has been created and update it after a set time (e.g. two weeks)

Error in Github Action: There was an error opening the file 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'

Run kohlrahbi --input-path edi_energy_mirror/edi_energy_de/FV2404 --output-path ./machine-readable_anwendungshandbuecher/FV2404 --file-type flatahb --file-type csv --file-type xlsx
☝️ No pruefis were given. I will parse all known pruefis.
INFO [kohlrahbi] start looking for pruefi '13002'
ERROR [kohlrahbi] There was an error opening the file 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/kohlrahbi/init.py", line 285, in get_or_cache_document
doc = docx.Document(ahb_file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/api.py", line 23, in Document
document_part = Package.open(docx).main_document_part
^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/pkgreader.py", line 22, in from_file
phys_reader = PhysPkgReader(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/phys_pkg.py", line 76, in init
self._zipf = ZipFile(pkg_file, "r")
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/zipfile.py", line 1286, in init
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'
ERROR [kohlrahbi] Error processing pruefi '13002':

Pin CI Dependencies

There should be a dev_requirements folder like in our template repository.
The tox.ini should refer to the pinned dependencies.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.