GithubHelp home page GithubHelp logo

hochfrequenz / kohlrahbi Goto Github PK

View Code? Open in Web Editor NEW
5.0 4.0 2.0 42.68 MB

An Anwendungshandbücher (AHB) scraper that extracts tables from docx files

Home Page: https://pypi.org/project/kohlrahbi/

License: GNU General Public License v3.0

Python 100.00%
ahb anwendungshandbuch bdew energiewirtschaft

kohlrahbi's Issues

Remove Irrelevant Lines In Flatahb Output

In the AHB documents there are many lines which are not relevant for each Prüfi.

Take this snippet as an example

image

All the lines are just important for 55002, but not for 55001 or 55003.

At the moment we still export these unimportant lines for 55001 and 55003 too.

We can filter these lines out by looking at the Bedingungsausdruck column.
If this field is empty I think we can remove the whole line from the export.

This helps to reduce the file sizes and improves the user experience for the users of the AHB Tabellen frontend :)

Add column for discriminator

At the moment you can not distinguish between a Freitextfeld and a Qualifier.
So we add an extra column for this purpose.

Ensure local FVYYMM_pruefi_docx_filename_map.toml is reasonably up-to-date

When KohlrAHBi is used to parse docx documents for a specific format version (FV), it initially scans the compatible directory in the local edi_energy_mirror repository to create a pruefi to docx filemapping. These files are kept in a cache folder. This speeds up later parsing processes. However, if some files are changed (updated repository or alternating use of real and test repositories) those mappings need to be updated in order to provide reliable results. Therefore, we could either

  • provide a --delete-cache flag or
  • check for the last date the specific filemapping has been created and update it after a set time (e.g. two weeks)

Idea: Save the Page Number

Tobias von Lynqtech möchte gerne die Seitenzahlen von wo die Informationen kommen mit ab speichern.
Damit soll es einfacher sein das geparste Ergebnis zu validieren.

Use classes to improve program structure

try to find a new structure which is based on classes.
This should reduce the amount of arguments which have to passed at the moment.
One class could be realized for the row of a docx table.

Pin CI Dependencies

There should be a dev_requirements folder like in our template repository.
The tox.ini should refer to the pinned dependencies.

Remove `section_name` From Flatahb If `segment_code` Is Empty

Today I learned: The name in the AHB documents on the left side is not the name of the segment group, it is the name of the segment.

So it doesn't make sense if we add this segment name in the lines, where we only have the segment group.

So the output for example of the 55001 should be

{
      "ahb_expression": "Kann",
      "data_element": null,
      "guid": "fba40a12-c494-4105-847a-46f7b5f01ef3",
      "index": 25,
      "name": "",
      "section_name": null,  <-- I changed this line
      "segment_code": null,  <-- because the segment_code is null
      "segment_group_key": "SG3",
      "value_pool_entry": null
    },
    {
      "ahb_expression": "Muss",
      "data_element": null,
      "guid": "cca45667-760e-4d1d-a123-a30c7acda5e7",
      "index": 26,
      "name": "",
      "section_name": "Ansprechpartner",
      "segment_code": "CTA",
      "segment_group_key": "SG3",
      "value_pool_entry": null
    }

You can find the current output here: https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/flatahb/55001.json#L245-L266

Order of CLI flags matters

Current situation:
Using for example the following promt:

  • kohlrahbi conditions --assume-yes -eemp some/path -o output/path --format-version FV2310
    sets assume_yes = True and generates the output directory if it does not exist.
  • kohlrahbi conditions -eemp some/path -o output/path --assume-yes --format-version FV2310
    sets assume_yes = False when the existence of the output directory is checked

Expected/favored behavior:
The order of cli flags should not matter.

Fix initialisation of `elixir` in `read_functions.py`

mypy and linter complain correctly about the unknown elixir variable in some if cases.
The main workflow works but it is not clean code!

Error message

ahbextractor/helper/read_functions.py:177: error: Cannot determine type of "elixir" [has-type]

Write tests for all the functions that use classes from `docx`

I analyzed which parts of the code are not covered by tests yet.

C:\github\AHBExtractor\src\ahbextractor_init_.py 2 0 0 100%
C:\github\AHBExtractor\src\ahbextractor\helper_init_.py 0 0 0 100%
C:\github\AHBExtractor\src\ahbextractor\helper\check_row_type.py 49 3 0 94%
C:\github\AHBExtractor\src\ahbextractor\helper\elixir.py 36 18 0 50%
C:\github\AHBExtractor\src\ahbextractor\helper\export_functions.py 64 41 0 36%
C:\github\AHBExtractor\src\ahbextractor\helper\write_functions.py 129 19 0 85%
C:\github\AHBExtractor\unittests_init_.py 0 0 0 100%
C:\github\AHBExtractor\unittests\test_check_row_type.py 17 0 0 100%
C:\github\AHBExtractor\unittests\test_export_functions.py 8 0 0 100%
C:\github\AHBExtractor\unittests\test_write_functions.py 151 2 0 99%
Total 1404 513 0 63%

The biggest gaps in the test coverage are where docx-instances are used as arguments (e.g. Tables, Cells, Paragraphs...)

Error in Github Action: There was an error opening the file 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'

Run kohlrahbi --input-path edi_energy_mirror/edi_energy_de/FV2404 --output-path ./machine-readable_anwendungshandbuecher/FV2404 --file-type flatahb --file-type csv --file-type xlsx
☝️ No pruefis were given. I will parse all known pruefis.
INFO [kohlrahbi] start looking for pruefi '13002'
ERROR [kohlrahbi] There was an error opening the file 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/kohlrahbi/init.py", line 285, in get_or_cache_document
doc = docx.Document(ahb_file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/api.py", line 23, in Document
document_part = Package.open(docx).main_document_part
^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/pkgreader.py", line 22, in from_file
phys_reader = PhysPkgReader(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/phys_pkg.py", line 76, in init
self._zipf = ZipFile(pkg_file, "r")
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/zipfile.py", line 1286, in init
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'
ERROR [kohlrahbi] Error processing pruefi '13002':

Log more info on success/failure

I'm using kohlrahbi in a CI tool:

Run kohlrahbi --input_path edi_energy_mirror --output_path ./machine-readable_anwendungshandbuecher/FV2210/UTILMD/ --pruefis 11039
kohlrahbi --input_path edi_energy_mirror --output_path ./machine-readable_anwendungshandbuecher/FV2210/UTILMD/ --pruefis 11039
shell: /usr/bin/bash -e {0}
env:
pythonLocation: /opt/hostedtoolcache/Python/3.11.2/x64
PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.11.2/x64/lib/pkgconfig
Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.11.2/x64/lib
INFO [kohlrahbi] start looking for pruefi '11039'

The logs state that kohlrahbi started looking for a pruefi but nothing more. Was it found? was there any result written to anywhere?

Memorize which File contains which Pruefi Tables

As of today we're re-reading the same docx.Documents over and over again to find out if they contain a specific pruefi. Ideally we'd reduce the re-reading and memorize which pruefi is (not) contained in which file. Even the information that a file does not contain a pruefi could significantly speed up the overall scraping and I think is easier to integrate in the existing code base.

Remove double entries on segment group level

In the kohlrahbi data of the flatahb we have entries which are double.

For example

    {
      "ahb_expression": "Kann",
      "data_element": null,
      "guid": "17d56a20-a551-41c1-b2ca-c806ab8982c9",
      "index": 24,
      "name": null,
      "section_name": "Ansprechpartner",
      "segment_code": null,
      "segment_group_key": "SG3",
      "value_pool_entry": null
    },
    {
      "ahb_expression": "Kann",
      "data_element": null,
      "guid": "fba40a12-c494-4105-847a-46f7b5f01ef3",
      "index": 25,
      "name": "",
      "section_name": "Ansprechpartner",
      "segment_code": null,
      "segment_group_key": "SG3",
      "value_pool_entry": null
    },

https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/flatahb/55001.json#L234-L255

These double lines should get removed.

Get the latest docx version for a given format and format version

If there are several docx files for a given format and a format version kohlrAHBi should always pick the most recent file.

For example:

In FV2310 for UTILMD, kohlrAHBi picks

"UTILMDAHBStrom-informatorischeLesefassung1.1KonsolidierteLesefassungmitFehlerkorrekturenStand29.06.2023_20230928_20231001.docx"

instead of

"UTILMDAHBStrom-informatorischeLesefassung1.1KonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240402_20231212.docx"

Add Condition Into Flatahb Output

For the AHB Tabellen (alias AHBesser) frontend we would like to add the condition texts.

At the moment this information is missing in the flatahb json files

For example in the 55001

{
      "ahb_expression": "X [931][494]",
      "data_element": "2380",
      "guid": "a554bcf2-1a61-4b65-8a21-1f5a7ed86941",
      "index": 15,
      "name": "Datum oder Uhrzeit oderZeitspanne, Wert",
      "section_name": "Nachrichtendatum",
      "segment_code": "DTM",
      "segment_group_key": null,
      "value_pool_entry": null
    },

Source: https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/flatahb/55001.json#L136-L145

This information about what is behind the [931] is already parsed and available in the csv output: https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/csv/55001.csv#L14-L15

12,Nachrichtendatum,,DTM,2380,,,"Datum oder Uhrzeit oderZeitspanne, Wert",X [931][494],"[494] Das hier genannte Datum muss der Zeitpunkt sein, zu dem das Dokument erstellt wurde, oder ein Zeitpunkt, der davor liegt  
[931] Format: ZZZ = +00"

This should be part of the json file too.

Example output

The conditions should be in a list.

{
  "ahb_expression": "X [931][494]",
  "conditions": [
    "[494] Das hier genannte Datum muss der Zeitpunkt sein, zu dem das Dokument erstellt wurde, oder ein Zeitpunkt, der davor liegt",
    "[931] Format: ZZZ = +00"
  ]"data_element": "2380",
  "guid": "a554bcf2-1a61-4b65-8a21-1f5a7ed86941",
  "index": 15,
  "name": "Datum oder Uhrzeit oderZeitspanne, Wert",
  "section_name": "Nachrichtendatum",
  "segment_code": "DTM",
  "segment_group_key": null,
  "value_pool_entry": null
},

Export for Bedingungen only

Add the option to export for each Format all Bedingungen.
The spreadsheet should contain to columns: Bedingung-Key and Bedingung-Text.
All Formate (e.g. UTILMD, ORDERS, APERAK etc.) should be in one excel file, but on different sheets.

Add a further flag to choose the format version

with the PR #209 we can now read the concrete file name for each Prüfi.
But the file name contains also the version number of the document, which correlates with the format version (e.g. FV2310).

So the Python script collect_pruefis.py should be able to create different format versions of the output all_known_pruefis.toml. I would suggest the following folder setup

|- all_known_pruefis
    |- FV2310_all_known_pruefis
    |- FV2404_all_known_pruefis
    |- FV2410_all_known_pruefis
    |- FV2504_all_known_pruefis
    |- ...

If you than run kohlrahbi you should provide a flag like --format-version / -fv and a format version e.g. FV2310.
The total command to read the AHB for the Prüfi 13007 in the format version FV2310 can look like this:

python /src/kohlrahbi/__init__.py --format-version FV2310 -p 13007 --file-type flatahb --file-type csv --file-type
 xlsx --input_path /Users/kevin/workspaces/hochfrequenz/edi_energy_mirror/edi_energy_de/future --output_path /Users/kev
in/workspaces/hochfrequenz/kohlrahbi/output 

There should be a default format version. But I am not sure if it should be the current or future format version.
@hf-kklein do you have any feelings about this question?

Read Filename where to expect the Pruefi from Excel Overview

          Es wäre tatsächlich möglich, wenn du diese datei einliest:

https://github.com/Hochfrequenz/edi_energy_mirror/blob/master/edi_energy_de/current/Anwendungs%C3%BCbersichtderPr%C3%BCfidentifikatoren-informatorischeLesefassung2.0Au%C3%9FerordentlicheVer%C3%B6ffentlichung_20230331_20221201.xlsx

das hieße aber extra aufwand den ich in diesem pr nicht machen würde. performance ist ja nicht unsere hauptprio (und wird es auch nie werden an der stelle)

Originally posted by @hf-kklein in #53 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.