hochfrequenz / kohlrahbi Goto Github PK

View Code? Open in Web Editor NEW

5.0 4.0 2.0 42.68 MB

An Anwendungshandbücher (AHB) scraper that extracts tables from docx files

Home Page: https://pypi.org/project/kohlrahbi/

License: GNU General Public License v3.0

Python 100.00%

ahb anwendungshandbuch bdew energiewirtschaft

kohlrahbi's Issues

Write unit test for mehrzeilige überschrift

Write a unit test which checks if we can correctly parse multi line column headers.

https://github.com/Hochfrequenz/AHBExtractor/pull/28/files#r978491526

Remove Irrelevant Lines In Flatahb Output

In the AHB documents there are many lines which are not relevant for each Prüfi.

Take this snippet as an example

All the lines are just important for 55002, but not for 55001 or 55003.

At the moment we still export these unimportant lines for 55001 and 55003 too.

We can filter these lines out by looking at the Bedingungsausdruck column.
If this field is empty I think we can remove the whole line from the export.

This helps to reduce the file sizes and improves the user experience for the users of the AHB Tabellen frontend :)

Add column for discriminator

At the moment you can not distinguish between a Freitextfeld and a Qualifier.
So we add an extra column for this purpose.

Update README on how to use this package

With a minimal working example

Ensure local FVYYMM_pruefi_docx_filename_map.toml is reasonably up-to-date

When KohlrAHBi is used to parse docx documents for a specific format version (FV), it initially scans the compatible directory in the local edi_energy_mirror repository to create a pruefi to docx filemapping. These files are kept in a cache folder. This speeds up later parsing processes. However, if some files are changed (updated repository or alternating use of real and test repositories) those mappings need to be updated in order to provide reliable results. Therefore, we could either

provide a --delete-cache flag or
check for the last date the specific filemapping has been created and update it after a set time (e.g. two weeks)

Add `--version` flag and return installed version number

like many CLI tools the kohlrahbi should tell the user the installed version number after running the command

kohlrahbi --version

Use MAUS Data Model: `FlatAnwendungshandbuch`

We'd save one intermediate step if the AHBExtractor already used the MAUS data model, namely a FlatAnwendungshandbuch. THis would save us from writing and reading csv, also we could spot data errors on AhbExtractor side already (instead when importing CSV in maus)

To use a PEP 517 build-backend you are required to configure tox to use an isolated_build

>  tox -re dev
ERROR: pyproject.toml file found.
To use a PEP 517 build-backend you are required to configure tox to use an isolated_build:
https://tox.readthedocs.io/en/latest/example/package.html

tox --version: tox-3.27.1

Add unittest files to linting and typing environment

We should also check the tests with pylint and myp.

Replace attrs classes with pydantic

To have the same stack like in MIG_mose or other python projects, we should use pydantic instead of attrs classes.

Idea: Save the Page Number

Tobias von Lynqtech möchte gerne die Seitenzahlen von wo die Informationen kommen mit ab speichern.
Damit soll es einfacher sein das geparste Ergebnis zu validieren.

`unfold` removes Value Pool Entries `E02`, `E03`, `ZD2` from 44001 AHB

https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/94b25f381b40c9dac397fa79e8f3099260761697/FV2310/UTILMD/flatahb/44001.json#L5155C16-L5155C16

In 11001 ist es noch da.

Use classes to improve program structure

try to find a new structure which is based on classes.
This should reduce the amount of arguments which have to passed at the moment.
One class could be realized for the row of a docx table.

Pin CI Dependencies

There should be a dev_requirements folder like in our template repository.
The tox.ini should refer to the pinned dependencies.

Kannst du es noch in der `pyproject.toml` ergänzen?

          Kannst du es noch in der `pyproject.toml` ergänzen?

Zeile 33

Originally posted by @hf-krechan in #99 (review)

Remove `section_name` From Flatahb If `segment_code` Is Empty

Today I learned: The name in the AHB documents on the left side is not the name of the segment group, it is the name of the segment.

So it doesn't make sense if we add this segment name in the lines, where we only have the segment group.

So the output for example of the 55001 should be

{
      "ahb_expression": "Kann",
      "data_element": null,
      "guid": "fba40a12-c494-4105-847a-46f7b5f01ef3",
      "index": 25,
      "name": "",
      "section_name": null,  <-- I changed this line
      "segment_code": null,  <-- because the segment_code is null
      "segment_group_key": "SG3",
      "value_pool_entry": null
    },
    {
      "ahb_expression": "Muss",
      "data_element": null,
      "guid": "cca45667-760e-4d1d-a123-a30c7acda5e7",
      "index": 26,
      "name": "",
      "section_name": "Ansprechpartner",
      "segment_code": "CTA",
      "segment_group_key": "SG3",
      "value_pool_entry": null
    }

You can find the current output here: https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/flatahb/55001.json#L245-L266

UTILMD - 11156, 11157, 11188, 11189, 11190 can't be scraped

The Prüfidentifikatoren 11157 and 11158 can not be scraped because the left indent change.

Order of CLI flags matters

Current situation:
Using for example the following promt:

kohlrahbi conditions --assume-yes -eemp some/path -o output/path --format-version FV2310
sets assume_yes = True and generates the output directory if it does not exist.
kohlrahbi conditions -eemp some/path -o output/path --assume-yes --format-version FV2310
sets assume_yes = False when the existence of the output directory is checked

Expected/favored behavior:
The order of cli flags should not matter.

Save unique `\d{5}` segment ID (light grey number in AHB in >=FV2410)

e.g. "00540" and "00187" in this example

for the MIG part see Hochfrequenz/migmose#58

INVOIC - All 31### AHBs can't be scraped

Try to use append for new rows in dataframe

By using append you could avoid using a row index variable.

Fix initialisation of `elixir` in `read_functions.py`

mypy and linter complain correctly about the unknown elixir variable in some if cases.
The main workflow works but it is not clean code!

Error message

ahbextractor/helper/read_functions.py:177: error: Cannot determine type of "elixir" [has-type]

Don't let methods that have a `get_...` name modify the data

          allein vom namen der methode `get_row_type` hätte ich nicht erwartet, dass sie meinen edifact_structur_cell iwi modifiziert. ich hätte gedahct, sie sei pure.

Originally posted by @hf-kklein in #53 (comment)

Write tests for all the functions that use classes from `docx`

I analyzed which parts of the code are not covered by tests yet.


C:\github\AHBExtractor\src\ahbextractor_init_.py	2	0	100%
C:\github\AHBExtractor\src\ahbextractor\helper_init_.py	0	0	100%
C:\github\AHBExtractor\src\ahbextractor\helper\check_row_type.py	49	3	94%
C:\github\AHBExtractor\src\ahbextractor\helper\elixir.py	36	18	50%
C:\github\AHBExtractor\src\ahbextractor\helper\export_functions.py	64	41	36%
C:\github\AHBExtractor\src\ahbextractor\helper\write_functions.py	129	19	85%
C:\github\AHBExtractor\unittests_init_.py	0	0	100%
C:\github\AHBExtractor\unittests\test_check_row_type.py	17	0	100%
C:\github\AHBExtractor\unittests\test_export_functions.py	8	0	100%
C:\github\AHBExtractor\unittests\test_write_functions.py	151	2	99%
Total	1404	513	63%

The biggest gaps in the test coverage are where docx-instances are used as arguments (e.g. Tables, Cells, Paragraphs...)

MSCONS - 13###

All MSCONS AHB tables can't be scraped.

Error in Github Action: There was an error opening the file 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'

Run kohlrahbi --input-path edi_energy_mirror/edi_energy_de/FV2404 --output-path ./machine-readable_anwendungshandbuecher/FV2404 --file-type flatahb --file-type csv --file-type xlsx
☝️ No pruefis were given. I will parse all known pruefis.
INFO [kohlrahbi] start looking for pruefi '13002'
ERROR [kohlrahbi] There was an error opening the file 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/kohlrahbi/init.py", line 285, in get_or_cache_document
doc = docx.Document(ahb_file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/api.py", line 23, in Document
document_part = Package.open(docx).main_document_part
^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/pkgreader.py", line 22, in from_file
phys_reader = PhysPkgReader(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/phys_pkg.py", line 76, in init
self._zipf = ZipFile(pkg_file, "r")
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/zipfile.py", line 1286, in init
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'
ERROR [kohlrahbi] Error processing pruefi '13002':

Das ist 1:1 der Code der auch in der export funcition beautify_bedingungen steht. Ich nehme an, du kannst ihn an einer Stelle rausnehmen oder die Funktion aufrufen.

Originally posted by @hf-aschloegl in #5 (comment)

Log more info on success/failure

I'm using kohlrahbi in a CI tool:

Run kohlrahbi --input_path edi_energy_mirror --output_path ./machine-readable_anwendungshandbuecher/FV2210/UTILMD/ --pruefis 11039
kohlrahbi --input_path edi_energy_mirror --output_path ./machine-readable_anwendungshandbuecher/FV2210/UTILMD/ --pruefis 11039
shell: /usr/bin/bash -e {0}
env:
pythonLocation: /opt/hostedtoolcache/Python/3.11.2/x64
PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.11.2/x64/lib/pkgconfig
Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.11.2/x64/lib
INFO [kohlrahbi] start looking for pruefi '11039'

The logs state that kohlrahbi started looking for a pruefi but nothing more. Was it found? was there any result written to anywhere?

Cover the case if two package tables of two different formats are in one AHB document

In the AHB ORDERSORDRSPAHBMaBiS-informatorischeLesefassung2.2c_99991231_20231001 there are two packages tables.

During the scraping of the packages/conditions we need to take care of this case.

Memorize which File contains which Pruefi Tables

As of today we're re-reading the same docx.Documents over and over again to find out if they contain a specific pruefi. Ideally we'd reduce the re-reading and memorize which pruefi is (not) contained in which file. Even the information that a file does not contain a pruefi could significantly speed up the overall scraping and I think is easier to integrate in the existing code base.

Remove double entries on segment group level

In the kohlrahbi data of the flatahb we have entries which are double.

For example

    {
      "ahb_expression": "Kann",
      "data_element": null,
      "guid": "17d56a20-a551-41c1-b2ca-c806ab8982c9",
      "index": 24,
      "name": null,
      "section_name": "Ansprechpartner",
      "segment_code": null,
      "segment_group_key": "SG3",
      "value_pool_entry": null
    },
    {
      "ahb_expression": "Kann",
      "data_element": null,
      "guid": "fba40a12-c494-4105-847a-46f7b5f01ef3",
      "index": 25,
      "name": "",
      "section_name": "Ansprechpartner",
      "segment_code": null,
      "segment_group_key": "SG3",
      "value_pool_entry": null
    },

https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/flatahb/55001.json#L234-L255

These double lines should get removed.

Get the latest docx version for a given format and format version

If there are several docx files for a given format and a format version kohlrAHBi should always pick the most recent file.

For example:

In FV2310 for UTILMD, kohlrAHBi picks

"UTILMDAHBStrom-informatorischeLesefassung1.1KonsolidierteLesefassungmitFehlerkorrekturenStand29.06.2023_20230928_20231001.docx"

instead of

"UTILMDAHBStrom-informatorischeLesefassung1.1KonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240402_20231212.docx"

Add Condition Into Flatahb Output

For the AHB Tabellen (alias AHBesser) frontend we would like to add the condition texts.

At the moment this information is missing in the flatahb json files

For example in the 55001

{
      "ahb_expression": "X [931][494]",
      "data_element": "2380",
      "guid": "a554bcf2-1a61-4b65-8a21-1f5a7ed86941",
      "index": 15,
      "name": "Datum oder Uhrzeit oderZeitspanne, Wert",
      "section_name": "Nachrichtendatum",
      "segment_code": "DTM",
      "segment_group_key": null,
      "value_pool_entry": null
    },

Source: https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/flatahb/55001.json#L136-L145

This information about what is behind the [931] is already parsed and available in the csv output: https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/csv/55001.csv#L14-L15

12,Nachrichtendatum,,DTM,2380,,,"Datum oder Uhrzeit oderZeitspanne, Wert",X [931][494],"[494] Das hier genannte Datum muss der Zeitpunkt sein, zu dem das Dokument erstellt wurde, oder ein Zeitpunkt, der davor liegt  
[931] Format: ZZZ = +00"

This should be part of the json file too.

Example output

The conditions should be in a list.

{
  "ahb_expression": "X [931][494]",
  "conditions": [
    "[494] Das hier genannte Datum muss der Zeitpunkt sein, zu dem das Dokument erstellt wurde, oder ein Zeitpunkt, der davor liegt",
    "[931] Format: ZZZ = +00"
  ]"data_element": "2380",
  "guid": "a554bcf2-1a61-4b65-8a21-1f5a7ed86941",
  "index": 15,
  "name": "Datum oder Uhrzeit oderZeitspanne, Wert",
  "section_name": "Nachrichtendatum",
  "segment_code": "DTM",
  "segment_group_key": null,
  "value_pool_entry": null
},

Export for Bedingungen only

Add the option to export for each Format all Bedingungen.
The spreadsheet should contain to columns: Bedingung-Key and Bedingung-Text.
All Formate (e.g. UTILMD, ORDERS, APERAK etc.) should be in one excel file, but on different sheets.

Why is there the same code with the same log message twice? "Found a table with the following pruefis"

kohlrahbi/src/kohlrahbi/read_functions.py

Line 107 in ab7b671

 logger.info("Found a table with the following pruefis: %s", seed.pruefidentifikatoren) 

and 14 lines later:

kohlrahbi/src/kohlrahbi/read_functions.py

Line 121 in ab7b671

 logger.info("Found a table with the following pruefis: %s", seed.pruefidentifikatoren) 

Is one of them redundant?

Add warning color to terminal outputs

It would be nice to have some colors in the terminal outputs.
This can easily achieved by just inserting some strings into the output, see How to print colored text to the terminal?

Introduce non-interactive mode

Tried to use AHB in a CI setup:

⚠️ The output directory does not exist.
Aborted!
Should I try to create the directory at 'output'? [Y/n]:
Error: Process completed with exit code 1.

https://github.com/Hochfrequenz/edi_energy_mirror/actions/runs/4235701733/jobs/7359650410

There should be a way to run through with "always yes" mode.

Move this function into a common module to avoid code duplication and circular imports

          Move this function into a common module to avoid code duplication and circular imports

Originally posted by @hf-krechan in #123 (comment)

Add a further flag to choose the format version

with the PR #209 we can now read the concrete file name for each Prüfi.
But the file name contains also the version number of the document, which correlates with the format version (e.g. FV2310).

So the Python script collect_pruefis.py should be able to create different format versions of the output all_known_pruefis.toml. I would suggest the following folder setup

|- all_known_pruefis
    |- FV2310_all_known_pruefis
    |- FV2404_all_known_pruefis
    |- FV2410_all_known_pruefis
    |- FV2504_all_known_pruefis
    |- ...

If you than run kohlrahbi you should provide a flag like --format-version / -fv and a format version e.g. FV2310.
The total command to read the AHB for the Prüfi 13007 in the format version FV2310 can look like this:

python /src/kohlrahbi/__init__.py --format-version FV2310 -p 13007 --file-type flatahb --file-type csv --file-type
 xlsx --input_path /Users/kevin/workspaces/hochfrequenz/edi_energy_mirror/edi_energy_de/future --output_path /Users/kev
in/workspaces/hochfrequenz/kohlrahbi/output

There should be a default format version. But I am not sure if it should be the current or future format version.
@hf-kklein do you have any feelings about this question?

Read Filename where to expect the Pruefi from Excel Overview

          Es wäre tatsächlich möglich, wenn du diese datei einliest:

https://github.com/Hochfrequenz/edi_energy_mirror/blob/master/edi_energy_de/current/Anwendungs%C3%BCbersichtderPr%C3%BCfidentifikatoren-informatorischeLesefassung2.0Au%C3%9FerordentlicheVer%C3%B6ffentlichung_20230331_20221201.xlsx

das hieße aber extra aufwand den ich in diesem pr nicht machen würde. performance ist ja nicht unsere hauptprio (und wird es auch nie werden an der stelle)

Originally posted by @hf-kklein in #53 (comment)

hochfrequenz / kohlrahbi Goto Github PK

kohlrahbi's Issues

Example output

Recommend Projects

Recommend Topics

Recommend Org

Jobs