hochfrequenz / kohlrahbi Goto Github PK
View Code? Open in Web Editor NEWAn Anwendungshandbücher (AHB) scraper that extracts tables from docx files
Home Page: https://pypi.org/project/kohlrahbi/
License: GNU General Public License v3.0
An Anwendungshandbücher (AHB) scraper that extracts tables from docx files
Home Page: https://pypi.org/project/kohlrahbi/
License: GNU General Public License v3.0
Write a unit test which checks if we can correctly parse multi line column headers.
https://github.com/Hochfrequenz/AHBExtractor/pull/28/files#r978491526
In the AHB documents there are many lines which are not relevant for each Prüfi.
Take this snippet as an example
All the lines are just important for 55002, but not for 55001 or 55003.
At the moment we still export these unimportant lines for 55001 and 55003 too.
We can filter these lines out by looking at the Bedingungsausdruck
column.
If this field is empty I think we can remove the whole line from the export.
This helps to reduce the file sizes and improves the user experience for the users of the AHB Tabellen frontend :)
At the moment you can not distinguish between a Freitextfeld and a Qualifier.
So we add an extra column for this purpose.
With a minimal working example
When KohlrAHBi is used to parse docx documents for a specific format version (FV), it initially scans the compatible directory in the local edi_energy_mirror repository to create a pruefi to docx filemapping. These files are kept in a cache folder. This speeds up later parsing processes. However, if some files are changed (updated repository or alternating use of real and test repositories) those mappings need to be updated in order to provide reliable results. Therefore, we could either
like many CLI tools the kohlrahbi should tell the user the installed version number after running the command
kohlrahbi --version
We'd save one intermediate step if the AHBExtractor already used the MAUS data model, namely a FlatAnwendungshandbuch
. THis would save us from writing and reading csv, also we could spot data errors on AhbExtractor side already (instead when importing CSV in maus)
> tox -re dev
ERROR: pyproject.toml file found.
To use a PEP 517 build-backend you are required to configure tox to use an isolated_build:
https://tox.readthedocs.io/en/latest/example/package.html
tox --version: tox-3.27.1
We should also check the tests with pylint and myp.
To have the same stack like in MIG_mose or other python projects, we should use pydantic instead of attrs classes.
Tobias von Lynqtech möchte gerne die Seitenzahlen von wo die Informationen kommen mit ab speichern.
Damit soll es einfacher sein das geparste Ergebnis zu validieren.
try to find a new structure which is based on classes.
This should reduce the amount of arguments which have to passed at the moment.
One class could be realized for the row of a docx table.
There should be a dev_requirements
folder like in our template repository.
The tox.ini should refer to the pinned dependencies.
Kannst du es noch in der `pyproject.toml` ergänzen?
Zeile 33
Originally posted by @hf-krechan in #99 (review)
Today I learned: The name in the AHB documents on the left side is not the name of the segment group, it is the name of the segment.
So it doesn't make sense if we add this segment name in the lines, where we only have the segment group.
So the output for example of the 55001
should be
{
"ahb_expression": "Kann",
"data_element": null,
"guid": "fba40a12-c494-4105-847a-46f7b5f01ef3",
"index": 25,
"name": "",
"section_name": null, <-- I changed this line
"segment_code": null, <-- because the segment_code is null
"segment_group_key": "SG3",
"value_pool_entry": null
},
{
"ahb_expression": "Muss",
"data_element": null,
"guid": "cca45667-760e-4d1d-a123-a30c7acda5e7",
"index": 26,
"name": "",
"section_name": "Ansprechpartner",
"segment_code": "CTA",
"segment_group_key": "SG3",
"value_pool_entry": null
}
You can find the current output here: https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/flatahb/55001.json#L245-L266
Current situation:
Using for example the following promt:
kohlrahbi conditions --assume-yes -eemp some/path -o output/path --format-version FV2310
assume_yes = True
and generates the output directory if it does not exist.kohlrahbi conditions -eemp some/path -o output/path --assume-yes --format-version FV2310
assume_yes = False
when the existence of the output directory is checkedExpected/favored behavior:
The order of cli flags should not matter.
e.g. "00540"
and "00187"
in this example
for the MIG part see Hochfrequenz/migmose#58
By using append you could avoid using a row index variable.
mypy and linter complain correctly about the unknown elixir
variable in some if cases.
The main workflow works but it is not clean code!
Error message
ahbextractor/helper/read_functions.py:177: error: Cannot determine type of "elixir" [has-type]
allein vom namen der methode `get_row_type` hätte ich nicht erwartet, dass sie meinen edifact_structur_cell iwi modifiziert. ich hätte gedahct, sie sei pure.
Originally posted by @hf-kklein in #53 (comment)
I analyzed which parts of the code are not covered by tests yet.
C:\github\AHBExtractor\src\ahbextractor_init_.py | 2 | 0 | 0 | 100% |
C:\github\AHBExtractor\src\ahbextractor\helper_init_.py | 0 | 0 | 0 | 100% |
C:\github\AHBExtractor\src\ahbextractor\helper\check_row_type.py | 49 | 3 | 0 | 94% |
C:\github\AHBExtractor\src\ahbextractor\helper\elixir.py | 36 | 18 | 0 | 50% |
C:\github\AHBExtractor\src\ahbextractor\helper\export_functions.py | 64 | 41 | 0 | 36% |
C:\github\AHBExtractor\src\ahbextractor\helper\write_functions.py | 129 | 19 | 0 | 85% |
C:\github\AHBExtractor\unittests_init_.py | 0 | 0 | 0 | 100% |
C:\github\AHBExtractor\unittests\test_check_row_type.py | 17 | 0 | 0 | 100% |
C:\github\AHBExtractor\unittests\test_export_functions.py | 8 | 0 | 0 | 100% |
C:\github\AHBExtractor\unittests\test_write_functions.py | 151 | 2 | 0 | 99% |
Total | 1404 | 513 | 0 | 63% |
The biggest gaps in the test coverage are where docx-instances are used as arguments (e.g. Tables, Cells, Paragraphs...)
All MSCONS AHB tables can't be scraped.
Run kohlrahbi --input-path edi_energy_mirror/edi_energy_de/FV2404 --output-path ./machine-readable_anwendungshandbuecher/FV2404 --file-type flatahb --file-type csv --file-type xlsx
☝️ No pruefis were given. I will parse all known pruefis.
INFO [kohlrahbi] start looking for pruefi '13002'
ERROR [kohlrahbi] There was an error opening the file 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/kohlrahbi/init.py", line 285, in get_or_cache_document
doc = docx.Document(ahb_file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/api.py", line 23, in Document
document_part = Package.open(docx).main_document_part
^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/pkgreader.py", line 22, in from_file
phys_reader = PhysPkgReader(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/docx/opc/phys_pkg.py", line 76, in init
self._zipf = ZipFile(pkg_file, "r")
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/zipfile.py", line 1286, in init
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'edi_energy_mirror/edi_energy_de/FV2404/MSCONSAHB-informatorischeLesefassung3.1cKonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240331_20231212.docx'
ERROR [kohlrahbi] Error processing pruefi '13002':
Das ist 1:1 der Code der auch in der export funcition beautify_bedingungen steht. Ich nehme an, du kannst ihn an einer Stelle rausnehmen oder die Funktion aufrufen.
Originally posted by @hf-aschloegl in #5 (comment)
I'm using kohlrahbi in a CI tool:
Run kohlrahbi --input_path edi_energy_mirror --output_path ./machine-readable_anwendungshandbuecher/FV2210/UTILMD/ --pruefis 11039
kohlrahbi --input_path edi_energy_mirror --output_path ./machine-readable_anwendungshandbuecher/FV2210/UTILMD/ --pruefis 11039
shell: /usr/bin/bash -e {0}
env:
pythonLocation: /opt/hostedtoolcache/Python/3.11.2/x64
PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.11.2/x64/lib/pkgconfig
Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.2/x64
LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.11.2/x64/lib
INFO [kohlrahbi] start looking for pruefi '11039'
The logs state that kohlrahbi started looking for a pruefi but nothing more. Was it found? was there any result written to anywhere?
As of today we're re-reading the same docx.Document
s over and over again to find out if they contain a specific pruefi. Ideally we'd reduce the re-reading and memorize which pruefi is (not) contained in which file. Even the information that a file does not contain a pruefi could significantly speed up the overall scraping and I think is easier to integrate in the existing code base.
In the kohlrahbi data of the flatahb we have entries which are double.
For example
{
"ahb_expression": "Kann",
"data_element": null,
"guid": "17d56a20-a551-41c1-b2ca-c806ab8982c9",
"index": 24,
"name": null,
"section_name": "Ansprechpartner",
"segment_code": null,
"segment_group_key": "SG3",
"value_pool_entry": null
},
{
"ahb_expression": "Kann",
"data_element": null,
"guid": "fba40a12-c494-4105-847a-46f7b5f01ef3",
"index": 25,
"name": "",
"section_name": "Ansprechpartner",
"segment_code": null,
"segment_group_key": "SG3",
"value_pool_entry": null
},
These double lines should get removed.
If there are several docx files for a given format and a format version kohlrAHBi should always pick the most recent file.
For example:
In FV2310 for UTILMD, kohlrAHBi picks
"UTILMDAHBStrom-informatorischeLesefassung1.1KonsolidierteLesefassungmitFehlerkorrekturenStand29.06.2023_20230928_20231001.docx"
instead of
"UTILMDAHBStrom-informatorischeLesefassung1.1KonsolidierteLesefassungmitFehlerkorrekturenStand12.12.2023_20240402_20231212.docx"
For the AHB Tabellen (alias AHBesser) frontend we would like to add the condition texts.
At the moment this information is missing in the flatahb json files
For example in the 55001
{
"ahb_expression": "X [931][494]",
"data_element": "2380",
"guid": "a554bcf2-1a61-4b65-8a21-1f5a7ed86941",
"index": 15,
"name": "Datum oder Uhrzeit oderZeitspanne, Wert",
"section_name": "Nachrichtendatum",
"segment_code": "DTM",
"segment_group_key": null,
"value_pool_entry": null
},
This information about what is behind the [931]
is already parsed and available in the csv output: https://github.com/Hochfrequenz/machine-readable_anwendungshandbuecher/blob/51c7d93e77fafea60c1acb3a9b2ca5fe26c51206/FV2310/UTILMD/csv/55001.csv#L14-L15
12,Nachrichtendatum,,DTM,2380,,,"Datum oder Uhrzeit oderZeitspanne, Wert",X [931][494],"[494] Das hier genannte Datum muss der Zeitpunkt sein, zu dem das Dokument erstellt wurde, oder ein Zeitpunkt, der davor liegt
[931] Format: ZZZ = +00"
This should be part of the json file too.
The conditions should be in a list.
{
"ahb_expression": "X [931][494]",
"conditions": [
"[494] Das hier genannte Datum muss der Zeitpunkt sein, zu dem das Dokument erstellt wurde, oder ein Zeitpunkt, der davor liegt",
"[931] Format: ZZZ = +00"
]"data_element": "2380",
"guid": "a554bcf2-1a61-4b65-8a21-1f5a7ed86941",
"index": 15,
"name": "Datum oder Uhrzeit oderZeitspanne, Wert",
"section_name": "Nachrichtendatum",
"segment_code": "DTM",
"segment_group_key": null,
"value_pool_entry": null
},
Add the option to export for each Format all Bedingungen.
The spreadsheet should contain to columns: Bedingung-Key
and Bedingung-Text
.
All Formate (e.g. UTILMD, ORDERS, APERAK etc.) should be in one excel file, but on different sheets.
kohlrahbi/src/kohlrahbi/read_functions.py
Line 107 in ab7b671
and 14 lines later:
kohlrahbi/src/kohlrahbi/read_functions.py
Line 121 in ab7b671
Is one of them redundant?
It would be nice to have some colors in the terminal outputs.
This can easily achieved by just inserting some strings into the output, see How to print colored text to the terminal?
Tried to use AHB in a CI setup:
⚠️ The output directory does not exist.
Aborted!
Should I try to create the directory at 'output'? [Y/n]:
Error: Process completed with exit code 1.
https://github.com/Hochfrequenz/edi_energy_mirror/actions/runs/4235701733/jobs/7359650410
There should be a way to run through with "always yes" mode.
Move this function into a common module to avoid code duplication and circular imports
Originally posted by @hf-krechan in #123 (comment)
with the PR #209 we can now read the concrete file name for each Prüfi.
But the file name contains also the version number of the document, which correlates with the format version (e.g. FV2310
).
So the Python script collect_pruefis.py
should be able to create different format versions of the output all_known_pruefis.toml
. I would suggest the following folder setup
|- all_known_pruefis
|- FV2310_all_known_pruefis
|- FV2404_all_known_pruefis
|- FV2410_all_known_pruefis
|- FV2504_all_known_pruefis
|- ...
If you than run kohlrahbi you should provide a flag like --format-version
/ -fv
and a format version e.g. FV2310
.
The total command to read the AHB for the Prüfi 13007 in the format version FV2310 can look like this:
python /src/kohlrahbi/__init__.py --format-version FV2310 -p 13007 --file-type flatahb --file-type csv --file-type
xlsx --input_path /Users/kevin/workspaces/hochfrequenz/edi_energy_mirror/edi_energy_de/future --output_path /Users/kev
in/workspaces/hochfrequenz/kohlrahbi/output
There should be a default format version. But I am not sure if it should be the current or future format version.
@hf-kklein do you have any feelings about this question?
Es wäre tatsächlich möglich, wenn du diese datei einliest:
das hieße aber extra aufwand den ich in diesem pr nicht machen würde. performance ist ja nicht unsere hauptprio (und wird es auch nie werden an der stelle)
Originally posted by @hf-kklein in #53 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.