GithubHelp home page GithubHelp logo

4training / pywikitools Goto Github PK

View Code? Open in Web Editor NEW
2.0 5.0 10.0 875 KB

Python tools for mediawiki with the Translate plugin (some based on pywikibot)

License: Other

Makefile 0.47% Python 99.53%
python pywikibot libreoffice-api libreoffice

pywikitools's Introduction

Run tests Coverage GPLv3 license Open Source? Yes!

Python Mediawiki Tools

Python tools for mediawiki with the Translate plugin (some based on pywikibot). This is used for https://www.4training.net to remove some bottlenecks of the project, providing different automation and reports (TODO: document the outcomes of these scripts). Hopefully others can benefit from some of the scripts as well!

  • Free software: GNU General Public License v3
The tools use the mediawiki API. URL and all documentation: https://www.4training.net/mediawiki/api.php
Read-only scripts make direct use of the API calls. Bots writing to the system use the pywikibot framework: https://www.mediawiki.org/wiki/Manual:Pywikibot

Setup:

Note: pywikitools base path refers to the directory, where you can find README.rst, CONTRIBUTING.rst and requirements.txt.

  1. Install required libraries: pip install -r requirements.txt:

    • Follow these steps if you are using a virtual environment on a Linux machine:

      • To install virtualenv: sudo python3 -m pip install virtualenv
      • To create a new virtual environment: virtualenv --system-site-packages new_venv_name. It is important to include the tag --system-site-packages, else the virtual environment will not be able to import the uno package into your working environment.
      • To activate the virtual environment: source new_venv_name/bin/activate
      • Change into pywikitools base path and run pip install -r requirements.txt.
  2. Install LibreOffice UNO (python bridge): sudo apt-get install python3-uno (on linux)

    • This is not necessary for all scripts, only for our LibreOffice module and scripts using it (translateodt.py)
    • Running the complete test suite requires it, though
  3. Set up configuration in config.ini:

    • cp config.example.ini config.ini
    • Change the base path ini config.ini to the directory where you cloned the pywikitools base folder, for example: base = /YOUR_HOME_PATH/pywikitools/
    • Configure all other necessary options like user names and site (connect to 4training.net / test.4training.net)
  4. You're ready to go! Look at the different scripts and how to invoke them and try them out! To get to know everything and to understand what is going on, set the logging level to INFO (default is WARN) by adding -l info.

Run scripts

python3 path/to/script args

If you're not yet logged in, pywikibot will ask you for the password for the user you defined in config.ini. After successful login, the login cookie is stored in pywikibot-[UserName].lwp so you don't have to log in every time again.

Testing and ensuring good code quality

From your base pywikitools path, run python3 -m unittest discover -s pywikitools/test to run the test suite. Run also flake8 . to check for any linting issues.

With GitHub Actions these two commands are run also on any push or pull request in the repository. The goal is to cover all important code parts with good tests. Some of the tests are making real API calls, that's why running the tests can take half a minute. More details

We use codecov to calculate the coverage ratio. You can see it in the codecov badge on the repository page or check out the details on codecov.io

File overview: Configuration and main entry scripts

autotranslate.py
Create a first translation draft by using machine translation by DeepL or Google translate Introduction for users: https://www.youtube.com/watch?v=czsqgA6Ua7s
config.example.ini
Example for all configuration settings
config.ini
Not in repository, needs to be created by you. Configure here for each script: Which system should we connect to? (www.4training.net / test.4training.net) Which user name does it use?
correct_bot.py
Automatically correct simple mistakes in texts of different languages
resources_bot.py
Automatically scan through all available translations, gather information on each language and do many useful things with this information, like filling out the “Available training resources in...” for each language and exporting the worksheets into HTML
translateodt.py
Processes English ODT file and replaces it with the translation into another language Introduction for users: https://www.youtube.com/watch?v=g9lZbLaXma0
pywikitools/fortraininglib.py
Our central library with important functions and API calls

More tools:

downloadalltranslations.py
Download all translated worksheets of a given worksheet
dropboxupload.py
Upload files into dropbox
mediawiki2drupal.py
Export content from mediawiki into Drupal

License

Jesus says in Matthew 10:8, “Freely you have received; freely give.”

We follow His example and believe His principles are well expressed in the developer world through free and open-source software. That's why we want you to have the four freedoms to freely use, study, share and improve this software. We only require you to release any derived work under the same conditions (you're not allowed to take this code, build upon it and make the result proprietary):

GNU General Public License (Version 3)

Contributing and coding conventions

By contributing you release your contributed code under the licensing terms explained above. Thank you!

For more details see CONTRIBUTING.rst

Communication

Please subscribe to the repository to get informed on changes. We use github issues for specific tasks, wishes, bugs etc. Please don’t hesitate to open a new one! Assign yourself on the issues that you plan to work on.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

pywikitools's People

Contributors

ammarsayya avatar chenjieen avatar doublethep avatar hmuiier avatar holybiber avatar janetzki avatar josuakugler avatar lemonade-thinking avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pywikitools's Issues

Implement test function for checking against revisions on the server

Based on basic function call (see #19 ): Have an older revision of one translation unit as the input and check whether the output of our corrector is identical to the newer revision of it.
Takes three arguments:

  • name of translation unit
  • old revision ID
  • new revision ID

This is similar to checking
https://www.4training.net/mediawiki/index.php?title=Translations:How_to_Continue_After_a_Prayer_Time/1/ar&type=revision&diff=62258&oldid=62195

correctbot: report for user

After a run, correctbot should send an email to the user (see generateodtbot.py for an example) with a report on the corrections, roughly saying:

I corrected 8 translation units. The following changes were made:
- fixed capitalization: 3x
- fixed misplaced spaces: 2x
- corrected dash: 2x
- fixed general typos: 1x
- fixed file name: 1x
- language specific fixes: 2x (could be shown also in more detail)

Please review all my changes here:
https://www.4training.net/mediawiki/index.php?title=How_to_Continue_After_a_Prayer_Time/ar&curid=20946&diff=62278&oldid=62238
In case I did a mistake, please contact an administrator to revert changes.

Create German corrector rules

Implement rules together with meaningful tests:

  • no plain quotation marks like "
  • no English quotation marks like ”
  • German quotation marks consist of two parts: „ at the beginning and “ at the end
  • throw error in case of unresolvable situations (e.g. found three quotation marks in one paragraph)

Remove is_incomplete()

The class TranslationProgress has a function is_incomplete() that isn't really used (besides in a log warning). Unless there some good reasons to make more use of it, remove it completely.

Restructure project to resolve import issues

The Python import system has caused me a lot of headaches... currently it's necessary to correctly set PYTHONPATH in order for all imports to work properly. We need a way to structure our package so that this won't be necessary but Python scripts can just be run directly.

Relative versus absolute imports: which to use?

Relative imports

  • Problem: they don't work relative to the current file / directory (yes, that is not intuitive) but to package / module names! But package/module names are dependent on how a script is called and from which directory
  • It would be necessary to change script invocation to the form python3 -m pywikitools.resourcesbot and not directly call python3 script.py. Main functionality of the resourcesbot e.g. could go into pywikitools/resourcesbot/__main__.py
  • Disadvantages:
    • Calling via the pywikibot pwb.py wrapper script aren't possible anymore, namely helpful for only simulating all write activities with pwb.py -simulate script.py
    • Less intuitive: Needs to be documented well so that people know how to start something

Absolute imports

  • from pywikitools.resourcesbot.changes import ChangeType requires that python knows where to find the package pywikitools. That's not automatically clear and currently needs to be defined from outside! (either adding the root path of this repository to PYTHONPATH or write it to a .pth file in the right directory)
  • Disadvantages:
    • This needs to be explained in the documentation and adds another tedious step to get something running.
    • If I happen to make another copy of the repository and work on it, no code from it will be called but actually always the other copy of it which is a mean and not so obvious mistake
  • Advantage: Generally this seems to be the normal approach when making this package distributable via pypi and for installation

Proposed solution

  • Use absolute imports generally
  • Reduce the entry-point scripts to a bare minimum of argument handling and then calling the modules
  • Put these scripts into the root directory of the repository. That way absolute imports will directly work and all scripts can be called simply via python3 script.py
  • Messing around with PYTHONPATH or .pth files wouldn't be necessary anymore and always the local code would be executed in development

Resourcesbot: Print reports

For each language: how many worksheets are translated? How many of them have PDF files (and which)?
Where did we find incomplete translations?
And combining all these into an overall report

Resourcesbot: Refactor summary generation

  • use new LanguageInfo/WorksheetInfo functions now
  • new postprocessing classes:
    LanguageReport(LanguagePostProcessor)
    GlobalReport(GlobalPostProcessor)
  • migrate all functionality there ( total_summary(), log_languagereport(), create_summary())

CorrectBot: Trigger emptying of job queue

After saving all corrections of individual translation units back to the mediawiki system, trigger https://www.mediawiki.org/wiki/Manual:RunJobs.php to directly empty the job queue. This ensures that the resulting translated page is also updated directly by FuzzyBot putting together the changes on individual translation units.
Wait some seconds and check with the API function of #55 if job queue is really empty. Print a warning if it isn't.

Implement test framework basics

Have base class CorrectTest(unittest.TestCase) that all language-specific tests will be based on.

Needs to take language code of the language we're testing in constructor (to avoid redundancy of having to define that in every line again).

Have essential function to check whether given input produces expected output (one-line function call)

Implement function native_to_standard_numeral

Several languages use their own numerals instead of the "standard" numerals 0123456789. This is a challenge for translateodt.py as the version number is written in the native language and the script can't know if it is identical to the English version.
Implement a function
native_to_standard_numeral(languagecode: str, native: str) -> str
-> Replace the numerals, e.g. for Hindi १ -> 1 . Other characters (especially the dot .) remain unchanged.
For testing:
Hindi: native_to_standard_numeral('hi', '१.२') == '1.2'
Kannada: native_to_standard_numeral('kn', '೧.೨') == '1.2'

Could go into fortraininglib.py or become a separate script.

We currently need the following languages:
Hindi = 'hi': https://en.wikipedia.org/wiki/Hindustani_numerals
Kannada = 'kn': https://en.wikibooks.org/wiki/Kannada/Numbers
Tamil = 'ta': https://en.wikipedia.org/wiki/Tamil_numerals

ResourcesBot: Improve --rewrite flag

Currently we only have a --rewrite-all flag. It would be good to introduce better control to select only individual plugins to do a rewrite:

  • --rewrite all has the same functionality as --rewrite-all
  • --rewrite list: Rewrite available resources lists (WriteList)
  • --rewrite report: Rewrite language report (WriteReport)
  • --rewrite html: Rewrite the exported HTML (ExportHTML)

ResourcesBot: Write report to mediawiki (language summary page)

We introduce new internal language summary pages for translators / translation coordinators, following the example of https://www.4training.net/4training:TestLanguage
ResourcesBot should update these pages automatically if there are any changes.

Implement as a GlobalPostProcessor as we need to read from the English LanguageInfo object as well (to be able to compare versions).
The WriteLists module has a similar functionality: create the mediawiki string, find its position in the whole page, replace if there are changes.

Coloring details:

  • Version is either green (same as English) or red
  • If version is red, no other cell can be green. It can be either orange (something is already done) or red (nothing done yet)
  • If progress is unfinished (by the definition of TranslationProgress.is_unfinished()), progress should be red. If it's incomplete, progress should be orange. If it's 100% progress can be green (if version is green as well)
  • PDF and ODT fields are red if those files are missing. If they exist, their color is green or orange, depending on the color of version.
  • The first two columns are red if all the rest is red, green if all the rest is green and orange for all other cases

Depends on #47
#36 is probably not necessary anymore afterwards

translateodt: Add command line option --ignore translation unit

Add command line option so that the specified translation unit won't get processed (and we don't have warnings that are confusing for translators).

Example: https://www.4training.net/Bible_Reading_Hints_(Seven_Stories_full_of_Hope)
This gives a number of warnings because there are translation units that are irrelevant for us and should be ignored.

It must be possible to have this option multiple times, like
translateodt.py --ignore "Template:BibleReadingHints/2" --ignore "Template:BibleReadingHints/3" ...

Clear dropbox for a specified language

Feature wish: Add functionality to delete all generated files for a given language (takes languagecode as parameter) and make this available for users

ResourcesBot: Write translated worksheet titles into interface messages

The user interface of 4training.net is translated into a few languages. Part of this are the worksheet titles shown in the sidebar, which are stored in mediawiki system messages like https://www.4training.net/MediaWiki:Sidebar-hearingfromgod
Translations of this are stored also with the Translate plugin, e.g. the German translation of the title is found at https://www.4training.net/MediaWiki:Sidebar-hearingfromgod/de
There are technical reasons why it's not possible to directly take everything from https://www.4training.net/Hearing_from_God - so we need to copy the translations of worksheet titles into these system messages (was done manually until now)

The task is now to automate this step, e.g. copy from https://www.4training.net/Translations:Hearing_from_God/Page_display_title/de to https://www.4training.net/MediaWiki:Sidebar-hearingfromgod/de

Implementation should be a new module: class WriteSidebarMessages(LanguagePostProcessor)

resourcesbot: Mark updated language information page for translation

When a language information page is updated, it afterwards needs to be marked for translation so that the translated versions of it get updated as well. Currently that still needs to be done manually - it would be great if resourcesbot can do that automatically.

The first call is https://www.4training.net/mediawiki/index.php?title=Special:PageTranslation&target=Spanish&do=mark (GET)
and then a POST call follows.

Bad example where this hasn't been done in a long while:
https://www.4training.net/Spanish/de vs. https://www.4training.net/Spanish

Bug in ResourcesBot: Existing files for Georgian worksheet throw error

The Georgian translation of "Four Kinds of Disciples" seems to have correct PDF and ODT files: https://www.4training.net/Four_Kinds_of_Disciples/ka
However, ResourcesBot does not list the worksheet in https://www.4training.net/Georgian and https://www.4training.net/4training:Georgian

The Log shows the following warning from ResourcesBot._add_file_type():
pywikitools.resourcesbot WARNING: Exception thrown for odt file: Query on [[en:File:ოთხი სახის მოწაფეები.odt]] returned data on 'File:Ოთხი სახის მოწაფეები.odt'

pywikitools.resourcesbot WARNING: Exception thrown for pdf file: Query on [[en:File:ოთხი სახის მოწაფეები.pdf]] returned data on 'File:Ოთხი სახის მოწაფეები.pdf'

TODO: Investigate why pywikibot can't access the FilePage for these two worksheets correctly.

resourcesbot: introduce class LanguageInfo

Currently the ugly data structure global_result holds for each language a dictionary of worksheet names of a dictionary of properties. Make this nice and OOP by introducing a class LanguageInfo, holding the information on all the worksheets in objects of class WorksheetInfo.

Create French corrector rules

Implement rules together with tests:

  • "example" is English, "exemple" is French
  • quotation marks: replace all other quotation marks with « Foo » (with non-breaking whitespaces \u00a0 before/after the guillemets!)

resourcesbot: detect when PDF is linked to in more than one worksheet

When two worksheets link to the very same PDF, most likely a mistake happened. Detect the case and print a WARNING about it.
Probably take the first occurence as the legitimate one and ignore any further uses.

Easiest way to implement that is probably a dictionary with PDF name -> worksheet/language-code. Every PDF found is added into this dict and in case the key already exists, issue a warning and ignore it.

Currently the issue can be found in https://www.4training.net/4training:Ku.json

resourcesbot: Introduce class WorksheetInfo

Currently the last level of the ugly global_result data structure contains a dictionary with the following fields:
'title', 'pdf-timestamp', 'pdf', 'odt-timestamp', 'odt'
Wrap this into a class WorksheetInfo and use this instead of a dictionary.

ResourcesBot: Save also unfinished worksheets

Currently unfinished translations (some part of the worksheet is translated but more than 4 translation units are missing) are ignored completely.
Change this behavior so that information on these worksheets is also included in the JSON files (but WriteLists and ExportHTML should still ignore them)

translateodt: Handle headline translation also when there are variants

Problem: For all worksheets that have different variants (mainly God's Story), the headline won't be translated. Error message for example for https://www.4training.net/God%27s_Story_(first_and_last_sacrifice)/ro:
WARNING: Not found:
God's Story (first and last sacrifice)
Translation:
Povestea lui Dumnezeu (primul şi ultimul sacrificu)

Handle article headline in such a way that "God's Story" gets translated by "Povestea lui Dumnezeu".

resourcesbot: define post-processing interface

Design a simple interface to enable different post-processing features for resourcesbot.
First we need an overview of all changes since last run (e.g. German: added new worksheet "Family and our Relationship with God"; Arabic: added PDF for worksheet "Family and our Relationship with God")
-> out of that I can call extensions (should be in separate files), for example:

  • re-create zip file with all German worksheets (calls zipgenerator with argument 'de')
  • send update email on notification newsletters for German and Arabic

Refactor language corrector classes

  • Have file for each language with the naming convention correctors/de.py for German etc.
  • Automatically load the necessary class
  • let corrector class inherit from base classes which contain rules applying to several languages
    Example:
    class GermanCorrector (UniversalCorrector, ...):
    class ArabicCorrector (UniversalCorrector, RTLCorrector):
  • let corrector call all existing methods of our language-specific class one after another

translateodt: enable per-worksheet configuration in the mediawiki system

Enable a configuration file with options for the handling of a specific worksheet, for example for the --ignore option (issue #3)

Suggestion: Save the configuration in
https://www.4training.net/4training:Bible_Reading_Hints_(Seven_Stories_full_of_Hope).config
-> Script tries to see if such a file exists and if yes, uses the relevant options (could be simple text, one command per line)

Idea #1: Using the configparser syntax/lib may simplify processing these options even more

Add method for retrieving specific revision of a page

Either extent fortraininglib.get_page_source() or make a new one.
This is necessary for #30

This is the API call for retrieving revision id 671 of translation unit Translations:Prayer/1/de
https://www.4training.net/mediawiki/api.php?action=query&prop=revisions&rvstartid=671&rvlimit=1&rvprop=content&format=json&titles=Translations:Prayer/1/de

Bonus: This gives a warning (as well as the current implementation of get_page_source()) because of a missing rvslots property. Research this and probably add rvslots=* into our call

translateodt: introduce RTL/LTR marks for RTL languages

For right-to-left languages we need correct RTL/LTR marks when there is both RTL text and English (LTR) text together. This is the case in the document properties. Currently some parts get displayed in a wrong way. Introducing these marks should correct that:
https://en.wikipedia.org/wiki/Right-to-left_mark

Also RTL titles with parenthesis form a challenge and need to be ended with a RTL mark - otherwise he thinks it is mixed RTL and LTR and the ending parenthesis is shown at some confusing places.
Check that for headlines and file names like https://www.4training.net/Bible_Reading_Hints_(Seven_Stories_full_of_Hope)/ar

CorrectBot: Check links in UniversalCorrector

Somehow similar to TranslationUnit.remove_links() - issue a warning when finding a link like [[Pardonner pas à pas|Pardonner pas à pas]] (example taken from Overcoming Fear and Anger in French).

Rusty error handling instead of passing error messages as string

@return on error: returns string with error message

I think, it is worth to at least have a look to the following error handling approach inspired by rust: https://rednafi.github.io/reflections/go-rusty-with-exception-handling-in-python.html instead of returning errors as strings.

Using the result-approach, it is always clear what kind of object is returned in case the calculation fails. The return value would carry the calculated result in case of success and an error message in case of an error. This gives the user the freedom to decide if the error is worth to be printed.

Better type hinting

It would be great to have mypy --strict to run without complaining.

Afterwards we can add mypy checks to our CI/CD to enforce well-typed code in our repository.

German correcting rules for quotes don't work in some cases

Fix GermanCorrector.correct_quotes(): Does not handle cases correctly when there is special characters around quotes. See TestGermanCorrector.test_correct_quotes_todo() and make sure everything there can be uncommented, no warnings are given and tests pass.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.