GithubHelp home page GithubHelp logo

fractalqb / texst Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 153 KB

Check texts against reference text specifications

License: GNU Affero General Public License v3.0

Makefile 0.69% Go 98.99% Shell 0.32%
testing golden-master golang go

texst's Introduction

texst – Text Tests

Build codecov Go Report Card Go Reference License: AGPL v3

Package texst checks text files against a reference text specifications. The simplest reference text would be the verbatim text with each line prefixed with a 'reference text' line tag, e.g. "> ". This would only match exactly the verbatim text. To do more complex matching one can add other line types to the reference text specification.

Line types are recognised by the rune in the first column of each line in the reference text specification. There are line types that serve different purposes.

Most often one might need to mark parts of a reference line that do not need to match exactly to the checked “subject” text. texst does not embed markers into the reference text line because it would need some very sophisticated escaping to make arbitrary reference text feasible. Instead each reference text line may be followed by argument lines, that modify the way the reference text is matched against the checked text. Argument lines start with ' ' (U+0020). Some types of argument lines are used to mark segments of the reference text to not match exactly to the subject text:

> This is some reference text content
 .        xxxx

The above example says that the four runes above the non-space part of the argument line, i.e. "some", are not compared to the checked text. The '.' identifies the specific type of argument line (see Types of argument lines). So the text

This is blue reference text content

would perfectly match the reference text example. Argument lines can be stacked and are applied in order to their reference text line up to the next non-argument line.

> This is some reference text content
 .        xxxx
 .                       yyyy

would be the same as

> This is some reference text content
 .        xxxx           yyyy

For some files, e.g. log files, it would be rather tedious if one had to mark each timestamp in the reference text line:

Jun 27 21:58:11.112 INFO  [thread1] create `localization dir:test1/test.xCuf/l10n`
Jun 27 21:58:11.113 INFO  [thread2] load state from `file:test1/test.xCuf/bcplus.json`
…

To solve this one can set a global mask line after the preamble and between reference text specifications. For our example one would write:

*.ttt tt tt tt tt ttt
> Jun 27 21:58:11.112 INFO  [thread1] create `localization dir:test1/test.xCuf/l10n`
> Jun 27 21:58:11.113 INFO  [thread2] load state from `file:test1/test.xCuf/bcplus.json`
> Jun 27 18:58:11.125 DEBUG [thread1] clearing maps
> …

With a little attention, you notice that the log lines are from different threads. I.e. one cannot rely on the order of lines in the reference text specification. But at least the lines from one thread shall be in exactly the same order as given in the reference.

We declare two “interleaving groups” '1' and '2' in the preamble and mark the reference text lines to be in the specific group:

%%12
*.ttt tt tt tt tt ttt
>1Jun 27 21:58:11.112 INFO  [thread1] create `localization dir:test1/test.xCuf/l10n`
>2Jun 27 21:58:11.113 INFO  [thread2] load state from `file:test1/test.xCuf/bcplus.json`
>1Jun 27 18:58:11.125 DEBUG [thread1] clearing maps
> …

Now, both subjects

Jun 27 21:58:11.112 INFO  [thread1] create `localization dir:test1/test.xCuf/l10n`
Jun 27 21:58:11.113 INFO  [thread2] load state from `file:test1/test.xCuf/bcplus.json`
Jun 27 18:58:11.125 DEBUG [thread1] clearing maps
…

and

Jun 27 21:58:11.112 INFO  [thread1] create `localization dir:test1/test.xCuf/l10n`
Jun 27 18:58:11.125 DEBUG [thread1] clearing maps
Jun 27 21:58:11.113 INFO  [thread2] load state from `file:test1/test.xCuf/bcplus.json`
…

match the reference. For more details use the reference documentation.

texst's People

Contributors

fractalqb avatar mperlick avatar

Watchers

 avatar  avatar  avatar

texst's Issues

Configurable check for newline format

Currently the newline chars "\n" or "\r\n" are discarded while reading from input. I.e. one cannot check if these are of expected format. There should be options to have explicit checks for this.

Add option to reuse reference specifications

My original implementation (Java, around 2005) slurped the complete reference into main memory and then used that to compare subject texts. (This is why this project started with v0.3.0, not v0.1.0; I consider my 1st Go port from ~2016 as v0.2.0) The “load the complete reference” strategy

  1. made it efficient to apply the same reference to many subjects (IMHO a rather rare use case)
  2. can blow up – or at least impose heavy demands on – main memory (The primary reason for v0.3.0; it can stream reference texts, but has to parse them on each run)

Now that the algorithm works, I see a good chance that both strategies, streaming and complete loading, can nicely coexist in the texst package.

More argument line types to apply additional checks to masked segments of subject

Currently the segments of a subject line that match the masked segments of the reference text do not receive many further checks. – There are length constraints of the '=', '*' and '+' argument lines but that's it.
Next things should be:

  • Use the rune that marks the reference segments as a name for that segments
  • Arg line type to define a regular expression for a segment name
  • Check if all defined regular expression of a ref line match the respective segments of the subject line; if not => mismatch

Make texsting more versatile

By shifting some concerns to the texst package it would be easy to let users of texsting have more control over the way reference files are selected. The default strategy will remain.

Marking segments in reference text with tabs

Without tabs the marking of segments using argument lines is (IMHO) visually pleasant and helpful. But its hard to apply to reference text that contains tabs. – Find a good solution… 😁

Same options for global masks as for line specific masks

Currently one can only set one global segment line, i.e. global segments can only be of one type: exact, optional or variable. It would be more flexible and intuitive if global segment lines would be “the same” as line specific segments – only global.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.