GithubHelp home page GithubHelp logo

Comments (11)

laurentprudhon avatar laurentprudhon commented on August 22, 2024

PROBLEM 1 :
Some COBOL programs on the mainframe contain non-printable EBCDIC characters in alphanumericLiterals.
Such alphanumericLiterals are used in value clauses of the DATA DIVISION to initialize tables of bytes. They are not interpreted as a character string at runtime, but as table of numeric elements.
If the text of these programs is converted to another character set, the character representation of the alphanumericLiteral may be preserved, but the numeric values in the table change, and the program does not work.
There is NO SOLUTION to this problem, because we can not know which parts of the program text have been encoded as EBCDIC strings with this goal in mind.
We can not know when we must preserve the textual representation, and when we must preserve the numeric representation.
This pattern must be strictly forbidden in all our programs : if a field is initialized as text, it must be interpreted as text at runtime.

from typecobol.

laurentprudhon avatar laurentprudhon commented on August 22, 2024

PROBLEM 2 :
In the context described in PROBLEM1, we can find one EBCDIC character which maps to Unicode endOfLine character (\r or \n) in the middle of an alphanumericLiteral.
When the class TextDocument reads the Stream of Unicode chars from an ASCII source file (with explicit line endings), it can not know if this endOfLine char really signals an end of line, or if it is a char inside an literal.
Other languages avoid this problem by forbidding endofLine chars in literals, and defining an special char sequence, for example "\r\n" to represent the forbidden characters for the compiler.
As there is no such escaping char sequences in Cobol, there is NO perfect SOLUTION to this problem either.

from typecobol.

laurentprudhon avatar laurentprudhon commented on August 22, 2024

FIX ?
Here is the best thing we could do to reduce the probability of PROBLEM2 occurence :

  • drop the support for files with Unix/Linux-style single character line endings
  • Windows-style two characters end of lines \r\n become mandatory
    Then if only one endOfLine char is found in an alphanumericLiteral, we will know it is not the end of the line. But the problem remains if the alphanumericLiteral contains the sequence of chars \r\n.

I implemented and tested this fix with success on our sample files.
But do you think it is worth dropping the support for unix/linux style source files ?
I wait for your answers to commit this fix ...

from typecobol.

wiztigers avatar wiztigers commented on August 22, 2024

Dropping support of UNIX-style files without correcting the bug would indeed be disappointing.

As I understand the problem, TextDocument takes a Stream of chars. This stream is built by other objects, which take various source file formats as their input. Couldn't we :

  • either give TextDocument another (non-character based) way to discriminate between lines ? Like a custom-created Stream object, or an array of line Streams, ...
  • or modify the File > Stream conversion objects to return a special way to differentiate lines (for example EOF, two EOF in a row being the end of the source file) independent from the input format, this way of discrimination being of course known by TextDocument ?

from typecobol.

smedilol avatar smedilol commented on August 22, 2024

RDZ have the same problem an can't interpret correctly such a file.

But ...
Cobol files located Inside partitionned data contains a fixed number of characters.
For our organization it's 80 chars.

So instead of looking for line endings chars, I think it's better to parse 80 chars and consider this as a whole line.

Of course this should be one behavior of the parser and must be configurable (use fixed line length or use line endings char).

from typecobol.

wiztigers avatar wiztigers commented on August 22, 2024

Yup, but this 80 chars limit has no sense in free format, and the sexay thing is TextDocument doesn't currently know about file formats.
Wouldn't the fact of including this notion in TextDocument break the SOLID principle ?

from typecobol.

smedilol avatar smedilol commented on August 22, 2024

Maybe one solution is to have 2 implementations of ITextDocument:

  • the current one with line ending chars
  • a new one for fixed line length

from typecobol.

laurentprudhon avatar laurentprudhon commented on August 22, 2024

Our friend Regis is right here : the idea was to restrict the knowledge of the text storage format (encoding and line endings) to the File namespace, ie for now the CobolFile class. The CobolFile implementation noramlizes the input as a Stream of Unicode chars with \r, \n, or \r\n line endings. The later phases of the compiler, notably the Text namespace / TextDocument class don't need to worry about the storage format anymore.
The consequence of this choice is that we need a line ending character (or character sequence), and that we can not allow such character (or character sequence) in character literals in our parser, while the original Cobol specification has no such restriction.
But iun fact the architecture we choose for our compiler to read files from disk does not matter : this limitation will always be present if we want to allow free fromat Cobol programs in our visual text editor in memory. All the text editors from Eclipse or other IDEs will internally detect a line ending - an dispay it on screen - if they find such line ending characters in the string representing one ouf our program lines.
After thinking a bit more about that, I devised a different fix for issue, which I committed this morning in the same issue-59 branch :
We recognized above that we will anyway be unable to support Unicode line ending chars in alphanumericLiterals in interactive editing scenarios.
And we know that the EBCDIC alphanumeric literals containing non printable characters will be broken anyway by the Unicode conversion, because the developer relied on explicitely on the numeric code representing them in the original EBCDIC character set.
So I propose the following solution :

  • restore support for single \r and \n characters as line endings in TextDocument (revert to the previous version of the file)
  • update CobolFile class : when reading a fixed length line, if we encounter a line ending character after Unicode conversion of an original EBCDIC character, replace it on the fly with a question mark '?' char

Document clearly two restrictions of our compiler :

  • because of the internal conversion of the program text to Unicode characters in .Net or Java, we do not support alphanumeric literals containing non printable EBCDIC characters
  • because of the feature allowing free text format and variable line length, we do not support alphanumeric literals containing line ending characters

NB : when we say we do not support these two cases, it will only have an impact if we generate Cobol from a TypeCobol program and then compile it with the IBM compiler. For Cobol code analysis in memory, it has no impact.
In the two cases above, the solution is to modifiy the original EBCDIC program text before using our tool :

  • initialize numeric tables directly with numbers instead of their corresponding chars
  • set line ending chars individually Inside alphanumeric literals, for exemple with reference modification

from typecobol.

laurentprudhon avatar laurentprudhon commented on August 22, 2024

NB : this new fix won't resolve the problems found on our sample files in ASCII format, because is corrects the EBCDIC to Unicode conversion process, which has already been executed before in this case.
The solution for our test suite is simply to manually replace the offending line ending characters in the source file with question marks characters to mimic the new behavior of the CobolFile class.

from typecobol.

laurentprudhon avatar laurentprudhon commented on August 22, 2024

Sorry, I can not push the new commit today, because it appears that I can't reach the Github server while using the VPN -> I will push it Monday

from typecobol.

wiztigers avatar wiztigers commented on August 22, 2024

@prudholu largely solved the problem, and I added the identified restrictions to the appropriate wiki page.

from typecobol.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.