Seen in a production source file : some lines contain newline characters (although nor

Newline in fixed length COBOL source files about typecobol HOT 11 CLOSED

wiztigers commented on August 22, 2024

Newline in fixed length COBOL source files

from typecobol.

Comments (11)

laurentprudhon commented on August 22, 2024

PROBLEM 1 :
Some COBOL programs on the mainframe contain non-printable EBCDIC characters in alphanumericLiterals.
Such alphanumericLiterals are used in value clauses of the DATA DIVISION to initialize tables of bytes. They are not interpreted as a character string at runtime, but as table of numeric elements.
If the text of these programs is converted to another character set, the character representation of the alphanumericLiteral may be preserved, but the numeric values in the table change, and the program does not work.
There is NO SOLUTION to this problem, because we can not know which parts of the program text have been encoded as EBCDIC strings with this goal in mind.
We can not know when we must preserve the textual representation, and when we must preserve the numeric representation.
This pattern must be strictly forbidden in all our programs : if a field is initialized as text, it must be interpreted as text at runtime.

from typecobol.

laurentprudhon commented on August 22, 2024

PROBLEM 2 :
In the context described in PROBLEM1, we can find one EBCDIC character which maps to Unicode endOfLine character (\r or \n) in the middle of an alphanumericLiteral.
When the class TextDocument reads the Stream of Unicode chars from an ASCII source file (with explicit line endings), it can not know if this endOfLine char really signals an end of line, or if it is a char inside an literal.
Other languages avoid this problem by forbidding endofLine chars in literals, and defining an special char sequence, for example "\r\n" to represent the forbidden characters for the compiler.
As there is no such escaping char sequences in Cobol, there is NO perfect SOLUTION to this problem either.

from typecobol.

laurentprudhon commented on August 22, 2024

FIX ?
Here is the best thing we could do to reduce the probability of PROBLEM2 occurence :

drop the support for files with Unix/Linux-style single character line endings
Windows-style two characters end of lines \r\n become mandatory
Then if only one endOfLine char is found in an alphanumericLiteral, we will know it is not the end of the line. But the problem remains if the alphanumericLiteral contains the sequence of chars \r\n.

I implemented and tested this fix with success on our sample files.
But do you think it is worth dropping the support for unix/linux style source files ?
I wait for your answers to commit this fix ...

from typecobol.

wiztigers commented on August 22, 2024

Dropping support of UNIX-style files without correcting the bug would indeed be disappointing.

As I understand the problem, TextDocument takes a Stream of chars. This stream is built by other objects, which take various source file formats as their input. Couldn't we :

either give TextDocument another (non-character based) way to discriminate between lines ? Like a custom-created Stream object, or an array of line Streams, ...
or modify the File > Stream conversion objects to return a special way to differentiate lines (for example EOF, two EOF in a row being the end of the source file) independent from the input format, this way of discrimination being of course known by TextDocument ?

from typecobol.

smedilol commented on August 22, 2024

RDZ have the same problem an can't interpret correctly such a file.

But ...
Cobol files located Inside partitionned data contains a fixed number of characters.
For our organization it's 80 chars.

So instead of looking for line endings chars, I think it's better to parse 80 chars and consider this as a whole line.

Of course this should be one behavior of the parser and must be configurable (use fixed line length or use line endings char).

from typecobol.

wiztigers commented on August 22, 2024

Yup, but this 80 chars limit has no sense in free format, and the sexay thing is TextDocument doesn't currently know about file formats.
Wouldn't the fact of including this notion in TextDocument break the SOLID principle ?

from typecobol.

smedilol commented on August 22, 2024

Maybe one solution is to have 2 implementations of ITextDocument:

the current one with line ending chars
a new one for fixed line length

from typecobol.

laurentprudhon commented on August 22, 2024

Our friend Regis is right here : the idea was to restrict the knowledge of the text storage format (encoding and line endings) to the File namespace, ie for now the CobolFile class. The CobolFile implementation noramlizes the input as a Stream of Unicode chars with \r, \n, or \r\n line endings. The later phases of the compiler, notably the Text namespace / TextDocument class don't need to worry about the storage format anymore.
The consequence of this choice is that we need a line ending character (or character sequence), and that we can not allow such character (or character sequence) in character literals in our parser, while the original Cobol specification has no such restriction.
But iun fact the architecture we choose for our compiler to read files from disk does not matter : this limitation will always be present if we want to allow free fromat Cobol programs in our visual text editor in memory. All the text editors from Eclipse or other IDEs will internally detect a line ending - an dispay it on screen - if they find such line ending characters in the string representing one ouf our program lines.
After thinking a bit more about that, I devised a different fix for issue, which I committed this morning in the same issue-59 branch :
We recognized above that we will anyway be unable to support Unicode line ending chars in alphanumericLiterals in interactive editing scenarios.
And we know that the EBCDIC alphanumeric literals containing non printable characters will be broken anyway by the Unicode conversion, because the developer relied on explicitely on the numeric code representing them in the original EBCDIC character set.
So I propose the following solution :

restore support for single \r and \n characters as line endings in TextDocument (revert to the previous version of the file)
update CobolFile class : when reading a fixed length line, if we encounter a line ending character after Unicode conversion of an original EBCDIC character, replace it on the fly with a question mark '?' char

Document clearly two restrictions of our compiler :

because of the internal conversion of the program text to Unicode characters in .Net or Java, we do not support alphanumeric literals containing non printable EBCDIC characters
because of the feature allowing free text format and variable line length, we do not support alphanumeric literals containing line ending characters

NB : when we say we do not support these two cases, it will only have an impact if we generate Cobol from a TypeCobol program and then compile it with the IBM compiler. For Cobol code analysis in memory, it has no impact.
In the two cases above, the solution is to modifiy the original EBCDIC program text before using our tool :

initialize numeric tables directly with numbers instead of their corresponding chars
set line ending chars individually Inside alphanumeric literals, for exemple with reference modification

from typecobol.

laurentprudhon commented on August 22, 2024

NB : this new fix won't resolve the problems found on our sample files in ASCII format, because is corrects the EBCDIC to Unicode conversion process, which has already been executed before in this case.
The solution for our test suite is simply to manually replace the offending line ending characters in the source file with question marks characters to mimic the new behavior of the CobolFile class.

from typecobol.

laurentprudhon commented on August 22, 2024

Sorry, I can not push the new commit today, because it appears that I can't reach the Github server while using the VPN -> I will push it Monday

from typecobol.

wiztigers commented on August 22, 2024

@prudholu largely solved the problem, and I added the identified restrictions to the appropriate wiki page.

from typecobol.

Newline in fixed length COBOL source files about typecobol HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs