I can't get Orson PDF to display non-English letters contained in the text when not re

Unable to render non-English letters supported by the font about orsonpdf HOT 4 OPEN

jfree commented on June 14, 2024

Unable to render non-English letters supported by the font

from orsonpdf.

Comments (4)

mhschmieder commented on June 14, 2024

I am hoping to do some work on encoding issues tonight so will keep this in mind as it may be a related issue. Much of the code is hard-wired for 7-bit US-ASCII for some reason, which is an unnecessary restriction as PDF supports UTF-8 and UTF-16 and as UTF-8 (my preference) covers the same Unicode subset as UTF-16 but results in byte-for-byte identical files when the content adheres to 7-bit US-ASCII. I will check whether this affects font selection and extended font mappings as well as whether it is being applied prior to glyph conversion and vectorizing the text. I am hoping to find a way to solve these character set limitations in the library, in a way that still allows the downstream client to be in control of the encoding.

from orsonpdf.

mhschmieder commented on June 14, 2024

Although I am now setting the Rendering Hint to get vectored text, just because my clients want as exact a match to the on-screen GUI look as possible (and as most modern applications recognize common fonts and can back-convert to selectable text as long as the original font was a common one), I then looked again at the OrsonPDF source code to see if it would be safe to switch the encoding to UTF-8 in the two "toBytes()" functions (one in PDFUtils, the other in PDFDocument).

After looking at where those functions are called, it seems perfectly safe to make this change and I don't feel it should even require "additional" functions that take the charset as an argument and are called by these older functions using US-ASCII as the charset value for backward compatibility. As I stated above, if the content is all US-ASCII anyway, the resulting file will be 100% identical when the String is converted to a byte array using UTF-8 encoding.

On the other hand, the Dictionary class in OrsonPDF also uses PDFUtils.toBytes() to convert the String containing the PDF text describing the Dictionary. So it may be that US-ASCII is needed for that particular encoding, as the Dictionary entries go first in the PDF output and likely are only valid if US-ASCII-limited.

As this issue got no comments after two years, I'm not sure if I should just go ahead and make the changes with a pull request that includes this explanation? If no comments are made here soon, I will probably do that. After all, the development team can always reject the change and say why.

Of course I will discover immediately if this changes causes a font-mapping issue, before I even commit any code changes, as every PDF output that I do at least has the degree symbol in it, which isn't in the US-ASCII character set. I may have to try several available font mappings to verify though, as I thought Helvetica supported some of the more common non-US characters like degrees, copyright, etc.

from orsonpdf.

jfree commented on June 14, 2024

Font support is definitely an area where OrsonPDF has limitations. I'll be happy to look at pull requests that extend what's possible with the API.

from orsonpdf.

mhschmieder commented on June 14, 2024

I just spent about two hours trying to understand every possible code path that might be depending on the functions that force to US-ASCII encoding. I'm not quite ready yet to risk the change to UTF-8 vs. making it an option in enhanced versions of the byte-array-conversion functions that would only affect text content, but I should have time tomorrow to give that a trial run as a safe approach that doesn't affect the internals of the library, and then see if the built-in PDF Fonts choke on non-US ASCII characters such as the degrees symbol and the copyright sign, for the core text content of the PDF document.

If indeed it ends up being an issue of font support for UTF-8, then I'll come up with some creative plan for setting the document to map to fonts that support expanded character sets. But if it comes to that, then Issue #6 would have to be addressed first, and as I mentioned there, I am hoping to have time to work on that one later this week.

from orsonpdf.

Unable to render non-English letters supported by the font about orsonpdf HOT 4 OPEN

Comments (4)

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs