GithubHelp home page GithubHelp logo

Comments (4)

mhschmieder avatar mhschmieder commented on June 14, 2024

I am hoping to do some work on encoding issues tonight so will keep this in mind as it may be a related issue. Much of the code is hard-wired for 7-bit US-ASCII for some reason, which is an unnecessary restriction as PDF supports UTF-8 and UTF-16 and as UTF-8 (my preference) covers the same Unicode subset as UTF-16 but results in byte-for-byte identical files when the content adheres to 7-bit US-ASCII. I will check whether this affects font selection and extended font mappings as well as whether it is being applied prior to glyph conversion and vectorizing the text. I am hoping to find a way to solve these character set limitations in the library, in a way that still allows the downstream client to be in control of the encoding.

from orsonpdf.

mhschmieder avatar mhschmieder commented on June 14, 2024

Although I am now setting the Rendering Hint to get vectored text, just because my clients want as exact a match to the on-screen GUI look as possible (and as most modern applications recognize common fonts and can back-convert to selectable text as long as the original font was a common one), I then looked again at the OrsonPDF source code to see if it would be safe to switch the encoding to UTF-8 in the two "toBytes()" functions (one in PDFUtils, the other in PDFDocument).

After looking at where those functions are called, it seems perfectly safe to make this change and I don't feel it should even require "additional" functions that take the charset as an argument and are called by these older functions using US-ASCII as the charset value for backward compatibility. As I stated above, if the content is all US-ASCII anyway, the resulting file will be 100% identical when the String is converted to a byte array using UTF-8 encoding.

On the other hand, the Dictionary class in OrsonPDF also uses PDFUtils.toBytes() to convert the String containing the PDF text describing the Dictionary. So it may be that US-ASCII is needed for that particular encoding, as the Dictionary entries go first in the PDF output and likely are only valid if US-ASCII-limited.

As this issue got no comments after two years, I'm not sure if I should just go ahead and make the changes with a pull request that includes this explanation? If no comments are made here soon, I will probably do that. After all, the development team can always reject the change and say why.

Of course I will discover immediately if this changes causes a font-mapping issue, before I even commit any code changes, as every PDF output that I do at least has the degree symbol in it, which isn't in the US-ASCII character set. I may have to try several available font mappings to verify though, as I thought Helvetica supported some of the more common non-US characters like degrees, copyright, etc.

from orsonpdf.

jfree avatar jfree commented on June 14, 2024

Font support is definitely an area where OrsonPDF has limitations. I'll be happy to look at pull requests that extend what's possible with the API.

from orsonpdf.

mhschmieder avatar mhschmieder commented on June 14, 2024

I just spent about two hours trying to understand every possible code path that might be depending on the functions that force to US-ASCII encoding. I'm not quite ready yet to risk the change to UTF-8 vs. making it an option in enhanced versions of the byte-array-conversion functions that would only affect text content, but I should have time tomorrow to give that a trial run as a safe approach that doesn't affect the internals of the library, and then see if the built-in PDF Fonts choke on non-US ASCII characters such as the degrees symbol and the copyright sign, for the core text content of the PDF document.

If indeed it ends up being an issue of font support for UTF-8, then I'll come up with some creative plan for setting the document to map to fonts that support expanded character sets. But if it comes to that, then Issue #6 would have to be addressed first, and as I mentioned there, I am hoping to have time to work on that one later this week.

from orsonpdf.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.