GithubHelp home page GithubHelp logo

Comments (3)

dginev avatar dginev commented on June 2, 2024

Hi Jonathan,

Thank you for looking into ar5iv and starting this discussion.

Some general notes first:

  1. Naturally, we've experienced how overwhelming it can be to deal with a large corpus of TeX, so we created a build system to systematically aggregate latexml's log messages already at the original start of the arXMLiv project, back in ~2007. Anyone can query that system at:
    https://corpora.mathweb.org/corpus/arxmliv/tex%5Fto%5Fhtml

  2. For downloading the logs of 10,000 examples, assuming you have a list of arXiv ids, this can be done with a simple loop that fetches https://ar5iv.labs.arxiv.org/log/<arxivid>. The ar5iv site is actively crawled by many interested parties at the moment (most recently I was told that SearchOnMath deployed a search index over ar5iv). So you are very welcome to grab any log entries you may need.

  3. We are always open to monitoring new classes of problems that latexml conversion remains "silent" on today. If there is a pattern that can be automatically recognized during conversion, we can add and emit more messages (in 4 severities: Info/Warning/Error/Fatal), to track them at scale.

  4. I actually find it practical to crowd source collecting concrete problems in ar5iv. In recent years we have had a trend where what used to be considered a "large" number for development has subsided down to "medium" or even "small". A thousand issues may have been seen as large, but is now commonplace. Consider the rust language issues, where we see ~9000 open issues, ~40,000 closed issues and ~60,000 closed pull requests. I think a modern project backed by a large community should aspire to such numbers - and I am hoping the issues in this ar5iv repository will eventually approach this kind of magnitude.


On the formula example:

For math parsing warnings, these are best used specifically to improve latexml's MathGrammar module, but often do not translate into visual degradation in the rendered HTML. latexml has a fallback parsing mode (similar to an extent to the way MathJax and KaTeX deal with math) which takes over when the grammatical parse fails.

Certainly there are many remaining upgrades and inaccuracies to math parsing, and concrete reports of encountering them are always appreciated. Some are sufficiently difficult that resolving them in full requires swapping the entire grammar engine for one capable of dealing with ambiguity, which is one branch of ongoing work in latexml.

The aggregated report for math parsing failures (reported via the ALLCAPS grammatical categories of the concrete latexml grammar) is here:
https://corpora.mathweb.org/corpus/arxmliv/tex%5Fto%5Fhtml/warning/not%5Fparsed?all=false

As just one example, we've noticed that the most common parsing failure has to deal with unbalanced parens, often in frivolous TeX uses such as $(1+$ z $)^3$ found in astro-ph/0001053. There are other issues for an OPEN not finding its CLOSE, but the report tells us latexml fails to do so in 14% of arXiv articles.

For cases like this where math mode gets interrupted, as is with your example $\sim 4 \times$ $10^{6}$..., the parse warning is also useful feedback to authors using latexml, who may want to edit their formulas to parse grammatically. latexml already covers a few choice cases that would extract an isolated construct out of math mode and deposit it in a textual element. We have been discussing - but haven't yet settled on - a few choice cases where adjacent math elements may be merge-able, so as to improve the parsing success rates.


And lastly - triaging issues. We have been somewhat disciplined in adding support for the "next most needed" package in arXiv with the limited time we have available for that work. And respectively - fixing the "next most common" Error and Fatal issues.

The aggregate reports are indeed very informative in that regard. But as the issues in this repository often point out, there is a difference between broad coverage over arXiv and pixel-perfect individual articles. Sometimes we will manage to preserve the content (and thus have no log messages emitted by latexml), but be inaccurate with the exact sizing and styling of the emitted HTML elements - which leads to clunky rendering. In those cases especially, having the arXiv community report the problems back is extremely helpful, as they may remain invisible to us otherwise.

Hope some of that is helpful - I think we are very much in alignment on how to generally approach solving the "arXiv to HTML problem".

from ar5iv.

dginev avatar dginev commented on June 2, 2024

Oh and a fun note at the end: if you want to draw 10,000 ar5iv articles at random, consider using the "feeling lucky" feature. One can fetch the URL

https://ar5iv.labs.arxiv.org/feeling_lucky

and follow the redirect - each visit should lead to a different article. Then swapping /html/ for /log/ you can obtain the log info.

from ar5iv.

dginev avatar dginev commented on June 2, 2024

Thanks again for the conversation and detailed testing of ar5iv!

I will close here, but you are always welcome to open more issues for specific articles, feature requests, or general quality-of-life upgrades for latexml .

from ar5iv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.