Exact location of issue Thanks to the detailed conv

Understand many conversion errors and warnings about ar5iv HOT 3 CLOSED

jfine2358 commented on June 2, 2024

Understand many conversion errors and warnings

from ar5iv.

Comments (3)

dginev commented on June 2, 2024

Hi Jonathan,

Thank you for looking into ar5iv and starting this discussion.

Some general notes first:

Naturally, we've experienced how overwhelming it can be to deal with a large corpus of TeX, so we created a build system to systematically aggregate latexml's log messages already at the original start of the arXMLiv project, back in ~2007. Anyone can query that system at:
https://corpora.mathweb.org/corpus/arxmliv/tex%5Fto%5Fhtml
For downloading the logs of 10,000 examples, assuming you have a list of arXiv ids, this can be done with a simple loop that fetches https://ar5iv.labs.arxiv.org/log/<arxivid>. The ar5iv site is actively crawled by many interested parties at the moment (most recently I was told that SearchOnMath deployed a search index over ar5iv). So you are very welcome to grab any log entries you may need.
We are always open to monitoring new classes of problems that latexml conversion remains "silent" on today. If there is a pattern that can be automatically recognized during conversion, we can add and emit more messages (in 4 severities: Info/Warning/Error/Fatal), to track them at scale.
I actually find it practical to crowd source collecting concrete problems in ar5iv. In recent years we have had a trend where what used to be considered a "large" number for development has subsided down to "medium" or even "small". A thousand issues may have been seen as large, but is now commonplace. Consider the rust language issues, where we see ~9000 open issues, ~40,000 closed issues and ~60,000 closed pull requests. I think a modern project backed by a large community should aspire to such numbers - and I am hoping the issues in this ar5iv repository will eventually approach this kind of magnitude.

On the formula example:

For math parsing warnings, these are best used specifically to improve latexml's MathGrammar module, but often do not translate into visual degradation in the rendered HTML. latexml has a fallback parsing mode (similar to an extent to the way MathJax and KaTeX deal with math) which takes over when the grammatical parse fails.

Certainly there are many remaining upgrades and inaccuracies to math parsing, and concrete reports of encountering them are always appreciated. Some are sufficiently difficult that resolving them in full requires swapping the entire grammar engine for one capable of dealing with ambiguity, which is one branch of ongoing work in latexml.

The aggregated report for math parsing failures (reported via the ALLCAPS grammatical categories of the concrete latexml grammar) is here:
https://corpora.mathweb.org/corpus/arxmliv/tex%5Fto%5Fhtml/warning/not%5Fparsed?all=false

As just one example, we've noticed that the most common parsing failure has to deal with unbalanced parens, often in frivolous TeX uses such as $(1+$ z $)^3$ found in astro-ph/0001053. There are other issues for an OPEN not finding its CLOSE, but the report tells us latexml fails to do so in 14% of arXiv articles.

For cases like this where math mode gets interrupted, as is with your example $\sim 4 \times$ $10^{6}$..., the parse warning is also useful feedback to authors using latexml, who may want to edit their formulas to parse grammatically. latexml already covers a few choice cases that would extract an isolated construct out of math mode and deposit it in a textual element. We have been discussing - but haven't yet settled on - a few choice cases where adjacent math elements may be merge-able, so as to improve the parsing success rates.

And lastly - triaging issues. We have been somewhat disciplined in adding support for the "next most needed" package in arXiv with the limited time we have available for that work. And respectively - fixing the "next most common" Error and Fatal issues.

The aggregate reports are indeed very informative in that regard. But as the issues in this repository often point out, there is a difference between broad coverage over arXiv and pixel-perfect individual articles. Sometimes we will manage to preserve the content (and thus have no log messages emitted by latexml), but be inaccurate with the exact sizing and styling of the emitted HTML elements - which leads to clunky rendering. In those cases especially, having the arXiv community report the problems back is extremely helpful, as they may remain invisible to us otherwise.

Hope some of that is helpful - I think we are very much in alignment on how to generally approach solving the "arXiv to HTML problem".

from ar5iv.

dginev commented on June 2, 2024

Oh and a fun note at the end: if you want to draw 10,000 ar5iv articles at random, consider using the "feeling lucky" feature. One can fetch the URL

https://ar5iv.labs.arxiv.org/feeling_lucky

and follow the redirect - each visit should lead to a different article. Then swapping /html/ for /log/ you can obtain the log info.

from ar5iv.

dginev commented on June 2, 2024

Thanks again for the conversation and detailed testing of ar5iv!

I will close here, but you are always welcome to open more issues for specific articles, feature requests, or general quality-of-life upgrades for latexml .

from ar5iv.

Understand many conversion errors and warnings about ar5iv HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs