Comments (3)
Hi Jonathan,
Thank you for looking into ar5iv and starting this discussion.
Some general notes first:
-
Naturally, we've experienced how overwhelming it can be to deal with a large corpus of TeX, so we created a build system to systematically aggregate latexml's log messages already at the original start of the arXMLiv project, back in ~2007. Anyone can query that system at:
https://corpora.mathweb.org/corpus/arxmliv/tex%5Fto%5Fhtml -
For downloading the logs of 10,000 examples, assuming you have a list of arXiv ids, this can be done with a simple loop that fetches
https://ar5iv.labs.arxiv.org/log/<arxivid>
. The ar5iv site is actively crawled by many interested parties at the moment (most recently I was told that SearchOnMath deployed a search index over ar5iv). So you are very welcome to grab any log entries you may need. -
We are always open to monitoring new classes of problems that latexml conversion remains "silent" on today. If there is a pattern that can be automatically recognized during conversion, we can add and emit more messages (in 4 severities: Info/Warning/Error/Fatal), to track them at scale.
-
I actually find it practical to crowd source collecting concrete problems in ar5iv. In recent years we have had a trend where what used to be considered a "large" number for development has subsided down to "medium" or even "small". A thousand issues may have been seen as large, but is now commonplace. Consider the rust language issues, where we see ~9000 open issues, ~40,000 closed issues and ~60,000 closed pull requests. I think a modern project backed by a large community should aspire to such numbers - and I am hoping the issues in this ar5iv repository will eventually approach this kind of magnitude.
On the formula example:
For math parsing warnings, these are best used specifically to improve latexml's MathGrammar module, but often do not translate into visual degradation in the rendered HTML. latexml has a fallback parsing mode (similar to an extent to the way MathJax and KaTeX deal with math) which takes over when the grammatical parse fails.
Certainly there are many remaining upgrades and inaccuracies to math parsing, and concrete reports of encountering them are always appreciated. Some are sufficiently difficult that resolving them in full requires swapping the entire grammar engine for one capable of dealing with ambiguity, which is one branch of ongoing work in latexml.
The aggregated report for math parsing failures (reported via the ALLCAPS grammatical categories of the concrete latexml grammar) is here:
https://corpora.mathweb.org/corpus/arxmliv/tex%5Fto%5Fhtml/warning/not%5Fparsed?all=false
As just one example, we've noticed that the most common parsing failure has to deal with unbalanced parens, often in frivolous TeX uses such as $(1+$ z $)^3$
found in astro-ph/0001053. There are other issues for an OPEN
not finding its CLOSE
, but the report tells us latexml fails to do so in 14% of arXiv articles.
For cases like this where math mode gets interrupted, as is with your example $\sim 4 \times$ $10^{6}$...
, the parse warning is also useful feedback to authors using latexml, who may want to edit their formulas to parse grammatically. latexml already covers a few choice cases that would extract an isolated construct out of math mode and deposit it in a textual element. We have been discussing - but haven't yet settled on - a few choice cases where adjacent math elements may be merge-able, so as to improve the parsing success rates.
And lastly - triaging issues. We have been somewhat disciplined in adding support for the "next most needed" package in arXiv with the limited time we have available for that work. And respectively - fixing the "next most common" Error and Fatal issues.
The aggregate reports are indeed very informative in that regard. But as the issues in this repository often point out, there is a difference between broad coverage over arXiv and pixel-perfect individual articles. Sometimes we will manage to preserve the content (and thus have no log messages emitted by latexml), but be inaccurate with the exact sizing and styling of the emitted HTML elements - which leads to clunky rendering. In those cases especially, having the arXiv community report the problems back is extremely helpful, as they may remain invisible to us otherwise.
Hope some of that is helpful - I think we are very much in alignment on how to generally approach solving the "arXiv to HTML problem".
from ar5iv.
Oh and a fun note at the end: if you want to draw 10,000 ar5iv articles at random, consider using the "feeling lucky" feature. One can fetch the URL
https://ar5iv.labs.arxiv.org/feeling_lucky
and follow the redirect - each visit should lead to a different article. Then swapping /html/
for /log/
you can obtain the log info.
from ar5iv.
Thanks again for the conversation and detailed testing of ar5iv!
I will close here, but you are always welcome to open more issues for specific articles, feature requests, or general quality-of-life upgrades for latexml .
from ar5iv.
Related Issues (20)
- Improve article 1106.3479 (psfrag post-processing)
- Improve article 2210.07194 HOT 2
- Improve article 0906.4999
- Improve article 1708.09568
- Improve article 2306.00874
- Improve article 2302.09133 HOT 1
- Improve article 2311.04329
- Improve article 2104.09864
- Improve article 2402.15332 HOT 1
- Improve article 2304.09409 HOT 1
- Improve article 1908.05686 HOT 1
- Improve article 1303.6051 HOT 3
- Improve article 2003.02320
- Improve article 2112.00828 HOT 1
- Improve article 2401.18058 HOT 1
- Improve main source detection (AutoTeX)
- Improve article 2001.07685 HOT 1
- Improve article 1905.05172 HOT 1
- Improve article 2308.04512
- Improve article 2311.06609 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ar5iv.