GithubHelp home page GithubHelp logo

google-research / arxiv-latex-cleaner Goto Github PK

View Code? Open in Web Editor NEW
4.8K 30.0 305.0 357 KB

arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv

License: Apache License 2.0

Python 95.70% TeX 4.30%
arxiv latex

arxiv-latex-cleaner's Introduction

Google Research

This repository contains code released by Google Research.

All datasets in this repository are released under the CC BY 4.0 International license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.


Because the repo is large, we recommend you download only the subdirectory of interest:

SUBDIR=foo
svn export https://github.com/google-research/google-research/trunk/$SUBDIR

If you'd like to submit a pull request, you'll need to clone the repository; we recommend making a shallow clone (without history).

git clone [email protected]:google-research/google-research.git --depth=1

Disclaimer: This is not an official Google product.

Updated in 2023.

arxiv-latex-cleaner's People

Contributors

aditya95sriram avatar akwick avatar andryandrew avatar andylin-hao avatar bryant1410 avatar domoritz avatar dylduhamel avatar eric-heiden avatar giulioromualdi avatar hayesall avatar jaywonchung avatar jessicajzhang03 avatar jonasschult avatar jponttuset avatar ldes89150 avatar merajhashemi avatar miweiss avatar mmore500 avatar mrnabati avatar nzw0301 avatar phguo avatar philgzl avatar pingchuanma avatar sdnr avatar sebymiano avatar so-cool avatar therealsupermario avatar tryone144 avatar vaufreyd avatar wazizian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arxiv-latex-cleaner's Issues

Unneeded files are copied.

When some input command is commented out (e.g., % \input{some/file}) the referenced file is copied to the output directory anyhow.

Fix references with special characters

Files with names with special characters such as (, ) or are not processed by arxiv, hence corresponding references will break.

Feature request: automatically replace file references and referenced file names with 'clean' ones.

Support `biber` for references

When using Biber it may be required to add the following line at the top of the TEX file:

% !BIB program = biber

Wouldn't it make sense to whitelist or include in RE such cases?

Not working on Windows

Dear authors,

Thank you very much for the arxiv cleaner (I was starting to write my own one when I discover by chance that you already did).

You current version is not working on Windows as python will use '\' as separator (os.sep value) and this breaks regular expressions. I did a version that works also on Windows (replacing os.path.join to make it behave like in Linux). Should I share the code and ask for a pull request?

Best regards. Doms.

commented \end{document} is processed

Hi,

Thanks for putting this out there, it's pretty neat and useful tool!

It seems that you rely on parsing \end{document} to truncate the .tex file? If this \end{document} is commented, this still truncates the whole file afterwards. Maybe you should run this stripping after all comments have been removed so as to not delete content inadvertently?

Adrien

PyPI deployment

Hi, is there any particular reason why this package is not deployed to PyPI?

Should ignore everything after \end{document}

If an image file is referenced after \end{document}, it should not be included. A suggested fix to _read_tex_file_content() is below.

def _strip_tex_contents(lines, end_str):
    for i in range(len(lines)):
      if end_str in lines[i]:
        return lines[:i+1]    # remove all lines past '\\end{document}'
    return lines

def _read_tex_file_content(filename):
  with open(filename, 'r', encoding='utf-8') as fp:
    lines = fp.readlines()
    lines = _strip_tex_contents(lines, '\\end{document}')
    return lines

arXiv complains but acts on it and processing works

When submitting on arXiv, I got the following messages:

*** File .DS_Store has been removed ***

Removed hidden file .DS_Store

REMOVING main.pdf due to name conflict

*** File main.pdf has been removed ***

Everything worked out great when processing after arXiv did that correction itself. I was wondering if it is within scope of this package to anticipate arXiv corrections (i.e. wether there is any edge case where anticipating is actually useful).

[Unexpected Behavior] Comments Not Removed on Lines with "auto-ignore"

While fixing issue #91, I discovered that comments in lines containing the word auto-ignore are not being removed as expected. This behavior is not documented in either the README or the help message, which may lead to unexpected outcomes for users.
I suppose there is a specific reason for this behavior as it is also being tested 🤔

Example

Input:

Foo auto-ignore Bar ... % Top Secret Comment

Output:

Foo auto-ignore Bar ... % Top Secret Comment

Expected Output:

Foo auto-ignore Bar ... %

URL cut off when it contains `%` symbol

Hi, when an url contains a % symbol, it is treated as a comment and is cut off. For example

\url{https://www.example.com/hello%20world}

becomes

\url{https://www.example.com/hello

TypeError: 'encoding' is an invalid keyword argument for this function

I ran the script and it gave me this error.

I guess it could be a compatibility issue?

The full error log is:

Traceback (most recent call last):
  File "/usr/local/bin/arxiv_latex_cleaner", line 7, in <module>
    from arxiv_latex_cleaner.__main__ import __main__
  File "/Library/Python/2.7/site-packages/arxiv_latex_cleaner/__main__.py", line 91, in <module>
    run_arxiv_cleaner(ARGS)
  File "/Library/Python/2.7/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 385, in run_arxiv_cleaner
    splits['tex_in_root'] + splits['tex_not_in_root'], parameters)
  File "/Library/Python/2.7/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 166, in _read_all_tex_contents
    os.path.join(parameters['input_folder'], fn))
  File "/Library/Python/2.7/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 158, in _read_file_content
    with open(filename, 'r', encoding='utf-8') as fp:
TypeError: 'encoding' is an invalid keyword argument for this function

--commands_to_delete hangs forever

Here's a minimal example. If my source file includes:

\todo1{
\begin{figure}
\caption{\todo2{\emph{problem}}}
\end{figure}
}

When running:

python3.8 -m arxiv_latex_cleaner --commands_to_delete todo1 todo2 --verbose sources

It seems to hang forever on the above file. In my attempts, removing any of the todo1, todo2, figure, or emph seems to make the problem go away...

Commands inside commands

The option to delete user-defined commands (e.g. \todo{}), won't work if there is another command inside (e.g. \todo{\textit{}}).
This is because I'm detecting the commands as \todo{"anything but braces"}. One would need to detect the closing brace from the command and delete until there.

Testing fails

I ran this command
python -m unittest arxiv_latex_cleaner.tests.arxiv_latex_cleaner_test
and four unit test failed.
Screenshot from 2024-01-12 21-17-45

Also when I try to run arxiv_latex_cleaner to make my latex files compatible to arxiv, it fails for the case out_folder = os.path.abspath(input_folder).removesuffix('.zip') + '_arXiv'
AttributeError: 'str' object has no attribute 'removesuffix'
which it fails for when running the unittest command.

Is it possible to update `absl-py>=0.12`?

Old absl-py versions do not work with python 3.10 because they incorrectly do a string comparison on the python version, see this issue. This was fixed with v0.12. Is there a particular reason to require absl-py~=0.6.1?

UnicodeDecodeError in Windows

Hello,

Thanks very much for this useful tool!

Recently I met a small issue when running the tool:

C:\>arxiv_latex_cleaner ./paper
Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\arxiv_latex_cleaner.exe\__main__.py", line 4, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\__main__.py", line 87, in <module>
    run_arxiv_cleaner(ARGS)
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 330, in run_arxiv_cleaner
    splits['tex_in_root'] + splits['tex_not_in_root'], parameters)
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 138, in _read_all_tex_contents
    os.path.join(parameters['input_folder'], fn))
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 131, in _read_file_content
    return fp.readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x99 in position 5722: illegal multibyte sequence

The reason could be that the default encoding in Windows is 'gbk'. Thus changing with open(filename, 'r') as fp to with open(filename, 'rb', encoding='utf-8') as fp may solve the issue.

Script slow with many files

Hi,

using the cleaner on a folder with many files (~1000), the script appeared to hang. However, it actually was processing, just very slow.

I was able to track down the slowness to this line:

if item not in _keep_pattern(haystack, patterns_to_remove)

When passing the list of files to _remove_pattern from _list_all_files, _keep_pattern is called per file, but with the full list of files to the haystack argument. Thus, _remove_pattern has quadratic complexity when it should have linear complexity.

Changing the line in question to

      if item not in _keep_pattern([item], patterns_to_remove)

fixed the problem.

However, even with quadratic complexity, this operation should not be so slow with just 1000 files; I suspect the regex operation regex.findall(rem, item) in _keep_pattern to be an additional cause for slowness, because it has to compile the search pattern on each invocation (a slow operation in regex parsing). It might be worthwhile to compile the pattern into a regex object only once, and change _keep_pattern, _remove_pattern to directly accept regex objects, instead of string patterns.

[BUG] Inline comment containing a url is not removed

Hi 👋
When there is an inline comment followed by a \url{} command, the inline comment is not removed as expected. This issue seems to have been introduced while addressing #82.

Example

Input:

Hello! % (\url{topsecret.com})

Output:

Hello! %
\url{topsecret.com})

Expected Output:

Hello! %

Feature request: more universal and robust comments deletion?

This example is helpful considering various scenarios when deleting comments in tex files. And I show the example below:

\begin{document}
hello world
% comment after %
20\% just a percent 
new line \\%still a comment
\begin{comment}
comment even without percent % haha
\end{comment}
\begin{verbatim}
% not a comment though with %
\end{verbatim}
\end{document}
after document

In current version of this script:

  1. No support for verbatim package. Since every char within verbatim env is not comment even with %.

  2. Cannot delete comment directly attached after the newline \\, it will recognize \\% as \ \% where \% is just a percent symbol instead of the beginning of comment.

  3. Maybe it is also relevant to delete everything below \end{document} since they would never be involved with the file.

1, 3 are more like feature requests while I think it would be better to fix 2.

Eps files

Hi,

Thanks for open-sourcing this! I have a feature request related to .eps files. My understanding that the common installations of TeX convert each existing myfile.eps into a file myfile-eps-converted-to.pdf, and this file is then looked up when the pdf needs to be compiled.

When I try to use the arxiv-latex-cleaner on a LaTeX project with .eps files, the *-eps-converted-to.pdf are generated in the old project folder, but not the new cleaned-up folder. I am also not sure whether these files are compressed.

The minimal fix I am looking for is to automatically move the *-eps-converted-to.pdf files from the old to the new folder. I might eventually fix this myself, but just letting you know about this problem. Thanks!

comment symbol "%" remove

It is suggested that comments should not be replaced by blank, but be replaced by "%".

"%" can be important to hold some positions. Deleting it may cause error.

Images referenced from parent directory not copied over

Project structures where tex files are nested in a folder result in images not being copied.

images/
  - image_0.png
  - image_1.png
  - image_2.png
sections/
  - section_0.tex
  - section_1.tex  
  - section_2.tex
main.tex

For example, if section_0.tex references image ../images/image_0.png, then image_0.png is not copied.

suggestion: eps to pdf

arXiv recommends using unix command:

for i in *ps; do ps2pdf -DEPSCrop $i; done;

to replace all the eps figures with PDF to significantly reduce size.

It could be great to integrate this function as well. Just a suggestion.

Regards,

Mike

Support for comment package?

Thanks for this great project, it would nice to see support for comment package.

\begin{comment}
This can also be treated as comments and properly removed.
\end{comment}

Deletes referenced .tex source files

It seems the tool has a few ways of recognizing source files that are referenced, but it doesn't recognize all of them. In particular, the import package is a modern solution for including files whose syntax looks like \import{sections/}{section1-1.tex}: https://www.overleaf.com/learn/latex/Management_in_a_large_project#Using_the_import_package

Such files are deleted by this cleaner. It would be nice if the tool finds a way to recognize these files. If not, it seems that at least there should be an option to give a source file whitelist (similar to the image whitelist) of filepaths that will be cleaned and not deleted.

Commands_to_delete ignores commands with options

It seems that commands with preceding options are not filtered, e.g.:

arxiv_latex_cleaner "${dir}" --commands_to_delete "todo"

This will not drop the following command:

\todo[inline]{valve pls fix}

Did I use the wrong command?

PS: Thank you for this great tool.

New encoding error as a result of fixing previous one.

I used pip to install and when trying to use the cleaner ran into a new encoding error.
Googling around suggested that it's because some versions of python don't support the added encoding utf8 change.
It was fixed by adding "from io import open" to the top of the arxiv-latex-cleaner.py file.

How to delete extra blank lines?

This is a very useful tool. But after removing the annotation content, there will be many duplicate blank lines left. How should I delete it?

Removing "comments" in code blocks

Hi, thank you for your wonderful work!

When running arxiv-latex-cleaner on projects that involve minted or lstlisting code blocks, the % symbol within a code block causes the cleaner to delete all the characters from the position of % onward .

It would be great if the comment cleaning is turned off on such code blocks, since % here is not intended as a LaTeX comment.

Custom command \figref is treated as \fi

I have a custom command \figref:

\newcommand\figref[1]{Figure~\ref{fig:#1}}

I have an instance of that command in an \iffalse ... \fi block, and the \figref command is misinterpreted as \fi:

So, arxiv-latex-cleaner changes this:

\iffalse
... \figref{myfigure}
\fi

to the following:

gref{myfigure}
\fi

Image reference check wrongly admits prefix of referenced image

Let's say I have two images called include_image_yes.png and include_image.png, and I only reference include_image_yes.png while include_image.png is not referenced anywhere. The tool thinks of include_image.png as being referenced just because it appears as a prefix of include_image_yes.png (extensions are ignored for this check, which is unavoidable). This happens even when the images are not in root and inside some directory like images.

I've tracked it down to the _keep_only_referenced function (link) and so it's possible that this affects more than just images. A possible fix could be to use a slightly more elaborate regex check which searches for include_image\b, with a word boundary metacharacter to prevent such false positives.

Replace tikz pictures included using \input{tikzFile.tikz} with externalized pdfs

Hello,

I noticed that tikz pictures will not be replaced with their respective externalized pdf if I include them using, e.g.

\tikzsetnextfilename{tikzFile}
\input{tikzFile.tikz}

Since this is my usual workflow, is there a way to prevent me from having to replace all those input commands with the tikz pictures they refer to?

Thank you!

Referenced .sty and .bst Files In Subdirectory Are Not Included

I have the following directory structure:

main.tex
style/example_pkg.sty
style/example_bib.bst

In my root latex file I reference these files:

\usepackage[pagenumbers]{style/example_pkg}
\bibliographystyle{style/example_bib}

The problem is that after running the latex cleaner these are no longer present in the output directory.

arxiv_latex_cleaner . --verbose

The problematic output directory structure is the following, completely missing the files in /style/*

main.tex

Note that if we instead do not place these files in the subdirectory /style/*, and they are simply contained in the root directory /*, then everything works as expected.

main.tex
example_pkg.sty
example_bib.bst

Note that we also need to update the references in the commands.

\usepackage[pagenumbers]{example_pkg}
\bibliographystyle{example_bib}

The output is now expected

Nested \iffalse \fi block comments.

I used \iffalse ... \fi to block comment in my latex document, and used this modification of the _remove_environment command:

def _remove_iffalse(text):
  """Removes '\\iffalse *\\fi' from 'text'."""
  """This has problems with nested \\iffalse \\fi statements"""
  return re.sub(
      r'\\iffalse[\s\S]*?\\fi',
      '', text)

However, this runs incorrectly on:

\iffalse
A
\iffalse
B
\fi
C
\fi

Which in latex outputs nothing, but with the _remove_iffalse code above outputs:

C
\fi

(I had one such nested comment in my document, because of commenting out a subsection of a section that was later commented out in its entirety.)

A similar problem does not exist for \begin{comment} \end{comment}, because

\begin{comment}
A
\begin{comment}
B
\end{comment}
C
\end{comment}

Does not compile in Latex.

An option to flatten the file structure.

Although this was made to make text repos compatible with arxiv submission, I think an option to flatten the sub directories will also extend this toll to be useable in submission to other platforms.

Description of the feature:
\includegraphics{activation/Activations.pdf}
these file include lines can be shortened to
\includegraphics{activation_Activations.pdf}

and the activation/Activations.pdf copied to same directory as the tex file.

Blank lines introduced by the `commands_to_delete` option

If a command to be deleted is on a single line in the latex source file, asking arxiv-latex-cleaner to remove the command introduces a spurious blank line / line break / new paragraph.

Steps to reproduce:

  1. Create a latex document with the following content in test/test.tex:
\documentclass{article}
\usepackage{blindtext}
\usepackage{marginnote}

\begin{document}

\blindtext
\marginnote{Test}
\blindtext

\end{document}
  1. Compile test/test.tex (pdf1)
  2. Run the document through arxiv_latex_cleaner test --commands_to_delete \marginnote
  3. Compile test_arXiv/test.tex (pdf2)
  4. Compare the results.
  • pdf1 shows a single paragraph with a margin note
  • pdf2 shows 2 paragraphs and no margin note.

Expected result:

  • pdf2 should show a single paragraph and no margin note

Additional context

I am using "[email protected]" from a recent pip install

The documentation for commands_to_delete is a bit confusing. Reding through the help message I understood the input folder argument should be placed at the end. I thus first tried calling
arxiv_latex_cleaner --commands_to_delete \marginnote test
but got an error message back
[email protected]: error: the following arguments are required: input_folder

For the record, in my current workflow I use the following piece of code in my latex source which works for me but is not very scalable:

% Comment out the next line if you do not want to show notes
\def\shownotes{} % set to true
\ifdefined\shownotes
\usepackage{marginnote}
\else
\newcommand{\marginnote}[1]{\ignorespaces}
\fi

Credit for the \ignorespaces part: https://tex.stackexchange.com/a/201818

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.