google-research / arxiv-latex-cleaner Goto Github PK

View Code? Open in Web Editor NEW

4.8K 30.0 305.0 357 KB

arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv

License: Apache License 2.0

Python 95.70% TeX 4.30%

arxiv latex

arxiv-latex-cleaner's Introduction

Google Research

This repository contains code released by Google Research.

All datasets in this repository are released under the CC BY 4.0 International license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.

Because the repo is large, we recommend you download only the subdirectory of interest:

SUBDIR=foo
svn export https://github.com/google-research/google-research/trunk/$SUBDIR

If you'd like to submit a pull request, you'll need to clone the repository; we recommend making a shallow clone (without history).

git clone [email protected]:google-research/google-research.git --depth=1

Disclaimer: This is not an official Google product.

Updated in 2023.

arxiv-latex-cleaner's People

Contributors

Stargazers

Watchers

Forkers

fangyh09 awesome-archive purpleman-ljl yinjc phamcuong92 jingfei-liu john2912 akileshbadrinaaraayanan pedramsalimi bohblue2 zxwu xinw1012 heechulbae sahwar engmux guoshi28 vikastmz ivalab lucazampieri gyger jeankossaifi wshenx fredguth shyamalschandra gwding abdelpakey mrochan leonscrt zorrock paztronomer deepaknlp nryant muzaluisa tutvision flora0420 pduckworth pablohl usmanmaqbool weirayao monjoybme steffen-wolf mrnabati vonrosenchild chaoso hovinhthinh suyeecav sierxue sebymiano gustavopinto saurabhjha1 debasishmaji xuecaihu justinshenk kasakh m-niemeyer afhi pgorczak shichaosuper maimaris patricbonnier alice-sieve collector-m xizero00 phdenzel capri2014 gyglim mhgharieb andreas-koukorinis phate tonghe90 lim0606 pedrodiamel milkigit boringwar pencilzhang cs-chan bismex sapanachaudhary winshikhar franck-dernoncourt lilujunai zhenghaven eric-heiden xf05888 amy-tabb hasanalikhattak huynhlam midhuniitm penniepeng321 ruoyus minygd kaist-vclab icoz69 oshapio barracosa pengruifei zokin imrankhan1984 so-cool woojunepark

arxiv-latex-cleaner's Issues

bib file gets removed

The .bib files removed after cleaning in linux.

When checking whether an image is referenced, \graphicspath is not taken into account

In LaTeX there's a feature allowing to set the root of all images with \graphicspath, then all image includes will be relative to this root, not the main tex file's directory.

However, this tool does not know about \graphicspath and therefore thinks that none of the images are referenced and deletes them all.

.bib files getting deleted

Would be nice to add .bib to the set of restricted file types which aren't deleted.

Unneeded files are copied.

When some input command is commented out (e.g., % \input{some/file}) the referenced file is copied to the output directory anyhow.

Fix references with special characters

Files with names with special characters such as (, ) or are not processed by arxiv, hence corresponding references will break.

Feature request: automatically replace file references and referenced file names with 'clean' ones.

Propose to add an option for not resizing images and figures

Support `biber` for references

When using Biber it may be required to add the following line at the top of the TEX file:

% !BIB program = biber

Wouldn't it make sense to whitelist or include in RE such cases?

Not working on Windows

Dear authors,

Thank you very much for the arxiv cleaner (I was starting to write my own one when I discover by chance that you already did).

You current version is not working on Windows as python will use '\' as separator (os.sep value) and this breaks regular expressions. I did a version that works also on Windows (replacing os.path.join to make it behave like in Linux). Should I share the code and ask for a pull request?

Best regards. Doms.

Always copy the anc folder from the source folder

If there is an anc folder in the source directory, it is intended as ancillary material (per arxiv's directory structure convention) and needs to be copied into the target folder.

commented \end{document} is processed

Hi,

Thanks for putting this out there, it's pretty neat and useful tool!

It seems that you rely on parsing \end{document} to truncate the .tex file? If this \end{document} is commented, this still truncates the whole file afterwards. Maybe you should run this stripping after all comments have been removed so as to not delete content inadvertently?

Adrien

PyPI deployment

Hi, is there any particular reason why this package is not deployed to PyPI?

Should ignore everything after \end{document}

If an image file is referenced after \end{document}, it should not be included. A suggested fix to _read_tex_file_content() is below.

def _strip_tex_contents(lines, end_str):
    for i in range(len(lines)):
      if end_str in lines[i]:
        return lines[:i+1]    # remove all lines past '\\end{document}'
    return lines

def _read_tex_file_content(filename):
  with open(filename, 'r', encoding='utf-8') as fp:
    lines = fp.readlines()
    lines = _strip_tex_contents(lines, '\\end{document}')
    return lines

Trim \n at the end of Tex file triggers arXiv error

This tool always trims the \n symbol at the end of the Tex file. ArXiv complains truncated file error… I have to manually add an empty line for each Tex file…

arXiv complains but acts on it and processing works

When submitting on arXiv, I got the following messages:

*** File .DS_Store has been removed ***

Removed hidden file .DS_Store

REMOVING main.pdf due to name conflict

*** File main.pdf has been removed ***

Everything worked out great when processing after arXiv did that correction itself. I was wondering if it is within scope of this package to anticipate arXiv corrections (i.e. wether there is any edge case where anticipating is actually useful).

[Unexpected Behavior] Comments Not Removed on Lines with "auto-ignore"

While fixing issue #91, I discovered that comments in lines containing the word auto-ignore are not being removed as expected. This behavior is not documented in either the README or the help message, which may lead to unexpected outcomes for users.
I suppose there is a specific reason for this behavior as it is also being tested 🤔

Example

Input:

Foo auto-ignore Bar ... % Top Secret Comment

Output:

Foo auto-ignore Bar ... % Top Secret Comment

Expected Output:

Foo auto-ignore Bar ... %

URL cut off when it contains `%` symbol

Hi, when an url contains a % symbol, it is treated as a comment and is cut off. For example

\url{https://www.example.com/hello%20world}

becomes

\url{https://www.example.com/hello

TypeError: 'encoding' is an invalid keyword argument for this function

I ran the script and it gave me this error.

I guess it could be a compatibility issue?

The full error log is:

Traceback (most recent call last):
  File "/usr/local/bin/arxiv_latex_cleaner", line 7, in <module>
    from arxiv_latex_cleaner.__main__ import __main__
  File "/Library/Python/2.7/site-packages/arxiv_latex_cleaner/__main__.py", line 91, in <module>
    run_arxiv_cleaner(ARGS)
  File "/Library/Python/2.7/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 385, in run_arxiv_cleaner
    splits['tex_in_root'] + splits['tex_not_in_root'], parameters)
  File "/Library/Python/2.7/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 166, in _read_all_tex_contents
    os.path.join(parameters['input_folder'], fn))
  File "/Library/Python/2.7/site-packages/arxiv_latex_cleaner/arxiv_latex_cleaner.py", line 158, in _read_file_content
    with open(filename, 'r', encoding='utf-8') as fp:
TypeError: 'encoding' is an invalid keyword argument for this function

[fueature request] remove the fig unused

I wonder if it is possible to remove the untouched figures in the folder during the cleaning.

--commands_to_delete hangs forever

Here's a minimal example. If my source file includes:

\todo1{
\begin{figure}
\caption{\todo2{\emph{problem}}}
\end{figure}
}

When running:

python3.8 -m arxiv_latex_cleaner --commands_to_delete todo1 todo2 --verbose sources

It seems to hang forever on the above file. In my attempts, removing any of the todo1, todo2, figure, or emph seems to make the problem go away...

Commands inside commands

The option to delete user-defined commands (e.g. \todo{}), won't work if there is another command inside (e.g. \todo{\textit{}}).
This is because I'm detecting the commands as \todo{"anything but braces"}. One would need to detect the closing brace from the command and delete until there.

Testing fails

I ran this command
python -m unittest arxiv_latex_cleaner.tests.arxiv_latex_cleaner_test
and four unit test failed.

Also when I try to run arxiv_latex_cleaner to make my latex files compatible to arxiv, it fails for the case out_folder = os.path.abspath(input_folder).removesuffix('.zip') + '_arXiv'
AttributeError: 'str' object has no attribute 'removesuffix' which it fails for when running the unittest command.

suggestion: change files created/modified time for better privacy protection.

It would be nice to be able to change the created time or the modified time for better privacy protection.
For example, we can add an optional argument as --modify_time 1/1/2000

Is it possible to update `absl-py>=0.12`?

Old absl-py versions do not work with python 3.10 because they incorrectly do a string comparison on the python version, see this issue. This was fixed with v0.12. Is there a particular reason to require absl-py~=0.6.1?

UnicodeDecodeError in Windows

Hello,

Thanks very much for this useful tool!

Recently I met a small issue when running the tool:

C:\>arxiv_latex_cleaner ./paper
Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\arxiv_latex_cleaner.exe\__main__.py", line 4, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\__main__.py", line 87, in <module>
    run_arxiv_cleaner(ARGS)
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 330, in run_arxiv_cleaner
    splits['tex_in_root'] + splits['tex_not_in_root'], parameters)
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 138, in _read_all_tex_contents
    os.path.join(parameters['input_folder'], fn))
  File "c:\programdata\anaconda3\lib\site-packages\arxiv_latex_cleaner\arxiv_latex_cleaner.py", line 131, in _read_file_content
    return fp.readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x99 in position 5722: illegal multibyte sequence

The reason could be that the default encoding in Windows is 'gbk'. Thus changing with open(filename, 'r') as fp to with open(filename, 'rb', encoding='utf-8') as fp may solve the issue.

Size-reduce files with ImageOptim or similar

(Thanks for this project)

It could be beneficial to, after rescaling images, also push them through ImageOptim (if on a Mac) or similar "recompressors"

(The nice thing about ImageOptim is, that it tries several different compressors and keeps the best output)

Script slow with many files

Hi,

using the cleaner on a folder with many files (~1000), the script appeared to hang. However, it actually was processing, just very slow.

I was able to track down the slowness to this line:

arxiv-latex-cleaner/arxiv_latex_cleaner/arxiv_latex_cleaner.py

Line 76 in ea9d6db

if item not in _keep_pattern(haystack, patterns_to_remove)

When passing the list of files to _remove_pattern from _list_all_files, _keep_pattern is called per file, but with the full list of files to the haystack argument. Thus, _remove_pattern has quadratic complexity when it should have linear complexity.

Changing the line in question to

      if item not in _keep_pattern([item], patterns_to_remove)

fixed the problem.

However, even with quadratic complexity, this operation should not be so slow with just 1000 files; I suspect the regex operation regex.findall(rem, item) in _keep_pattern to be an additional cause for slowness, because it has to compile the search pattern on each invocation (a slow operation in regex parsing). It might be worthwhile to compile the pattern into a regex object only once, and change _keep_pattern, _remove_pattern to directly accept regex objects, instead of string patterns.

[BUG] Inline comment containing a url is not removed

Hi 👋
When there is an inline comment followed by a \url{} command, the inline comment is not removed as expected. This issue seems to have been introduced while addressing #82.

Example

Input:

Hello! % (\url{topsecret.com})

Output:

Hello! %
\url{topsecret.com})

Expected Output:

Hello! %

Feature request: more universal and robust comments deletion?

This example is helpful considering various scenarios when deleting comments in tex files. And I show the example below:

\begin{document}
hello world
% comment after %
20\% just a percent 
new line \\%still a comment
\begin{comment}
comment even without percent % haha
\end{comment}
\begin{verbatim}
% not a comment though with %
\end{verbatim}
\end{document}
after document

In current version of this script:

No support for verbatim package. Since every char within verbatim env is not comment even with %.
Cannot delete comment directly attached after the newline \\, it will recognize \\% as \ \% where \% is just a percent symbol instead of the beginning of comment.
Maybe it is also relevant to delete everything below \end{document} since they would never be involved with the file.

1, 3 are more like feature requests while I think it would be better to fix 2.

Eps files

Hi,

Thanks for open-sourcing this! I have a feature request related to .eps files. My understanding that the common installations of TeX convert each existing myfile.eps into a file myfile-eps-converted-to.pdf, and this file is then looked up when the pdf needs to be compiled.

When I try to use the arxiv-latex-cleaner on a LaTeX project with .eps files, the *-eps-converted-to.pdf are generated in the old project folder, but not the new cleaned-up folder. I am also not sure whether these files are compressed.

The minimal fix I am looking for is to automatically move the *-eps-converted-to.pdf files from the old to the new folder. I might eventually fix this myself, but just letting you know about this problem. Thanks!

comment symbol "%" remove

It is suggested that comments should not be replaced by blank, but be replaced by "%".

"%" can be important to hold some positions. Deleting it may cause error.

Images referenced from parent directory not copied over

Project structures where tex files are nested in a folder result in images not being copied.

images/
  - image_0.png
  - image_1.png
  - image_2.png
sections/
  - section_0.tex
  - section_1.tex  
  - section_2.tex
main.tex

For example, if section_0.tex references image ../images/image_0.png, then image_0.png is not copied.

suggestion: eps to pdf

arXiv recommends using unix command:

for i in *ps; do ps2pdf -DEPSCrop $i; done;

to replace all the eps figures with PDF to significantly reduce size.

It could be great to integrate this function as well. Just a suggestion.

Regards,

Mike

Support for comment package?

Thanks for this great project, it would nice to see support for comment package.

\begin{comment}
This can also be treated as comments and properly removed.
\end{comment}

Deletes referenced .tex source files

It seems the tool has a few ways of recognizing source files that are referenced, but it doesn't recognize all of them. In particular, the import package is a modern solution for including files whose syntax looks like \import{sections/}{section1-1.tex}: https://www.overleaf.com/learn/latex/Management_in_a_large_project#Using_the_import_package

Such files are deleted by this cleaner. It would be nice if the tool finds a way to recognize these files. If not, it seems that at least there should be an option to give a source file whitelist (similar to the image whitelist) of filepaths that will be cleaned and not deleted.

Commands_to_delete ignores commands with options

It seems that commands with preceding options are not filtered, e.g.:

arxiv_latex_cleaner "${dir}" --commands_to_delete "todo"

This will not drop the following command:

\todo[inline]{valve pls fix}

Did I use the wrong command?

PS: Thank you for this great tool.

The dot in regex patterns matches any character, while it should match just '.'

Patterns of files to delete contain a dot (e.g. ^.idea). The dot matches any character when used in a regex, not the character . as expected.

This makes some files such as rideaustin.tex to be ignored, because it matches the pattern ^.idea.

This was found thanks to the comment of @nikhgarg in issue #30, and their source files.

New encoding error as a result of fixing previous one.

I used pip to install and when trying to use the cleaner ran into a new encoding error.
Googling around suggested that it's because some versions of python don't support the added encoding utf8 change.
It was fixed by adding "from io import open" to the top of the arxiv-latex-cleaner.py file.

How to delete extra blank lines?

This is a very useful tool. But after removing the annotation content, there will be many duplicate blank lines left. How should I delete it?

Can the package fuse different latex files into one ?

I was looking for a tool that can clean different latex files and merge them into one big tex file while remove necessaries.
Is this what happens here ?

Removing "comments" in code blocks

Hi, thank you for your wonderful work!

When running arxiv-latex-cleaner on projects that involve minted or lstlisting code blocks, the % symbol within a code block causes the cleaner to delete all the characters from the position of % onward .

It would be great if the comment cleaning is turned off on such code blocks, since % here is not intended as a LaTeX comment.

Custom command \figref is treated as \fi

I have a custom command \figref:

\newcommand\figref[1]{Figure~\ref{fig:#1}}

I have an instance of that command in an \iffalse ... \fi block, and the \figref command is misinterpreted as \fi:

So, arxiv-latex-cleaner changes this:

\iffalse
... \figref{myfigure}
\fi

to the following:

gref{myfigure}
\fi

does not copy PDF images that contain "()" in file name

I have figures like /manuscript/(a).pdf. They are not copied when I run arxiv-latex-cleaner by arxiv_latex_cleaner manuscript --keep_bib --commands_to_delete cmt todo sout --verbose.

Image reference check wrongly admits prefix of referenced image

Let's say I have two images called include_image_yes.png and include_image.png, and I only reference include_image_yes.png while include_image.png is not referenced anywhere. The tool thinks of include_image.png as being referenced just because it appears as a prefix of include_image_yes.png (extensions are ignored for this check, which is unavoidable). This happens even when the images are not in root and inside some directory like images.

I've tracked it down to the _keep_only_referenced function (link) and so it's possible that this affects more than just images. A possible fix could be to use a slightly more elaborate regex check which searches for include_image\b, with a word boundary metacharacter to prevent such false positives.

Replace tikz pictures included using \input{tikzFile.tikz} with externalized pdfs

Hello,

I noticed that tikz pictures will not be replaced with their respective externalized pdf if I include them using, e.g.

\tikzsetnextfilename{tikzFile}
\input{tikzFile.tikz}

Since this is my usual workflow, is there a way to prevent me from having to replace all those input commands with the tikz pictures they refer to?

Thank you!

Referenced .sty and .bst Files In Subdirectory Are Not Included

I have the following directory structure:

main.tex
style/example_pkg.sty
style/example_bib.bst

In my root latex file I reference these files:

\usepackage[pagenumbers]{style/example_pkg}
\bibliographystyle{style/example_bib}

The problem is that after running the latex cleaner these are no longer present in the output directory.

arxiv_latex_cleaner . --verbose

The problematic output directory structure is the following, completely missing the files in /style/*

main.tex

Note that if we instead do not place these files in the subdirectory /style/*, and they are simply contained in the root directory /*, then everything works as expected.

main.tex
example_pkg.sty
example_bib.bst

Note that we also need to update the references in the commands.

\usepackage[pagenumbers]{example_pkg}
\bibliographystyle{example_bib}

The output is now expected

Nested \iffalse \fi block comments.

I used \iffalse ... \fi to block comment in my latex document, and used this modification of the _remove_environment command:

def _remove_iffalse(text):
  """Removes '\\iffalse *\\fi' from 'text'."""
  """This has problems with nested \\iffalse \\fi statements"""
  return re.sub(
      r'\\iffalse[\s\S]*?\\fi',
      '', text)

However, this runs incorrectly on:

\iffalse
A
\iffalse
B
\fi
C
\fi

Which in latex outputs nothing, but with the _remove_iffalse code above outputs:

C
\fi

(I had one such nested comment in my document, because of commenting out a subsection of a section that was later commented out in its entirety.)

A similar problem does not exist for \begin{comment} \end{comment}, because

\begin{comment}
A
\begin{comment}
B
\end{comment}
C
\end{comment}

Does not compile in Latex.

An option to flatten the file structure.

Although this was made to make text repos compatible with arxiv submission, I think an option to flatten the sub directories will also extend this toll to be useable in submission to other platforms.

Description of the feature:
\includegraphics{activation/Activations.pdf}
these file include lines can be shortened to
\includegraphics{activation_Activations.pdf}

and the activation/Activations.pdf copied to same directory as the tex file.

[Feature] Remove `\if 0 ... \fi` style comments as well

It is common to see the following kind of comments

\if 0

\fi

It would be helpful if they can be removed as well.

Blank lines introduced by the `commands_to_delete` option

If a command to be deleted is on a single line in the latex source file, asking arxiv-latex-cleaner to remove the command introduces a spurious blank line / line break / new paragraph.

Steps to reproduce:

Create a latex document with the following content in test/test.tex:

\documentclass{article}
\usepackage{blindtext}
\usepackage{marginnote}

\begin{document}

\blindtext
\marginnote{Test}
\blindtext

\end{document}

Compile test/test.tex (pdf1)
Run the document through arxiv_latex_cleaner test --commands_to_delete \marginnote
Compile test_arXiv/test.tex (pdf2)
Compare the results.

pdf1 shows a single paragraph with a margin note
pdf2 shows 2 paragraphs and no margin note.

Expected result:

pdf2 should show a single paragraph and no margin note

Additional context

I am using "[email protected]" from a recent pip install

The documentation for commands_to_delete is a bit confusing. Reding through the help message I understood the input folder argument should be placed at the end. I thus first tried calling
arxiv_latex_cleaner --commands_to_delete \marginnote test
but got an error message back
[email protected]: error: the following arguments are required: input_folder

For the record, in my current workflow I use the following piece of code in my latex source which works for me but is not very scalable:

% Comment out the next line if you do not want to show notes
\def\shownotes{} % set to true
\ifdefined\shownotes
\usepackage{marginnote}
\else
\newcommand{\marginnote}[1]{\ignorespaces}
\fi

Credit for the \ignorespaces part: https://tex.stackexchange.com/a/201818

google-research / arxiv-latex-cleaner Goto Github PK

arxiv-latex-cleaner's Introduction

Google Research

arxiv-latex-cleaner's People

Contributors

Stargazers

Watchers

Forkers

arxiv-latex-cleaner's Issues

Example

Input:

Output:

Expected Output:

Example

Input:

Output:

Expected Output:

Steps to reproduce:

Expected result:

Additional context

Recommend Projects

Recommend Topics

Recommend Org

Jobs