GithubHelp home page GithubHelp logo

tektonika_copyedit's Introduction

docx parsing for Tektonika

dependencies:

  • python 3.n (preferably 3.8+)
  • numpy
  • biblib

A conda environment is a nice way to set this up.

You will also need to have pandoc (pandoc.org/) installed for the initial conversion of the .docx file.

general steps for the conversion process

  • convert .docx article file to latex

    • pandoc file.docx -f docx -t latex --wrap=none -s -o file_pandoc.tex
  • copy-paste bibliography from docx into anystyle.io (or run the anystyle gem locally if you're into that) and output as bibtex, save as a .bib file

    • [anystyle.io -> file_anystyle.bib]
  • fix anystyle bibtex file year fields and keys, make a new .bib file

    • (set input filenames manually in the script)
    • fix_bibtex.py -> file_init.bib
  • manually correct any non-ascii keys in bib file, if there are any

    • (these will be printed to stdout so we know they need to be fixed, usually for non-ascii characters)
    • (feels like there should be a way around this but I don't know it)
  • parse the pandoc output tex file to a better tex format

    • (set the input filenames manually in the script)
    • parse_pandoc_file.py -> file_init.tex
  • run bibtex and pdflatex, look at the output and figure out what needs fixing

    • pdflatex file_init.tex -> file_init.pdf
    • bibtex file_init.aux
    • pdflatex ''
    • pdflatex ''
    • (running at least twice gives inline references a chance to sort themselves out)
  • manually link figure files at the right sizes, adjust placement of automated \includegraphics as needed

    • pandoc does not extract image files from word so they will need to be uploaded separately
  • manually adjust for extra bits of inline citations (in red), in line citations for multiple papers by the same authors (hopefully in red), and year-only citations (in red)

  • add extra hyphenation rules for words latex doesn't know if columns are overfull

  • manually add authors, affilitations, short title, other header metadata with default placeholders

  • look at junk file and manually reformat/place tables in text where they belong (because I do not understand longtable)

TODO:

  • make sure catch for supplemental figures/tables works for in-text references
  • figure out parsing for author metadata
  • figure out longtable/table parsing?
  • parse extra bits of citations, like 'e.g.,' wherever possible
  • more user-friendly startup (ie input filenames, rather than editing scripts)
    • related: complete workflow that runs all scripts in sequence automatically
    • and maybe make this all install as a package?

tektonika_copyedit's People

Contributors

hfmark avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

jifarquharson

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.