GithubHelp home page GithubHelp logo

pszemraj / confectionary Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 0.0 246 KB

a tool to quickly create sweet PDF files from text files :cupcake:

License: Apache License 2.0

Python 100.00%
keywords paragraph-segmentation paragraph-splitting pdf pdf-generation reader text

confectionary's People

Contributors

jonathanlehner avatar pszemraj avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

confectionary's Issues

UnicodeEncodeError: 'latin-1' codec

files with non-standard characters cause the latin-1 codec used by the package to error out

Context

  • need to be able to handle files with special characters in the name

Examples of weird char names:

 'SUMM OCR Pałczyński et al. - 2022 - Study of the Few-S .txt',
 'SUMM OCR Refinetti, Goldt - 2022 - The dynamics of repr .txt',
 'SUMM OCR Sercan, Arık, Pfister - 2019 - TabNet Attenti .txt',
 'SUMM OCR Serhal et al. - 2022 - Overview on prediction, .txt',
 'SUMM OCR Somani et al. - 2021 - Deep learning and the e .txt',
 'SUMM OCR Somepalli et al. - 2021 - SAINT Improved Neura .txt',
 'SUMM OCR Śmigiel, Pałczyński, Ledziński - 2021 - ECG .txt',

Process

  1. user passes path to directory with text files with special chars
  2. files are loaded
  3. when confectionary tries to write a chapter name errors out

Expected result

  • filenames are cleaned to remove special chars before writing to chapter name.
  • original file names are left intact

Current result

code fails to run

(pdf) C:\Users\peter\code-dev-22\misc-repos\text2pdf>python confectionary\text2pdf.py -i "G:\My Drive\ETHZ-2022-S\ml-healthcare\ml4hc-p1-papers\LED_large_5e_[batch=2048]_[nbeams=20]_[max_l=512]\NSC + SBD" -kw "ML4HC Paper Summaries - Project 1 - LED-L-5e"
18 files found matching extension .txt

# entries is 18, < title thresh 39
will use one page for TOC

Building Chapters in PDF file:   0%|                                                            | 0/18 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 354, in <module>
    _finished_pdf_loc = dir_to_pdf(
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 255, in dir_to_pdf
    pdf.print_chapter(filepath=str(textfile.resolve()), num=i, title=out_name)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 300, in print_chapter
    self.chapter_title(num, title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 207, in chapter_title
    self.start_section(total_title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 4040, in start_section
    self.multi_cell(w=self.epw, h=self.font_size, txt=name, ln=1)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2375, in multi_cell
    txt = self.normalize_text(txt)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2945, in normalize_text
    return txt.encode(self.core_fonts_encoding).decode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0131' in position 33: ordinal not in range(256)

Possible Fix

  • use clean() from the clean-text package

Add the ability to build PDF without paragraph segmentation

Add the ability to build PDF without paragraph segmentation.

(aside:_ add a progress bar for downloading word2vec models)

Context

  • low-commital way to try out repo
  • faster

Expected result

use with some switch or arg to not have to do the paragraph seg

Current result

well. you are forced to do it now

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.