GithubHelp home page GithubHelp logo

Comments (7)

jsvine avatar jsvine commented on August 22, 2024

Great question / conundrum!

Not too concerned about the value-alteration. This can be handled easily by — as you allude to — making a .copy() of chars. And I think snap_to_y_grid handles most cases like these.

But how should something like this handle edge-cases, e.g., a word gently-sloping-down? For the sake of a concrete-if-lame example:

letter doctop
C 20
O 21
M 22
P 23
U 24
T 25
E 26
R 27

You'd have to set a pretty aggressive snap_to_grid value to clump all those letters together. But the letter-to-letter variation in doctop is actually pretty small — just one pixel at a time. I wonder:

  • Is there a clever way to algorithmically identify strings of near-to-each-other characters? (And one that isn't too computationally expensive?)
  • Should we even be worrying about edge cases like these?

Eager for your thoughts, and thanks again for raising the issue!

from pdfplumber.

jsvine avatar jsvine commented on August 22, 2024

Alternative attempt at solving this, using doctop-clustering:

Essentially, it looks at all doctops, sorts them, then clusters any doctops within y_tolerance of the next-smallest doctop. Then, it treats all doctops in the same cluster as being on the same line. Would seem to handle your situation — does it? — and my fake example above. What do you think?

And do you have any PDF (or CSV char-output) of previously-failing examples?

from pdfplumber.

jsfenfen avatar jsfenfen commented on August 22, 2024

I think this would work--lemme dig around for the one that was failing. This is actually a problem for me in just coalescing words for merging with some fonts--all cap fonts sometimes have larger first letters which also makes trouble. I may try to adapt this to that.

The question of sloping letters is also a really good one--and pretty gnarly to deal with once the delta y approaches line height. I've mucked around with that a little, and I think you need a different approach than clustering. For text that's just slanted I did an experiment with linear regression, and choosing words that had a shared y-intercept that sorta worked. But the other problem one gets (much more with scanned text) is of non-linear curve (think of putting a book on a copy machine). But that's an even worse problem...

from pdfplumber.

jsvine avatar jsvine commented on August 22, 2024

all cap fonts sometimes have larger first letters which also makes trouble

Interesting! Hadn't encountered that before, but makes total sense.

For text that's just slanted I did an experiment with linear regression, and choosing words that had a shared y-intercept that sorta worked.

Clever! I'm a bit worried about the computational expense of that, at least for the main collate_chars function. Maybe worth breaking out more-advanced chars-in-a-line detection into its own method?

But the other problem one gets (much more with scanned text) is of non-linear curve (think of putting a book on a copy machine). But that's an even worse problem...

Oy, yeah. I'm thinking handling super-crazy text layouts is outside the scope of this library. But definitely open to being convinced otherwise!

from pdfplumber.

jsfenfen avatar jsfenfen commented on August 22, 2024

Yeah, wrt angle detection that's definitely something you wouldn't want there. If you assume a constant angle (not a great assumption) you can do some detection stuff and then apply your discovered angle elsewhere.

from pdfplumber.

jsvine avatar jsvine commented on August 22, 2024

The new strategy for handling doctop-variations is now in master and v0.2.0 — pushed and published this morning.

from pdfplumber.

jsfenfen avatar jsfenfen commented on August 22, 2024

Awesome!

from pdfplumber.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.