GithubHelp home page GithubHelp logo

Comments (11)

nipunsadvilkar avatar nipunsadvilkar commented on August 16, 2024 1

Yes, I really need to come up with some assertion logic to map respective sentences to the original text. This is the main reason why I've been working on https://github.com/nipunsadvilkar/pySBD/tree/sentence-char-span branch because even if pysbd fails to find proper sentence. tok.is_sent_start would remain False and we will get an original text at the end

from pysbd.

nipunsadvilkar avatar nipunsadvilkar commented on August 16, 2024 1

@danielkingai2 : Fixed above bug & have released char-span functionality today. Didn't release it yday since I wanted to add tests and update docs.

from pysbd.

nipunsadvilkar avatar nipunsadvilkar commented on August 16, 2024

@danielkingai2 : Is given example - "Fig. ??" - actual sentence from some text? Seems like acronym with ?? would be less likely. Tried following:

In [2]: seg.segment('Fig. 20??')                                                                                                          
Out[2]: ['Fig.', '20??']

In [3]: seg.segment('Fig. 20 ??')                                                                                                         
Out[3]: ['Fig.', '20 ??']

In [4]: seg.segment('Fig. ??')                                                                                                            
Out[4]: ['Fig.']

I would need to add fig abbreviation to keep above tokens intact. Will look into it

from pysbd.

dakinggg avatar dakinggg commented on August 16, 2024

It can result from text that is parsed from a pdf. The problem is that it becomes challenging to use the output if the output text doesn't match the input text (even if the sentence splitting is wrong, it is better to retain the original text). It looks like it only happens when the ?? is at the end of the input sequence and after a sentence split by pysbd (see examples below). Would it be easier to add a case to handle a sequence ending with ?? rather than special casing the abbreviation?

>>> segmenter.segment("This text talks about Fig. ??. It is a figure.")
['This text talks about Fig.', '??.', 'It is a figure.']
>>> segmenter.segment("This text talks about Fig. ??.")
['This text talks about Fig.', '??.']
>>> segmenter.segment("This text talks about Fig. ?? .")
['This text talks about Fig.', '?? .']
>>> segmenter.segment("This text talks about Fig. ?? which is a figure.")
['This text talks about Fig.', '?? which is a figure.']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ??']
>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.']

from pysbd.

nipunsadvilkar avatar nipunsadvilkar commented on August 16, 2024

Yes, I agree.

Can you please try out adding following line after https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/processor.py#L182

txt = re.sub(r'☇$', '??', txt)

from pysbd.

dakinggg avatar dakinggg commented on August 16, 2024

seems better with that, at least it doesn't truncate the text

>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.', '??']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ?', '?']

from pysbd.

dakinggg avatar dakinggg commented on August 16, 2024

In the meantime, would you mind merging that fix you came up with?

from pysbd.

nipunsadvilkar avatar nipunsadvilkar commented on August 16, 2024

Yes, sure!

And sorry for multiple bugs, since I ported it from https://github.com/diasks2/pragmatic_segmenter to python. My main criterion was getting golden rules tests pass.

Some of the issues (better to call it edge-cases) which you created earlier are not accounted for in pragmatic_segmenter. Though, I have tried to fix them in pysbd. Thanks for making pysbd more robust :)

from pysbd.

dakinggg avatar dakinggg commented on August 16, 2024

No worries, I totally understand that you've ported the ruby gem! Definitely appreciate your responsiveness. Working with pdf parsed text causes all kinds of edge cases.

from pysbd.

nipunsadvilkar avatar nipunsadvilkar commented on August 16, 2024

Yeah I concur, I myself work with lot of OCR text so know the pain of unformatted text

from pysbd.

dakinggg avatar dakinggg commented on August 16, 2024

Nice!

from pysbd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.