Looks like the example with just question marks is good now: <div class="snippet-c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

seems better with that, at least it doesn't truncate the text <div class="snippet-

Question marks at the end swallowed about pysbd HOT 11 CLOSED

nipunsadvilkar commented on August 16, 2024

Question marks at the end swallowed

from pysbd.

Comments (11)

nipunsadvilkar commented on August 16, 2024 1

Yes, I really need to come up with some assertion logic to map respective sentences to the original text. This is the main reason why I've been working on https://github.com/nipunsadvilkar/pySBD/tree/sentence-char-span branch because even if pysbd fails to find proper sentence. tok.is_sent_start would remain False and we will get an original text at the end

from pysbd.

nipunsadvilkar commented on August 16, 2024 1

@danielkingai2 : Fixed above bug & have released char-span functionality today. Didn't release it yday since I wanted to add tests and update docs.

from pysbd.

nipunsadvilkar commented on August 16, 2024

@danielkingai2 : Is given example - "Fig. ??" - actual sentence from some text? Seems like acronym with ?? would be less likely. Tried following:

In [2]: seg.segment('Fig. 20??')                                                                                                          
Out[2]: ['Fig.', '20??']

In [3]: seg.segment('Fig. 20 ??')                                                                                                         
Out[3]: ['Fig.', '20 ??']

In [4]: seg.segment('Fig. ??')                                                                                                            
Out[4]: ['Fig.']

I would need to add fig abbreviation to keep above tokens intact. Will look into it

from pysbd.

dakinggg commented on August 16, 2024

It can result from text that is parsed from a pdf. The problem is that it becomes challenging to use the output if the output text doesn't match the input text (even if the sentence splitting is wrong, it is better to retain the original text). It looks like it only happens when the ?? is at the end of the input sequence and after a sentence split by pysbd (see examples below). Would it be easier to add a case to handle a sequence ending with ?? rather than special casing the abbreviation?

>>> segmenter.segment("This text talks about Fig. ??. It is a figure.")
['This text talks about Fig.', '??.', 'It is a figure.']
>>> segmenter.segment("This text talks about Fig. ??.")
['This text talks about Fig.', '??.']
>>> segmenter.segment("This text talks about Fig. ?? .")
['This text talks about Fig.', '?? .']
>>> segmenter.segment("This text talks about Fig. ?? which is a figure.")
['This text talks about Fig.', '?? which is a figure.']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ??']
>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.']

from pysbd.

nipunsadvilkar commented on August 16, 2024

Yes, I agree.

Can you please try out adding following line after https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/processor.py#L182

txt = re.sub(r'☇$', '??', txt)

from pysbd.

dakinggg commented on August 16, 2024

seems better with that, at least it doesn't truncate the text

>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.', '??']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ?', '?']

from pysbd.

dakinggg commented on August 16, 2024

In the meantime, would you mind merging that fix you came up with?

from pysbd.

nipunsadvilkar commented on August 16, 2024

Yes, sure!

And sorry for multiple bugs, since I ported it from https://github.com/diasks2/pragmatic_segmenter to python. My main criterion was getting golden rules tests pass.

Some of the issues (better to call it edge-cases) which you created earlier are not accounted for in pragmatic_segmenter. Though, I have tried to fix them in pysbd. Thanks for making pysbd more robust :)

from pysbd.

dakinggg commented on August 16, 2024

No worries, I totally understand that you've ported the ruby gem! Definitely appreciate your responsiveness. Working with pdf parsed text causes all kinds of edge cases.

from pysbd.

nipunsadvilkar commented on August 16, 2024

Yeah I concur, I myself work with lot of OCR text so know the pain of unformatted text

from pysbd.

dakinggg commented on August 16, 2024

Nice!

from pysbd.

Question marks at the end swallowed about pysbd HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs