Comments (11)
Yes, I really need to come up with some assertion logic to map respective sentences to the original text. This is the main reason why I've been working on https://github.com/nipunsadvilkar/pySBD/tree/sentence-char-span branch because even if pysbd fails to find proper sentence. tok.is_sent_start
would remain False and we will get an original text at the end
from pysbd.
@danielkingai2 : Fixed above bug & have released char-span
functionality today. Didn't release it yday since I wanted to add tests and update docs.
from pysbd.
@danielkingai2 : Is given example - "Fig. ??" - actual sentence from some text? Seems like acronym with ??
would be less likely. Tried following:
In [2]: seg.segment('Fig. 20??')
Out[2]: ['Fig.', '20??']
In [3]: seg.segment('Fig. 20 ??')
Out[3]: ['Fig.', '20 ??']
In [4]: seg.segment('Fig. ??')
Out[4]: ['Fig.']
I would need to add fig
abbreviation to keep above tokens intact. Will look into it
from pysbd.
It can result from text that is parsed from a pdf. The problem is that it becomes challenging to use the output if the output text doesn't match the input text (even if the sentence splitting is wrong, it is better to retain the original text). It looks like it only happens when the ?? is at the end of the input sequence and after a sentence split by pysbd (see examples below). Would it be easier to add a case to handle a sequence ending with ?? rather than special casing the abbreviation?
>>> segmenter.segment("This text talks about Fig. ??. It is a figure.")
['This text talks about Fig.', '??.', 'It is a figure.']
>>> segmenter.segment("This text talks about Fig. ??.")
['This text talks about Fig.', '??.']
>>> segmenter.segment("This text talks about Fig. ?? .")
['This text talks about Fig.', '?? .']
>>> segmenter.segment("This text talks about Fig. ?? which is a figure.")
['This text talks about Fig.', '?? which is a figure.']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ??']
>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.']
from pysbd.
Yes, I agree.
Can you please try out adding following line after https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/processor.py#L182
txt = re.sub(r'☇$', '??', txt)
from pysbd.
seems better with that, at least it doesn't truncate the text
>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.', '??']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ?', '?']
from pysbd.
In the meantime, would you mind merging that fix you came up with?
from pysbd.
Yes, sure!
And sorry for multiple bugs, since I ported it from https://github.com/diasks2/pragmatic_segmenter to python. My main criterion was getting golden rules tests pass.
Some of the issues (better to call it edge-cases) which you created earlier are not accounted for in pragmatic_segmenter. Though, I have tried to fix them in pysbd. Thanks for making pysbd more robust :)
from pysbd.
No worries, I totally understand that you've ported the ruby gem! Definitely appreciate your responsiveness. Working with pdf parsed text causes all kinds of edge cases.
from pysbd.
Yeah I concur, I myself work with lot of OCR text so know the pain of unformatted text
from pysbd.
Nice!
from pysbd.
Related Issues (20)
- make spaCy requirement more explicit
- Example not working with Spacy version 3.1 and 3.0.6 HOT 3
- Would it be possible to package for a sdist archive for PyPI as well?
- English segmenter fails if no space between 2 sentences
- Does pyBSD correctly handle i.e. ?
- Question: PHP port feasibility HOT 1
- Examples of modifying sentence segmentation rules. HOT 2
- Does pysbd delete sentences after detection ?
- Arabic sentence split on the Arabic comma
- How is accuracy on OPUS-100 computed? HOT 1
- Does not properly segment within quotations HOT 1
- pysbd_as_spacy_component.py -- fails to find pysbd module HOT 2
- Bug in German splitting with parenthesis
- Control characters break German segmentation
- Combination of single quotes prevent sbd
- How to separate sentences when there is no punctuation? HOT 2
- `--` breaks segmentation
- Specific string causes segment function to return empty array HOT 1
- PyBSD vs PolyGlot
- Chinese segmenter's unexpected output
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pysbd.