nipunsadvilkar / pysbd Goto Github PK

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

License: MIT License

Python 100.00%

sentence-boundary-detection python segmentation rule-based sentence-tokenizer sentence

pysbd's Introduction

pySBD: Python Sentence Boundary Disambiguation (SBD)

pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.

This project is a direct port of ruby gem - Pragmatic Segmenter which provides rule-based sentence boundary detection.

Highlights

'PySBD: Pragmatic Sentence Boundary Disambiguation' a short research paper got accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020.

Research Paper:

https://arxiv.org/abs/2010.09657

Recorded Talk:

Poster:

Install

Python

pip install pysbd

Usage

Currently pySBD supports 22 languages.

import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']

Use pysbd as a spaCy pipeline component. (recommended)
Please refer to example pysbd_as_spacy_component.py
Use pysbd through entrypoints

import spacy
from pysbd.utils import PySBDFactory

nlp = spacy.blank('en')

# explicitly adding component to pipeline
# (recommended - makes it more readable to tell what's going on)
nlp.add_pipe(PySBDFactory(nlp))

# or you can use it implicitly with keyword
# pysbd = nlp.create_pipe('pysbd')
# nlp.add_pipe(pysbd)

doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
print(list(doc.sents))
# [My name is Jonas E. Smith., Please turn to p. 55.]

Contributing

If you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to CONTRIBUTING.md to know more and follow these steps.

Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Citation

If you use pysbd package in your projects or research, please cite PySBD: Pragmatic Sentence Boundary Disambiguation.

@inproceedings{sadvilkar-neumann-2020-pysbd,
    title = "{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation",
    author = "Sadvilkar, Nipun  and
      Neumann, Mark",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.15",
    pages = "110--114",
    abstract = "We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\%} of the Golden Rule Set examplars for English, an improvement of 25{\%} over the next best open source Python tool.",
}

Credit

This project wouldn't be possible without the great work done by Pragmatic Segmenter team.

pysbd's People

Contributors

Stargazers

Watchers

pysbd's Issues

Check and write rule for handling no space in between sentences

XXXX et al. [2004] error

Describe the bug
Error segmentation

To Reproduce

import pysbd
text = "Yan et al. [2004] analysed SSH variations in northwest Europe and suggested that SSH changes are related to changes in heat content and heat fluxes."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))

This is a whole sentence and should not be segmented.

Incorrect text span start and end returned

Found another example that screws up the start and end indices for the spans. I wasn't really able to reproduce it without the full text, although probably there is a smaller version of the text that doesn't break it. It also has something to do with the <|CITE|> tokens.

In[41]: text = 'Trust in journalism is not associated with frequency of media use (except in the case of television as mentioned above), indicating that trust is not an important predictor of media use, though it might have an important impact on information processing. This counterintuitive fi nding can be explained by taking into account the fact that audiences do not watch informative content merely to inform themselves; they have other motivations that might override credibility concerns. For example, they might follow media primarily for entertainment purposes and consequently put less emphasis on the quality of the received information.As <|CITE|> have claimed, audiences tend to approach and process information differently depending on the channel; they approach television primarily for entertainment and newspapers primarily for information. This has implications for trust as well since audiences in an entertainment processing mode will be less attentive to credibility cues, such as news errors, than those in an information processing mode (Ibid.). <|CITE|> research confi rms this claim -he found that audiences tend to approach newspaper reading more actively than television viewing and that credibility assessments differ regarding whether audience members approach news actively or passively. These fi ndings can help explain why we found a weak positive correlation between television news exposure and trust in journalism. It could be that audiences turn to television not because they expect the best quality information but rather the opposite -namely, that they approach television news less critically, focus less attention on credibility concerns and, therefore, develop a higher degree of trust in journalism. The fact that those respondents who follow the commercial television channel POP TV and the tabloid Slovenske Novice exhibit a higher trust in journalistic objectivity compared to those respondents who do not follow these media is also in line with this interpretation. The topic of Janez Janša and exposure to media that are favourable to him and his SDS party is negatively connected to trust in journalism. This phenomenon can be partly explained by the elaboration likelihood model <|CITE|> , according to which highly involved individuals tend to process new information in a way that maintains and confi rms their original opinion by 1) taking information consistent with their views (information that falls within a narrow range of acceptance) as simply veridical and embracing it, and 2) judging counter-attitudinal information to be the product of biased, misguided or ill-informed sources and rejecting it <|CITE|> <|CITE|> . Highly partisan audiences will, therefore, tend to react to dissonant information by lowering the trustworthiness assessment of the source of such information.'

In[42]: import pysbd
In[43]: segmenter = pysbd.Segmenter(char_span=True)
In[44]: char_spans = segmenter.segment(text)
In[45]: char_spans[-1]
Out[45]: TextSpan(sent=<correct sentence text>, start=143, end=302)
In[46]: char_spans[-2]
Out[46]: TextSpan(sent=<correct sentence text>, start=0, end=142)
In[47]: char_spans[-3]
Out[47]: TextSpan(sent=<correct sentence text>, start=0, end=152)
In[48]: char_spans[-4]
Out[48]: TextSpan(sent=<correct sentence text>, start=2139, end=2368)

Catastrophic backtracking in HTMLTagRule

Describe the bug
Segmenter will hang if it encounters unfinished html ( unescaped html attribute with unfinished hml tag).
The reason is the regex taht can be found in rules.py, class HTML:
HTMLTagRule = Rule(r"<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[\^'\">\s]+))?)+\s*|\s*)\/?>", '')

To Reproduce
Steps to reproduce the behavior:

sentencer = pysbd.Segmenter(language="en", clean=True)
txt = '<iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay" src="url Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum:<iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay" src="url Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum:<iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay" src="url Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum:<iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay" src="url Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum'
sentences = sentencer.segment(txt)

You can test it here, please see the link: https://regex101.com/r/dGyZqj/1/

Expected behavior
The segmenter/regex should not hang if it encounters unfinished html.

Possible solution
We could replace the regex by a simplified one:
HTMLTagRule = Rule(r"(<([^>]+)>)", '')
that will do the same job (remove tags and keep the text) without the nested quantifiers that lead almost always to catastrophic backtracking.

Listing rules for cleaning unwanted formatting

Use Timex format (tag) to substitute word/spaces/delimiters

https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py#L77

text = re.sub(timex + '(?!</TIMEX2>)', '<TIMEX2>' + timex + '</TIMEX2>', text)

Infinite loop?

The below seems to hang forever-

segmenter = pysbd.Segmenter(language="en", clean=False)
text = "..[111 111 111 111 111 111 111 111 111 111]"
segmenter.segment(text)

Interrupting I get the traceback:

Traceback (most recent call last):
  File "check.py", line 5, in <module>
    segmenter.segment(text)
  File ".../python3.7/site-packages/pysbd/segmenter.py", line 87, in segment
    postprocessed_sents = self.processor(text).process()
  File ".../python3.7/site-packages/pysbd/processor.py", line 37, in process
    self.replace_periods_before_numeric_references()
  File ".../python3.7/site-packages/pysbd/processor.py", line 141, in replace_periods_before_numeric_references
    r"∯\2\r\7", self.text)
  File ".../python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
KeyboardInterrupt

this is pysbd version 0.3.3, python 3.7.7

Could it be entering into an infinite loop?

(I found this bug by applying pysbd to wikipedia, on this article: https://en.wikipedia.org/wiki/Clojure it tripped up on "...[484 216 622 139 651 592 379 228 242 355]"

Text surrounded by quotes seems to not get segmented?

>>> s = '"I just love doing this sort of thing. It\'s great."'
>>> seg.segment(s)
['"I just love doing this sort of thing. It\'s great."']
>>> s = '"I phoned the \'Rape Helpline\' yesterday. They suggested I buy a balaclava."'
>>> seg.segment(s)
['"I phoned the \'Rape Helpline\' yesterday. They suggested I buy a balaclava."']

destructive behaviour in edge-cases

As of v0.3.3, pySBD shows destructive behavior in some edge-cases even when setting the option clean to False.
When dealing with OCR text, pySBD removes whitespace after multiple periods.

To reproduce

import pysbd

splitter = pysbd.Segmenter(language="fr", clean=False)

text = "Maissen se chargea du reste .. Logiquement,"
print(splitter.segment(text))

text = "Maissen se chargea du reste ... Logiquement,"
print(splitter.segment(text))

text = "Maissen se chargea du reste .... Logiquement,"
print(splitter.segment(text))

Actual output
Please note the missing whitespace after the final period in the example with .. and .....

['Maissen se chargea du reste .', '.', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '...', 'Logiquement,']

Expected output

['Maissen se chargea du reste .', '. ', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '... ', 'Logiquement,']

In general, pySBD works well. Many thanks @nipunsadvilkar. I can also look into this as soon as I find some time and open a pull request.

Removing HTML tags rule

"unbalanced parenthesis" error

I'm getting this error every once in a while, for example, for text:

'Remuneration Report. 9.8.4(5) Directors? (the Company) See paragraph headed ?Capital structure? in this report. 9.8.4(8) Non-pro-rata allotments of equity for cash (major subsidiaries) N/A 9.8.4(10) Contracts of significance involving a Director N/A 9.8.4(11) Contracts of significance involving a controlling shareholder N/A 9.8.4(12) Waivers of dividends N/A 9.8.4(13) Waivers of future dividends N/A 9.8.4(14) Agreement with a controlling shareholder (LR 9.2.2.AR(2)(a)) See Corporate'

I get: unbalanced parenthesis at position 14

This is how I instantiate and run:
segmenter = pysbd.Segmenter(language="en", clean=True, doc_type="pdf")
sentences = segmenter.segment(txt)

where txt is the string above.
Thanks!

crashing on input

pysbd crashes on this input

>>> segmenter.segment("Proof. First let v ∈ V be incident to at least three leaves and suppose there is a minimum power dominating set S of G that does not contain v. If S excludes two or more of the leaves of G incident to v, then those leaves cannot be dominated or forced at any step. Thus, S excludes at most one leaf incident to v, which means S contains at least two leaves ℓ 1 and ℓ 2 incident to v. Then, (S\{ℓ 1 , ℓ 2 }) ∪ {v} is a smaller power dominating set than S, which is a contradiction. Now consider the case in which v ∈ V is incident to exactly two leaves, ℓ 1 and ℓ 2 , and suppose there is a minimum power dominating set S of G such that {v, ℓ 1 , ℓ 2 } ∩ S = ∅. Then neither ℓ 1 nor ℓ 2 can be dominated or forced at any step, contradicting the assumption that S is a power dominating set. If S is a power dominating set that contains ℓ 1 or ℓ 2 , say ℓ 1 , then (S\{ℓ 1 }) ∪ {v} is also a power dominating set and has the same cardinality. Applying this to every vertex incident to exactly two leaves produces the minimum power dominating set required by (3). Definition 3.4. Given a graph G = (V, E) and a set X ⊆ V , define ℓ r (G, X) as the graph obtained by attaching r leaves to each vertex in X. If X = {v 1 , . . . , v k }, we denote the r leaves attached to vertex v i as ℓ")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/daniel/miniconda3/envs/scispacy_retrain/lib/python3.6/site-packages/pysbd/segmenter.py", line 21, in segment
    segments = processor.process()
  File "/home/daniel/miniconda3/envs/scispacy_retrain/lib/python3.6/site-packages/pysbd/processor.py", line 31, in process
    self.text = AbbreviationReplacer(self.text).replace()
  File "/home/daniel/miniconda3/envs/scispacy_retrain/lib/python3.6/site-packages/pysbd/abbreviation_replacer.py", line 46, in replace
    self.text = self.search_for_abbreviations_in_string()
  File "/home/daniel/miniconda3/envs/scispacy_retrain/lib/python3.6/site-packages/pysbd/abbreviation_replacer.py", line 75, in search_for_abbreviations_in_string
    self.text = self.scan_for_replacements(self.text, match, ind, char_array)
  File "/home/daniel/miniconda3/envs/scispacy_retrain/lib/python3.6/site-packages/pysbd/abbreviation_replacer.py", line 79, in scan_for_replacements
    char = char_array[ind] if char_array else ''
IndexError: list index out of range
>>>

Handling multiple types of quotations

Example on website is broken

https://spacy.io/universe/project/python-sentence-boundary-disambiguation

It should be from pysbd.utils import PySBDFactory, with an "s" on "utils"

Deciding API stucture for module like scikit-learn

Incorrect text span start and end returned

Looks like something weird happening in this case, note that the indices of the second text span are incorrect:

>>> seg = pysbd.Segmenter(language='en', clean=False, char_span=True)
>>> seg.segment("1) The first item. 2) The second item.")                                                                                
[TextSpan(sent='1) The first item.', start=0, end=18), TextSpan(sent='2) The second item.', start=0, end=19)]

Explore Abbreviation handling and write rules for it

Handle irregularities between pySBD & pySBD + spaCy sentence output

pySBD spaCy pipeline component uses a token-based approach and sets is_sent_start to True or False depending on Spans obtained from pySBD character offsets. We create Span objects using doc.char_span method by creating a slice - doc.text[start:end] which is a sentence span whose first Token object needs to have attribute is_sent_start set to True. On the other hand, if the character indices don’t map to a valid span it returns None . Hence we get irregularities in pySBD & pySBD + spaCy sentence output.

The inability to get Span object from pySBD character offsets can be tackled using the deconstruction of Doc object like the way PKSHATechnology-Research/camphr authors have written get_doc_char_span which uses destruct_token

Exception when clean=True in search_for_connected_sentences

Describe the bug
Segmenter will raise "exception: bad escape (end of pattern) at position" when it is initialized with clean=True and it encounters a sentence like "etc.Png,Jpg,.\" (word/token that contains a backslash).

The exception is raised in:
module:
cleaner.py
class:
class Cleaner
method name:
search_for_connected_sentences
line:

txt = re.sub(re.escape(word), new_word, txt)

To Reproduce
Steps to reproduce the behavior:

# This is a simplified example, the original text contained names so I changed it to img formats
# Word that is a abbreviation with dot followed by upper case letter and backslash
sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\"
sentences = sentencer.segment(txt)

Expected behavior
The output should be the same as is, but is should not trow an exception.
Workaround to see the output is to escape the backslash.

sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\\\"
sentences = sentencer.segment(txt)

Expected output:

['etc.', 'Png,Jpg,.', '\\']

Possible solution
replace txt = re.sub(re.escape(word), new_word, txt)
with txt = txt.replace(word, new_word)
It avoids all the pitfalls of regular expressions (like escaping), and is generally faster.

Additional context
Originally we parse small text files (in Slovak language) without special treatment to form a huge sentenced corpus. The example was specially crafted just to reproduce the behavior for English parser. I know that the backslash combination is rare for English but it happens to occur in Slovak articles when you process vast amounts of text.

Looses text when breaking into sentences

I did some very simple testing and found that the sentence parser dropping entire parts of the document when identifying the sentences. The data is public so I can share it and a simple test driver.
But if you look at the output of sentences, you can see that about 1/2 the document is not returned in the sentences iterable in spacy.

`from pysbd.utils import PySBDFactory
import spacy

doc1 = 'EX-10.46 3 f18180exv10w46.htm EXHIBIT 10.46 exv10w46\n\n'
'EXHIBIT 10.46\n\n'
'Guarantee Agreement\n\n'
'December 16, 2005\n\n'
'SanDisk Corporation\n\n'
'IBJ Leasing Co., Ltd.\n'
'Sumisho Lease Co., Ltd.\n\n'
'Toshiba Finance Corporation\n\n'
'Guarantee Agreement\n\n'
'SanDisk Corporation (the “Guarantor”) and IBJ Leasing Co., Ltd., Sumisho Lease Co., Ltd., and Toshiba Finance Corporation as SD Lessor thereunder (in such capacity, collectively, the “SD Lessors”) hereby enter into this guarantee agreement (the “Agreement”) with respect to the Master Lease Agreement dated December 16, 2005 and individual agreements thereunder (collectively, the “Lease Agreement”) by and between the SD Lessors and IBJ Leasing Co., Ltd. and Sumisho Lease Co., Ltd. as Toshiba Lessor thereunder (in such capacity, collectively, the “Toshiba Lessors”) and Flash Partners Yugen Kaisha (the “Lessee”).\n'
'Unless as otherwise specified in this Agreement, the words defined in the Lease Agreement shall have the same meaning in this Agreement.\n\n'
'Article 1. (Guarantee)\n\n'
'The Guarantor shall guarantee the performance, from time to time, of the obligations subject to the guarantee below (the “Guaranteed Obligation”) to the SD Lessors, jointly and severally (rentai-hosho) with the Lessee (the “Guarantee”).\n\n'
'(Guaranteed Obligation)\n\n'
'Guaranteed Obligation shall mean payment obligations of lease (lease-ryo), stipulated loss payment (kitei-songaikin), purchase option exercise price (konyu-sentakuken-koshikagaku), terminal return adjustment amount (henkanji-choseikin), break funding cost, late charges (chien-songaikin), and any and all payment obligations of other amounts concerning SD Tranche I and SD Tranche II in individual transactions pursuant to the Lease Agreement; provided that the Guarantor and the SD Lessors may consult in the event of any doubt concerning “other amounts” as mentioned above.\n\n'
'In any event, the Guarantor shall not pay any obligation concerning Toshiba Tranche 1 and Toshiba Tranche 2.\n\n'
'Article 2. (Period of Request for the Performance of Guarantee Obligation)\n\n'
'In the event the SD Lessors request the performance of the Guarantee to the Guarantor, the SD Lessors shall make a written demand to the Guarantor requesting the performance of the Guaranteed Obligation which the Lessee fails to duly and punctually perform. The SD Lessors may, upon each failure of due and punctual performance of the Guaranteed Obligation, make a request pursuant to this Article; provided that the delay in making such request will not exempt the Guarantor from the obligations under the Guarantee.\n\n'
'Article 3. (Performance of Guaranteed Obligation)\n\n'
'3.1 The Guarantor shall, in the event the Lessee fails to perform all or any part of its obligations under the Guaranteed Obligation within 10 business days from each due date, perform the Guarantee in favor of the SD Lessors within 20 business days from the receipt of the written demand from the SD Lessors.'

nlp = spacy.blank('en')
nlp.add_pipe(PySBDFactory(nlp))

doc = nlp(doc1)

for num, sent in enumerate(doc.sents):
print(f'{num} : {sent}')`

Integrating newlines handling rule

Last sentence disappearing from input

The last sentence of this input seems to be removed in the output

>>> segmenter.segment("As an example of a different special-purpose mechanism, we have introduced a methodology for letting donors make their donations to charities conditional on donations by other donors (who, in turn, can make their donations conditional) [70]. We have used this mechanism to collect money for Indian Ocean Tsunami and Hurricane Katrina victims. We have also introduced a more general framework for negotiation when one agent's actions have a direct effect (externality) on the other agents' utilities [69]. Both the charities and externalities methodologies require the solution of NP-hard optimization problems in general, but there are some natural tractable cases as well as effective MIP formulations. Recently, Ghosh and Mahdian [86] at Yahoo! Research extended our charities work, and based on this a web-based system for charitable donations was built at Yahoo!")
['As an example of a different special-purpose mechanism, we have introduced a methodology for letting donors make their donations to charities conditional on donations by other donors (who, in turn, can make their donations conditional) [70].', 'We have used this mechanism to collect money for Indian Ocean Tsunami and Hurricane Katrina victims.', "We have also introduced a more general framework for negotiation when one agent's actions have a direct effect (externality) on the other agents' utilities [69].", 'Both the charities and externalities methodologies require the solution of NP-hard optimization problems in general, but there are some natural tractable cases as well as effective MIP formulations.']

Debugging print statements left in

I think you left some debugging print statements in with your latest pr :) At least here:

pySBD/pysbd/lists_item_replacer.py

Line 133 in 781843e

print(each, match, chomped_match)

, not sure if there are others.

Cleaning the text before segmentation

First, thank you for this great tool !

Describe the bug
Can't pass "clean = True" to clean the text before the segmentation, when using the PySBDFactory class.
As indicated, char_span should be False, when clean = True, but this

PySBDFactory(nlp, language='es', clean =True, char_span=False)

is not working.

To Reproduce
pysbd==0.3.0rc0
python 3.7

import spacy
from pysbd.utils import PySBDFactory

nlp = spacy.blank('es')
nlp.add_pipe(PySBDFactory(nlp, language='es', clean =True, char_span=False))

text4="""
1- mi primera oración
ii- mi segunda oración
yo. mi tercera oración
"""
doc = nlp(text4)

This is the complete Traceback :

AttributeError                            Traceback (most recent call last)
<ipython-input-216-0b0fdebddef6> in <module>
----> 1 doc = nlp(text4)

~\anaconda3\envs....\lib\site-packages\spacy\language.py in __call__(self, text, disable, component_cfg)
    437             if not hasattr(proc, "__call__"):
    438                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 439             doc = proc(doc, **component_cfg.get(name, {}))
    440             if doc is None:
    441                 raise ValueError(Errors.E005.format(name=name))

~\anaconda3\envs\......\lib\site-packages\pysbd\utils.py in __call__(self, doc)
     78         sents_char_spans = self.seg.segment(doc.text_with_ws)
     79         start_token_ids = [sent.start for sent in sents_char_spans]
---> 80         for token in doc:
     81             token.is_sent_start = (True if token.idx
     82                                    in start_token_ids else False)

~\anaconda3\envs\.....\lib\site-packages\pysbd\utils.py in <listcomp>(.0)
     78         sents_char_spans = self.seg.segment(doc.text_with_ws)
     79         start_token_ids = [sent.start for sent in sents_char_spans]
---> 80         for token in doc:
     81             token.is_sent_start = (True if token.idx
     82                                    in start_token_ids else False)

AttributeError: 'str' object has no attribute 'start'

Additional context

I have two other questions here :

1- In the class Processor(object): is there any reason why nlp is not a class atrribute :

nlp = spacy.blank('en') (line 10) is not in the class, especially given the fact we have a lang attribute in there :
self.lang = lang

2- Could you please explain why in numbered lists like :

text4="""1- mi primera oración
ii- mi segunda oración
i. mi tercera oración
"""
The last item is not in the same sentence as the first two :

Result :

---- sentence :  1- mi primera oración
---- sentence :  ii- mi segunda oración
---- sentence :  i.
---- sentence :  mi tercera oración

Thank you.

Different segmentation with Spacy and when using pySBD directly

Firstly thank you for this project - I was lucky to find it and it is really useful

I seem to have found a case where the segmentation is behaving differently when run within the Spacy pipeline and when run using pySBD directly. I stumbled on it with my own text where a sentence after a previous sentence that was in quotes was being lumped together. I looked through the Golden Rules and found this wasn't expected and then noticed that even with the text in one of your tests it acts differently in Spacy.

To reproduce run these two bits of code:

from pysbd.utils import PySBDFactory
nlp = spacy.blank('en')
nlp.add_pipe(PySBDFactory(nlp))
doc = nlp("She turned to him, \"This is great.\" She held the book out to show him.")
for sent in doc.sents:
    print(str(sent).strip() + '\n')

She turned to him, "This is great." She held the book out to show him.

import pysbd
text = "She turned to him, \"This is great.\" She held the book out to show him."
seg = pysbd.Segmenter(language="en", clean=False)
#print(seg.segment(text))
for sent in seg.segment(text):
    print(str(sent).strip() + '\n')

She turned to him, "This is great."

She held the book out to show him.

The second way is the desired output (based on the rules at least)

Making it compatible with multiple document type format

Question marks at the end swallowed

Looks like the example with just question marks is good now:

>>> segmenter.segment("??")
['??']

but the example with double question marks as a token at the end of a sentence still loses the question marks:

>>> segmenter.segment("T stands for the vector transposition. As shown in Fig. ??")
['T stands for the vector transposition.', 'As shown in Fig.']

looks like this is the minimal repro:

>>> segmenter.segment("Fig. ??")
['Fig.']

Regexp issues

I'm getting errors because the regexp engine interprets parentesis: "unterminated subpattern" and "unbalanced parenthesis".

I'm analysing very large amounts of text, so not sure how these were triggered.

Replacing Table of contents kind of text having multiple periods

re.error: unbalanced parenthesis at position 10

I am getting the following error trying to use the Polish model to segment Croatian:

segmenter = pysbd.Segmenter(language="pl")
text = """10, 7, itd. Isto tako postoji podjela u odnosu na rotacijsku brzinu ploče u minuti: 33 1/3, 45, 78, kapaciteta (Long play - dugo sviranje), reprodukcijske kvalitete, te po broju audio kanala ("Mono", "Stereo", "Quardophonic", itd)."""
segmenter.segment(text)


> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "python3.7/site-packages/pysbd/segmenter.py", line 87, in segment
    postprocessed_sents = self.processor(text).process()
  File "python3.7/site-packages/pysbd/processor.py", line 34, in process
    self.replace_abbreviations()
  File "python3.7/site-packages/pysbd/processor.py", line 180, in replace_abbreviations
    self.text = self.abbreviations_replacer().replace()
  File "python3.7/site-packages/pysbd/abbreviation_replacer.py", line 37, in replace
    abbr_handled_text += self.search_for_abbreviations_in_string(line)
  File "python3.7/site-packages/pysbd/abbreviation_replacer.py", line 93, in search_for_abbreviations_in_string
    text, match, ind, char_array
  File "python3.7/site-packages/pysbd/abbreviation_replacer.py", line 111, in scan_for_replacements
    txt = self.replace_period_of_abbr(txt, am)
  File "python3.7/site-packages/pysbd/abbreviation_replacer.py", line 71, in replace_period_of_abbr
    txt,
  File "python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "python3.7/sre_parse.py", line 938, in parse
    raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 10

Version: 0.3.3

Starting period edge case

Thanks for fixing all the bugs so fast! I've got another edge case for you. Sometimes when the input only contains space separated periods, the first one is removed:

>>> segmenter.segment('.')
[]
>>> segmenter.segment('..')
['..']
>>> segmenter.segment('...')
['...']
>>> segmenter.segment('. .')
['. .']
>>> segmenter.segment('. . .')
['. .']
>>>

Typo in example code on spacy universe

Describe the bug
Showing the following error in Google Colab: No module named 'pysbd.util'

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-19-50881d57442a> in <module>()
      1 import spacy
----> 2 from pysbd.util import PySBDFactory

ModuleNotFoundError: No module named 'pysbd.util'

To Reproduce
Open a notebook in Google Colab

!pip install spacy
!pip install git+https://github.com/nipunsadvilkar/pySBD     # Installation gets successful

import spacy
from pysbd.util import PySBDFactory        # This line is generating the error

Expected behavior
No error

Shouldn't colons cause a sentence split?

They currently don't:

>>> s = 'Tomorrow I will do the greatest thing ever: Become a god.'
>>> seg.segment(s)
['Tomorrow I will do the greatest thing ever: Become a god.']
>>> s = 'The best player of the city: Zob Ahan F.C. and Sepahan F.C..'
>>> seg.segment(s)
['The best player of the city: Zob Ahan F.C. and Sepahan F.C..']

Handling punctuation within brackets

Deciding preprocessing steps needed for handling clean text

ERROR when proceesing this paragraphy

We start by discussing the systematic discrepancy between results on comparable TI single crystals obtained by means of ARPES and transport experiments. The radial scatter plot of Fig. 1 compares the binding energy of the Dirac point obtained either by ARPES experiments (red circles) or by Shubnikov de Haas (SdH) oscillations in magneto-transport (blue circles). The value of E F -E D (i.e. the Dirac point binding energy) increases radially: the border of the inner circle corresponds to zero binding energy of the Dirac point (i.e. E D =E F ) and each tick denotes an increase of 100 meV. Each data point in the figure corresponds to a different experimental study in the literature, showing the work of many groups, including our own, and results are shown for five different TI compounds. A general conclusion can be readily made. ARPES shows a systematically higher binding energy for the Dirac point than magneto-transport experiments. We note that several ARPES studies [7, 8, 20-24, 26, 28, 29, 32, 39, 43, 45] have observed energy shifts to higher binding energies because of surface band bending on intentional and unintentional (= 'aging') surface decoration. In order to maintain a fair comparison with magneto-transport, the filled red circles in Fig. 1 correspond to surfaces that have been neither decorated nor aged in UHV. Such data points have been acquired in a time frame between a few minutes and 2 hours after cleavage. Empty markers show the value of E D -by means of ARPES-on exposure to air (empty squares) or on increasing exposure to the residual UHV gases (empty circles). Such surface decoration might be an even more important issue in magneto-transport experiments, as such experiments do not take place in a UHV environment and generally do not involve in-situ cleavage of the single crystalline sample. However, the magneto-transport data seems relatively insensitive to surface decoration as the binding energies of the Dirac point are smaller than even the most pristine surfaces studied by ARPES. Fig. 1 makes it clear that surface decoration alone cannot be the key to the observed differences between ARPES and QO experiments, and thus the conclusion drawn earlier -that the E D values obtained by SdH oscillations cannot be systematically reproduced by ARPES even in the most pristine surfaces -is still valid. In the following, we will explain where the difference in the experimentally determined E D comes from between the two techniques, and we will discuss whether we can approach the SdH values by means of ARPES. Fig. 2 shows the first experimental evidence that the surface band bending of 3D TIs is modified substantially on exposure to EUV illumination of a duration of a single second, compared to the typical timescale of ARPES data collection for an I(E, k) image of tens of seconds or even several minutes. In order to highlight that the development of the band bending is indeed dominated by EUV exposure, and not by simple surface decoration with residual UHV gases, as has generally been believed [7, 8, [20] [21] [22] [23] [24] 43] , we have constructed the following experimental protocol. Firstly, we have intentionally exposed all cleavage surfaces to residual UHV gases for 3 hours at low temperature before the first measurement. Secondly, we have limited the duration of each measurement (and hence the EUV exposure) to a minimum of 1-2 seconds using a photon flux of 3.2 × 10 21 photons/(s m 2 ). The optimization of the sample position with respect to the electron energy analyzer and the photon beam, and the adjustment of the emission angles -such that the detector image cuts through the center of the Brillouin zone-were carried out on a part of the cleave one or more millimeters away from the point where the data of Figs. 2 and 3 were recorded. This means that the E D values for the locations measured for Figs. 2 and 3 represent those for regions with carefully controlled EUV exposure [62] .
```

Please help to fix it. Thanks.

Integrating rule for consecutive characters occurence

Performance improvement?

I am not certain of this, but I suspect there might be room for performance improvement by using re.compile to precompile all of the needed regexs. Otherwise they will have to be compiled regularly (once the re cache of 100 has been exceeded)

🐛 doc_type='pdf' no longer works

Describe the bug
After the latest update, pdf mode no longer works. New lines seem to always get recognized as new sentences.
To Reproduce
Steps to reproduce the behavior:
Input text - "This is a sentence\ncut off in the middle because pdf."

Expected behavior
Expected output - "This is a sentence\ncut off in the middle because pdf."

How to modify segmentation rules by hand?

I have the following piece of text which I feed to pysbd.Segmenter:

'Trying to get back to Com. & Adm. through the most direct path in the dark.'

The correct way of handling this text is to keep it as a single sentence, although the segment() method returns:

['Trying to get back to Com.',
 '& Adm.',
 'through the most direct path in the dark.']

How do I tell the segmenter to avoid splitting a sentence at specific abbreviations, in this case "Com." and "Adm"? The poster in the README file states that rules are "easy to modify", so how do I do that?

periods replacing characters

A few characters and whitespaces have been replaced with ., e.g. time when -> ti.e.when

>>> segmenter.segment("Thus, we first compute EMC 3 's response time-i.e., the duration from the initial of a call (from/to a participant in the target region) to the time when the decision of task assignment is made; and then, based on the computed response time, we estimate EMC 3 maximum throughput [28]-i.e., the maximum number of mobile users allowed in the MCS system. EMC 3 algorithm is implemented with the Java SE platform and is running on a Java HotSpot(TM) 64-Bit Server VM; and the implementation details are given in Appendix, available in the online supplemental material.")
["Thus, we first compute EMC 3 's response ti.e.i.e., the duration from the initial of a call (from/to a participant in the target region) to the ti.e.when the decision of task assignment is made; and then, based on the computed response ti.e. we estimate EMC 3 maximum throughput [28]-i.e., the maximum number of mobi.e.users allowed in the MCS system.", 'EMC 3 algorithm is implemented with the Java SE platform and is running on a Java HotSpot(TM) 64-Bit Server VM; and the implementation details are gi.e. in Appendix, available in the onli.e.supplemental material.']

character getting swallowed

I found a case where a character is swallowed from the output (the 9 in 1998)

>>> segmenter.segment("Random walk models (Skellam, 1951;Turchin, 1998) received a lot of attention and were then extended to several more mathematically and statistically sophisticated approaches to interpret movement data such as State-Space Models (SSM) (Jonsen et al., 2003(Jonsen et al., , 2005 and Brownian Bridge Movement Model (BBMM) (Horne et al., 2007). Nevertheless, these models require heavy computational resources (Patterson et al., 2008) and unrealistic structural a priori hypotheses about movement, such as homogeneous movement behavior. A fundamental property of animal movements is behavioral heterogeneity (Gurarie et al., 2009) and these models poorly performed in highlighting behavioral changes in animal movements through space and time (Kranstauber et al., 2012).")
['Random walk models (Skellam, 1951;Turchin, 199) received a lot of attention and were then extended to several more mathematically and statistically sophisticated approaches to interpret movement data such as State-Space Models (SSM) (Jonsen et al., 2003(Jonsen et al., , 2005 and Brownian Bridge Movement Model (BBMM) (Horne et al., 2007).', 'Nevertheless, these models require heavy computational resources (Patterson et al., 2008) and unrealistic structural a priori hypotheses about movement, such as homogeneous movement behavior.', 'A fundamental property of animal movements is behavioral heterogeneity (Gurarie et al., 2009) and these models poorly performed in highlighting behavioral changes in animal movements through space and time (Kranstauber et al., 2012).']
>>>

Adding Multiple Language Support

Add pysbd support for all the languages supported by pragmatic_segmenter

Question marks swallowed from input

Got a couple examples of question marks being remove from the input text

>>> segmenter.segment("T stands for the vector transposition. As shown in Fig. ??")
['T stands for the vector transposition.', 'As shown in Fig.']
>>> segmenter.segment("??")
[]

Long number stalls process.

t = 'Rok bud.2027777983834843834843042003200220012000199919981997199619951994199319921991199019891988198042003200220012000199919981997199619951994199319921991199019891988198'
segmenter.segment(t)

Stalls. Apparently replace_periods_before_numeric_references takes forever.

IndexError: list index out of range

I get this error for some texts.

Here is an example of where it fails:

>>> import pysbd
>>> text = "This new form of generalized PDF in (9) is generic and suitable for all the fading models presented in Table I withbranches MRC reception. In section III, (9) will be used in the derivations of the unified ABER and ACC expression."
>>> segmenter = pysbd.Segmenter(language="en", clean=False)
>>> sents = segmenter.segment(text)

Integrating Inline formatting rule

Pysbd just hangs🐛

Describe the bug
The process hangs .

To Reproduce
Steps to reproduce the behavior:
Input text - <f.302205302116302416302500302513915bd>
flat = "f.302205302116302416302500302513915bd"
print(flat)
x=segClean = pysbd.Segmenter(language="en", clean=True, char_span=False)
for z in x.segment(flat):
print(z)

Example:
Input text - "My name is Jonas E. Smith. Please turn to p. 55."

Expected behavior
Return f.302205302116302416302500302513915

Example:
['f.302205302116302416302500302513915bd']

Additional context
Add any other context about the problem here.

Divide processor task into small steps

Make changes in regex use `\\` in special characters

replace \r with \\r and \n with \\n