GithubHelp home page GithubHelp logo

gosbd / gosbd Goto Github PK

View Code? Open in Web Editor NEW
11.0 11.0 2.0 1.87 MB

A sentence splitting (sentence boundary disambiguation) library for Go. It is rule-based and works out-of-the-box.

Home Page: https://gosbd.pages.dev/

License: MIT License

Go 79.21% Makefile 0.28% HTML 8.83% JavaScript 11.69%
ai golang llm natural-language-processing nlp-library rag retrieval-augmented-generation sentence-boundary-detection sentence-segmentation sentence-segmenter sentence-splitter sentence-splitting sentence-tokenizer text-splitter text-splitting

gosbd's Introduction

GoSBD: Sentence Splitting (Sentence Boundary Disambiguation) Library for Go

gosbd-logo

Godoc

GoSBD is a library for segmenting text into sentences, designed to make it easier to build Retrieval Augmented Generation (RAG) systems in Go. It is inspired by pySBD and pragmatic_segmenter, and works out-of-the-box with a rule-based approach.

Playground

Try out GoSBD in our online playground.

Features

  • Sentence Splitting: Efficiently breaks down a block of text into individual sentences.
  • Lightweight and Easy Integration: Designed to be lightweight and easy to integrate into existing Go projects.
  • High Accuracy: Offers high accuracy in sentence segmentation. For more details, see pySBD.
  • Fast Sentence Splitting: GoSBD aims to provide high-performance sentence splitting by leveraging Go's efficiency.
  • Non-Destructive Splitting: Segments text into sentences without altering the original content.
  • Language-Specific Configuration: Adaptable to handle punctuation rules specific to different languages.
  • Text Cleaning: Equipped with features to manage and clean noisy text, including:
    • Handling irregular newline characters and spacing
    • Processing Tables of Contents
    • Recognizing and managing URLs and HTML tags
    • Dealing with sentences that are delimited without any space

Note: Text Cleaning feature is to be implemented. Contributions are greatly welcomed.

Installation

To install gosbd, you can use go get:

go get github.com/gosbd/gosbd

Usage

Here's a basic example of how to use gosbd:

package main

import (
    "fmt"
    "github.com/gosbd/gosbd"
)

// This example segments a text string into individual sentences.
func main() {
    segmenter := gosbd.NewSegmenter("en")
    text := "This is a sentence. And this is another one."
    sentences := segmenter.Segment(text)
    for _, sentence := range sentences {
        fmt.Println(sentence)
    }
}

Roadmap

  • Add Online Playground.
  • Add chuking feature with overlapping option.
  • Setup Codecov for monitoring test coverage.
  • Implement text cleaner.
  • Add support for more languages.
  • Add benchmark test.
  • Setup GitHub Action for testing.

Language Support Roadmap

The following table outlines our current language support. We're actively seeking contributions to expand this list. If you're interested in contributing, consider helping us add support for a language, whether it's listed below or not. Your expertise in a language not listed here could be a valuable addition to our project.

Language ISO Code Supported
Amharic am Planned
Arabic ar Planned
Armenian hy Planned
Bulgarian bg Planned
Burmese my Planned
Chinese zh Yes
Danish da Planned
Deutsch de Planned
Dutch nl Planned
English en Yes
French fr Planned
Greek el Planned
Hindi hi Planned
Italian it Planned
Japanese ja Yes
Kazakh kk Planned
Marathi mr Planned
Persian fa Planned
Polish pl Planned
Russian ru Yes
Slovak sk Planned
Spanish es Planned
Urdu ur Planned

We welcome contributions that help us add support for these languages. Please feel free to submit a Pull Request with your contributions.

Motivation

Sentence splitting is a crucial step in the preprocessing pipeline of Natural Language Processing (NLP) tasks, especially for building Retrieval Augmented Generation (RAG) systems. RAG systems rely on accurately segmented sentences to retrieve relevant information and generate coherent responses.

While libraries like pragmatic_segmenter and pySBD are known for their high accuracy and efficiency in sentence splitting, there are no equivalent libraries available in Go. This poses a challenge for developers building RAG systems in Go, as they need to rely on external libraries or implement their own sentence splitting logic.

GoSBD aims to bridge this gap by providing a reliable and efficient sentence splitting solution in Go. By offering a native Go library for sentence splitting, GoSBD simplifies the process of building RAG systems and other NLP applications entirely within the Go ecosystem. This not only streamlines the development workflow but also enables faster execution times by leveraging Go's performance characteristics.

Acknowledgement

This library builds upon the excellent foundations laid by pySBD and pragmatic_segmenter.

Contributing

Contributions are greatly appreciated and crucial for this project! Here are a few ways you can contribute:

  • Add new tests and rules: Improve the accuracy of sentence segmentation by adding new tests and rules.
  • Add support for a new language: Help expand the reach of this library by adding support for new languages.
  • Port features: Help improve this library by porting features that are supported in pySBD and pragmatic_segmenter.

Please feel free to submit a Pull Request with your contributions.

License

This project is licensed under the MIT License.

gosbd's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

gosbd's Issues

panic: runtime error: slice bounds out of range

[transformer] 2024/04/30 18:02:22 msg: Ampere
[transformer] 2024/04/30 18:02:22 msg: Anarchism
[transformer] 2024/04/30 18:02:23 msg: Algorithm
[transformer] 2024/04/30 18:02:23 msg: Annual plant
[transformer] 2024/04/30 18:02:23 msg: Anthophyta
[transformer] 2024/04/30 18:02:23 msg: Albedo
[transformer] 2024/04/30 18:02:23 msg: A
[transformer] 2024/04/30 18:02:23 msg: Mouthwash
[transformer] 2024/04/30 18:02:23 msg: Alabama
[transformer] 2024/04/30 18:02:24 msg: Alexander the Great
[transformer] 2024/04/30 18:02:24 msg: Alfred Korzybski
[transformer] panic: runtime error: slice bounds out of range [:448] with length 447
[transformer]
[transformer] goroutine 22 [running]:
[transformer] github.com/gosbd/gosbd/internal/processor.(*Processor).sentenceBoundaryPunctuation(0xc000140460, {0xc001e7c700?, 0x1c0?})
[transformer] 	/go/pkg/mod/github.com/gosbd/[email protected]/internal/processor/processor.go:180 +0x2a5
[transformer] github.com/gosbd/gosbd/internal/processor.(*Processor).processText(0xc000140460, {0xc0022c6000, 0x1bd})
[transformer] 	/go/pkg/mod/github.com/gosbd/[email protected]/internal/processor/processor.go:148 +0x23c
[transformer] github.com/gosbd/gosbd/internal/processor.(*Processor).checkForPunctuation(0xc000140460, {0xc0022c6000, 0x1bd})
[transformer] 	/go/pkg/mod/github.com/gosbd/[email protected]/internal/processor/processor.go:130 +0x91
[transformer] github.com/gosbd/gosbd/internal/processor.(*Processor).splitIntoSegments(0xc000140460, {0xc0022a0800?, 0x275f?})
[transformer] 	/go/pkg/mod/github.com/gosbd/[email protected]/internal/processor/processor.go:57 +0x2ff
[transformer] github.com/gosbd/gosbd/internal/processor.(*Processor).Process(0xc000140460, {0xc0012c7000?, 0xc000092aa0?})
[transformer] 	/go/pkg/mod/github.com/gosbd/[email protected]/internal/processor/processor.go:43 +0x2a5
[transformer] github.com/gosbd/gosbd/internal/segmenter.(*Segmenter).Segment(0xc000119170, {0xc0012c7000?, 0x280d?})
[transformer] 	/go/pkg/mod/github.com/gosbd/[email protected]/internal/segmenter/segmenter.go:37 +0x5d
[transformer] main.main.func1({0x872b38, 0xc001f80100})
[transformer] 	/app/cmd/transformer/main.go:62 +0x96
[transformer] github.com/nats-io/nats.go/jetstream.(*pullConsumer).Consume.func1(0xc000144d20)
[transformer] 	/go/pkg/mod/github.com/nats-io/[email protected]/jetstream/pull.go:245 +0x2be
[transformer] github.com/nats-io/nats%2ego.(*Conn).waitForMsgs(0xc00021a708, 0xc0001560e0)
[transformer] 	/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:3106 +0x412
[transformer] created by github.com/nats-io/nats%2ego.(*Conn).subscribeLocked in goroutine 1
[transformer] 	/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:4320 +0x3a8

sentences := segmenter.Segment(schema.Content)

There seems to be a problem with the module. It fails on the rendered wikitext (plaintext) from https://en.wikipedia.org/wiki/Alfred_Korzybski.

this sentence is lost

original := "Candidates tied to Tehreek-e-Insaf (PTI), the party of Imran Khan, won the most seats in Pakistan’s general election, despite a de facto ban on their campaign. Mr Khan is in prison on multiple charges, which he says are politically motivated. The Pakistan Muslim League-Nawaz (PML-N), which was widely expected to win, came second. PML-N is the party of Nawaz Sharif, Mr Khan’s arch-rival. It will form a coalition government with the Pakistan Peoples Party, which came third. Mr Khan’s supporters said the election had been rigged, which the PML-N denied. The head of the army claimed the poll had been “free and unhindered”."
sent := "The Pakistan Muslim League-Nawaz (PML-N), which was widely expected to win, came second."
segmenter := gosbd.NewSegmenter("en")
sentences := segmenter.Segment(original)
for _, sentence := range sentences {

	fmt.Println(sentence)
}

the result :
[]string len: 6, cap: 8, ["Candidates tied to Tehreek-e-Insaf (PTI), the party of Imran Khan, won the most seats in Pakistan’s general election, despite a de facto ban on their campaign.","Mr Khan is in prison on multiple charges, which he says are politically motivated.","PML-N is the party of Nawaz Sharif, Mr Khan’s arch-rival.","It will form a coalition government with the Pakistan Peoples Party, which came third.","Mr Khan’s supporters said the election had been rigged, which the PML-N denied.","The head of the army claimed the poll had been “free and unhindered”."]
this sentence:sent := "The Pakistan Muslim League-Nawaz (PML-N), which was widely expected to win, came second." is lost.
please fix it. 3ks.

possable func (b BetweenPunctuation) punctuationBetweenParens(text string) string {
return betweenParensRegex.ReplaceAllStringFunc(
text, b.punctuationReplacer.ReplaceFunc(processor.PunctuationMatchTypeNone))
}

Improve overall code readability

The code for v0.1 is essentially a direct translation of pySBD into Go. We need to improve maintainability by making the entire codebase more idiomatic Go.

Can't `go get` the package...

> go get github.com/yohamta/gosbd
go: github.com/yohamta/gosbd@upgrade (v0.1.0) requires github.com/yohamta/[email protected]: parsing go.mod:
	module declares its path as: github.com/gosbd/gosbd
	        but was required as: github.com/yohamta/gosbd

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.