GithubHelp home page GithubHelp logo

liulalemx / felig-toolkit Goto Github PK

View Code? Open in Web Editor NEW
28.0 3.0 4.0 7.59 MB

A toolset for Amharic Language pre-processing. Includes an Amharic Stemmer, Transliterator, Stopword remover , Lexical analyzer, Corpus indexer and Term weighter.

Home Page: https://felig-toolkit-web.vercel.app/

License: MIT License

JavaScript 25.26% TypeScript 74.74%
amharic amharic-corpus amharic-nlp corpus lexical-analyzer linguistics stopword-removal transliterator amharic-stemmer

felig-toolkit's Introduction

Felig logo

Felig Toolkit

A toolset for Amharic Language pre-processing πŸ”§

Felig Toolkit Web​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​

Now with Typescript support!


What is felig-toolkit?

It is a toolset for Amharic Language pre-processing. It includes an Amharic Stemmer, Amharic Transliterator, Amharic Stopword remover, Amharic Lexical analyzer, Amharic Corpus indexer and Term weighter.

Amharic Lexical Analyzer

Breaks down Amharic language corpus and returns tokens by removing any whitespace, expanding abbreviations(አ.አ -> αŠ α‹²αˆ΅ αŠ α‰ α‰ ), removing numbers, breaking up hyphenated words, and removing punctuation (ፑ ፒ ! ? ...).

Amharic Stopword remover

Removes commonly occuring words that have no contribution to the semantics of the corpus. Eg: αŠ₯αŠ“ ፑ αˆ΅αˆˆα‹šαˆ… ፑ α‰ αˆ˜αˆ†αŠ‘αˆ...

Amharic Transliterator

Changes Unicode Amharic characters to ASCII. Exmaple: αˆαŒ†α‰½ -> αˆαŒ…αŠ¦α‰½ -> ljoc. This tool implements two types of Amharic transliteration lookup tables.

  • SERA (System for Ethiopic Representation in ASCII) - This system maps alphabets with similar sounds separately. Eg: (αˆ€α£αˆα‘αŠ€)፣(ሰፑሠ)ፑ(αŒΈα‘α€)ፑ(α‹α‘αŠ ). However, in practice, these alphabets are used interchangeably and use of SERA would greatly decrease recall. NOT RECOMMENDED!

  • Felig - Normalizes the redundant symbols into a common symbol. RECOMMENDED!

Amharic Stemmer LIVE DEMO

Reduces the different morphological (e.g. inflectional or derivational) variations of Amharic word forms by taking an Amharic word and returning the stem through affix-removal with longest match.

Exmaple: αˆαŒ†α‰½ -> αˆαŒ…αŠ¦α‰½ -> ljoc -> lj -> αˆαŒ…

Amharic Corpus Indexer

Produces an index file for the stemmed words in a corpus and relates them with the files they are found in. It also stores their frequencies per file.

Term Weighter

Calculates the weight of words from the index file using product of their length normalized Term frequency and Inverse document frequency (tf*idf).

Installation

Felig Toolkit is available as a package on NPM for use in a Node application:

# NPM
npm install felig-toolkit
# YARN
yarn add felig-toolkit
# PNPM
pnpm install felig-toolkit

Example

note: this package uses es-modules

import felig_toolkit from 'felig-toolkit'

What's Included

  • felig_transliterate(word,lang): takes a single word and its' language (am/en) and returns felig-transliterated string

  • sera_transliterate(word,lang): takes a single word and its' language (am/en) and returns SERA-transliterated string.

  • rmvStopwrd(corpus): takes an Amharic corpus text (sentence/paragraph/multiple-paragraphs) and removes stop wprds

  • lexAnalyze(corpus): takes an Amharic corpus text returns a string of tokens

  • stem(word): takes an Amharic word string and returns the stem as a string (async)

  • indexer(filesArray, outputIndexFilePath, type): takes an array of files and produces an index (.json) file. (type= "doc" | "query")

  • weigh_terms(indexFilePath, outputWeightedTermsPath, typeOfIndex): takes an index file and produces a file (.json) with weighted terms. (typeOfIndex= "doc" | "query")

How to use in Web apps

Felig toolkit does not work in the browser (requires node.js enviroment).

Use felig-toolkit on your server.

Exmaple: If you are using Next.js, you can use felig-toolkit in a Next server route handler (/api/felig/route.ts) and pass the results.

Contributions

felig-toolkit is open to contributions, but it is recommend to create an issue or reply in a comment to let others know what you are working on first.

How to run locally

Prerequisites

  1. Clone the repository
  2. Run npm install
  3. Run node index.js on the root directory

Attribution

To prepare the following tools, these academic papers were used

felig-toolkit's People

Contributors

liulalemx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.