GithubHelp home page GithubHelp logo

sasajib / plpstgrssearch Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lemonskyjwt/plpstgrssearch

0.0 1.0 0.0 19 KB

Howto for adding Polish language to PostgreSQL

License: MIT License

Shell 100.00%

plpstgrssearch's Introduction

NAME

pg_hunspell - adding a language to PostgreSQL's (Full Text Search) FTS with Hunspell

SYNOPSIS

pg_hunspell_install pl PL polish

DESCRIPTION

This utility does all the leg work. To add a language dictionary to your PostgreSQL's instance, you'll need three dictionary files:

  • .dict
  • .affix
  • .stop

We download them and produce the SQL to help get them functioning as a dictionary, and FTS configuration in the database.

Obtaining dictionary files

We have two sources,

  • On Debian or Ubuntu the script will use apt to install the files, or dpkg if they're already installed.
  • Or, if you're not on Debian or Ubuntu, it will source them from the proper projects on github. The dictionary and affix files we get from LibreOffice Dictionaries, and we pull the stopwords from the stopwords-iso project.

Once this is done, those files get processed and installed into the tsearch_data dir belonging to your PostgreSQL's installation. You can find the location of that directory with pg_config --sharedir. If you're on Debian or Ubuntu, you'll be prompted to do this automatically. If you wish to install for other version of PostgreSQL, you'll have to copy the above listed files to the appropriate tsearch_data location.

Creating search dict and config in PostgreSQL

After you run the pg_hunspell_install script the SQL to CREATE the DICTIONARY and CONFIGURATION will be outputted, as well as the catalog annotations. Simply start psql and run these commands.

Testing

Postgres comes with ts_debug function that's useful for testing text search configs. Lets test some random phrase:

SELECT token, dictionary, lexemes
FROM ts_debug(
  'polish',
  'Szybkie brązowe lisy przeskoczyły ponad starym szarym Burkiem który spal.'
)
WHERE alias <> 'blank';

Here's the output:

    token     | dictionary  |      lexemes        
--------------+-------------+---------------------
 Szybkie      | polish_dict | {szybki}
 brązowe      | polish_dict | {brązowy}
 lisy         | polish_dict | {lisa,lis}
 przeskoczyły | polish_dict | {przeskoczyć}
 ponad        | polish_dict | {}
 starym       | polish_dict | {starym,stary,stara}
 szarym       | polish_dict | {szary}
 Burkiem      | polish_dict | {burkiem,burek}
 który        | polish_dict | {}
 spał         | polish_dict | {spała,spać}

This shows that the dict we've just added, the polish_dict, was used and valid lexemes were resolved for each of words used, meaning that search for szybki lis would've matched szybkie lisy.

GOTCHA: You may need to additionally unaccent your strings if you want to search without diacritics, otherwhise your search my return surprising results.

If you want for searches for, say, iphone to match apple phone, you should research the thesaurus features.

SEE ALSO

Something I didn't know when I endeavoured to start this, Debian provides pg_updatedicts. Currently, it does some things we don't with the update mechanisms. That is to say, in some ways it is better. Given enough time, I'll steal those ideas. They don't do stop words, and I'm going to get a thesaurus up too. After which, I'll submit to mainline PostgreSQL; and, also to Debian to see if they're interested in upgrading.

SOURCE CODE

Find the code on the GitHub Repository. PR's welcome!

AUTHORS

  • Evan Carroll, 14 Nov 2017
  • Rafał Pitoń, 09 Aug 2016

plpstgrssearch's People

Contributors

evancarroll avatar rafalp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.