GithubHelp home page GithubHelp logo

eklem / stopword-sami Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 18.82 MB

Sami stopword lists for natural language processing. Examples on use could be search engines, machine learning and chatbots.

License: MIT License

JavaScript 100.00%
nlp stopwords northern-sami lule-sami southern-sami

stopword-sami's Introduction

stopword-sami

NPM version NPM downloads MIT License

What

WIP! Project to generate stopword lists for all the Sami languages:

Grant from the Sami Parliament

The Sami Parliament is financially supporting the project. Hooray! This will make it possible to finish the project.

Sámediggi Sámedigge Saemiedigkie
The Sami Parliament The Sami Parliament The Sami Parliament

Other Sami languages

These are not planned as of now, but could be if we find text sources and someone to help us verify the lists.

  • Kildinsamisk
  • Skoltesamisk/østsamisk
  • Enaresamisk
  • Pitesamisk
  • Umesamisk

When the quality of the stopword lists are good enough they will be added to the stopword module. Northern Sami will most likely be the first that reaches good enough quality. Then you'll have Lule Sami and South Sami.

Why stopword lists for Sami languages?

To i.e. be able to create good search engines or do machine learning based on content written in the different Sami langauges.

Install

If you can avoid crawling and just use the content from this repo, that's good. That means less unnecessary trafick on nrk.no. Content is here and will be updated every month, or more often if you need it and published to npm.

npm install stopword-sami

To crawl and calculate

To get more content, you first have to get more IDs, so first the crawlIds-command, then the crawlContent-command and then the calcStopwords-command.

npm run crawlIds && npm run crawlContent && npm run calcStopwords

Work ahead

  • Generating lists of IDs to crawl Using nrk-sapmi-crawler to crawl lists of documents to crawl. These documents will later be crawled and the text content will be the basis for ongoing stopword training. The more content, the better lists.

  • Crawl content (work in progress) When lists of enough content, and the nrk-sapmi-crawler also can crawl documents, crawl the actual documents

  • Start training stopword lists Run the stopword-trainer on the text that is crawled. From this we'll ask for help to manually verify the lists and also come with words to add to a red-list for each Sami language. The stopword lists are black-lists, words that you don't want. Every now and then, words you want sneak into a stopword list. Adding it to a red-list makes sure it won't end up in the finished stopword list.

  • Application for funding last part of the project.

  • Find people that knows Lule- and South Sami languages to verify lists. North Sami already covered.

  • Verifying lists and generating redlists Need help to generate redlists so the lists can be cleaned and cut off.

  • Decide cutoff. How many words to keep in each list.

  • Add lists that have beta quality to stopword module.

  • Update daq-proc and daq-proc demo to showcase new stopword lists.

  • Lightning talk at NDC Oslo

  • Blog posts to market lits

Help needed

We need help to verify generated list and help me understand different traits of the different Sami languages when that time comes.

Also, to generate/train stopword lists, we need text sources. For Northern Sami we will get what we need, but for Lulesami and South Sami it's a little thin. Maybe we just have to wait for NRK to create more content. For the rest of the languages, we have no source so far. If you know of a data-set or a source to generate a data set, please give us a hint!

Applications: Markdown to Word/PDF conversion

So far, Pandoc has worked well:

pandoc application-draft-02.md -f markdown -s -o application-draft.docx

stopword-sami's People

Contributors

dependabot[bot] avatar eklem avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

stopword-sami's Issues

I need text sources

If I get ahold of more text or a site I can crawl for any of the other languages than North-, Lule- and South Sami, I'll create a stopword list for those too.

And doesn't matter if the language is not spoken in Norway.

.npmignore

Skip:

datasets/
project_documents/
redlists/
scripts/

Adjust application - more about usage after project ends

Explain that people can:

  1. Build on the project (dependency)
  2. Fork it if they want it to take a different direction.
  3. Contribute to the project by issues, discussion and PRs.

And 1 & 2 can be done in commercial and/or closed source software with other licenses.

Set custom user-agent

Name it stopword-sami. nrk-sapmi-crawler can be used by others to achieve different things.

stopword-trainer counting a bit off?

The word kulturhistorisk is found three times in document corpus. Once in one doc and twice in another doc:

But in the calculation file it's only listed as found once over the whole corpus:
https://github.com/eklem/stopword-sami/blob/trunk/stopwords/stopword-sma-calculation.json#L11127-L11133

This is possibly an upstream error in stopword-trainer. Quick fix would maybe be to delete the calculation file and do it from scratch?

People to talk to for all questions

  • How much overlap is there between the different Sami languages? What do they consst of.
  • Big enough difference for them to have their own stopword list?
  • What are the three Sami languages at https://www.nrk.no/sapmi
  • Sources of text to generate stopwords list from?
  • Other people I should talk to?

Skip images in data-sets

Too much noise because it's hard to actually identify the correct photo and if there is a photo connected at all.

Dependencies

  • stopword-trainer
  • nrk-sapmi-crawler

Crawl and re-train stopword list

setup for red-lists

  • separate files for each red-list. Part of the datasets folder
  • read + feed it to the swt.getStopwords function
  • Slice the list on i.e. 500 words in the beginning. Then you'll know if it is good to push further with more redlisted words to "get to" stopwords later in the list/array.

The final list should probably/hopefully contain 200 - 300 words.

NDC Oslo lightning talk: Sami stopwords - How far have we gotten and why does it matter?

What I've sent in:

[...]

  • What are stopwords
  • What does the work with stopword-sami consist of
  • NRK as a text source and how to improve the stopword lists over time. Manual work, redlists and content crawling.
  • Solutions a stopword list can help you create: search engines, chatbots, plagiarism detection, sentiment analysis and other machine learning solutions.
  • Demo North Sami stopword list to show what simple linguistic understanding can do
    https://eklem.github.io/daq-proc/demo/document-processing/ (At least a beta version of North Sami stopword list will be added by August).
  • Show how far the lists for Lule Sami and South Sami has gotten and what's left to do.
    https://github.com/eklem/stopword-sami
    https://github.com/eklem/nrk-sapmi-crawler

[...]

Show what you get without stopwords:

With stopwords:

  • Smaller indexes (faster search and less storage need)
    • amount of words in general
    • typeahead (ngrams)
  • automatically generated keywords
  • Less noise in search result, more relevant hits.

Tell more about:

  • Why corpus of short texts are difficult to generate stopwords from (show the analysis on Norwegian based on Wikipedia.org/no)
  • Countering this with redlists
  • For search, the quality at the bottom doesn't need to be great as long as the volume has good quality, because that's what will be shown.

Illustrations to draw:

  • A lot of words/tokens
  • Quite some of those words being stopwords
  • Stopwords don't have a precise definition. It depends. A frequency sorted long array is then good. Cut off to fit your needs.
  • Smaller indexes, less noise at the bottom of a search result
  • Keyword generating/calculating doesn't need to be perfect since it's volume that bubles up.
  • Sami flag

Do calculation again

Do the calculation from scratch. Check if errors in stopwordiness goes away. Seems "O" and "D" should not be stopwords for Northern Sami

Northern Sámi

Here's the list of all pages.

The Wikipedia-site has over 7700 articles so it could be enough to generate a good list of stopwords. Seems a lot of the articles are very short, so we'll see how it goes.

Could also be a nice start for a Northern Sami search engine?

set up project - apply for money

Figure out what's needed

  • Creating crawler
  • Setting up recurring crawls and processing
  • Better stopword-trainer
  • Separate stopword-trainer CLI
  • understanding of the different Sami languages
  • Project plan
  • Application for financing
  • Someone that can verify lists and identify words to add to red-list.

Bigger picture

  • First working version of NowSearch.xyz
  • Result of work will be open sourced under MIT-licence.

Quality

Northern Sami will reach a good level first. It has around 4000 articles a year. Southern Sami has 350 articles from mid 2017 until late 2021 and Lule Sámi has 500 articles from mid 2017.

Add logos

Add the Sami Parliament logo to the repository

Verify validity of stopword lists

I need someone with a good understanding of Northern Sami, and the other sami languages to check the result that is generated by stopword-trainer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.