GithubHelp home page GithubHelp logo

highscore's Introduction

Hey, I am Dominik πŸ§‘β€πŸ’»

GitHub Linkedin Contact

I live in Germany 🏫, started working as a Full Stack Developer in 2003 πŸ‘΄ and I am currently working as a Senior Data Engineer at YAZIOπŸ‘¨β€πŸ’».

  • πŸ”­ I’m currently working on grooveguessr
  • πŸ’» mainly working with Kotlin, Spring Boot and Python in my day job now
  • πŸ¦€ love Rust
  • I have a keen interest in aviation πŸ›«, football 🏈, and 3D printing

πŸ‘― Communities

highscore's People

Contributors

bobjflong avatar carlosramireziii avatar domnikl avatar mhfs avatar rgo avatar tim92 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

highscore's Issues

detect the language of a text

I think language detection could be solved easily with a list of words that a language uses most. E.g. for English: "for with that this", German: "der die das ..." and so on, but could be very hard for short texts.

add a whitelist

Whitelist should only be considered if one was given, Blacklist should not be used when whitelist was given (= makes no sense)

Stemming not working?

Can't seem to get stemming to work. Any ideas?

2.0.0-p648 :065 > x = Highscore::Content.new('fishes dog fishing boat', Highscore::Whitelist.load('fish'))
#<Highscore::Content:0x007fb2beea6e60 @content="fishes dog fishing boat", @blacklist=nil, @whitelist=#<Highscore::Whitelist:0x007fb2beea7158 @words=["fishing"], @bloom_filter=#<BloomFilter::Native:0x007fb2beea7040 @opts={:size=>40, :hashes=>28, :seed=>1466442694, :bucket=>4, :raise=>true}, @bf=#<CBloomFilter:0x007fb2beea6f00 @hash_value={}>>>, @language_wordlists={}, @emphasis={:multiplier=>1.0, :upper_case=>3.0, :long_words=>2.0, :short_words_threshold=>2, :bonus_multiplier=>3.0, :bonus_list=>nil, :long_words_threshold=>15, :vowels=>0, :consonants=>0, :ignore_short_words=>true, :ignore_case=>false, :ignore=>nil, :word_pattern=>/\p{Word}+/, :stemming=>false}>

2.0.0-p648 :065 > x.configure { set :stemming, true }

2.0.0-p648 :065 > x.keywords
#<Highscore::Keywords:0x007fb2bee451b0 @keywords={}>

Stemming?

Any thoughts on stemming (or making it an option) the keywords?

Looks like gems like fast-stemming would make this pretty easy to do.

Use a bonuslist and a blacklist?

Is it possible to use both a bonuslist and a blacklist? Essentially, I want to be able to exclude words while giving others a boost.

random seed error

With rails 4, ruby 2, development environment, highscore 1.2.0 will throw an error of "random seed error", though it works in ten percent of my tries, but it works entirely well in production .In my development environment, I have to change the highscore version to 1.1.0, no error occored.

ability to preseed with previously used keywords

Just like black- and whitelists, there should be a method to preseed the process with a list of keywords that have been used in the past. Those should then get ranked higher than anything else.

encapsulate Keywords in their own class

Keywords are currently handled as Hash and Arrays (ranked Keywords are arrays because Hashes are not sortable). This should be fixed to their own classes that handle sorting and stuff like that.

rank blacklisted words

Just like the actual list of ranked keywords, there should be methods to get a (ranked) list of (explicitely) ignored keywords.

allow for multiple blacklist/whitelist adapters to be used

Current implementation only works with files, but should also work with

  • relational databases
  • simple key-value stores like Redis
  • Memcache, etc.

Best possible implementation would be to have an adapter infrastructure that reads blacklisted items lazy (?)
Language shouldn't be ignored either!

allow multiple blacklists (per language)

  • refactor API, add add_wordlist method for Highscore::Content to add blacklist, Parameters: Blacklist object and language
  • add CLI options for every possible language

Bloomfilter bucket gets filled up depending on size of list and raises exception

Hi there,

Using the words here as a blacklist, I get the following error when trying to initialize it.

Steps to reproduce

words = YAML.load_file 'stopwords.yml'
Highscore::Blacklist.load(words)
RuntimeError: bucket got filled up
    ./bloomfilter-rb-2.1.1/lib/bloomfilter/native.rb:24:in `insert'
    ./bloomfilter-rb-2.1.1/lib/bloomfilter/native.rb:24:in `insert'
    ./highscore-1.2.0/lib/highscore/wordlist.rb:115:in `block in init_bloom_filter'
    ./highscore-1.2.0/lib/highscore/wordlist.rb:47:in `block in each'
    ./highscore-1.2.0/lib/highscore/wordlist.rb:47:in `each'
    ./highscore-1.2.0/lib/highscore/wordlist.rb:47:in `each'
    ./highscore-1.2.0/lib/highscore/wordlist.rb:115:in `init_bloom_filter'
    ./highscore-1.2.0/lib/highscore/wordlist.rb:41:in `initialize'

Here are the params used for the bloomfilter creation when that list is supplied.

{ :size => 22800, :bucket => 4, :raise => true, :hashes => 28 }

lambda method for blacklisting

In addition to the normal blacklist functionality, there should be a method, where one can pass a block or lambda into and it evaluatues whether to ignore the word or not.

Use case: all words that contain more than 2 numbers should be ignored:

c = Highscore::Content.new(url)
c.configure do
  set :ignore, lambda { |w| w.gsub(/[^0-9]/, '').length > 2 }
end

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.