GithubHelp home page GithubHelp logo

reach's Introduction

Hi there πŸ‘‹

I'm StΓ©phan Tulkens! I'm a computational linguistics/AI person. I am currently working as a machine learning engineer at Ecosia.

I got my Phd at CLiPS at the University of Antwerpen under the watchful eyes of Walter Daelemans (Computational Linguistics) and Dominiek Sandra (Psycholinguistics). The topic of my Phd was the way people process orthography during reading. You can find a copy here. Before that I studied computational linguistics (Ma), philosophy (Ba) and software engineering (Ba)

My goal is always to make things as fast and small as possible. I like it when simple models work well, and I love it when simple models get close in accuracy to big models. I do not believe absolute accuracy is a metric to be chased, and I think we should always be mindful of what a model computes or learns from the data.

I’m currently working on πŸƒβ€β™‚οΈ:

  • reach: a library for loading and working with word embeddings.
  • piecelearn: a library that trains a subword tokenizer and embeddings on the same corpus, giving you open vocabulary embeddings.
  • unitoken: a library for easy pre-tokenization.
  • hashing_split: a library for hash-based data splits (stable splits!)

Other stuff I made (most of it from my Phd) πŸ•:

  • wordkit: a library for working with orthography
  • old20: calculate the orthographic levenshtein distance 20 metric.
  • metameric: fast interactive activation networks in numpy.
  • humumls: load the UMLS database into a mongodb instance. Fast!
  • dutchembeddings: word embeddings for dutch (back when this was a cool thing to do)

My research interests πŸ€–:

  • Tokenizers, specifically subword tokenizers.
  • Embeddings, specifically static embeddings (so old-fashioned! πŸ’€), and how to combine these in meaningful ways.
  • String similarity, and how to compute it without using dynamic programming.

Contact:

reach's People

Contributors

cmry avatar dependabot-preview[bot] avatar dependabot[bot] avatar stephantul avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

reach's Issues

Missing dependency: TQDM

This basic usage yields an error:

from reach import Reach

r = Reach.load("path/to/vector")

I think this package needs to add tqdm as dependency, or truly make it optional by importing it optionally.

Fix bug: adding <UNK> resets dtype

Loading a .vec file with a given dtype (e.g., float32) using an OOV UNK token silently resets the dtype to the system default (usually float64).

Small fix

OOV

What is meant with "[The] approaches know how to deal with OOV words" ?
Is the deal to delete them?

Bump version

Nothing deprecated, so let's bump a minor version

Order of indices

I could look it up myself, but since I am lazy:
If I load the vectors passing wordlist, is it guaranteed that r.vectors entries will be in the same order?
Basically, what I need is the embedding matrix and a vector of words (strings) in correspondence with the words in the matrix (without having to sort again).

Also, the package in pip seems to be outdated :-p

Move from string to hashable

We currently type everything as if all items are strings, but actually they can be any hashable. The typing and docs should be updated to reflect this.

problem loading in ipython windows 10 and Ubuntu 20.04

Hi stephan, thanks for making this package!

I've been having trouble importing it on windows 10 and Ubuntu 20.04 in jupyter notebooks: from reach import Reach gives an import error (import reach works, but using rech.Reach.load() doesn't work): unknown location yet the package is visible in the site packages on both machines, under the correct name:

  • I git cloned this package into my Anaconda site-packages on both windows 10 and ubuntu
  • I checked the python versions in ipython and powershell, they're the same in both windows 10 and ubuntu
  • I also checked the PATH variable in both, also the same result in both windows 10 and ubuntu
  • I restarted the PCs but it made no difference

I don't think this issue is related to the package versions on my machines (which probably also don't match the package requirements) because the python interpreter doesn't know where to find Reach.

Do you know of any way I could fix this locally?
Thanks anyways!

Best,
Lisa

edit: I can't import it via the python interpreter in the command line either.

Switch to ruff

Switch from flake8 to ruff, update some pre-commit hooks, remove setup.cfg where possible

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.