GithubHelp home page GithubHelp logo

dvfeinblum / lexicount Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 36 KB

A (soon-to-be) nlp tool for seeing how obnoxious a writer you are.

License: MIT License

Python 80.92% Dockerfile 0.57% PLpgSQL 16.71% Shell 1.80%
apache-airflow nlp nltk psql sqitch

lexicount's Introduction

Hello There ✌️

I'm Dave (they/them) and this is my swanky GitHub profile. I guess that's actually redundant.

About Me

  • I used to be a theoretical chemist
    • Now I'm not (though if you wanna see what I used to do click here)
  • I like data, especially when it's big
  • I think security is neat and important

None-Code Crap I Do

  • Play hockey
  • Make espresso
  • Think about plants
  • Look at plants
  • Read books about plants
  • Go hiking

lexicount's People

Contributors

dvfeinblum avatar

lexicount's Issues

Switch from Redis to Postgres?

Still waffling back and forth a bit on this. It'd be nice to use postgres instead because at some point, I may want to be able to do more than just count stuff. Plus too, redis isn't exactly amenable to doing the sort of counting I may want to do.

Read up on Word2Vec

Vectorizing sentences is probably how we'll start digging into this stuff, and word2vec is a really easy way of doing that.

Rework the Processing into a DAG

Is your feature request related to a problem? Please describe.
Async is cool, but multiprocessing is better. Luigi Airflow is dope so let's implement it.

Describe the solution you'd like
Gut all of the async stuff (should be as easy as reverting to pre-async code), install airflow, and create some tasks.

Describe alternatives you've considered
I could use this as an opportunity to learn Airflow, but I'd rather improve my luiginess. I changed my mind.

Additional context
One of the reasons I changed my mind is because Airflow has more features than Luigi. And also we're porting our pipeline over to Airflow at work so this is like, studying or something.

What's the deal with spaCy?

spaCy is a python package chock full of nifty NLP stuff. Might be able to replace NLTK with it, and we should be able to start some genuine data sciency stuff with it.

Adding Autocommit Made this Really Slow

Describe the bug
In Pull #17, I fixed some issues with the parser that caused data to be lost from early cursor closures. Unfortunately, I did this by adding Autocommit to the connection session and that has caused a significant performance decline.

To Reproduce
Steps to reproduce the behavior:

  1. Run the parser.

Expected behavior
Old runs used to take around a minute. That time has doubled now.

Fix Up Blogpost Sanitizing

Right now, the constellation of replaces and strips doesn't actually work correctly. There are also hyphenated words that get messed up by the translator I'm currently using.

Create a Runmode that Splits Sentences Instead of Words

Is your feature request related to a problem? Please describe.
The meta-purpose of this project is to learn some NLP. Word2Vec is a really nice low-bar-of-entry way of doing that, and vectors for sentences would be a nice place to start.

Describe the solution you'd like
Currently, the blog parser sanitizes posts by removing punctuation and then NLTKing the words in the post. We should do something similar but, instead of splitting on spaces, we should split on periods.

Describe alternatives you've considered
N/A

Additional context
N/A

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.