If we can significantly boost the training speed, it doesn't matter that much. scikit-learn with MultinomialNB takes too much memory while it's running and it ends up using all my MacBook pro's memory so I had to partial run several times. Maybe we can improve it by just cherrypick those words to training and reduce the # of features.
Current TwitchBot is just console command based, you give it a twitch channel name and run it. Would be great if we can somehow make it as web app or service and easily deploy to multiple TwitchTV channel, grab those chat information and show up what's toxic chat or not on web page instead of console.
Research if we can run current detox_bot in Kubernetes cluster and can take more incoming request than 1 instance and How. ***Conrad is researching this ***
Need opinion and things you've done so we. can add those items in progress report.
So far what I can remember is,
Implemented working prototype ( by copying existing detox_bot ... )
Investigated IaaS/PaaS for running the service on k8s. (need more detail from @harley3 if you have documentation for the investigation )
Made a progress and now prototype app runs on Google Cloud Platform's k8s cluster. (means it's dockerized as well )
Investigated Apache Spark MLlib and TensorFlow if it's good enough to replace current scikit-learn implementation and also scalable ( need more detail from @kevinmackie and @noya for this )
Depends on which ML library we are going to use(Spark+Mllib or Tensorflow?), those stopwords need to be regenerated and/or migrated as well as adding more stopwords in it.