m-tari / arxiv_interface Goto Github PK

Viresa: an AI-powered virtual assistant for scientists

Jupyter Notebook 93.62% Python 6.37% Shell 0.01%

ai arxiv classification deep-learning eda machine-learning natural-language-processing nlp recommender-system summarization transformers

arxiv_interface's Introduction

👋 Hi, I’m @m-tari
👀 I’m interested in computational science, machine learning, and data science.

arxiv_interface's People

Contributors

Watchers

Forkers

r3ihan3h

arxiv_interface's Issues

Code Review - Round 2

This is great work so far! Especially the explanations you've added in the notebooks are very helpful. The semantic search in the app seems to be working very well, and I think we can soon finalize other functionalities of the app too (tagging and title generation).

Here is a few comments:

You can remove import os and from . import config_set since you don't seem to be using them.

arxiv_interface/src/semantic_search.py

Lines 4 to 8 in c8e086a

import os, io

import s3fs

import torch

# custom libraries

from . import config_set
It looks like config_set is not used in the app script either. So it can be removed.

arxiv_interface/streamlit_app.py

Line 3 in c8e086a

from src import config_set, semantic_search
It seems like the paper title is always set to 'title' and you're not using the actual title for the semantic search.

arxiv_interface/streamlit_app.py

Lines 15 to 17 in c8e086a

def suggest_articles(title, input_abstract):

st.session_state.articles = semantic_search.search_papers('title', input_abstract)

return st.session_state.articles

Since you already have the notebooks for tagging the paper (I guess this one's not 100% complete?) and generating titles from the abstract, I think the next step would be to add relevant scripts in src/ for the two functions below to use and return actual results.

arxiv_interface/streamlit_app.py

Lines 6 to 12 in c8e086a

 def get_category(txt): 

 st.session_state.category = 'Computer Science' 

 return st.session_state.category 

 def suggest_title(txt): 

 st.session_state.title = "A thought-provoking title" 

 return st.session_state.title

I think you need to increase the max_chars limit because most abstracts have more than 850 characters and in that case, the interface doesn't allow you to paste the abstract in the text box.

arxiv_interface/streamlit_app.py

Lines 51 to 55 in c8e086a

input_abstract = st.text_area('Abstract to analyze:',

height=400,

max_chars=850,

value="We derive a new fully implicit formulation for the ..."

)
We still need to fix the issue with kfold here. We can discuss this further once we meet

arxiv_interface/src/train.py

Line 46 in c8e086a

for train_index, test_index in kf.split(X_train, y_train):

Code Review - Round 1

Overview

Great work so far! You've made a lot of progress in less than one week!! I like the structure of your repository and the clean and easy-to-follow code you've written.

Feedback

Make sure you add notes and explanations on any decisions and assumptions that you make in this project. For example, explain why you decided to use a Naive Bayes classifier, and does the performance of the model match your expectations? Or why are using the f1 score as your metric? Taking note of these discussions would show your theory knowledge and makes it very easy to gather these notes later and turn them into an article (if we wanted to write about this project and publish it)
You can break down your eda.ipynb notebook into multiple notebooks. There are currently data preparation and training work in this notebook as well. Also, it'd be good to add section headers and explanations for each section in your notebooks. This makes it very easy to follow your work. Use markdown cells to explain what's happening in each section and comment on what the results suggest and whether the outcome matches your expectation. And talk about what your next steps would be based on those results.
Can you save .py versions of your notebooks as well and push both the .ipynb file and its corresponding .py file. This will help me reference specific sections of the code when reviewing it and it generally helps us easily compare different versions of your notebooks.
Make sure you use a virtual environment and create a requirements.txt file for your project. This way anyone can easily clone your repository, recreate your environment and run your code.
You can push your processed data files to a data/ folder if the sizes are around a few MB.

Code Review

I think INPUT_FILE_PATH should be renamed to INPUT_FILENAME since it's the full filename and not just the path.

arxiv_interface/src/config.py

Line 9 in 1d8f5b7

INPUT_FILE_PATH = os.path.join(SRC_PATH, input_dir, input_file)

I don't quite understand the way you're using kfold here. It seems like you're finding the train/test split where the model has the highest score on the test set. You're essentially finding the split where the test set is the easiest for the model, not the best model. Kfold is usually used with cross-validation to find the best model or the best set of hyperparameters by splitting the training data into different folds i.e. various training and validation folds. Once you find the best set of hyperparameters you'd then train your model on the entire training data using the hyperparameters you've found using cross-validation. Another use case of kfold is to average your test score across the various folds to have a better estimate of your accuracy based on multiple folds.

arxiv_interface/src/train.py

Lines 47 to 76 in 1d8f5b7

 for train_index, test_index in kf.split(X_train, y_train): 

 X_train_folds = X_train.iloc[train_index] 

 y_train_folds = y_train.iloc[train_index, :] 

 X_test_fold = X_train.iloc[test_index] 

 y_test_fold = y_train.iloc[test_index, :] 

 # transform training and validation data 

 X_train_folds_trans = tfidf.fit_transform(X_train_folds) 

 X_test_fold_trans = tfidf.transform(X_test_fold) 

 # print(tfidf.get_feature_names()) 

 # initialize model 

 clf = model_dispatcher.models[model] 

 # fit the model on training data 

 clf.fit(X_train_folds_trans, y_train_folds) 

 # make predictions on test data 

 preds = clf.predict(X_test_fold_trans) 

 # calculate metrics 

 print(classification_report(y_test_fold, preds)) 

 score = f1_score(y_test_fold, preds, average='macro') 

 print("f1_score:", score) 

 if score>best_score: 

 best_score = score 

 best_clf = clf

This is assuming that the script is always executed from the webapp/ folder and it will fail if this is not the case.

arxiv_interface/webapp/app.py

Line 7 in 1d8f5b7

model_bin = open('../models/n_bayes_score_0.32.bin', 'rb')

You can have a utils.py script with a function that gives you the root folder and always reference all paths with respect to your project root folder. This would be similar to what you have in config.py

arxiv_interface/src/config.py

Line 8 in 1d8f5b7

SRC_PATH = os.path.dirname(os.getcwd())

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

	import os, io
	import s3fs
	import torch
	# custom libraries
	from . import config_set

	def suggest_articles(title, input_abstract):
	st.session_state.articles = semantic_search.search_papers('title', input_abstract)
	return st.session_state.articles

	def get_category(txt):
	st.session_state.category = 'Computer Science'
	return st.session_state.category

	def suggest_title(txt):
	st.session_state.title = "A thought-provoking title"
	return st.session_state.title

	input_abstract = st.text_area('Abstract to analyze:',
	height=400,
	max_chars=850,
	value="We derive a new fully implicit formulation for the ..."
	)

	for train_index, test_index in kf.split(X_train, y_train):

	X_train_folds = X_train.iloc[train_index]
	y_train_folds = y_train.iloc[train_index, :]
	X_test_fold = X_train.iloc[test_index]
	y_test_fold = y_train.iloc[test_index, :]

	# transform training and validation data
	X_train_folds_trans = tfidf.fit_transform(X_train_folds)
	X_test_fold_trans = tfidf.transform(X_test_fold)

	# print(tfidf.get_feature_names())

	# initialize model
	clf = model_dispatcher.models[model]

	# fit the model on training data
	clf.fit(X_train_folds_trans, y_train_folds)

	# make predictions on test data
	preds = clf.predict(X_test_fold_trans)

	# calculate metrics
	print(classification_report(y_test_fold, preds))
	score = f1_score(y_test_fold, preds, average='macro')
	print("f1_score:", score)

	if score>best_score:
	best_score = score
	best_clf = clf