GithubHelp home page GithubHelp logo

m-tari / arxiv_interface Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 1.0 38.27 MB

Viresa: an AI-powered virtual assistant for scientists

Jupyter Notebook 93.62% Python 6.37% Shell 0.01%
ai arxiv classification deep-learning eda machine-learning natural-language-processing nlp recommender-system summarization transformers

arxiv_interface's Introduction

  • ๐Ÿ‘‹ Hi, Iโ€™m @m-tari
  • ๐Ÿ‘€ Iโ€™m interested in computational science, machine learning, and data science.

arxiv_interface's People

Contributors

m-tari avatar

Watchers

 avatar

Forkers

r3ihan3h

arxiv_interface's Issues

Code Review - Round 2

This is great work so far! Especially the explanations you've added in the notebooks are very helpful. The semantic search in the app seems to be working very well, and I think we can soon finalize other functionalities of the app too (tagging and title generation).

Here is a few comments:

  1. You can remove import os and from . import config_set since you don't seem to be using them.
    import os, io
    import s3fs
    import torch
    # custom libraries
    from . import config_set
  2. It looks like config_set is not used in the app script either. So it can be removed.
    from src import config_set, semantic_search
  3. It seems like the paper title is always set to 'title' and you're not using the actual title for the semantic search.
    def suggest_articles(title, input_abstract):
    st.session_state.articles = semantic_search.search_papers('title', input_abstract)
    return st.session_state.articles
  4. Since you already have the notebooks for tagging the paper (I guess this one's not 100% complete?) and generating titles from the abstract, I think the next step would be to add relevant scripts in src/ for the two functions below to use and return actual results.
    def get_category(txt):
    st.session_state.category = 'Computer Science'
    return st.session_state.category
    def suggest_title(txt):
    st.session_state.title = "A thought-provoking title"
    return st.session_state.title
  5. I think you need to increase the max_chars limit because most abstracts have more than 850 characters and in that case, the interface doesn't allow you to paste the abstract in the text box.
    input_abstract = st.text_area('Abstract to analyze:',
    height=400,
    max_chars=850,
    value="We derive a new fully implicit formulation for the ..."
    )
  6. We still need to fix the issue with kfold here. We can discuss this further once we meet
    for train_index, test_index in kf.split(X_train, y_train):

Code Review - Round 1

Overview

Great work so far! You've made a lot of progress in less than one week!! I like the structure of your repository and the clean and easy-to-follow code you've written.

Feedback

  1. Make sure you add notes and explanations on any decisions and assumptions that you make in this project. For example, explain why you decided to use a Naive Bayes classifier, and does the performance of the model match your expectations? Or why are using the f1 score as your metric? Taking note of these discussions would show your theory knowledge and makes it very easy to gather these notes later and turn them into an article (if we wanted to write about this project and publish it)
  2. You can break down your eda.ipynb notebook into multiple notebooks. There are currently data preparation and training work in this notebook as well. Also, it'd be good to add section headers and explanations for each section in your notebooks. This makes it very easy to follow your work. Use markdown cells to explain what's happening in each section and comment on what the results suggest and whether the outcome matches your expectation. And talk about what your next steps would be based on those results.
  3. Can you save .py versions of your notebooks as well and push both the .ipynb file and its corresponding .py file. This will help me reference specific sections of the code when reviewing it and it generally helps us easily compare different versions of your notebooks.
  4. Make sure you use a virtual environment and create a requirements.txt file for your project. This way anyone can easily clone your repository, recreate your environment and run your code.
  5. You can push your processed data files to a data/ folder if the sizes are around a few MB.

Code Review

  1. I think INPUT_FILE_PATH should be renamed to INPUT_FILENAME since it's the full filename and not just the path.
    INPUT_FILE_PATH = os.path.join(SRC_PATH, input_dir, input_file)
  2. I don't quite understand the way you're using kfold here. It seems like you're finding the train/test split where the model has the highest score on the test set. You're essentially finding the split where the test set is the easiest for the model, not the best model. Kfold is usually used with cross-validation to find the best model or the best set of hyperparameters by splitting the training data into different folds i.e. various training and validation folds. Once you find the best set of hyperparameters you'd then train your model on the entire training data using the hyperparameters you've found using cross-validation. Another use case of kfold is to average your test score across the various folds to have a better estimate of your accuracy based on multiple folds.
    for train_index, test_index in kf.split(X_train, y_train):
    X_train_folds = X_train.iloc[train_index]
    y_train_folds = y_train.iloc[train_index, :]
    X_test_fold = X_train.iloc[test_index]
    y_test_fold = y_train.iloc[test_index, :]
    # transform training and validation data
    X_train_folds_trans = tfidf.fit_transform(X_train_folds)
    X_test_fold_trans = tfidf.transform(X_test_fold)
    # print(tfidf.get_feature_names())
    # initialize model
    clf = model_dispatcher.models[model]
    # fit the model on training data
    clf.fit(X_train_folds_trans, y_train_folds)
    # make predictions on test data
    preds = clf.predict(X_test_fold_trans)
    # calculate metrics
    print(classification_report(y_test_fold, preds))
    score = f1_score(y_test_fold, preds, average='macro')
    print("f1_score:", score)
    if score>best_score:
    best_score = score
    best_clf = clf
  3. This is assuming that the script is always executed from the webapp/ folder and it will fail if this is not the case.
    model_bin = open('../models/n_bayes_score_0.32.bin', 'rb')

    You can have a utils.py script with a function that gives you the root folder and always reference all paths with respect to your project root folder. This would be similar to what you have in config.py
    SRC_PATH = os.path.dirname(os.getcwd())

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.