GithubHelp home page GithubHelp logo

demetrius-mp / sesg Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 767 KB

SeSG (Search String Generator) python package repository.

Home Page: https://sesg.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 100.00%
research-tool search-string systematic-literature-reviews topic-modeling python

sesg's Introduction

Hi ๐Ÿ‘‹, I'm Demetrius

Pursuing a Msc. in Computer Science

  • ๐Ÿ”ญ Iโ€™m currently working on my graduation

  • ๐ŸŒฑ Iโ€™m currently learning text mining and topic extraction

  • ๐Ÿ‘ฏ Iโ€™m looking to collaborate on open source development

  • ๐Ÿ“ซ How to reach me [email protected]

  • โšก Fun fact I like listening to city pop while programming

Connect with me:

demetrius-mp demetriusmp demetrius-panovitch 12002242 demetrius-mp

Languages and Tools:

bootstrap css3 flask git html5 java javascript linux mysql postgresql python sqlite

demetrius-mp

ย demetrius-mp

sesg's People

Contributors

demetrius-mp avatar

Stargazers

 avatar

Watchers

 avatar

sesg's Issues

fix: inconsistent citation graph architecture

in sesg.snowballing.fuzzy_backward_snowballing the graph is returned as an adjacency list, however in sesg.graph we provide a function named edges_to_adjacency_list.

we should provide a way to turn make a directed adjacency list into an undirected one (actually, just need to put it back where it was, see this commit

docs: ideas

"Experiment workflow" page, describing how the experiment is executed.

"Proposed usage" page, describing how a researcher could use the package.

feat: mypy

check the viability of using mypy, and add a task to run mypy against the codebase

feat: remove bert OOV words

remove out-of-vocabulary words found by bert when enriching a string

this is done by removing every token that starts with "##"

fix: weird `KeyError` on ScopusClient

another weid issue (like #42).

seems like scopus sometimes will return a JSON response without the search-results key. this happend non-deterministically which means that if you simply retry the request, you might get a correct JSON response.

feat: add caching for similar words

not sure if it would make it noticeably faster, but worth a shot

im thinking of using a class SearchStringBuilder that receives the enrichment text, and stores a cache dict, in which the key is the word, and the value is a previously computed list of similar words.

fix: weird `JSONDecodeError` on `ScopusClient`

seems like sometimes the Scopus API does not return a valid JSON response for a string. However, if we simply redo the request, the API responds correctly.

i see no other option but to add a retry on exception mechanism to avoid this problem, (tenacity seems like the perfect solution)

refactor: word enrichment

maybe create a differente function to concatenate the enriched words

main goal is to make testing easier

docs: deploy

figure out how to deploy, and where to deploy.

feat: enhance scopus search

add a recipe on how to use sesg.scopus.create_clients_with_disjoint_api_keys to execute more performatic searchs

the idea is to use either multiprocessing or multithreading and initialize each client at the same time

this would hugely increase the performance, theoretically it could be n_clients times faster

refactor: decouple search string formulation from similar word generation

this would allow for a higher modularity

right now, we pass a list of topics (where a topic is a list of words) to a function that returns a formulated search string with the similar words already injected

this approach makes it hard to try out other similar word generation strategies

i thought about providing a new function:

def generate_similar_words(
  topics: list[list[str]],
) -> dict[str, list[str]]:
  ...

which would be used as follows:

topics = extract_topics_with_bertopic(
    docs,
    kmeans_n_clusters=2,
    umap_n_neighbors=5,
)

similar_words = generate_similar_words(topics)

search_string = generate_search_string(
    topics,
    similar_words,
    n_words_per_topic=5,
    n_similar_words_per_word=1,
)

notice that the generate_similar_words function might receive additional parameters, according to the strategy used.

feat: logging

add some sort of logging to show time taken by functions and other informations

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.