GithubHelp home page GithubHelp logo

leofuchs / sesg Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 2.0 176.39 MB

:bookmark_tabs: SeSG (Search String Generator): A approach that uses text mining to build search strings for secondary studies.

License: MIT License

Python 100.00%
text-mining bert bag-of-words levenshtein-distance lda scopus-api scopus python

sesg's Introduction

SeSG (Search String Generator)

Repository Size Top Language License Stargazers

Implementation and experimentation of SeSG, a search string generator that uses text mining techniques to build a search string from a supplied Quasi-Gold Standard.

Note: This is a research algorithm, susceptible to errors and imperfections.

Repository Structure

This is the directory structure. In summary, there is a folder with the results of the experiment (complete-results), a folder with the output of the execution (exits), a folder with the input files of the execution (files-qgs) and the codes that form the SeSG.

├── SeSG
│   ├── complete-results
│   │   ├── azeem-review
│   │       └── ...
│   │   ├── hosseini-review
│   │       └── ...
│   │   └── vasconcellos-review
│   │       └── ...
│   ├── exits
│   │   ├── snowballing-images
│   │   ├── manual-exit.csv
│   │   ├── result.csv
│   │   └── sentences.txt
│   ├── files-qgs
│   │   ├── review-azeem
│   │       ├── gs-pdf
│   │       ├── gs-txt
│   │       ├── qgs-txt
│   │       ├── GS.csv
│   │       └── QGS.csv
│   │   ├── review-hosseini
│   │       └── ...
│   │   └── review-vasconcellos
│   │       └── ...
│   ├── .gitignore
│   ├── LICENSE
│   ├── README.md
│   ├── SeSG.py
│   ├── object-azeem.py
│   ├── object-hosseini.py
│   ├── object-vasconcellos.py
│   └── requirements.txt

SeSG Process

An example of how the SeSG process works is shown in Figure 1. This process begins with the execution of the LDA on a bag-of-words formulated from the selected QGS. Then, BERT is used to find similar terms, used to enrich the search string. Finally, the terms found previously are grouped together and the search string is formulated.

Figure 1. An example of the Topics Extraction and Enrichment and Generation of Search String sub-processes of the SeSG process, showing the necessary input parameters and how the search string is developed.


To executing the SeSG, simply run some of the `.py` files present at the root of the directory.

Quasi-Experiment Running

There are three .py files that perform the experiment SeSG, each with the ideal configuration to perform the experiment in a particular object.

1. Azeem et al.

The file object-azeem.py performs the experiment for the study by Azeem et al. [1]. For this to happen, some parameters passed within the code must be:

author = 'azeem'
pub_year_one = 2018
pub_year_two = 1999
qgs_size = 5
gs_size = 15

2. Hosseini et al.

The file object-hosseini.py performs the experiment for the study by Hosseini et al. [2]. For this to happen, some parameters passed within the code must be:

author = 'hosseini'
pub_year_one = 2016
pub_year_two = 0
qgs_size = 15
gs_size = 46

3. Vasconcellos et al.

The file object-vasconcellos.py performs the experiment for the study by Vasconcellos et al. [3]. For this to happen, some parameters passed within the code must be:

author = 'vasconcellos'
pub_year_one = 2015
pub_year_two = 0
qgs_size = 10
gs_size = 30

Quasi-Experiment Results

The execution of the .py script completely originates in several outputs. The script itself generates the search strings and their respective results as an output on the screen, in addition to a spreadsheet named author-result.csv with a compilation of this information presented.

Note: The results found in the /exits/ folder are exemplifying a random execution of the SeSG-azeem.py script.

In addition, in the folder /exits/snowballing-images/ are the graphs that represent the snowballing of each of the search strings presented, with their nomenclature following the test configuration. For example, graph-with-0.1-3-7-0.ps symbolizes that the represented graph has the following configuration: 0.1 min-df, 3 topics, 7 words and 0 similar words.

The output graph represents the connection between the articles present in the GS, showing which of these were found by searching bases (bold nodes), those found through snowballing rounds (filled nodes) and those that were not found after the application of the hybrid approach (dashed nodes).

Figure 2. Graph representing the connection between the GS in the Vasconcellos et al. [3] object.

Requirements

  • python 3.6.9
  • fuzzywuzzy 0.18.0
  • graphviz 0.14
  • nltk 3.5
  • numpy 1.19.0
  • pandas 1.0.5
  • pyscopus 0.9.0
  • python-Levenshtein 0.12.0
  • scikit-learn 0.23.1
  • scipy 1.5.1
  • torch 1.5.1
  • transformers 3.0.2

References

[1] Azeem, M. I., Palomba, F., Shi, L., & Wang, Q. (2019). Machine learning techniques for code smell detection: A systematic literature review and meta-analysis. Information and Software Technology, 108, 115-138.

[2] Hosseini, S., Turhan, B., & Gunarathna, D. (2017). A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering, 45(2), 111-147.

[3] Vasconcellos, F. J., Landre, G. B., Cunha, J. A. O., Oliveira, J. L., Ferreira, R. A., & Vincenzi, A. M. (2017). Approaches to strategic alignment of software process improvement: A systematic literature review. Journal of systems and software, 123, 45-63.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.