GithubHelp home page GithubHelp logo

tpayet / github-issues-scraper Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 17 KB

Get issues and pull requests from GitHub repository to MeiliSearch

License: MIT License

Ruby 41.72% Dockerfile 2.41% Shell 55.88%

github-issues-scraper's Introduction

Hi there 👋

github-issues-scraper's People

Contributors

curquiza avatar dependabot-preview[bot] avatar fharper avatar tpayet avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

github-issues-scraper's Issues

Change master branch to main

Let's be allies and make this change that means a lot.

Here is a blog post that explain a little more why it's important, and how to easily do it. It will be a bit more complicated with automation, but we still should do it!

Improve the search bar experience

The way of scraping issues and PRs determines the quality of the search. Here are the points I noticed for improvements:

  • Using url and anchor instead of just url field so that the same issue/PR is not repeated in the search bar results. I asked this change in the main PR because it's essential to fix that before any integration from my POV.

  • hierarchy_radio_lvlX is never set. I pointed out a link in the documentation main issue to explain in details how to imitate docs-scraper, and there is an explanation about hierarchy_radio_lvlX. This field is also important for the search. It should be fixed ASAP 🙂

  • If I'm not wrong, every single document has a content field filled (never set to null). It's not a big deal but we don't want the content to be displayed every time. It depends on the search.

In the current main docs, if I type "add documents" the search bar returns only a title because a title exists according to the search:

Capture d’écran 2020-06-18 à 10 27 18

Be sure the Add Documents page contains add documents in its content (not only in the titles), but the search bar does not need to display it.

Again with the current main documentation, if I type "our solution", there is no title or subtitle matching, so the search bar returns the content:

Capture d’écran 2020-06-18 à 11 03 43

But, with the current search bar for the GitHub issue, we always return content: it's not necessary and "spoils" the results:

Capture d’écran 2020-06-18 à 11 06 48

Here, if I type "new token" I want the search bar to return only the issue "Tracking issue: New tokenizer" without any content.
If I type "i agree", there is no title matching, so I do want this result:

Capture d’écran 2020-06-18 à 10 58 02

How to fix that? When scraping a new issue, another document should be added with the same information but:

  • with content set to null
  • (with anchor set to null according to the first point)
  • (and with hierarchy_radio_lvlX according to the 2nd point)

Nothing has to be removed. Only one additional document is required.

  • I see a document with a content set to "" (not null). Maybe there are others.
{
            "objectID": 822216370506972087,
            "hierarchy_radio_lvl0": null,
            "hierarchy_radio_lvl1": null,
            "hierarchy_radio_lvl2": null,
            "hierarchy_radio_lvl3": null,
            "hierarchy_radio_lvl4": null,
            "hierarchy_radio_lvl5": null,
            "hierarchy_lvl0": "🌍 GitHub",
            "hierarchy_lvl1": "Issue",
            "hierarchy_lvl2": "High-availability with a distributed consensus algorithm",
            "content": "",
            "anchor": null,
            "url": "https://github.com/meilisearch/MeiliSearch/issues/528"
        }

We should investigate on that to set to null or to fill it with the right content. I noticed the issue does not have any description. In this case, only a document with a content set to null should be added (linked with the 3rd point).

NB

The 2nd and 3rd points are linked to improve the user experience and should be done in the same PR.

Specify the index uid

For test purpose, I would like to have the possibility to specify the index UID as an environment variable 🙂

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.