GithubHelp home page GithubHelp logo

acbilson / chaos-index Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 47 KB

A search engine indexer for the chaos suite

License: GNU General Public License v3.0

Dockerfile 5.31% Just 7.44% Shell 1.50% Python 85.75%

chaos-index's People

Contributors

acbilson avatar

Watchers

 avatar

chaos-index's Issues

Iterative Scraping

Page scraping today requires a single list of content that's iterated over. Those blogs which have multiple pages of content require adding each page as a separate site. For some websites, that could be 10 pages or more because there's only five posts per page.

A new site column should be added which specifies the element to navigate to another page. This column, perhaps next_page_query, ought to be checked after each scraping iteration and, if it exists, should be used as the new site and the whole scraping process repeated.

For example, on Victoria Drake's blog there's an aria-label="next" which could be used to navigate to the next page.

<li class="page-item"><a href="/blog/page/3/" class="page-link" aria-label="Next"><span aria-hidden="true">older</span></a> /
<a href="/blog/page/10/" class="page-link" aria-label="Last"><span aria-hidden="true">oldest</span></a></li>

A Second Scraper

My initial scraper handles a decent number of sites, but I have noticed some which don't structure their HTML in a way that it can work effectively.

Instead of trying to handle edge-cases in a single scraper, a little abstraction would allow me to specify a scraper version for different pages. I can continue to run the original version, but add new ones as need arises.

Pros: Less issues with backwards compatibility. Cleaner code.
Cons: Code duplication, and more code to test.

Dump indexed content into FTS5 table

Search bogs down a lot when passing the entirety of my index to the client. But I think sqlite can actually manage the search entirely on its end via FTS5 tables. While building a secure endpoint to query from it will take a little research, only returning matches will vastly speed up the client.

Better Element Query

My use of BeautifulSoup today only allows for element tag and class selection, but that is sometimes not enough to identify the correct element. The code to convert a query into bs4 logic needs revision and abstraction, and additional parameters need to be taken into account. Here are the most obvious:

  1. The use of an id
  2. The use of a non-class attribute (aria-label or role for example)
  3. Specifying the nth found. Sometimes there's not any factor that separates that second section from the first
  4. Greater specificity on the final element. Right now it just looks for 'a' and 'p' tags, but greater specificity here could benefit in less hierarchical sites particularly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.