acbilson / chaos-index Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 47 KB

A search engine indexer for the chaos suite

License: GNU General Public License v3.0

Dockerfile 5.31% Just 7.44% Shell 1.50% Python 85.75%

chaos-index's People

Contributors

Watchers

chaos-index's Issues

Iterative Scraping

Page scraping today requires a single list of content that's iterated over. Those blogs which have multiple pages of content require adding each page as a separate site. For some websites, that could be 10 pages or more because there's only five posts per page.

A new site column should be added which specifies the element to navigate to another page. This column, perhaps next_page_query, ought to be checked after each scraping iteration and, if it exists, should be used as the new site and the whole scraping process repeated.

For example, on Victoria Drake's blog there's an aria-label="next" which could be used to navigate to the next page.

<li class="page-item"><a href="/blog/page/3/" class="page-link" aria-label="Next"><span aria-hidden="true">older</span></a> /
<a href="/blog/page/10/" class="page-link" aria-label="Last"><span aria-hidden="true">oldest</span></a></li>

A Second Scraper

My initial scraper handles a decent number of sites, but I have noticed some which don't structure their HTML in a way that it can work effectively.

Instead of trying to handle edge-cases in a single scraper, a little abstraction would allow me to specify a scraper version for different pages. I can continue to run the original version, but add new ones as need arises.

Pros: Less issues with backwards compatibility. Cleaner code.
Cons: Code duplication, and more code to test.

Dump indexed content into FTS5 table

Search bogs down a lot when passing the entirety of my index to the client. But I think sqlite can actually manage the search entirely on its end via FTS5 tables. While building a secure endpoint to query from it will take a little research, only returning matches will vastly speed up the client.

Better Element Query

My use of BeautifulSoup today only allows for element tag and class selection, but that is sometimes not enough to identify the correct element. The code to convert a query into bs4 logic needs revision and abstraction, and additional parameters need to be taken into account. Here are the most obvious:

The use of an id
The use of a non-class attribute (aria-label or role for example)
Specifying the nth found. Sometimes there's not any factor that separates that second section from the first
Greater specificity on the final element. Right now it just looks for 'a' and 'p' tags, but greater specificity here could benefit in less hierarchical sites particularly

acbilson / chaos-index Goto Github PK

chaos-index's People

Contributors

Watchers

chaos-index's Issues

Iterative Scraping

A Second Scraper

Dump indexed content into FTS5 table

Better Element Query

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs