GithubHelp home page GithubHelp logo

aavache / llmwebcrawler Goto Github PK

View Code? Open in Web Editor NEW
23.0 1.0 5.0 21 KB

A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval

Python 99.61% Shell 0.39%
python ray raylib distributed-computing huggingface llm milvus transformer vector-database webcrawler

llmwebcrawler's Introduction

LLM-based Web Crawler

An scalable web crawler, here a list of the feature of this crawler:

  • This service can crawl recursively the web storing links it's text and the corresponding text embedding.
  • We use a large language model (e.g Bert) to obtain the text embeddings, i.e. a vector representation of the text present at each webiste.
  • The service is scalable, we use Ray to spread across multiple workers.
  • The entries are stored into a vector database. Vector databases are ideal to save and retrieve samples according to a vector representation.

By saving the representations into a vector database, you can retrieve similar pages according to how close two vectors are. This is critical for a browser to retrieve the most relevant results.

CLI

Run the crawler with the terminal:

$ python cli_crawl.py --help

options:
  -h, --help            show this help message and exit
  -u INITIAL_URLS [INITIAL_URLS ...], --initial-urls INITIAL_URLS [INITIAL_URLS ...]
  -lm LANGUAGE_MODEL, --language-model LANGUAGE_MODEL
  -m MAX_DEPTH, --max-depth MAX_DEPTH

API

Host the API with uvicorn and FastAPI.

uvicorn api_app:app --host 0.0.0.0 --port 80

Take a look to the example in start_api_and_head_node.sh. Note that the ray head nodes needs to be initialized first.

Large Language Model

For our use case, we simply use BERT model implemented by Huggingface to extract embeddings from the web text. More precisely, we use bert-base-uncased. Note that the code is agnostic and new models could be registered and added with few lines of code, take a look to llm/best.py.

Saving crawled data

We use Milvus as our main database administrator software. We use a vector-style database due to its inherited capability of searching and saving entries based on vector representations (embeddings).

Milvus lite

Start your standalone Milvus server as follows, I suggest using an multiplexer software such as tmux:

tmux new -s milvus
milvus-server

Take a look under scripts/ to see some of the basic requests to Milvus.

Docker compose

You can also use the official docker compose template:

docker compose --file milvus-docker-compose.yml up -d

Parallel computation

We use Ray, is great python framework to run distributed and parallel processing. Ray follows the master-worker paradigm, where a head node will request tasks to be executed to the connected workers.

Start the head and the worker nodes in Ray

Head node

  1. Setup the head node
ray start --head
  1. Connect your program to the head node
import ray

# Connect to the head
ray.init("auto")

In case you want to stop ray node:

ray stop

Or checking the status:

ray status

Worker node

  1. Initialize the worker node
ray start

The worker node does not need to have the code implementation as the head node will serialize and submit the arguments and implementation to the workers.

Future features

The current implementation is a PoC. Many improvements can be made:

  • [Important] New entrypoint in the API to search similar URL given text.
  • Optimize search and API.
  • Adding new LLMs models and new chunking strategies with popular libraries, e.g. LangChain.
  • Storing more features in the vector DB, perhaps, generate summaries.

Contributing

All issues and PRs are welcome ๐Ÿ™‚.

Reference

llmwebcrawler's People

Contributors

aavache avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.