GithubHelp home page GithubHelp logo

duanyu / embedding_application Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 1.0 15.7 MB

Some applications of text embedding model, e.g., semantic retrieval and clustering.

Jupyter Notebook 100.00%
clustering embeddings natural-langauge-processing semantic-retrieval text-embedding

embedding_application's Introduction

embedding_application

一些embedding model的应用,例如语义检索、聚类等。

语义检索

应用:鲁迅全集检索

embedding_luxun_search.ipynb

鲁迅全集进行passage切分,随后用bge-large-zh-v1.5进行embedding表征,随后导入milvus,然后就可以搜索各种内容啦。

由于是语义检索的概念,像“内卷与躺平”这种query也可以搜到不错的结果。

说明:

  1. 在modelscope-GPU环境可直接运行,所需依赖的库已在jupyter中指明,在其他环境下还需pip3 install modelscope
  2. 在GPU上大约需消耗2G显存,CPU也能跑但是贼慢。

感兴趣的读者可以读这篇博文,以了解更多细节。

聚类

应用:新闻早报聚类

embedding_news_clustering.ipynb

将几个微信公众号的早报新闻解析为(title,passages)格式,随后使用bge-large-zh-v1.5对title进行表征,最后进行DBSCAN聚类,以聚合相同、相关的新闻。

说明:

  1. 在modelscope-GPU环境可直接运行,所需依赖的库已在jupyter中指明,在其他环境下还需pip3 install modelscope
  2. DBSCAN的超参选择方面,metric选择cosine距离、eps选择0.4-0.45、min_samples=2。其中eps越大,越能包含“相关”新闻;eps越小,越只能包含“相同”新闻;
  3. 解析url使用了unstructured,此库依赖nltk_data中的punkt、averaged_perceptron_tagge,如果nltk下载慢,建议直接使用下载好的punkt、averaged_perceptron_tagge(本项目source目录下已下载好,可直接用)。

感兴趣的读者可以读这篇博文,以了解更多细节。

embedding_application's People

Contributors

duanyu avatar

Watchers

 avatar

Forkers

lavineleo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.