Here are some ideas to get you started:
- 🔭 I'm looking for like-minded friends to work with
- 🌱 I’m currently learning AIGC
- 💬 Ask me about anything, I am happy to help
- 📫 How to reach me: [email protected]
基于向量数据库与GPT3.5的通用本地知识库方案(A universal local knowledge base solution based on vector database and GPT3.5)
License: GNU Affero General Public License v3.0
Here are some ideas to get you started:
你好,非常感谢作者的贡献,让我更加理解实现思路,我遇到了点问题,想请教您。
如果我自己依据想构建的知识库的数据去训练 Embedding 模型,然后向量化本地数据的时候,同时把训练 Embedding 模型的数据也向量化存储在Qdrant,这样做是不是不合适?
我想基于我们公司已有的想沉淀为知识库的数据进行训练 Embedding,这样期望进行向量化存储和搜索的时候,相似性和准确率稍微可以高点,我该怎么做呢?
您好!请问vectors_config=VectorParams(size=1536, distance=Distance.COSINE)中1536是怎么计算出来的?我改为其他值例如1000,3000,532都会出错。谢谢。
感谢分享!
有一个问题想咨询一下您,按照我的理解,GanymedeNil/text2vec-large-chinese模型是基于LERT预训练模型,使用CoSENT方法,在中文STS-B数据集上训练得到的。
我现在有一些特定领域文本,想使用该Embedding模型在特定领域文本上微调,但这些文本是无标注的,无法使用CoSENT方法进行有监督微调。
那是不是我的可行做法可以是:
直觉上看,这两种做法会有效吗?希望听一下您的见解。
期待您的回复
想咨询一下您二次训练时的任务是如何设计的
你好,非常感谢作者的贡献,让我更加理解实现思路,我遇到了点问题,想请教您。
如果我自己依据想构建的知识库的数据去训练 Embedding 模型,然后向量化本地数据的时候,同时把训练 Embedding 模型的数据也向量化存储在Qdrant,这样做是不是不合适?
我想基于我们公司已有的想沉淀为知识库的数据进行训练 Embedding,这样期望进行向量化存储和搜索的时候,相似性和准确率稍微可以高点,我该怎么做呢?
真是相见恨晚,最近研究过来后我也和您有一样的想法,但您5个月前就已经有这个想法了。
因为怕我自己给的数据用于finetune会破坏原来的性能,所以想借助一下您的数据集然后往里面增加我的数据进行一个训练,不知是否方便提供数据集
并不是医疗数据集,而是您开源的两个模型训练所用到的数据集
你好, 在readme中看到说对GPT进行finetune效果会变好, 但是在代码中好像没看到fine-tuned相关的操作.
方便问下用于fine-tune的GPT的模型是哪里来的吗? 还是是用的一些开源的foundation model?
谢谢
如何增加本地知识库呢
把用户的 q 转化为向量后,在向量数据库中查询到的 topK 个结果应该是和 q 相似的吧,而不是 q 对应的答案吧?还是说我对向量数据库理解有问题,谢谢。
Fine-tune是基于openai的api进行模型它的训练么?基于积攒的新的问答数据
参考demo的txt的内容格式,是一定要以#####间隔,还是说这是您摘的原始的MSD数据库的一条?谢谢
我选的是cMedQQ这个数据集,我感觉这个用这个question作为title会不会质量很差呀,因为我现在用的这个数据集(差不多9000多行)建设完成后感觉对于query返回的topK的结果很不好呢
感谢分享!
有一个问题想咨询一下您,按照我的理解,GanymedeNil/text2vec-large-chinese模型是基于LERT预训练模型,使用CoSENT方法,在中文STS-B数据集上训练得到的。
我现在有一些特定领域文本,想使用该Embedding模型在特定领域文本上微调,但这些文本是无标注的,无法使用CoSENT方法进行有监督微调。
那是不是我的可行做法可以是:
直觉上看,这两种做法会有效吗?希望听一下您的见解。
期待您的回复
大佬自训练的版本,看起来不错,效果怎样,有在相关数据集上的评估指标可以分享吗?
比如 text2vec-large-chinese 和 text2vec-base-chinese 的效果对比,便于大家选用
谢谢!
也就是openai瞎胡扯的问题。
仿照大佬的项目搭了一个,暴力计算的相似度,大概多少量级数据需要用到向量库么
https://github.com/thinksoso/MedChat
youngvonli AT gmail
openai.error.APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/embeddings (Caused by SSLError(
2023-04-02 13:45:47,078] ERROR in app: Exception on /search [POST]
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/backends/sync.py", line 94, in connect_tcp
sock = socket.create_connection(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 845, in create_connection
raise err
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 833, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 218, in handle_request
resp = self._pool.handle_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request
response = connection.handle_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 86, in handle_request
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 63, in handle_request
stream = self._connect(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 111, in _connect
stream = self._network_backend.connect_tcp(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/backends/sync.py", line 93, in connect_tcp
with map_exceptions(exc_map):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc)
httpcore.ConnectError: [Errno 61] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 95, in send_inner
response = self._client.send(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 908, in send
response = self._send_handling_auth(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 936, in _send_handling_auth
response = self._send_handling_redirects(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 973, in _send_handling_redirects
response = self._send_single_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 1009, in _send_single_request
response = transport.handle_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 217, in handle_request
with map_httpcore_exceptions():
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 61] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 2528, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/Users/xiaolu/Desktop/github/document.ai/code/server/server.py", line 105, in search
res = query(search)
File "/Users/xiaolu/Desktop/github/document.ai/code/server/server.py", line 64, in query
search_result = client.search(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/qdrant_client.py", line 253, in search
return self._client.search(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/qdrant_remote.py", line 419, in search
search_result = self.http.points_api.search_points(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 963, in search_points
return self._build_for_search_points(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 488, in build_for_search_points
return self.api_client.request(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 68, in request
return self.send(request, type)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 85, in send
response = self.middleware(request, self.send_inner)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 188, in call
return call_next(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 97, in send_inner
raise ResponseHandlingException(e)
qdrant_client.http.exceptions.ResponseHandlingException: [Errno 61] Connection refused
15258339420
比如像 Notion 中自己关于某个 topic 的笔记,就应该记成类似如下形式吗?
{
title: "某个领域的研究",
text: "具体的某个研究的内容,可能非常复杂和杂碎"
}
另外就是,很多知识库都会有 tag 系统,我对某个内容会进行 tag,这个信息怎么纳入知识库或者 vector 中?
import_data.py error
qdrant_client.http.exceptions.UnexpectedResponse: Unexpected Response: 503 (Service Unavailable)
Raw response content:
b''
how to set the parameters of the clients?
长时间无法响应是因为OpenAI接口已经被ban,已经有很多公开的方案了,请善用搜索
您好,很感谢您的项目,学习到很多~
请问你的默沙东的数据集是自己从官网爬的吗,我看openAI官方的例子有webQA的爬虫的例子,不知道是不是这样弄也可以
感谢作者大大的分享,想知道对于计算资源不足的人而言,能不能基于已有的embedding模型在特定的领域知识上做微调,或者类似于lora的方式做增量
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.