GithubHelp home page GithubHelp logo

ganymedenil / document.ai Goto Github PK

View Code? Open in Web Editor NEW
3.6K 3.6K 308.0 78 KB

基于向量数据库与GPT3.5的通用本地知识库方案(A universal local knowledge base solution based on vector database and GPT3.5)

License: GNU Affero General Public License v3.0

Python 64.55% HTML 35.45%

document.ai's Introduction

Hi, I'mGanymedeNil, a Developer 🚀 from China.

GitHub LinkedIn Twitter Gmail Medium Hugo

Github

Here are some ideas to get you started:

  • 🔭 I'm looking for like-minded friends to work with
  • 🌱 I’m currently learning AIGC
  • 💬 Ask me about anything, I am happy to help
  • 📫 How to reach me: [email protected]

document.ai's People

Contributors

coderabbit214 avatar ganymedenil avatar hiqiuyi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

document.ai's Issues

OpenAI API - Access Terminated😭

我加了一行数据导入
cat source_data/004.txt
哮喘#####哮喘发作可表现为突然出现的喘息、咳嗽和呼吸困难。有时哮喘可表现为缓慢发作,症状逐渐加重。无论哪种类型的哮喘发作,哮喘患者都会首先感觉到呼吸困难、咳嗽或胸紧。哮喘发作可以在几分钟后结束,也可持续数小时或数天。胸部或颈部皮肤瘙痒可以是哮喘的早期症状,尤其是儿童。夜间或运动时干咳可以是哮喘唯一的症状。
image

马上收到了api禁用,看了使用条例,不能用于健康诊断,大家注意使用,不要api被禁了
image

ps:不过要大赞这个项目,感觉找到了AIGC的正确使用方式,能自己的数据里训练出一个细分垂直领域的知识库,太牛逼了。

关于自训练的Embedding模型的问题

你好,非常感谢作者的贡献,让我更加理解实现思路,我遇到了点问题,想请教您。

如果我自己依据想构建的知识库的数据去训练 Embedding 模型,然后向量化本地数据的时候,同时把训练 Embedding 模型的数据也向量化存储在Qdrant,这样做是不是不合适?

我想基于我们公司已有的想沉淀为知识库的数据进行训练 Embedding,这样期望进行向量化存储和搜索的时候,相似性和准确率稍微可以高点,我该怎么做呢?

如何在现有Embedding模型基础上使用无监督数据微调?

感谢分享!

有一个问题想咨询一下您,按照我的理解,GanymedeNil/text2vec-large-chinese模型是基于LERT预训练模型,使用CoSENT方法,在中文STS-B数据集上训练得到的。

我现在有一些特定领域文本,想使用该Embedding模型在特定领域文本上微调,但这些文本是无标注的,无法使用CoSENT方法进行有监督微调。
那是不是我的可行做法可以是:

  • 使用LERT在这些领域文本上进行MLM无监督微调,再在STS-B上微调;
  • 尝试利用特定领域文本构建文本对,利用CoSENT方法微调;

直觉上看,这两种做法会有效吗?希望听一下您的见解。
期待您的回复

关于自训练 Embeddings 问题

你好,非常感谢作者的贡献,让我更加理解实现思路,我遇到了点问题,想请教您。

如果我自己依据想构建的知识库的数据去训练 Embedding 模型,然后向量化本地数据的时候,同时把训练 Embedding 模型的数据也向量化存储在Qdrant,这样做是不是不合适?

我想基于我们公司已有的想沉淀为知识库的数据进行训练 Embedding,这样期望进行向量化存储和搜索的时候,相似性和准确率稍微可以高点,我该怎么做呢?

您好,请问方便分享text2vec-large/base-chinese的数据集吗

真是相见恨晚,最近研究过来后我也和您有一样的想法,但您5个月前就已经有这个想法了。

因为怕我自己给的数据用于finetune会破坏原来的性能,所以想借助一下您的数据集然后往里面增加我的数据进行一个训练,不知是否方便提供数据集

并不是医疗数据集,而是您开源的两个模型训练所用到的数据集

关于GPT fine-tune

你好, 在readme中看到说对GPT进行finetune效果会变好, 但是在代码中好像没看到fine-tuned相关的操作.
方便问下用于fine-tune的GPT的模型是哪里来的吗? 还是是用的一些开源的foundation model?

谢谢

About Fine-tune

Fine-tune是基于openai的api进行模型它的训练么?基于积攒的新的问答数据

向量匹配不准确

使用GPT接口插入向量及payload到向量数据库,用户输入内容,从向量数据库中返回的数据不是精准的。如下图
image
image
这是什么问题引起的呢?

关于qdrant库上传点的时候title选择的问题

我选的是cMedQQ这个数据集,我感觉这个用这个question作为title会不会质量很差呀,因为我现在用的这个数据集(差不多9000多行)建设完成后感觉对于query返回的topK的结果很不好呢

如何在现有Embedding模型基础上使用无监督数据微调?

感谢分享!

有一个问题想咨询一下您,按照我的理解,GanymedeNil/text2vec-large-chinese模型是基于LERT预训练模型,使用CoSENT方法,在中文STS-B数据集上训练得到的。

我现在有一些特定领域文本,想使用该Embedding模型在特定领域文本上微调,但这些文本是无标注的,无法使用CoSENT方法进行有监督微调。
那是不是我的可行做法可以是:

  • 使用LERT在这些领域文本上进行MLM无监督微调,再在STS-B上微调;
  • 尝试利用特定领域文本构建文本对,利用CoSENT方法微调;

直觉上看,这两种做法会有效吗?希望听一下您的见解。
期待您的回复

text2vec模型效果怎么样

大佬自训练的版本,看起来不错,效果怎样,有在相关数据集上的评估指标可以分享吗?

比如 text2vec-large-chinese 和 text2vec-base-chinese 的效果对比,便于大家选用

谢谢!

提示我被限额了,我明明刚充值的钱

2023-04-02 13:45:47,078] ERROR in app: Exception on /search [POST]
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/backends/sync.py", line 94, in connect_tcp
sock = socket.create_connection(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 845, in create_connection
raise err
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 833, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 218, in handle_request
resp = self._pool.handle_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request
response = connection.handle_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 86, in handle_request
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 63, in handle_request
stream = self._connect(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 111, in _connect
stream = self._network_backend.connect_tcp(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/backends/sync.py", line 93, in connect_tcp
with map_exceptions(exc_map):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc)
httpcore.ConnectError: [Errno 61] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 95, in send_inner
response = self._client.send(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 908, in send
response = self._send_handling_auth(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 936, in _send_handling_auth
response = self._send_handling_redirects(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 973, in _send_handling_redirects
response = self._send_single_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 1009, in _send_single_request
response = transport.handle_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 217, in handle_request
with map_httpcore_exceptions():
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 2528, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/Users/xiaolu/Desktop/github/document.ai/code/server/server.py", line 105, in search
res = query(search)
File "/Users/xiaolu/Desktop/github/document.ai/code/server/server.py", line 64, in query
search_result = client.search(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/qdrant_client.py", line 253, in search
return self._client.search(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/qdrant_remote.py", line 419, in search
search_result = self.http.points_api.search_points(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 963, in search_points
return self._build_for_search_points(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 488, in build_for_search_points
return self.api_client.request(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 68, in request
return self.send(request, type
)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 85, in send
response = self.middleware(request, self.send_inner)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 188, in call
return call_next(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 97, in send_inner
raise ResponseHandlingException(e)
qdrant_client.http.exceptions.ResponseHandlingException: [Errno 61] Connection refused

SentenceTransformer 调用的问题

很感谢作者的分享!
我现在使用LangChain-chatglm项目时碰到了一些加载模型文件的问题,大概确定了是因为模型文件缺少了modules.json,pooling文件夹,还有sentence_xlnet_config.json,这些都从shibing624/text2vec那边复制一份过来会有问题吗,需要有其他特别的设置吗

image

image

资料库包含tag应该怎么整理

比如像 Notion 中自己关于某个 topic 的笔记,就应该记成类似如下形式吗?

{
    title: "某个领域的研究",
    text: "具体的某个研究的内容,可能非常复杂和杂碎"
}

另外就是,很多知识库都会有 tag 系统,我对某个内容会进行 tag,这个信息怎么纳入知识库或者 vector 中?

数据集爬虫咨询

您好,很感谢您的项目,学习到很多~

请问你的默沙东的数据集是自己从官网爬的吗,我看openAI官方的例子有webQA的爬虫的例子,不知道是不是这样弄也可以

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.