ganymedenil / document.ai Goto Github PK

View Code? Open in Web Editor NEW

3.6K 3.6K 308.0 78 KB

基于向量数据库与GPT3.5的通用本地知识库方案(A universal local knowledge base solution based on vector database and GPT3.5)

License: GNU Affero General Public License v3.0

Python 64.55% HTML 35.45%

document.ai's Introduction

Hi, I'mGanymedeNil, a Developer 🚀 from China.

Here are some ideas to get you started:

🔭 I'm looking for like-minded friends to work with
🌱 I’m currently learning AIGC
💬 Ask me about anything, I am happy to help
📫 How to reach me: [email protected]

document.ai's People

Contributors

Stargazers

Watchers

Forkers

phantomtide jjqtony kevinsun2017 pzeus nero520 johnliu33 iseeyo hqman blackwhites lyhiving pdkyll glaceage piggypiggyrun howiechen95 8-diagrams zxm9988 alphacaicai chris-han tide999 xiispace yayawawo cyoyo-geek tiwentichat topsvcloud duoluo unliu lavineleo zhongerxin xiangtuo hnkama suzg cosmoslazycat xingcici qraccess yezhwi rexsu nough1 gemnioo dalian-ai taozhijiang zs1621 citypages maxwelledisons ideal19dev20 acamelq yxybyq tonyxia2016 griffan igen90 itsharex linuer vjimrunning wishgale katherineq11 bigbrother666sh xuexiaogang linecode fangqiluxatu mplebron antiboson vitekrubtsov nicocanada circlestarzero blm666 toread-jxj gebilaoman haozech junit burakakrishna qhxin danecryessx living198x nsongbai jangocheng techventurebuilder kokoosik jiangtao itsbean mengmajun ch8os kai2002 techthiyanes ai-jie01 louiscklaw ouichien git-models chuan0668 chring32 kang9779 mamingsuper jackcashman xiaolingis bravohaha yi-ge asdlei99 sherry0429 harveyvd newbeeyoung forksx dujingcen

document.ai's Issues

OpenAI API - Access Terminated😭

我加了一行数据导入
cat source_data/004.txt
哮喘#####哮喘发作可表现为突然出现的喘息、咳嗽和呼吸困难。有时哮喘可表现为缓慢发作，症状逐渐加重。无论哪种类型的哮喘发作，哮喘患者都会首先感觉到呼吸困难、咳嗽或胸紧。哮喘发作可以在几分钟后结束，也可持续数小时或数天。胸部或颈部皮肤瘙痒可以是哮喘的早期症状，尤其是儿童。夜间或运动时干咳可以是哮喘唯一的症状。

马上收到了api禁用，看了使用条例，不能用于健康诊断，大家注意使用，不要api被禁了

ps:不过要大赞这个项目，感觉找到了AIGC的正确使用方式，能自己的数据里训练出一个细分垂直领域的知识库，太牛逼了。

关于自训练的Embedding模型的问题

你好，非常感谢作者的贡献，让我更加理解实现思路，我遇到了点问题，想请教您。

如果我自己依据想构建的知识库的数据去训练 Embedding 模型，然后向量化本地数据的时候，同时把训练 Embedding 模型的数据也向量化存储在Qdrant，这样做是不是不合适？

我想基于我们公司已有的想沉淀为知识库的数据进行训练 Embedding，这样期望进行向量化存储和搜索的时候，相似性和准确率稍微可以高点，我该怎么做呢？

请教vectors_config=VectorParams(size=1536, distance=Distance.COSINE)参数问题

您好！请问vectors_config=VectorParams(size=1536, distance=Distance.COSINE)中1536是怎么计算出来的？我改为其他值例如1000，3000，532都会出错。谢谢。

如何在现有Embedding模型基础上使用无监督数据微调？

感谢分享！

有一个问题想咨询一下您，按照我的理解，GanymedeNil/text2vec-large-chinese模型是基于LERT预训练模型，使用CoSENT方法，在中文STS-B数据集上训练得到的。

我现在有一些特定领域文本，想使用该Embedding模型在特定领域文本上微调，但这些文本是无标注的，无法使用CoSENT方法进行有监督微调。
那是不是我的可行做法可以是：

使用LERT在这些领域文本上进行MLM无监督微调，再在STS-B上微调；
尝试利用特定领域文本构建文本对，利用CoSENT方法微调；

直觉上看，这两种做法会有效吗？希望听一下您的见解。
期待您的回复

您好，请问方便分享text2vec-large/base-chinese的数据集吗

真是相见恨晚，最近研究过来后我也和您有一样的想法，但您5个月前就已经有这个想法了。

因为怕我自己给的数据用于finetune会破坏原来的性能，所以想借助一下您的数据集然后往里面增加我的数据进行一个训练，不知是否方便提供数据集

并不是医疗数据集，而是您开源的两个模型训练所用到的数据集

关于GPT fine-tune

你好, 在readme中看到说对GPT进行finetune效果会变好, 但是在代码中好像没看到fine-tuned相关的操作.
方便问下用于fine-tune的GPT的模型是哪里来的吗? 还是是用的一些开源的foundation model?

谢谢

向量数据库查询到的应该是和问题相似的内容吧

把用户的 q 转化为向量后，在向量数据库中查询到的 topK 个结果应该是和 q 相似的吧，而不是 q 对应的答案吧？还是说我对向量数据库理解有问题，谢谢。

About Fine-tune

Fine-tune是基于openai的api进行模型它的训练么？基于积攒的新的问答数据

对于source data的格式有什么强制要求？

参考demo的txt的内容格式，是一定要以#####间隔，还是说这是您摘的原始的MSD数据库的一条？谢谢

向量匹配不准确

使用GPT接口插入向量及payload到向量数据库，用户输入内容，从向量数据库中返回的数据不是精准的。如下图


这是什么问题引起的呢？

关于qdrant库上传点的时候title选择的问题

我选的是cMedQQ这个数据集，我感觉这个用这个question作为title会不会质量很差呀，因为我现在用的这个数据集（差不多9000多行）建设完成后感觉对于query返回的topK的结果很不好呢

如何在现有Embedding模型基础上使用无监督数据微调？

感谢分享！

有一个问题想咨询一下您，按照我的理解，GanymedeNil/text2vec-large-chinese模型是基于LERT预训练模型，使用CoSENT方法，在中文STS-B数据集上训练得到的。

使用LERT在这些领域文本上进行MLM无监督微调，再在STS-B上微调；
尝试利用特定领域文本构建文本对，利用CoSENT方法微调；

直觉上看，这两种做法会有效吗？希望听一下您的见解。
期待您的回复

text2vec模型效果怎么样

大佬自训练的版本，看起来不错，效果怎样，有在相关数据集上的评估指标可以分享吗？

比如 text2vec-large-chinese 和 text2vec-base-chinese 的效果对比，便于大家选用

谢谢！

数据量级

仿照大佬的项目搭了一个，暴力计算的相似度，大概多少量级数据需要用到向量库么
https://github.com/thinksoso/MedChat

启动后提示：openai.error.APIConnectionError: Error communicating with OpenAI。这是因为 ssl 请求失败吗

openai.error.APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/embeddings (Caused by SSLError(

2023-04-02 13:45:47,078] ERROR in app: Exception on /search [POST]
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/backends/sync.py", line 94, in connect_tcp
sock = socket.create_connection(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 845, in create_connection
raise err
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 833, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 218, in handle_request
resp = self._pool.handle_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request
response = connection.handle_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 86, in handle_request
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 63, in handle_request
stream = self._connect(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 111, in _connect
stream = self._network_backend.connect_tcp(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/backends/sync.py", line 93, in connect_tcp
with map_exceptions(exc_map):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc)
httpcore.ConnectError: [Errno 61] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 95, in send_inner
response = self._client.send(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 908, in send
response = self._send_handling_auth(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 936, in _send_handling_auth
response = self._send_handling_redirects(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 973, in _send_handling_redirects
response = self._send_single_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 1009, in _send_single_request
response = transport.handle_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 217, in handle_request
with map_httpcore_exceptions():
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 2528, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/Users/xiaolu/Desktop/github/document.ai/code/server/server.py", line 105, in search
res = query(search)
File "/Users/xiaolu/Desktop/github/document.ai/code/server/server.py", line 64, in query
search_result = client.search(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/qdrant_client.py", line 253, in search
return self._client.search(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/qdrant_remote.py", line 419, in search
search_result = self.http.points_api.search_points(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 963, in search_points
return self._build_for_search_points(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 488, in build_for_search_points
return self.api_client.request(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 68, in request
return self.send(request, type)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 85, in send
response = self.middleware(request, self.send_inner)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 188, in call
return call_next(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 97, in send_inner
raise ResponseHandlingException(e)
qdrant_client.http.exceptions.ResponseHandlingException: [Errno 61] Connection refused

{
    title: "某个领域的研究",
    text: "具体的某个研究的内容，可能非常复杂和杂碎"
}

另外就是，很多知识库都会有 tag 系统，我对某个内容会进行 tag，这个信息怎么纳入知识库或者 vector 中？

ganymedenil / document.ai Goto Github PK

document.ai's Introduction

Hi, I'mGanymedeNil, a Developer 🚀 from China.

document.ai's People

Contributors

Stargazers

Watchers

Forkers

document.ai's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs