Comments (4)
Thank you for reporting this issue. Can you show how you are creating the Embeddings object? Are you using transformer or word backed vectors?
This is a good find with the default tokenizer. Another issue (#39) is most likely running into this same problem.
As a workaround, can you try passing in lists to the index/search/similarity calls. This will skip tokenization.
In other words:
embeddings.index([(uid, text.split(), None) for uid, text in enumerate(sections)])
embeddings.search(query.split())
or
embeddings.similarity(query, [x.split() for x in sections])
from txtai.
Thank you for reporting this issue. Can you show how you are creating the Embeddings object? Are you using transformer or word backed vectors?
This is a good find with the default tokenizer. Another issue (#39) is most likely running into this same problem.
As a workaround, can you try passing in lists to the index/search/similarity calls. This will skip tokenization.
In other words:
embeddings.index([(uid, text.split(), None) for uid, text in enumerate(sections)]) embeddings.search(query.split())or
embeddings.similarity(query, [x.split() for x in sections])
Thank you for your reply. I sovled this problem by replace
return [token for token in tokens if re.match(r"^\d*[a-z][-.0-9:_a-z]{1,}$", token) and token not in Tokenizer.STOP_WORDS]
with
return tokens
Because it would return a [] if the string doesn't have any English letter (r"^\d*[a-z][-.0-9:_a-z]{1,}$"). And for SentenceTransformer.encode(), it accept an input of strings, So in my case, I just transfer parameter of strings to SentenceTransformer.encode() which can solve this problem.
from txtai.
Thanks for the update. We'll keep this issue open and address it in the next release.
from txtai.
Issue should now be resolved. Tokenization can be disabled by setting the config option:
Embeddings({"method": "transformers", path: "/path/to/model", "tokenize": False})
from txtai.
Related Issues (20)
- Graph Rag - Possible to add extra attributes? HOT 1
- Cannot get it to work on M2 Mac HOT 3
- Fix issue with hardcoded autoawq version in example notebooks HOT 5
- AWQ is only available on GPU - no LLM instanciation possible HOT 6
- ImportError: NetworkX is not available - install "graph" extra to enable HOT 4
- Segmentation fault HOT 2
- API deps missing Pillow
- Add indexids only search
- Create temporary tables once per database session
- Add batch node and edge creation for graphs
- Add notebook on Retrieval Augmented and Guided Generation (RAGG)
- [Feature Request]: Auto-save during indexing HOT 3
- Split similarity extras install
- Cuda error on initialzing Embedding instance in a spawned subprocess aka a celery background task. HOT 2
- Add pgvector ANN backend
- Add RDBMS Graph
- New to txtai, some general questions HOT 2
- Add notebook covering txtai integration with Postgres
- 60th example error with litellm LLM HOT 1
- Configuration documentation update request HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from txtai.