Comments (2)
Hi, using bytes to represent documents seems counter-intuitive, let me explain why.
In the very early version of GNES, we did send text in vanilla str
/List[str]
type, send image in ndarray
, etc. However, we soon realize that there are two problems:
- these data types are Python-only, not universal. In principle, we don't want to restrict the end-user to use a Python client only. They can use a Java client, Go client or Javascript client to communicate with GNES. If we insist on such Python-oriented design, a serializer must exist on both client and server side to convert one type to another. It is a bad experience for developer and error-prone.
- these data types are too specific, not generic. GNES is not for NLP-only, image/video/audio retrieval is also in the scope of GNES. So how do you want represent every content in all modality using a single, generic representation? Bytes is the only option.
One follow-up question you may have is, if every income data is in bytes, how can GNES know what is what and how to deserialize these bytes to the correct modality?
The answer is the Preprocessor. It will deserialize the bytes into the correct modal and Python data type. Note how the class attribute doc_type
affect all preprocessor classes inherited.
gnes/gnes/preprocessor/base.py
Lines 41 to 54 in b4d2c8c
For example, using a SentSplitPreprocessor
(inherited from BaseTextPreprocessor
) will convert bytes into str, using a WeightedSlidingPreprocessor
(inherited from BaseImagePreprocessor
) will convert bytes into ndarray
.
As a summary, let me repeat the whole procedure again.
- Client (e.g.
CLIClient
) converts everything into bytes, and fill in thedocs.raw_bytes
field defined in our Protobuf. As the protobuf is universal, one can use whatever language he/she likes to perform this task. - The message is sent to GNES frontend, and its first service in the stack i.e. preprocessor takes over the message.
- The preprocessor service process the message and deserialize it according to the module loaded, i.e.
docs
but most importantlychunks
information will be enriched based on theraw_bytes
and the deserialization logic. You can of course customize a preprocessor and use it via the way written in GNES Hub. - Follow-up service will take over the message and use the preprocessed
chunks
ordocs
.
In short, a GNES flow/stack without a Preprocessor
service is useless, as it wont know how to handle a message in the correct way.
If you feel like this idea need to be known more for others, welcome to make a PR about this, either via Python docstring
or improve logging info. ❤️
from gnes.
Hi @hanxiao ,
I see that problem when use python
data type to represent various type of document (text, image, video, audio).
Thanks
from gnes.
Related Issues (20)
- Error while running examples on README HOT 3
- 🥮🌕Low maintenance during Mid-Autumn festival
- error while installing using pip HOT 2
- 🥟🥡🏖️ Low maintenance during public holidays Oct 1-7
- Stuck with on `create new stub`. On make `client query` HOT 5
- refactoring the core module by using c++ or golang HOT 1
- Waiting on channel to be ready HOT 6
- How to access gRPCFrontend?? HOT 1
- Scalability Benchmarking and combining text and image HOT 1
- Add the HTTP Client to the flow
- How to use GNES for text classification? HOT 2
- Modify data records after indexed
- error when using bert in flow
- Poem search example without Docker
- Is this still maintained HOT 2
- Can you support sharding?
- semantic poem demo issues HOT 2
- I want to read the code, and understand the thought, which part should I read first? HOT 1
- Clarify storage and distribution APIs HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gnes.