GithubHelp home page GithubHelp logo

Comments (2)

hanxiao avatar hanxiao commented on May 18, 2024 3

Hi, using bytes to represent documents seems counter-intuitive, let me explain why.

In the very early version of GNES, we did send text in vanilla str/List[str] type, send image in ndarray, etc. However, we soon realize that there are two problems:

  • these data types are Python-only, not universal. In principle, we don't want to restrict the end-user to use a Python client only. They can use a Java client, Go client or Javascript client to communicate with GNES. If we insist on such Python-oriented design, a serializer must exist on both client and server side to convert one type to another. It is a bad experience for developer and error-prone.
  • these data types are too specific, not generic. GNES is not for NLP-only, image/video/audio retrieval is also in the scope of GNES. So how do you want represent every content in all modality using a single, generic representation? Bytes is the only option.

One follow-up question you may have is, if every income data is in bytes, how can GNES know what is what and how to deserialize these bytes to the correct modality?

The answer is the Preprocessor. It will deserialize the bytes into the correct modal and Python data type. Note how the class attribute doc_type affect all preprocessor classes inherited.

class BaseTextPreprocessor(BasePreprocessor):
doc_type = gnes_pb2.Document.TEXT
class BaseAudioPreprocessor(BasePreprocessor):
doc_type = gnes_pb2.Document.AUDIO
class BaseImagePreprocessor(BasePreprocessor):
doc_type = gnes_pb2.Document.IMAGE
class BaseVideoPreprocessor(BasePreprocessor):
doc_type = gnes_pb2.Document.VIDEO

For example, using a SentSplitPreprocessor (inherited from BaseTextPreprocessor) will convert bytes into str, using a WeightedSlidingPreprocessor (inherited from BaseImagePreprocessor) will convert bytes into ndarray.

As a summary, let me repeat the whole procedure again.

  1. Client (e.g. CLIClient) converts everything into bytes, and fill in the docs.raw_bytes field defined in our Protobuf. As the protobuf is universal, one can use whatever language he/she likes to perform this task.
  2. The message is sent to GNES frontend, and its first service in the stack i.e. preprocessor takes over the message.
  3. The preprocessor service process the message and deserialize it according to the module loaded, i.e. docs but most importantly chunks information will be enriched based on the raw_bytes and the deserialization logic. You can of course customize a preprocessor and use it via the way written in GNES Hub.
  4. Follow-up service will take over the message and use the preprocessed chunks or docs.

In short, a GNES flow/stack without a Preprocessor service is useless, as it wont know how to handle a message in the correct way.

If you feel like this idea need to be known more for others, welcome to make a PR about this, either via Python docstring or improve logging info. ❤️

from gnes.

ilham-bintang avatar ilham-bintang commented on May 18, 2024

Hi @hanxiao ,

I see that problem when use python data type to represent various type of document (text, image, video, audio).

Thanks

from gnes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.