GithubHelp home page GithubHelp logo

Consistent API for fieldSpecs about pond HOT 5 CLOSED

esnet avatar esnet commented on June 11, 2024
Consistent API for fieldSpecs

from pond.

Comments (5)

montegoode avatar montegoode commented on June 11, 2024 1

After discussion, we settled on the following:

There are three field spec variants and they are named thusly:

  • field_spec can be a string, an array/tuple or None. The string can be a single column or a.deep.column. The array can be a list of columns including ['using.deep', 'column.syntax']. field_spec can imply retrieving one or more columns. If it is None then it will default to the default value column.
  • field_spec_list is an array/tuple of strings. It is used when the arg requires the request of multiple columns. Each element can be a simple column name or can ['use.deep', 'column.syntax'].
  • field_path is a lower level deal that can only address a single column. It can be a string that is either a single simple column or a.deep.column OR an array/tuple that only contains segments to a single column - i.e.: ['a', 'deep', 'column']. These values will eventually hit the sanitizer and be turned into the array form.

These names should exclusively be used in class methods in class methods for clarity both as incoming args and internal args. To wit:

    @staticmethod
    def is_valid_value(event, field_path=None):
        val = event.value(field_path)
        return not bool(val is None or val == '' or is_nan(val))

As for the aforementioned "single sanitizer method" - this is to be deployed inside Event.get() to handle simple on-off cases for when it makes sense to call in = e.get('in') but the developer should also use the same method "upstream" so splitting.deep.string.paths do not keep happening inside a loop. The final test to see if .get() got an array is inexpensive.

@pjm17971 please review.

from pond.

montegoode avatar montegoode commented on June 11, 2024 1

There is one final issue while we’re picking at this scab that I wanted to hammer out before we close this issue and make everything holy. It’s the issue of what the default for field_spec should be. Of course it should be value…mostly.

I propose the following:

Method prototypes handling a field_spec argument should set the default value to None in python land/whatever the proper corollary is in JS land, and that the uber sanitize method be responsible for setting the default if it receives None as the value.

Reasoning:

  1. That value really never needs to be set until the moment before it finally hits Event.get() and needs to become ['value']. So there really isn't any reason to pepper the entire rest of the code base with a million (..., field_spec=['value']) method prototypes. Set it to one language-specific null value and get on with your life.
  2. If we ever want to change the default which we will never want to do until we do, it's easy RE: point 1 because the sanitizer is doing it.
  3. There are some methods (Event.map(), et al) that take a field_spec and if that is None/etc it defaults to "mapping all the columns."

And that makes the two cases basically consistent, not providing a field_spec at all telegraphs "do your default thing" to all of those methods with the least amount of code to make it happen.

@pjm17971 also for your perusal.

from pond.

montegoode avatar montegoode commented on June 11, 2024

In the Collapser processor, it is internally using _field_spec as an internal attribute from a passed in Option. For consistency, this should be changed to _field_spec_list because that's what Event.collapse() takes as an argument.

from pond.

montegoode avatar montegoode commented on June 11, 2024

I've taken a pass through the python code and renamed everything using a consistent naming scheme and I've also come up with cut-and-paste arg docstrings so it's clear what is doing what. Both of these things help when you have a situation where TimeSeries.collapse() calls Pipeline.collapse() which invokes the Collapser processor, which in turn calls Event.collapse().

Whew.

To be on the same page I have used the following scheme:

  • field_spec can be a string, an array/tuple or None. The string can be a single column or a.deep.column. The array can be a list of columns including ['using.deep', 'column.syntax']. field_spec can imply retrieving one or more columns. If it is None then it will default to the default value column.
  • field_spec_list is an array/tuple of strings. It is used when the arg requires the request of multiple columns. Each element can be a simple column name or can ['use.deep', 'column.syntax'].
  • field_path_array is a lower level deal that can only address a single column. It is used to access a column and if the array has multiple values, they are segments of a.deep.column.path.

These names are used on "both sides" of a method. Example: Event.is_valid_value() is a very light abstraction around Event.get() which takes a field path array. So Event.is_valid_value() looks like this:

    @staticmethod
    def is_valid_value(event, field_path_array=None):
        val = event.value(field_path_array)
        return not bool(val is None or val == '' or is_nan(val))

And things calling it, when possible should look like this example from Collection.clean():

    def clean(self, field_path_array=None):

        flt_events = list()

        for i in self.events():
            if Event.is_valid_value(i, field_path_array):
                flt_events.append(i)

        return Collection(flt_events)

This not only makes things consistent but also makes it easier to track down the points where we should pre-split the field_path_array into an actual array so that's not happening inside a loop.

Speaking of optimizing those splits, we could consider removing any code that can handle a string from Event.get() - this would force the developer to split.this into ['split', 'this']farther upstream. Both code bases will still do that split in get() which could silently lead to writing non-optimal code.

The python code is currently doing the split with the sanitizer method in get() but that is by design while I was re-orging things. My next step is to use the renamed methods to swim back upstream to see where we should do the splits.

from pond.

pjm17971 avatar pjm17971 commented on June 11, 2024

For reference in the future, the python changes are mixed in here:
esnet/pypond@1048844

The javascript changes are here:
c933d09

Closing now.

from pond.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.