bsc-dom / pyclay Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 3.0 1.46 MB

dataClay Python codebase

Home Page: https://www.bsc.es/dataclay

License: BSD 3-Clause "New" or "Revised" License

Python 99.85% Dockerfile 0.05% Shell 0.10%

pyclay's Introduction

This repository has been merged into bsc-dom/dataclay

dataClay Python codebase

This repository holds the pyclay Python package. This package is used both by the dataClay Python clients and also for the dataClay service Execution Environment.

Installation

This package is available from PyPI, so just pip it:

$ pip install dataClay

Documentation

Official documentation available at read the docs

Other resources

BSC official dataClay webpage

pyclay's People

Contributors

Stargazers

Watchers

Forkers

kpavel eflows4hpc marcmonfort

pyclay's Issues

Use python typing instead of docstrings

It may depend on new model registration mechanism bsc-dom/dataclay-packaging#7

React proactively to invalid types when UUID is expected

We discovered this issue while working with get_object_by_id (DataClayRuntime.py). The documentation is correct, i.e. it specifies that UUID is expected on object_id, class_id and hint (when provided).

However, strange things happen: when the object_id is a string, the method works. Probably ending up with a "dirty" heap, containing both keys UUID and keys string. If class_id is provided and not an UUID, it breaks with a non-descriptive error.

It makes sense to add better errors and/or some proactive type checking, as misusing UUID and strings is something that can happen easily and those errors will leak to end-user applications.

Maybe this should be checked in other places too. I am completely against doing type-checking everywhere (for performance reasons, for Pythonic reasons, for readability reasons, for maintenance reasons), but certain key API end-user-facing calls may benefit from some hand-holding as a means to improve usability.

Customization / configuration option for pickling version

At the time of writing this issue, most pickle.dump happen with a -1 parameter, which implies highest pickling protocol version available.

It may be wise to either hardcode a specific version everywhere (and maintain that on sync) or have a user-definable configuration place for that --i.e. an additional global.properties place.

Most users will not need that and for all 100%-docker deployments there is no need for that. But for hybrid deployments[1] it is mandatory to avoid the "highest version", because at the time of writing this Python 3.8 has an incompatible highest version to the dataClay dockers (currently trying on 2.4.dev).

[1] Examples: A docker-compose is being used for services but the application is executed with the OS Python. Another one: the services have been run with a specific controlled Python installation but the application is being run by COMPSs in another Python installation.

Hash for objects positioning

Choose an object storage location based on the hash of its alias (if it has one) or ID.
When getting an object by its alias/ID, bypass the logic module and use the hash function.

EE HeapManager reaction to memory pressure under Python

The current Python implementation of the behaviour HeapManager when there is memory pressure is a clear bottleneck of the system.

Although this only affects applications that are memory-space-bounded (which, arguably, are not a lot right now), it may become relevant at some point.

The current strategy is looking into the system memory usage and react accordingly by serializing objects one by one, and checking when the system is not under memory pressure anymore.

We may want to use Python interpreter memory usage instead of the system memory usage.

This is specially relevant in deployments in which there are several Execution Environments: if there is unbalance between them, it doesn't make sense that all of them start evicting objects at the same time.

We may improve performance by doing several store objects in parallel.

When there is memory pressure, we will probably need to serialize (in order to evict from memory) a bunch of objects. So parallelizing that is an obvious HPC strategy, given that there are a lot of I/O overheads and locking calls.

We should improve the stop-I-don't-have-enough-memory

When the memory is filled, some applications may generate new data faster than the eviction rate. This can result in OOM errors (it has happened to me). I did some tests on adding blocking to the server side, but a proper gRPC error of not enough resources like a RESOURCE_EXHAUSTED status code may be cleaner and more scalable.

We may want to add hysteresis to the memory pressure threshold

I have some dirty workaround which consists in using a MEMORY_EASE threshold. When the eviction of objects start, instead of comparing against the memory pressure threshold, it compares to this MEMORY_EASE. This ensures that there is more margin. Also, if we implement the RESOURCE_EXHAUSTED approach (see above), we may use that status code for the memory presusre threshold but keep evicting until the memory ease threshold is reached.

Add Hdf5 contribution

"Proper" support for Python inheritance

Current implementation of dataClay class registration doesn't set properly the class inheritance (forces DataClayObjcet) and "works" because it re-adds the methods of parent class --weird behaviour.

This means that the current stubs don't have a correct ancestor and loses redefined methods of the children --thus breaking super.

I think that it will be easier to implement it once (if) the class registration is done from the source files, which will simplify the registration and will avoid modifications on the metadata models. Otherwise, this support will become a challenging task.

Init() failure

Currently running
init() -> finish() -> init() fails on error

Configure Travis to test with pytest

When there is inheritance, get attribute definitions from parent classes

When there is some inheritance, the docstring in the parent should be parsed in order to add the attributes.

This needs a recursive (tree-like) navigation on the inheritance classes.

Fix the serialization of volatile complex objects

We have been reported what seems to be a bug.

Rough steps in order to replicate the bug:

prepare a class A that contains (as attributes) an integer and a list
prepare a class B
add a method to the class B that expects an instance of the class A
register the model
create an instance of A --do not make it persistent
create an instance of B
make persistent the instance of B
call the method of the B instance

We would expect to work it without issues.

The reported bug showed that the list ended up in the pending_obj list, which is a serialization backlog for serialize_params_or_return which should only contain DataClayObject and thus the list should not be in there.

MQTT client livecycle is broken and connection is silently dropped

The dataclay.contrib.mqtt.MQTTMixin class instantiates a client the first time send_to_mqtt is called, but the client loop is not properly instantiated. That means that this connection tends to be dropped after a while of inactivity and then the client is broken and send_to_mqtt ceases to work (silently, with no error).

Steps to reproduce

Start dataClay and start a MQTT server (order does not matter).
- Make sure that MQTT_HOST and MQTT_PORT are correctly set. If running on localhost, 127.0.0.1 seems correct. If using a docker-compose, use the container name.
Register a class with the mixin, e.g.

class SomeClass(DataClayObject, MQTTMixin):
    ...

Create a instance of the class, make it persistent, try to call the send_to_mqtt method.
- The call will work, you will see the message if you listen to the server.
Wait for an arbitrary amount of time without sending any messages (brew a coffe and wait 5 minutes).
Try to call again the send_to_mqtt. The call will silently fail, no messages will be published.

Possible solutions

Use the network loop that the paho-mqtt library provides.

I am not sure on the proper way of doing that. Trying to call reconnect might also work. I recall finding something along is_connected in the client interface, but I also recall it behaving funny and not being reliable. Maybe in addition to the loop we also need a reconnect callback: the system should be able to "survive" a restart of the server (which needs not only a network loop management but also reconnect semantics).

Desired state

10 minutes of idle time should not break the connection. (Send + wait 10 minutes + send again: both send operations should work).
Restarting the MQTT server should work (Send + restart MQTT server + send again: the second send should correctly send to the new server... it may be slow to reconect, but it should reconnect (it can be minutes, but it cannot be hours).

Remove dataclay-common submodule

The dataclay-common submodule should be remove, since now we use it as a dependency.

Implementation of `pin` (and `unpin`) mechanism

Implemented in Python an experimental feature of pin and unpin.

The basic necessity for this feature was due to special libraries using stateful blackbox C structures (like Tensorflow). During certain steps of the execution of the application, those objects must not be nullified, due to undefined behaviour.

More generally, it can be a performance gain in scenarios which a object is very busy and long-lived, and is better to not nullify it.

The pin mechanism should change the flag to True, thus avoiding its nullification from the hand of the HeapManager. The unpin mechanism should change the flag to False, making the object nullifiable again.

We still must implement the API counterpart (only the backend stuff is implemented in that commit). Talking with @aqueralt and @toni, we felt that the sanest approach would be to, as the result of a pin method call, force a synchronous flag toggling to all the existing replicas (if there are any). This may become a costly operation, but the pin method should not be called heedlessly, so it makes sense. In the scenarios where the metadata is still not registered, the pin method should trigger a metadata registration of the object (just like the newReplica does, maybe?).

This is low-priority, as I already have a workaround. And we may have missed something, but those notes should bootstrap an implementation discussion if needed.

GRPC 1GB limit

If a field is exceeding 1GB, grpc is returning an error on make_persistent

Implement BatchObjectInfo RPC

Implement Extrae communication lines

Federate class

Implement federation of classes following design explained at bsc-dom/dataclay-packaging#15

Accept functions as parameter / return

There is no way to pass a functional thingy as a parameter to a dataClay method (or as a return of it).

This includes lambda, functions or methods.

Some of those are not serializable (like lambda) and adding the feature may be challenging (like closures related to functions) and not a whole lot usefeul. However, supporting references to methods of registered classes seems a little more manageable and may be suited for certain use cases.

We may add this feature for all or a subset of functional thingies.

Tox testing system

Configure Tox for automating tests

Ensure proper behaviour of new_version

The code of the new_version (in the ExecutionEnvironment.py) is implemented, but I suspect that it has not been tested for a long time (I may be mistaken). Or maybe it has never been executed at all.

We should check that it works as advised --as COMPSs will be using it in their applications.

Add a three-way-handshake on autoregistration procedure or improve the error messages on invalid DATASERVICE_HOST

[Note: Maybe this also affects Java, I have only realized this problem with Python Execution Environment]

The current behaviour when the Execution Environment is loading is that it performs an "autoregister" procedure with the Logic Module. This is successful as long as the LogicModule host is correct.

However, the LogicModule needs to be able to connect to the DataService, and sometimes this may be non-straightforward (e.g. in hybrid docker deployments, in the presence of certain NAT tables or firewalls or with other network shenanigans). The solution (which already is working!) is to use the DATASERVICE_HOST configuration field.

Everything works, if properly configured. The usability problem is the following: if DATASERVICE_HOST is improperly set (because you forgot or because you messed up its value) then the initialization seems to work but the system will break on the model registration with a somewhat cryptic java.lang.RuntimeException: UNAVAILABLE: io exception. We may want to know as soon as possible (i.e. during autoregistration) that the DataService is improperly configured and yield an error at that point. That can be done with a three-way-handshake strategy. If that proves challenging, we can instead improve the error, or add some validation background thing or something.

This only affects usability, so it is low priority, but it's worth keeping that in mind in case we are modifying some registration aspects in the future.

Avoid (or complain loudly) if the user tries to register `repr`

Just witnessed a recursion limit in certain scenarios when debug is activated and the user has implemented the __repr__ method in their classes.

Users can (and are encouraged) to implement their own __str__ method, but they should not implement __repr__ given that it is internally used for logging. It makes sense to keep __repr__ low level just as it is like now.

When the user tries to register __repr__, we should ignore it and/or complain very loudly about that.

Serializing objects using Pickle

The current serialization method is very complicated. With the new microservice approach we can simplify it a lot using mainly Pickle.

Iteration 1: Considering a simple object without references to other dc objects.

#81
Send the binary serialized object to the ee through a new grpc method.
Deserialize the object in the ee.
Check that the data and metadata of the object is correct

Iteration 2: DcObject with references to other dc objects (like Person and People)

Make_persistent (client) must serialize both -> The ee must deserialize both & save both objects to heap with its corresponding id & save both objects metadata to ETCD.

Iteration 3: dc object with reference to other ALREADY PERSISTENT dc object

Client should serialize only the parent object (not the persistent one). Or just keep an id
EE Should save the parent object to heap and ETCD

Check:

[Before closing issue, rename the New... and remove the Old... functions, vars, methods, etc

Service failing using classes with str method defined and circular references

Steps to reproduce the error:

Register a class with a method str defined
Activate debug in dspython service
Run make persistent of a circular referenced object A -> B -> A (volatile is under deserialization)

Result: infinite loop in the service

Reason: wrong dataClay log trying to print object str method before it is deserialized

Deploy several backends for pytest

In order to test more functionalities like move, replicas, and federation, we must deploy several backends when testing with pytest-docker.

Friendlier error when user puts wrong type

When there is an error in the class definition, e.g.

class Whatever(DataClayObject):
    """
    ...
    @ClassField some_attr invalid_name
    ...
    """
    ...

Throw a better hint to what is the error. Now, the system complains that cannot split . because it expects a dataClay class signature (i.e.: namespace.class.spec.MyClass). But if there is a typo or something else, it would be nice to include which class triggered the error and on which attribute.

[Writing this issue because CLASS has been stucked on an error and sending the incorrect class file for some days, and tbh, it was not their fault]

Considering a basic dataclay object (without references to other dc objects), serialize it using pickle.

Support for *args and **kwargs

The current way of registering methods follows closely the Java model and thus is completely inadequate for a lot of Pythonic idioms, like the use of optional parameters, *args, **kwargs or even more Python 3 features like the end of positional arguments *.

Having support for them would be nice to have, although that may clash with certain general assumptions of dataClay metadata model.

Still, it is Pythonic enough to justify looking into it.

Replace ETCD for Redis

the ETCD python client is not maintained and fails when using distributed mutex for dependency problems of the package. Change the metadata to use Redis instead.

test thing

Implementation of PersistentStream

We have a use case with COMPSs that requires the implementation of a PersistentStream.

For now, I will implement that into the storage.api following the interface given by COMPSs' people.

Overwrite or redefine model

Implement redefinition of classes following design explained at bsc-dom/dataclay-packaging#14

Behaviour on double-underscore Python attributes

AFAICT, "private variables" will break given that there is a little big of mangling when registering classes.

As of the time of writing this issue, original hierarchy of classes is lost on class registration, and thus the name mangling performed by Python should be inaccurate --and potentially result in name clashing, e.g. when using mixins.

This may be automatically resolved once bsc-dom/dataclay-packaging/issues/7 is addressed. But maybe it's worth taking that into account during the design phase of that.

Statically check methods' return types

When a @dclayMethod is not local and is returning something, the return type should always be specified by the programmer as a parameter of the decorator: return_=<type> .

Right now, methods not specifying the return type are always silently returning None with imaginable consequences.

Logging framework for verbose level using decorators

Implement a new logging framework using decorators when entering or exiting a method (VERBOSE log level). This should be done using decorators that must be applied if logging level is verbose. Objectives: reduce logging penalty (calling loggers, even if the debug level is off, there is a small penalty) + improve code (if we need more verbosity in logs, it means that we need an extra private auxiliary function) + reduce number of "debug" messages (only using to show "state" information)

Support properties

The property built-in of Python is quite widely used and pythonic. However, the current approach of dataClay assumes only methods and attributes, and introduces transparent getters and setters at a metaclass level.

We may want to add support for property, maybe by reimplementing the built-in @property as a dataClay counterpart @dclayProperty behaving very closely to the built-in one.

I vouch for this approach, although I think that it will be easier to implement it once (if) the class registration is done from the source files, which will simplify the registration and will avoid modifications on the metadata models.

If the class registration is not changed, I think that this feature is too cumbersome and fragile to implement.

bsc-dom / pyclay Goto Github PK

pyclay's Introduction

dataClay Python codebase

Installation

Documentation

Other resources

pyclay's People

Contributors

Stargazers

Watchers

Forkers

pyclay's Issues

Steps to reproduce

Possible solutions

Desired state

Recommend Projects

Recommend Topics

Recommend Org

Jobs