GithubHelp home page GithubHelp logo

scriptfusion / porter Goto Github PK

View Code? Open in Web Editor NEW
608.0 21.0 28.0 2.95 MB

:lipstick: Durable and asynchronous data imports for consuming data at scale and publishing testable SDKs.

License: GNU Lesser General Public License v3.0

PHP 100.00%
porter data-import framework data-transformation php-development abstraction scalability durability asynchronous library

porter's People

Contributors

a-barzanti avatar bilge avatar markchalloner avatar samvdb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

porter's Issues

Integrate async throttle

It was previously thought that directly integrating the async throttle with Porter was not needed because we can just throttle high level Porter import operations. However, this is false, for two reasons:

  1. Internally, fetches can be retried (by default up to 5 times).
  2. Any given import may pull down any number of resources to satisfy the import operation. The most common case is enumerating a paginated resource that results in n requests for n pages.

Each of these additional requests must be throttled independently to avoid triggering limits, whether a retry or the next resource in a sequence. For this to be possible, the throttle must be integrated into ImportConnector so it can throttle transparently without burdening the developer with additional calls or configuration.

A default throttle should be provided for async imports but it should be possible to override with a custom configuration or implementation via AsyncImportSpecification. Throttling will not be available for sync imports until such a time as the sync API converges with the async API internally.

Integrate hydrators into the architecture

Porter's notion of records is arrays, which are very flexible to pass between interfaces, but once data leaves Porter it is common for applications to want to work with objects instead. The job of a hydrator is to use array data to populate object fields. We should investigate the value of designing a hydrator interface and whether there are any existing hydration libraries fit for purpose.

CachingConnector::fetch should allow passing in the cache key

If we allow CachingConnector to take a cache key parameter, then it can be used with existing or shared caches where the keys are not of the form CachingConnector::hash produces.

My use case for this is using existing ODM Mongo documents to cache values with the document ID being the cache key.

To maintain backward compatibility the parameter should be optional and in the event of null, CachingConnector should fallback to generating the cache key using CachingConnector::hash.

ExponentialAsyncDelayRecoverableExceptionHandler not being cloned correctly

A recent high-concurrency import, that fails catastrophically when the target service is down, indicated with an integer overflow that somehow state is being shared across the default implementation of the recoverable exception handler.

A debugging session shows the handler is being cloned, and initialize() is called at least once, but somehow the series of delays keeps growing beyond the default five retries.

In case it matters, the specific resource implementation calls fetchAsync() 80 times, but each call should still be independent as the ImportConnector clones a new handler for each fetch*() call.

Integrate formatters into the architecture

This ticket is an open discussion about whether there is a good way to integrate formatters into the architecture. Data might flow through objects in the following order.

Connector โ†’ Formatter โ†’ ProviderResource

However, we need to understand what the interface for Formatter must be and how it integrates into the rest of the system in a meaningful and reusable way.

Dependency on psr/cache:^1

I just wanted to take Porter for a quick spin, created a new Symfony project and tried to require the Porter package, resulting in this error:

scriptfusion/porter 7.0.0 requires psr/cache ^1 -> found psr/cache[1.0.0, 1.0.1] but the package is fixed to 3.0.0 (lock file version)

Is an update feasible, best for psr/container as well?

Laravel and CachingConnector?

How to enable CachingConnector in Laravel?

    public function handle() {
        app()->bind(HttpConnector::class, CachingConnector::class);

        app()->bind(EuropeanCentralBankProvider::class, EuropeanCentralBankProvider::class);

        $porter = new Porter(app() );

        $specification = new ImportSpecification(new DailyForexRates() );
        $specification->enableCache();
        $rates = $porter->import($specification);

        foreach ($rates as $rate) {
            echo "$rate[currency]: $rate[rate]\n";
        }

    }

ScriptFUSION\Porter\Cache\CacheUnavailableException : Cannot cache: connector does not support caching.

HttpConnectorTest intermittent CI failure

Travis occasionally fails to pass HttpConnectorTest with an error similar to the following.

There was 1 error:

1) ScriptFUSIONTest\Functional\Porter\Net\Http\HttpConnectorTest::testConnectionToLocalWebserver
ScriptFUSION\Retry\FailingTooHardException: Operation failed after 5 attempt(s).

/home/travis/build/ScriptFUSION/Porter/vendor/scriptfusion/retry/src/retry.php:29
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:96
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:34

Caused by
ScriptFUSION\Porter\Net\Http\HttpConnectionException: file_get_contents(http://[::1]:12345/test?baz=qux): failed to open stream: Connection refused

/home/travis/build/ScriptFUSION/Porter/src/Net/Http/HttpConnector.php:65
/home/travis/build/ScriptFUSION/Porter/src/Connector/CachingConnector.php:62
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:110
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:86
/home/travis/build/ScriptFUSION/Porter/vendor/scriptfusion/retry/src/retry.php:26
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:96
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:34

This never used to be a problem, and thanks to the five retries it should have plenty of time to spin up the server. However, this is also the first test in the suite so it may have something to do with PHPUnit start-up time. We should consider moving slower tests to the end of the suite, and if that doesn't work, we'll have to increase the retry delay coefficient.

Retry delays are tied to the lifetime of an import specification

Since ImportSpecification creates the ExponentialBackoffExceptionHandler, the current retry delay is tied to the lifetime of the specification. That is, if an import fails five times and the same specification is used to import again, the next delay begins with the sixth attempt delay time instead of restarting from one.

Ideally the retry counter would restart at the beginning of a new import regardless of whether the specification is reused or not. However, this tends to be low impact bug because specifications are typically not reused. As a workaround, anyone encountering this issue can just create a new specification for each import instead of reusing specifications.

Make Mapper a suggested dependency

Mapper is currently a required dependency, but users who do not use mappings do not need to install it at all. In order to make Mapper a suggested dependency care must be taken to ensure Porter works correctly when Mapper is unavailable, including tests to verify correct operation in this scenario.

Durability is broken for subsequent generator iterations after the first

Durability is provided for the $provider->fetch call, but Provider::fetch is declared to return Iterator, which is typically implemented using generators. Generators imply deferred code executions, which means that even if the generator throws an exception, it is not caught by the retry handler because it already exited that code block.

This common case is not captured by PorterTest because it only tests that Provider::fetch throws an exception directly instead of the generator throwing an exception.

Reconsider whether forcing resources to return arrays is correct

Currently Porter believes resources should always want to return structured data as an array. However, there may be use-cases where structured data is either unavailable or undesirable. I am yet to encounter any compelling cases but am very interested to hear about any such cases.

If we open up the return type to be mixed, this would allow resources to return objects, which would solve #12. Allowing objects can be convenient for object-oriented applications, but if resources return objects as the de-facto standard, this could be inefficient for applications that just want to work with raw data. However, mixed would even permit resources to return different types depending on some configuration parameter.

Forcing the array return type is nice because it feeds into the transformers subsystem, giving transformers a consistent type to work with. However, I'm willing to forgo the entire transformers system in a future version, or change it to only be available when the return type is array, or change it to work with any return type, as necessary. Ultimately, the consequences for the transformers system are not important because Porter's primary responsibility is fetching data reliably, not transforming it.

Lazy-load registered providers

A typical Porter factory might load many providers to support all use cases of an application, even though only a smaller subset may actually be used during one execution life-cycle. Therefore we would like a mechanism to lazy-load registered providers only when they are required.

One such mechanism may be a factory interface that looks similar to the following.

interface PorterProviderFactory
{
    public function getProviderClassName() : string;

    public function createProvider() : Provider;
}

Document Porter's main API

Explicitly document the public methods of Porter, specifically import(), importOne(), the provider methods, including details about tagging, and all other public methods.

Dev mode

The introduction of a developer mode would allow for an opinionated preset to be applied to Porter's features set, in contrast to its defaults, which subsequently enables/disables certain features or modifies default values to be more conducive to development work.

For example, developer mode may:

  • reduce automatic retries from 5 -> 1
  • <add more ideas here...>

Document Symfony integration best practices

The readme is written in a framework-agnostic way, as if one were to just use Porter in isolation, which is a good default tone to take since it makes no assumptions. However, a lot of people use Symfony and it would be useful to describe how a Porter integration with Symfony should look like for people looking to get started in a Symfony framework environment.

Add FAQ or cookbook to documentation

Add examples either in the form of an FAQ or "cookbook" to demonstrate pattern solutions to common problems.

Scenarios:

  • Importing binary data
  • Import two or more collections at once (collections of collections)

ConnectorOptions is a bad design

At first glance, one would think tying options to a connector would create concurrency issues where two requests could set different options on the connector at the same time. Due to cloning, this is not an issue, however the problems with ConnectorOptions reach further than just potential concurrency issues. Since connectors may be decorated, finding the options you need to modify often means traversing the stack of connectors, but it's cumbersome and error prone to do this, by traversing the stack of connectors from ImportConnector down.

We cannot simply remove connector options and let implementations do as they please because the cache needs knowledge of the particular options exported by the connector in order to determine whether two requests are identical and thus the cache may be reused.

We propose changing the signature of fetch(string) to fetch(object), where object is some implementation-defined object that encapsulates both the original source string plus the connector options. In this way, everything needed to define the request is passed through all connectors in the stack and can be inspected or modified as needs be when it passes through. This also precludes the need to clone the connector (and its options), which makes implementations much easier and cleaner.

This change would be a BC break, and moreover, the signature is less convenient than simply passing a string, which can be sufficient for HTTP GET requests and some others. It is a consideration that we may support object|string, however this does complicate the interface and make it more taxing to implement.

Rather than just fetch(object) where object is literally typed to object, which is unsupported in PHP 7.1 anyway, we should probably have a Source interface that specifies toArray and serializes all configurable options as an array, for use with caching.

Document static data imports

It is possible to use Porter to import data we already have using static imports via StaticDataImportSpecification. This brings with it the same post-import benefits as importing data over a network and is especially useful in testing.

CachingConnector is a poor user experience

Having to wrap a connector in CachingConnector just to use caching is not as easy to use as if the cache just worked with any connector. Moreover, cache + connector is a violation of SRP. The cache should be refactored as a separate entity, apart from connectors.

SingleRecord interface

Instead of requiring consumers to guess whether to use import() or importOne(), resources that emit only one record should implement a new SingleRecord interface to clearly indicate that importOne() should be used and which we can use to verify the correct method has been called.

This provides a clear mechanism for data publishers to express intent and makes sense, because resources always know if they export one or multiple records, so they should have a way to express this.

Enable cache substitution in connectors

There's no point in implementing PSR-6 caching interfaces if the default caching implementation cannot be changed. However, due to some oversight, none of the first party connectors expose a method to change the cache implementation.

Document multiple instances of same provider

Although we normally add a provider to the container by its class name and expect a single instance of each provider in the container, there are many valid use cases for adding the same provider multiple times. Document these use cases with examples and how-tos.

Often, we may operate multiple accounts with a given provider for various reasons. Examples:

  • Multiple Stripe accounts for handling payments in different currencies
  • Multiple Discord bots to leverage separate request rate limits

Drop PHP 5 support

It is currently planned to drop support for PHP 5 and target either 7.0 or 7.1 for Porter v5.

Spawn a temporary SOAP server to test SoapConnector

The only file not fully tested, and thus preventing 100% code coverage, is SoapConnector. Its analogue, HttpConnector, is tested by the functional test, HttpConnectorTest, that spawns a temporary HTTP server using php -S to test the connector. In a similar fashion I suggest spawning a temporary SOAP server to test SoapConnector, however I do not know the best way to do this.

A question posted to StackOverflow asking how to write a minimum valid WSDL has received no answers.

Document FetchExceptionHandlers

After rewriting a 4000 word manual for Porter v4 I didn't really feel like writing about FetchExceptionHandlers. This feature will seldom be required, and for those whom do need it, if they can't figure it out for themselves, the docblocks in the file should probably suffice. Nevertheless, we should document the interface properly at some point.

Add asynchronous fetch support

Performing many sub-imports simultaneously is equivalent to queuing a series of I/O-bound operations whose total execution time is the sum of all imports' individual execution times. By running sub-requests concurrently in parallel asynchronously we reduce the total execution time to that of the the longest-running sub-import only. For highly concurrent sub-imports this is a significant time saving.

Rate Limiter

Any thoughts on adding some type of rate limiter functionality, as to not clobber the servers?

[BC-BREAK] scriptfusion/retry 1.1.2

Hi,

When running porter 3.* retry 1.1.2 will be installed because of the following composer requirement:

"scriptfusion/retry": "^1.1",

The retry lib works on 1.1.1 with porter, upgrading to 1.1.2 breaks stuff.

Specific lines in the retry lib that are triggered:

if ($result instanceof \Generator) {
            throw new \UnexpectedValueException('Cannot retry a Generator. You probably meant something else.');
        }

Porter causes this because a generator is returned in Porter.php line 98

function () use ($provider, $resource) {
                if (($records = $provider->fetch($resource)) instanceof \Iterator) {
                    // Force generator to run until first yield to provoke an exception.
                    $records->valid();
                }

                return $records;     <----- this breaks
            },

Fix specification cloning in Porter::import()

The specification is cloned too late during import() because members of the specifications are shared with other objects before cloning takes place thus creating shared mutable state. The specification must be cloned before any of its members are shared.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.