singer-io / singer-python Goto Github PK

View Code? Open in Web Editor NEW

534.0 25.0 131.0 239 KB

Writes the Singer format from Python

Home Page: https://singer.io

License: Apache License 2.0

Python 99.65% Makefile 0.35%

singer-python's Introduction

singer-python

Writes the Singer format from Python

Use

This library depends on python3. We recommend using a virtualenv like this:

python3 -m venv ~/.virtualenvs/singer-python

Next, install this library:

source ~/.virtualenvs/singer-python/bin/activate
git clone http://github.com/singer-io/singer-python
cd singer-python
make install

Now, from python code within the same virtualenv, you can use the library:

import singer

singer.write_schema('my_table',
	            {'properties':{'id': {'type': 'string', 'key': True}}},
		    ['id'])
singer.write_records('my_table',
                     [{'id': 'b'}, {'id':'d'}])
singer.write_state({'my_table': 'd'})

License

Distributed under the Apache License Version 2.0

singer-python's People

Contributors

Stargazers

Watchers

Forkers

empia ieleh skymeandr stockingeran b-ryan pedromachados flash716 rbramwell timvisher plenadatadave stvhanna domb16 neilmani stevenludwig abhishekmittal wukuan405 vishalbelsare mplovepop rizplate orihoch htsui madlittlemods datamill-co joshtemple chrisgoddard jmfrancois dwallace0723 charles-zhan jsding muhammadadeelusman ordermygear datamindedbe sbrichardson bhazard pathlight knaps makkus dcereijodo proctoru dougb marrycv claytoneu-wp francisambrocio guy-adams judahrand jwalterclark aaronsteers dr-gareth-roberts feluelle mmoraeswac jarobe42 manavkohli kajjjak reptilianbrain ali-varicent ameier38 arisro airbots1980 likeshumidity keberox keatmin jigardx richardwhitefoot kushagrasrivastava001 amulmgr rkdev daigotanaka georgi-petkov dherbst matera-tech polar-analytics isabella232 gynzy taylorbarstow alvin-mwangi impulse-cloud ericlebail juan-meister goodeggs grafana balkon16 modulartaco mozartdata sathya-reddy-m transfergo awesomedatatool laopeng2021 siilats hammadhasandogar siva-karthi devopstoday11 sixcodes xiaozan-dev erdal-pb cnuss cloudguruab contxts-io ome9ax nils-borrmann-y42 harshithmullapudi

singer-python's Issues

Invalid format string %04Y on Windows

Invalid format string when using %04Y on Windows 10 Python 3.7.1

$ python
>>> from datetime import datetime
>>> datetime(90, 1, 1).strftime("%04Y")
ValueError: Invalid format string
>>> datetime.strptime("2018-10-31 22:29:29.553000", "%Y-%m-%d %H:%M:%S.%f").strftime("%04Y-%m-%dT%H:%M:%S.%fZ")
ValueError: Invalid format string

Some platforms support modifiers from POSIX 2008 (and others). On Linux the format "%04Y" assures a minimum of four characters and zero-padding. The internal code (as used on Windows and by default on macOS) uses zero-padding by default

https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/strptime#l_sections

Related issues

#81
%04Y introduced in #52
macOS support added in #69

Is requirement for backoff==1.3.2 necessary

Singer-Python is dependent on an older version of backoff (1.3.2). Is there actual functionality broken with newer versions of backoff (say 1.4.3)? Could those dependencies be remedied so singer-python would work with newer backoff versions?

Ratelimit helper does not support multi-threading or async

It seems to me that the ratelimiting helper function found in singer.utils.ratelimit is rather limited. Perhaps we should replace it with an implementation using this rather excellent package which does support multi threading. There was also a PR open which added support for async but for some reason it was closed (tomasbasham/ratelimit#35). Maybe if lack of maintenance is a concern the whole package should be forked and maintained as part of the Singer.io project? It seems like a universally useful package!

Schema fails to be turned into dictionary if it contains "anyOf"

I would except the anyOf property of a Schema object to be a list of Schema objects, however, as you can currently see the Schema.to_dict() function does not deal with this case!

This means that singer.catalog.write_catalog() fails.

Support for inline configuration strings

Sometimes when productionalizing taps and targets executions it gets inconvenient having to rely on an actual configuration file stored in the system, and instead it would be much easier to be able to pass such config as a JSON string in the command line parameters.
So something like this

tap-mysql --config '{"host": "mysql-host.com", "port": "3306", "user": "$USR_PROD", "password": "$PWD_PROD"}'

One simple hack for supporting that would be changing this line to something like this

def load_json(path):
  try:
      inline_config = json.loads(myjson)
  except ValueError as e:
    with open(path) as fil:
      return json.load(fil)
  return inline_config

Loosen version spec in setup.py

singer-python/setup.py

Line 12 in 3ddbcc6

install_requires=[

For example, pytz==2018.5 is out. It would be nice to change == to >= for version dependencies. I use pipenv to pin application versions but like to leave library versions looser.

Problem in README example?

I don't remember how I got here and I'm really not a python person, but I think there might be an issue with your README example (that taught me something!)

In python 3 (which the README says your project depends on), I don't think i is in scope in the write_state line? (Turns out it leaks in python 2 -- http://stackoverflow.com/a/4199355/387413 -- and probably how this was tested?)

Cheers! (PS: cool project)

Messages are not valid JSON

The message formatting utility https://github.com/singer-io/singer-python/blob/master/singer/messages.py#L222 uses the simplejson library. By default simplejson does not produce valid JSON:

If allow_nan is true (the default), then NaN, Infinity, and -Infinity will be encoded
as such. This behavior is not JSON specification compliant, but is consistent with
most JavaScript based encoders and decoders. Otherwise, it will be a ValueError
to encode such floats. See also ignore_nan for ECMA-262 compliant behavior.

I'm getting an error trying to parse record messages from tap-salesforce using NodeJS because it's producing invalid JSON that contains NaN.

5.9.0: Transformer.filter_data_by_metadata() doesn't filter unselected nodes where selected unspecified

Hi,

I'm attempting to write a tap, using singer.transform.Transformer.filter_data_by_metadata in order to filter the data.

    def filter_data_by_metadata(self, data, metadata):
        if isinstance(data, dict) and metadata:
            for field_name in list(data.keys()):
                selected = singer.metadata.get(metadata, ('properties', field_name), 'selected')
                inclusion = singer.metadata.get(metadata, ('properties', field_name), 'inclusion')
                if inclusion == 'automatic':
                    continue

                if selected is False:
                    data.pop(field_name, None)
                    # Track that a field was filtered because the customer
                    # didn't select it.
                    self.filtered.add(field_name)

                if inclusion == 'unsupported':
                    data.pop(field_name, None)
                    # Track that the field was filtered because the tap
                    # declared it as unsupported.
                    self.filtered.add(field_name)

        return data

This logic has two problems:

If selected is missing in the metadata the field is not filtered
selected-by-default is totally ignored

The expected behaviour would be the following:

selected set to True, do nothing
selected set to False, filter field
selected missing and selected-by-default set to True, do nothing
selected missing and selected-by-default set to True, filter field

Add giveup function for requests

Most of our Taps use a combination of the Python requests and backoff to make HTTP requests that retry with a backoff strategy. A typical Tap will have a bit of code that looks like this:

def giveup(error):
    response = error.response
    return not (response.status_code == 429 or
                response.status_code >= 500)


@backoff.on_exception(backoff.constant,
                      (requests.exceptions.RequestException),
                      jitter=backoff.random_jitter,
                      max_tries=5,
                      giveup=giveup,
                      interval=30)
def request(url, access_token, params={}):
    requests.request(...)

We've seen an issue with the Outbrain tap where a ConnectionException is raised because of a snapped connection. There is no HTTP response in this case, so the error argument to giveup has no response property, and giveup throws an exception when we try to access error.response.status_code.

This could be fixed with a simple change to giveup:

def giveup(error):
    response = error.response
    if response is None:
        return False
    return not (response.status_code == 429 or
                response.status_code >= 500)

I think this logic is getting complex enough that we should add an implementation of giveup that does something like the above into the singer-python library. If we don't, it's likely that every Tap will experience the same error trying to access properties on a null error.response object at some point.

However, I'm hesitant to add a hard dependency on requests and backoff. So I'm thinking that we should make a module called singer.requests that can contain helper functions like this one that are specific to the requests library. We won't need to modify setup.py to add a dependency on requests, and it's up to a Tap whether they want to import that module at all.

We should give this giveup function a specific name, like giveup_on_http_4xx_except_429, to make room for other giveup strategies.

I don't want to put the decorated request function in this library, because I think it's pretty likely that different Taps would want to use different backoff strategies.

So a Tap implementation would then look more like this:

@backoff.on_exception(backoff.constant,
                      (requests.exceptions.RequestException),
                      jitter=backoff.random_jitter,
                      max_tries=5,
                      giveup=giveup_on_http_4xx_except_429,
                      interval=30)
def request(url, access_token, params={}):
    requests.request(...)

UTF-8 encoding woes

while using tap-pipedrive I noticed that the output produced - ultimately by format_message in messages.py is using simplejson with the default value of ensure_ascii=True - is encoded in Pythons escaped unicode (literal \u followed by 4 hexadecimal digits).

This confuses a lot of my later processing. I am not sure how to properly fix that later on.

I set PYTHONIOENCODING to utf-8 and it looks like the setting is working:

$ python -c'import sys; print(sys.stdout.encoding)'
utf8

The output from tap-pipedrive is unchanged, though.

A way to change the output encoding is to set ensure_ascii=False when calling simplejson.dumps. Would you accept a PR for that?

use singer taps and targets programatically

Hi, I'm participating in development of dataflows which has similar goals to your projects, and we would like to be able to integrate between the libraries - use singer taps / targets inside a data flow, and use a data flow as a singer tap / target (datahq/dataflows#16)

To enable this integration we need to be able to call singer taps / targets from Python code, this is easy to do using subprocess.Popen, see example here

I think it would really useful to have this in a more standard way as part of the singer-python library.

Example: singer.read_tap

Install the tap: pip install tap-exchangeratesapi

Read from the tap:

>>> tap = singer.read_tap('exchangeratesapi', {"base": "ILS", "start_date": "2018-10-01"})
>>> for message in tap:
>>>     print(message)  # SchemaMessage / RecordMessage / StateMessage

load_schema doesn't work

load_schema is supposed to load a schema from the schemas directory of a Tap or Target source tree. Unfortunately it doesn't work at all, because load_schema lives in singer-python and it doesn't know the absolute path of the caller's file. There may be a way to get the absolute path to the caller's file, but in the meantime we may want to just remove load_schema and get_abs_path, since these two functions don't do what they advertise.

Incompatible taps & targets

At this point I've found it impossible to continue with any singer project, as basically no combination of taps & target work together due to any number of dependency errors.

pip install tap-shopify target-postgres

ERROR: target-postgres 1.1.3 has requirement singer-python==5.1.1, but you'll have singer-python 5.4.1 which is incompatible.

How about CSV?

pip install tap-shopify target-csv

ERROR: target-csv 0.3.0 has requirement singer-python==2.1.4, but you'll have singer-python 5.4.1 which is incompatible.

Maybe Klaviyo will work? Nope.

pip install tap-klaviyo target-postgres

ERROR: target-postgres 1.1.3 has requirement singer-python==5.1.1, but you'll have singer-python 3.2.1 which is incompatible.
ERROR: target-csv 0.3.0 has requirement singer-python==2.1.4, but you'll have singer-python 3.2.1 which is incompatible.
ERROR: tap-shopify 1.1.10 has requirement singer-python==5.4.1, but you'll have singer-python 3.2.1 which is incompatible.

Is there a plan to adopt semantic versioning (major/minor) so that packages can be updated to NOT rely on a specific version? Excited about the potential of singer, but disappointed in the number of roadblocks that pop up to get even a trivial example working.

`utils.strptime` and `utils.strftime` are asymmetrical

EDIT: Current summary of this issue is here.

I am using python 3.7 and it looks like

singer-python/singer/utils.py

Line 14 in f6c5227

DATETIME_FMT = "%04Y-%m-%dT%H:%M:%S.%fZ"

is not valid format for strptime. Am I missing something here?

--catalog vs -p (properties) parameters

It might be me, but the use of the --catalog and property -p parameters is somewhat unclear to me.

They are used intermixed throughout different sections of the docs (e.g. Allowing Users to Select Streams to Sync vs. Sync mode).

Source code mentions -p to be deprecated over --catalog, however replacing -p by --catalog leads to different behaviour: the "selected": true on a schema definition in the catalog file is not honoured by --catalog, but is by -p parameter.

It would be helpful if this could be clarified a bit more in the docs.

utils.strftime does not format the year properly

4Y is set as the date

>>> now = pytz.utc.localize(datetime.utcnow())
>>> now
datetime.datetime(2017, 12, 2, 19, 58, 13, 276787, tzinfo=<UTC>)
>>> utils.strftime(now)
'4Y-12-02T19:58:13.276787Z'

Looks like this was introduced by #52

Bump pytz version to >= 2018.9

Hello ✋

Is there any good reason why you have this constraint pytz==2018.4? It makes it incompatible with ZenPy, which requires at least 2018.9. ZenPy library is also used in tap-zendesk (which is now using very old buggy version of ZenPy==2.0.0).

So, can we bump the version of pytz?

Thanks in advance.

Transformer dumps JSON incompatible string

When Transformer recognizes the type to be str, it will convert the (sub) object to str type. The issue is, if such (sub) object's original type is dict, the current method of converting to str produces JSON incompatible string:

singer-python/singer/transform.py

Line 288 in 6472683

return True, str(data)

This results in the conversion from a dict

{'active': True, 'note': None}

"{'active': True, 'note': None}"

instead of

'{"active": true, "note": null}'

str(data) conversion seems to produce problems with escape characters as well.

I am wondering if it is acceptable to replace str(data) with json.dumps(data)

One may argue that the tap should fully specify the schema so that the (sub)object is written out as dict. However, many of the Rest API often includes the field whose schema is not static.
An example is Github API's event object. event.payload is a (dict) object, but the schema depends on the event type.
https://docs.github.com/en/free-pro-team@latest/developers/webhooks-and-events/github-event-types#event-object-common-properties
In fact, I discovered this issue while I was debugging tap-github's usage of Transformer:
https://github.com/singer-io/tap-github/blob/master/tap_github/__init__.py#L361

If the str conversion was done through json.dumps, it would have been possible to parse JSON in the target datastore such as BigQuery and Redshift.

Add UTF-8 validity checking to schema

For data-type "string", the _transform function just attempts to do str(data) and catches an exception to determine if the string is valid. Binary strings with null bytes or other invalid UTF-8 character sequences will pass through this function as valid strings. However, targets may expect strings to be valid encoded text, such as UTF-8.

UTF-8 encoding validation can be enforced with a pre_hook when calling transform, but this doesn't inform the target about the type of string. It'd be helpful to somehow include character encoding as part of the schema so that downstream targets can know what to expect and choose the appropriate data type. For example, MySQL has TEXT and BLOB types to separately handle text and binary strings. One natural place to put this could be the "format" parameter, though it'd be tedious to have to explicitly specify UTF-8 for every string when that is the default. It'd be convenient to have a way to make UTF-8 the default for all strings in a schema and override it with binary (the current behavior) explicitly for binary fields.

JSONSchema Draft 7 array Tuple Validation unsupported? Schema.from_dict() raise exception

From the JSON Schema draft 7.0 specification: array types can be used to validate tuple as such:

{
  "type": "array",
  "items": [
    {
      "type": "something"
    },
    {
      "type": "otherthing"
    }
  ]
}

... and as such, the items property of a "type": "array" property can be a python list.

Using the above schema will raise an exception with singer-python==5.1.5 when entering the @singer.utils.handle_top_exception(LOGGER) decorator, although I also can see the error in the master version of the file: https://github.com/singer-io/singer-python/blob/master/singer/schema.py#L107 where the items variable is expected to be a dict (it can also be a list, as stated above)

I think I have a trivial fix which would be to add a isinstance(items, dict) check:

    @classmethod
    def from_dict(cls, data, **schema_defaults):
        '''Initialize a Schema object based on the JSON Schema structure.
        :param schema_defaults: The default values to the Schema
        constructor.'''
        kwargs = schema_defaults.copy()
        properties = data.get('properties')
        items = data.get('items')

        if properties is not None:
            kwargs['properties'] = {
                k: Schema.from_dict(v, **schema_defaults)
                for k, v in properties.items()
            }
        if items is not None and isinstance(items, dict):
            kwargs['items'] = Schema.from_dict(items, **schema_defaults)
        for key in STANDARD_KEYS:
            if key in data:
                kwargs[key] = data[key]
        return Schema(**kwargs)

and I'm happy to raise a PR for it. However I wanted to have the opinion of someone who's closer to the library to know if we even want to support Tuple Validation from the JSON Schema draft 7.0 specifications?

edit: made a bit more readable

Support Draft 7 validation keys in object schemas

The list of allowed JSON STANDARD_KEYS currently reflects only the Draft 4 JSON Schema and was last updated in 2018 (#92).

It would be very helpful if the list could be updated to include more recent keys -- such as required, minProperties, and maxProperties -- that are now available in Draft 7.

Thanks!

Catalog.streams is generator expression instead of list

I am running into a problem with singer-python 3.5.2.

The catalog object works the first time the streams attribute is accessed but not in subsequent times.

I believe the issue is related to how the "streams" attribute is initialized.

It is a generator expression instead of a list.

This shows the problem:

import singer
c = singer.catalog.Catalog.load("catalog_categories.json")
c.to_dict() # this works
c.to_dict() # this returns an empty object : {'streams': []}

Singer.get_logger issue

Hello,

I am trying to create a target using the "getting started" guide.
However, my program terminated during the import process of "singer" module.
Details are below:
Traceback (most recent call last):
File "tap_ip.py", line 5, in
import singer
File "/usr/local/lib/python3.6/site-packages/singer/init.py", line 8, in
from singer import transform
File "/usr/local/lib/python3.6/site-packages/singer/transform.py", line 7, in
LOGGER = singer.get_logger()
AttributeError: module 'singer' has no attribute 'get_logger'

Any help is appreciated.

Thanks.

singer-python 3.5.1 raises AttributeError: 'str' object has no attribute 'tostr'

It seems to be related to some of the recent date handling changes. 2.4.1 works fine but 3.5.2 fails with AttributeError: 'str' object has no attribute 'tostr'.

Attached are the schema and a sample data element that produce this error.
data_and_schema.json.zip

Async support

Any plans to introduce async support into this library?

`log_debug` cannot work...for very long

Problem

Singer-Python is the root repo for pretty much every tap and target, and the suggested way to log "out" is to use singer.get_logger() or one of the helpers of log_info, or...log_debug.

For target-postgres we use the DEBUG level for logging in tests, and for gaining more information for issues/bug reports etc.

To enable DEBUG logging, we have a single call out to singer.get_logger() followed up by setLevel('DEBUG') (loosely). This works pretty well up until get_logger() gets called again.

Once get_logger gets called again, the fileConfig code gets run again, and the root logger gets reset to the logging.conf and having the level set to INFO.

logging.config.fileConfig(path, disable_existing_loggers=False)

Once this happens, log_debug no longer works.

Question

Is there a suggested way to use/get DEBUG output while also leveraging singer-python?

...or...

Is this a 🐛?

Suggested Musical Pairing

https://soundcloud.com/winnetka-bowling-league/slow-dances

Feature Request: Add support for --stream_name argument

Proposed Feature Description:

As a user and developer of the Singer platform, I would LOVE to have access to a --stream-name argument in the standard/global tap CLI. When specified, a given tap would only extract data for the targeted stream. Essentially, this logic would intersect and further refine what the 'selected' attribute currently designates within the json file - but without having to edit JSON.

(For reference, my company's JSON catalog for Salesforce (tap-salesforce) is currently >300K lines of code.)

Cost of not having the feature:

The cost of not having this feature is that for large taps, there's no way to run one stream at a time without modifying very large and fragile json files. There's likewise no way to run multiple streams in parallel (which can be done if the stream name is passed as an argument), and there's no good way to retry/rerun just a single stream.

Similarly, during initial development and testing, if the 5th stream out of 9 fails (for instance), there's no way to start by running just the 5th stream. Or if, as a developer, I'm changing just the 9th stream, I have to rerun all streams just to test the final one.

Current Workaround:

In order to get the desired behavior today, we have created another program to wrap around the tap and target which takes as input: (1) a path to catalog_full.json and (2) a --stream_name argument specifying the name of the requested stream. With those inputs, the wrapper parses the full catalog and creates a temporary catalog file {{stream-name}}-catalog-tmp.json. The tap can then be executed for only the specified stream by passing the new stream-specific catalog file instead of the full catalog.

Additional Info:

I am willing and able to contribute code to this effort if the feature is accepted. ⚡️ Thanks!

Schema class does not support additionalProperties key

As a result, if you do Schema.from_dict it will drop the additionalProperties key and not output it during discovery. This would be a problem if you used the schemas from discovery as the schema when you are writing records during sync. But it also created a lot of confusion for me just now.