singer-io / singer-python Goto Github PK
View Code? Open in Web Editor NEWWrites the Singer format from Python
Home Page: https://singer.io
License: Apache License 2.0
Writes the Singer format from Python
Home Page: https://singer.io
License: Apache License 2.0
It seems to me that the ratelimiting helper function found in singer.utils.ratelimit
is rather limited. Perhaps we should replace it with an implementation using this rather excellent package which does support multi threading. There was also a PR open which added support for async but for some reason it was closed (tomasbasham/ratelimit#35). Maybe if lack of maintenance is a concern the whole package should be forked and maintained as part of the Singer.io project? It seems like a universally useful package!
For data-type "string"
, the _transform
function just attempts to do str(data)
and catches an exception to determine if the string is valid. Binary strings with null bytes or other invalid UTF-8 character sequences will pass through this function as valid strings. However, targets may expect strings to be valid encoded text, such as UTF-8.
UTF-8 encoding validation can be enforced with a pre_hook when calling transform, but this doesn't inform the target about the type of string. It'd be helpful to somehow include character encoding as part of the schema so that downstream targets can know what to expect and choose the appropriate data type. For example, MySQL has TEXT
and BLOB
types to separately handle text and binary strings. One natural place to put this could be the "format"
parameter, though it'd be tedious to have to explicitly specify UTF-8 for every string when that is the default. It'd be convenient to have a way to make UTF-8 the default for all strings in a schema and override it with binary (the current behavior) explicitly for binary fields.
It seems to be related to some of the recent date handling changes. 2.4.1 works fine but 3.5.2 fails with AttributeError: 'str' object has no attribute 'tostr'.
Attached are the schema and a sample data element that produce this error.
data_and_schema.json.zip
Any plans to introduce async support into this library?
It might be me, but the use of the --catalog and property -p parameters is somewhat unclear to me.
They are used intermixed throughout different sections of the docs (e.g. Allowing Users to Select Streams to Sync vs. Sync mode).
Source code mentions -p to be deprecated over --catalog, however replacing -p by --catalog leads to different behaviour: the "selected": true
on a schema definition in the catalog file is not honoured by --catalog, but is by -p parameter.
It would be helpful if this could be clarified a bit more in the docs.
Singer-Python is dependent on an older version of backoff (1.3.2). Is there actual functionality broken with newer versions of backoff (say 1.4.3)? Could those dependencies be remedied so singer-python would work with newer backoff versions?
Sometimes when productionalizing taps and targets executions it gets inconvenient having to rely on an actual configuration file stored in the system, and instead it would be much easier to be able to pass such config as a JSON string in the command line parameters.
So something like this
tap-mysql --config '{"host": "mysql-host.com", "port": "3306", "user": "$USR_PROD", "password": "$PWD_PROD"}'
One simple hack for supporting that would be changing this line to something like this
def load_json(path):
try:
inline_config = json.loads(myjson)
except ValueError as e:
with open(path) as fil:
return json.load(fil)
return inline_config
As a user and developer of the Singer platform, I would LOVE to have access to a --stream-name
argument in the standard/global tap CLI. When specified, a given tap would only extract data for the targeted stream. Essentially, this logic would intersect and further refine what the 'selected' attribute currently designates within the json file - but without having to edit JSON.
(For reference, my company's JSON catalog for Salesforce (tap-salesforce
) is currently >300K lines of code.)
The cost of not having this feature is that for large taps, there's no way to run one stream at a time without modifying very large and fragile json files. There's likewise no way to run multiple streams in parallel (which can be done if the stream name is passed as an argument), and there's no good way to retry/rerun just a single stream.
Similarly, during initial development and testing, if the 5th stream out of 9 fails (for instance), there's no way to start by running just the 5th stream. Or if, as a developer, I'm changing just the 9th stream, I have to rerun all streams just to test the final one.
In order to get the desired behavior today, we have created another program to wrap around the tap and target which takes as input: (1) a path to catalog_full.json
and (2) a --stream_name
argument specifying the name of the requested stream. With those inputs, the wrapper parses the full catalog and creates a temporary catalog file {{stream-name}}-catalog-tmp.json
. The tap can then be executed for only the specified stream by passing the new stream-specific catalog file instead of the full catalog.
I am willing and able to contribute code to this effort if the feature is accepted. ⚡️ Thanks!
As a result, if you do Schema.from_dict
it will drop the additionalProperties
key and not output it during discovery. This would be a problem if you used the schemas from discovery as the schema when you are writing records during sync. But it also created a lot of confusion for me just now.
Hello ✋
Is there any good reason why you have this constraint pytz==2018.4
? It makes it incompatible with ZenPy, which requires at least 2018.9. ZenPy library is also used in tap-zendesk
(which is now using very old buggy version of ZenPy==2.0.0).
So, can we bump the version of pytz?
Thanks in advance.
When Transformer recognizes the type to be str, it will convert the (sub) object to str type. The issue is, if such (sub) object's original type is dict, the current method of converting to str produces JSON incompatible string:
singer-python/singer/transform.py
Line 288 in 6472683
This results in the conversion from a dict
{'active': True, 'note': None}
to
"{'active': True, 'note': None}"
instead of
'{"active": true, "note": null}'
str(data)
conversion seems to produce problems with escape characters as well.
I am wondering if it is acceptable to replace str(data)
with json.dumps(data)
One may argue that the tap should fully specify the schema so that the (sub)object is written out as dict. However, many of the Rest API often includes the field whose schema is not static.
An example is Github API's event object. event.payload is a (dict) object, but the schema depends on the event type.
https://docs.github.com/en/free-pro-team@latest/developers/webhooks-and-events/github-event-types#event-object-common-properties
In fact, I discovered this issue while I was debugging tap-github's usage of Transformer:
https://github.com/singer-io/tap-github/blob/master/tap_github/__init__.py#L361
If the str conversion was done through json.dumps, it would have been possible to parse JSON in the target datastore such as BigQuery and Redshift.
I would except the anyOf
property of a Schema
object to be a list of Schema
objects, however, as you can currently see the Schema.to_dict()
function does not deal with this case!
This means that singer.catalog.write_catalog()
fails.
4Y
is set as the date
>>> now = pytz.utc.localize(datetime.utcnow())
>>> now
datetime.datetime(2017, 12, 2, 19, 58, 13, 276787, tzinfo=<UTC>)
>>> utils.strftime(now)
'4Y-12-02T19:58:13.276787Z'
Looks like this was introduced by #52
Singer-Python
is the root repo for pretty much every tap
and target
, and the suggested way to log "out" is to use singer.get_logger()
or one of the helpers of log_info
, or...log_debug
.
For target-postgres
we use the DEBUG
level for logging in tests, and for gaining more information for issues/bug reports etc.
To enable DEBUG
logging, we have a single call out to singer.get_logger()
followed up by setLevel('DEBUG')
(loosely). This works pretty well up until get_logger()
gets called again.
Once get_logger
gets called again, the fileConfig
code gets run again, and the root logger gets reset to the logging.conf
and having the level set to INFO
.
logging.config.fileConfig(path, disable_existing_loggers=False)
Once this happens, log_debug
no longer works.
Is there a suggested way to use/get DEBUG
output while also leveraging singer-python
?
...or...
Is this a 🐛?
load_schema
is supposed to load a schema from the schemas
directory of a Tap or Target source tree. Unfortunately it doesn't work at all, because load_schema
lives in singer-python
and it doesn't know the absolute path of the caller's file. There may be a way to get the absolute path to the caller's file, but in the meantime we may want to just remove load_schema
and get_abs_path
, since these two functions don't do what they advertise.
Hello,
I am trying to create a target using the "getting started" guide.
However, my program terminated during the import process of "singer" module.
Details are below:
Traceback (most recent call last):
File "tap_ip.py", line 5, in
import singer
File "/usr/local/lib/python3.6/site-packages/singer/init.py", line 8, in
from singer import transform
File "/usr/local/lib/python3.6/site-packages/singer/transform.py", line 7, in
LOGGER = singer.get_logger()
AttributeError: module 'singer' has no attribute 'get_logger'
Any help is appreciated.
Thanks.
From the JSON Schema draft 7.0 specification: array types can be used to validate tuple as such:
{
"type": "array",
"items": [
{
"type": "something"
},
{
"type": "otherthing"
}
]
}
... and as such, the items
property of a "type": "array"
property can be a python list
.
Using the above schema will raise an exception with singer-python==5.1.5
when entering the @singer.utils.handle_top_exception(LOGGER)
decorator, although I also can see the error in the master
version of the file: https://github.com/singer-io/singer-python/blob/master/singer/schema.py#L107 where the items
variable is expected to be a dict
(it can also be a list
, as stated above)
I think I have a trivial fix which would be to add a isinstance(items, dict)
check:
@classmethod
def from_dict(cls, data, **schema_defaults):
'''Initialize a Schema object based on the JSON Schema structure.
:param schema_defaults: The default values to the Schema
constructor.'''
kwargs = schema_defaults.copy()
properties = data.get('properties')
items = data.get('items')
if properties is not None:
kwargs['properties'] = {
k: Schema.from_dict(v, **schema_defaults)
for k, v in properties.items()
}
if items is not None and isinstance(items, dict):
kwargs['items'] = Schema.from_dict(items, **schema_defaults)
for key in STANDARD_KEYS:
if key in data:
kwargs[key] = data[key]
return Schema(**kwargs)
and I'm happy to raise a PR for it. However I wanted to have the opinion of someone who's closer to the library to know if we even want to support Tuple Validation from the JSON Schema draft 7.0 specifications?
edit: made a bit more readable
Hi, I'm participating in development of dataflows which has similar goals to your projects, and we would like to be able to integrate between the libraries - use singer taps / targets inside a data flow, and use a data flow as a singer tap / target (datahq/dataflows#16)
To enable this integration we need to be able to call singer taps / targets from Python code, this is easy to do using subprocess.Popen, see example here
I think it would really useful to have this in a more standard way as part of the singer-python library.
Install the tap: pip install tap-exchangeratesapi
Read from the tap:
>>> tap = singer.read_tap('exchangeratesapi', {"base": "ILS", "start_date": "2018-10-01"})
>>> for message in tap:
>>> print(message) # SchemaMessage / RecordMessage / StateMessage
I don't remember how I got here and I'm really not a python person, but I think there might be an issue with your README example (that taught me something!)
In python 3 (which the README says your project depends on), I don't think i
is in scope in the write_state
line? (Turns out it leaks in python 2 -- http://stackoverflow.com/a/4199355/387413 -- and probably how this was tested?)
Cheers! (PS: cool project)
The list of allowed JSON STANDARD_KEYS currently reflects only the Draft 4 JSON Schema and was last updated in 2018 (#92).
It would be very helpful if the list could be updated to include more recent keys -- such as required
, minProperties
, and maxProperties
-- that are now available in Draft 7.
Thanks!
Invalid format string when using %04Y
on Windows 10 Python 3.7.1
$ python
>>> from datetime import datetime
>>> datetime(90, 1, 1).strftime("%04Y")
ValueError: Invalid format string
>>> datetime.strptime("2018-10-31 22:29:29.553000", "%Y-%m-%d %H:%M:%S.%f").strftime("%04Y-%m-%dT%H:%M:%S.%fZ")
ValueError: Invalid format string
Some platforms support modifiers from POSIX 2008 (and others). On Linux the format "%04Y" assures a minimum of four characters and zero-padding. The internal code (as used on Windows and by default on macOS) uses zero-padding by default
https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/strptime#l_sections
Related issues
The message formatting utility https://github.com/singer-io/singer-python/blob/master/singer/messages.py#L222 uses the simplejson library. By default simplejson does not produce valid JSON:
If allow_nan is true (the default), then NaN, Infinity, and -Infinity will be encoded
as such. This behavior is not JSON specification compliant, but is consistent with
most JavaScript based encoders and decoders. Otherwise, it will be a ValueError
to encode such floats. See also ignore_nan for ECMA-262 compliant behavior.
I'm getting an error trying to parse record messages from tap-salesforce
using NodeJS because it's producing invalid JSON that contains NaN.
At this point I've found it impossible to continue with any singer project, as basically no combination of taps & target work together due to any number of dependency errors.
pip install tap-shopify target-postgres
ERROR: target-postgres 1.1.3 has requirement singer-python==5.1.1, but you'll have singer-python 5.4.1 which is incompatible.
How about CSV?
pip install tap-shopify target-csv
ERROR: target-csv 0.3.0 has requirement singer-python==2.1.4, but you'll have singer-python 5.4.1 which is incompatible.
Maybe Klaviyo will work? Nope.
pip install tap-klaviyo target-postgres
ERROR: target-postgres 1.1.3 has requirement singer-python==5.1.1, but you'll have singer-python 3.2.1 which is incompatible.
ERROR: target-csv 0.3.0 has requirement singer-python==2.1.4, but you'll have singer-python 3.2.1 which is incompatible.
ERROR: tap-shopify 1.1.10 has requirement singer-python==5.4.1, but you'll have singer-python 3.2.1 which is incompatible.
Is there a plan to adopt semantic versioning (major/minor) so that packages can be updated to NOT rely on a specific version? Excited about the potential of singer, but disappointed in the number of roadblocks that pop up to get even a trivial example working.
Hi,
I'm attempting to write a tap, using singer.transform.Transformer.filter_data_by_metadata
in order to filter the data.
def filter_data_by_metadata(self, data, metadata):
if isinstance(data, dict) and metadata:
for field_name in list(data.keys()):
selected = singer.metadata.get(metadata, ('properties', field_name), 'selected')
inclusion = singer.metadata.get(metadata, ('properties', field_name), 'inclusion')
if inclusion == 'automatic':
continue
if selected is False:
data.pop(field_name, None)
# Track that a field was filtered because the customer
# didn't select it.
self.filtered.add(field_name)
if inclusion == 'unsupported':
data.pop(field_name, None)
# Track that the field was filtered because the tap
# declared it as unsupported.
self.filtered.add(field_name)
return data
This logic has two problems:
selected
is missing in the metadata the field is not filteredselected-by-default
is totally ignoredThe expected behaviour would be the following:
selected
set to True, do nothingselected
set to False, filter fieldselected
missing and selected-by-default
set to True, do nothingselected
missing and selected-by-default
set to True, filter fieldLine 12 in 3ddbcc6
For example, pytz==2018.5
is out. It would be nice to change ==
to >=
for version dependencies. I use pipenv to pin application versions but like to leave library versions looser.
Most of our Taps use a combination of the Python requests and backoff to make HTTP requests that retry with a backoff strategy. A typical Tap will have a bit of code that looks like this:
def giveup(error):
response = error.response
return not (response.status_code == 429 or
response.status_code >= 500)
@backoff.on_exception(backoff.constant,
(requests.exceptions.RequestException),
jitter=backoff.random_jitter,
max_tries=5,
giveup=giveup,
interval=30)
def request(url, access_token, params={}):
requests.request(...)
We've seen an issue with the Outbrain tap where a ConnectionException is raised because of a snapped connection. There is no HTTP response in this case, so the error
argument to giveup
has no response
property, and giveup
throws an exception when we try to access error.response.status_code
.
This could be fixed with a simple change to giveup
:
def giveup(error):
response = error.response
if response is None:
return False
return not (response.status_code == 429 or
response.status_code >= 500)
I think this logic is getting complex enough that we should add an implementation of giveup
that does something like the above into the singer-python
library. If we don't, it's likely that every Tap will experience the same error trying to access properties on a null error.response
object at some point.
However, I'm hesitant to add a hard dependency on requests
and backoff
. So I'm thinking that we should make a module called singer.requests
that can contain helper functions like this one that are specific to the requests library. We won't need to modify setup.py
to add a dependency on requests
, and it's up to a Tap whether they want to import that module at all.
We should give this giveup
function a specific name, like giveup_on_http_4xx_except_429
, to make room for other giveup strategies.
I don't want to put the decorated request
function in this library, because I think it's pretty likely that different Taps would want to use different backoff strategies.
So a Tap implementation would then look more like this:
@backoff.on_exception(backoff.constant,
(requests.exceptions.RequestException),
jitter=backoff.random_jitter,
max_tries=5,
giveup=giveup_on_http_4xx_except_429,
interval=30)
def request(url, access_token, params={}):
requests.request(...)
I am running into a problem with singer-python 3.5.2.
The catalog object works the first time the streams attribute is accessed but not in subsequent times.
I believe the issue is related to how the "streams" attribute is initialized.
It is a generator expression instead of a list.
This shows the problem:
import singer
c = singer.catalog.Catalog.load("catalog_categories.json")
c.to_dict() # this works
c.to_dict() # this returns an empty object : {'streams': []}
while using tap-pipedrive I noticed that the output produced - ultimately by format_message
in messages.py
is using simplejson with the default value of ensure_ascii=True
- is encoded in Pythons escaped unicode (literal \u followed by 4 hexadecimal digits).
This confuses a lot of my later processing. I am not sure how to properly fix that later on.
I set PYTHONIOENCODING to utf-8 and it looks like the setting is working:
$ python -c'import sys; print(sys.stdout.encoding)'
utf8
The output from tap-pipedrive is unchanged, though.
A way to change the output encoding is to set ensure_ascii=False
when calling simplejson.dumps. Would you accept a PR for that?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.