networktocode / diffsync Goto Github PK
View Code? Open in Web Editor NEWA utility library for comparing and synchronizing different datasets.
Home Page: https://diffsync.readthedocs.io/
License: Other
A utility library for comparing and synchronizing different datasets.
Home Page: https://diffsync.readthedocs.io/
License: Other
A Diff object should be able to provide a summary of its contents (i.e., the number of objects that would be created/updated/deleted if this diff were used for a synchronization between systems).
Logging, usability.
Currently the print_detailed
APIs on DSync, DSyncModel, Diff, and DiffElement print to stdout when called. These should be changed/refactored so that they instead construct and return an assembled string, which the caller can then print()
, log.debug()
, etc. as desired.
The current functionality is useful for debugging but not for integrated use cases where logging would be more appropriate.
I found that the most recent version 1.4.3 of diffsync probably introduces a breaking change, which is not recorded in release notes.
That is removing Enum
from DiffSyncActions
's base classes. This change causes backwards incompatibility:
DiffSyncActions
is not iterable now.Enum
are not accessible by DiffSyncActions members, e.g. name
, and value
.To make it easier to understand the impact of the breaking changes, I write a short code snippet that reproduces the breaking changes.
The following code runs well in 1.4.2 but gets crashed in 1.4.3.
import diffsync.enum
from enum import Enum
print(issubclass(diffsync.enum.DiffSyncActions, Enum))
try:
print(list(diffsync.enum.DiffSyncActions))
except Exception as ex:
print(ex)
print(diffsync.enum.DiffSyncActions.CREATE.name)
Output in diffsync 1.4.2:
True
[<DiffSyncActions.CREATE: 'create'>, <DiffSyncActions.UPDATE: 'update'>, <DiffSyncActions.DELETE: 'delete'>, <DiffSyncActions.NO_CHANGE: None>]
CREATE
Output in diffsync 1.4.3:
False
'type' object is not iterable
Traceback (most recent call last):
File "a.py", line 9, in <module>
print(diffsync.enum.DiffSyncActions.CREATE.name)
AttributeError: 'str' object has no attribute 'name'
Maybe recording these changes in release notes helps to avoid user confusion during updating this version?
Kind regrads,
Currently the functions sync_to/from
are not returning anything, whether the sync was completed or not.
It would be useful to return at least the status of the sync and eventually the diff that was generated by the function.
Developer experience
dsync is not avaialable as a package name in pypi, so we need to give the package another name, We could use:
diff-sync
diffsync
ntc-dsync
My preference would be to use diff-sync
for the package and keep the name in python to dsync
.
Add a mandatory name attribute to each DSync object and pass this name to the Diff object
Currently during a diff we are missing a user friendly identifier to indicate what we are comparing and where some objects are missing etc..
Right now the diff is using some generic SOURCE and DEST identifier but it's not always clear which one is SOURCE and which one is DEST.
With a name clearly defined for each object, it will be easier to identify where a given data is coming from etc ...
A DSync
subclass or instance thereof should be able to specify a preferred Diff
subclass and have its diff_from
/diff_to
APIs automatically return an instance of this subclass instead of the base Diff
class.
This is needed for the SOT Sync project to allow the creation of Diff
subclass instances that are serializable to the NetBox ORM.
(This didn't seem to fit either issue template so I'm not using one, sorry!)
I think enabling the Discussions feature on GitHub for this project would be beneficial to the project overall. I, for one, have ideas and questions regarding diffsync and the NTC Slack is just too ephemeral to hash them out meaningfully, and also don't make sense as "Issuess". Plus, those Slack discussions will be lost to future diffsync users who are going through the same discoveries.
Thanks for doing what you do!
When we publish a new release the pipeline is failing to publish the new version to Github and Pypi
https://github.com/networktocode/diffsync/actions/runs/1746029264
Improve the documentation, specifically around the usage of the FLAGS for DiffSync class and DiffSyncModel class
For reference the flags are defined here > https://github.com/networktocode/diffsync/blob/master/diffsync/enum.py
diffsync 1.4.1
Python 3.9.10
......
File "/Users/x/pwsync/.venv/lib/python3.9/site-packages/pwsync/sync.py", line 10, in <module>
from diffsync.logging import enable_console_logging
File "/Users/x/pwsync/.venv/lib/python3.9/site-packages/diffsync/logging.py", line 22, in <module>
from packaging import version
ModuleNotFoundError: No module named 'packaging'
No need to make my script depend on packaging, it should be handled as a transitive dep from the diffsync module.
Seems diffsync module is not specifying a dependency on packaging module.
It seems triggered by this code:
from diffsync.logging import enable_console_logging
It would be very useful to have a few more Model flags to control which CRUD methods (create/update/delete) would be called during a sync()
I would like to propose
CRUD_NO_UPDATE
: Do not call update() on the DiffSyncModel during sync(), the model and the changes will still be visible in the diff.CRUD_NO_DELETE
: Do not call delete() on the DiffSyncModel during sync(), the model and the changes will still be visible in the diff.CRUD_NO_UPDATE_DELETE
= CRUD_NO_UPDATE
| CRUD_NO_DELETE
I wish we could support CRUD_NO_CREATE but I don't think this is possible right now because we can't pass context to this method since the model doesn't exist yet.
The main use case for me would be to protect
some object to be READ_ONLY
but still show up in the diff while other objects of the same type would remain READ_WRITE
.
As an example
When using the network-importer, once we have done the initial import of the SOT and the data has been cleaned up, it would be useful to protect
some objects from being updated/deleted in the SOT but it's still interesting to have these objects show up in the diff.
Today for this use case we are using the flag IGNORE
but the object is completely ignored and doesn't show up in the diff at all.
We could have an option for DiffSync to refresh data model contents from the underlying backend system or dataset, after doing a create
,update
,delete
operation (e.g. in _sync_from_diff_element()
), so as to verify that the operation was actually reflected in the backend.
With this option enabled, DiffSync could report a create()
as failed
if no underlying record was actually created, a create()
or update()
as incomplete
if some attributes were not set correctly, a delete()
as failed
if the underlying record still exists, etc.
This option should probably be off by default for performance reasons; also this would be mostly used as a debugging tool during development.
When developing a new adapter and associated DiffSyncModel
classes, the default model implementations of create
, update
, delete
report success and update their local status without actually interacting with the backend in any way. If these methods are left unimplemented, or only partially implemented (creating/updating the uid keys
of a model without setting its optional attributes
for example) then this can give a false impression of success/completeness that will only be corrected by inspecting the backend and/or running another sync attempt. Being able to automatically identify and flag incomplete synchronization actions would make gap analysis during development much easier.
When performing a diff or a sync, it needs to be possible to configure DSync to either continue after encountering failures or abort gracefully after encountering the first failure.
This is needed for the SOT Sync project.
In the dict constructed by Diff.dict()
, change the _dst
and _src
keys to something more intuitive, aesthetically pleasing, and/or "diff-like" -- perhaps -
and +
, or <
and >
?
The current "_src" and "_dst" keys were selected to avoid any likely conflict with the child DiffElement names, e.g.:
'DC1': {'_dst': {'parent_location_name': 'New York', 'status': 'in-transit'},
'_src': {'parent_location_name': 'Tennessee', 'status': 'active'},
'device': {...},
'prefix': {...},
but they're kinda ugly as keys.
Allow users to control the order in which objects are created/updated/deleted during a sync.
This logic could be different per type of object (device, interface etc ..)
In some cases the order in which objects are created on a remote system is important because one object can be dependent on another one.
For example, if we have a list of interfaces, with a lag interface and 2 lag members, we need to ensure that the lag interface
gets created first but deleted last.
Function that takes the identifiers of an object and either gets or create an object.
Current work around is wrap an object get in an exception of ObjectNotFound
.
Currently using try
and except
to catch if an object already exists.
try:
self.add(vrf)
except ObjectAlreadyExists:
pass
Expected use
self.get_or_create(vrf)
self.update_or_create(vrf)
Today, the list of valid actions : create
, update
, delete
is not clearly defined and the value is hardcoded in multiple places in the code
It would be good to create a proper enum for these values and use it everywhere instead of having hardcoded values
House keeping
Currently, to add bulk write operations to a DiffSync subclass object, one has to override the "sync_from" function on super(). This is in contrast to each individual DiffSyncModel's create(), update(), and delete() methods which are more idempotent in manner, and called by the "sync_from()" method on super.
It may be beneficial to be able to leave the downstream create(), update(), and delete() DiffSyncModel methods which are called by "sync_from" by default unimplemented and add a write() method to the DiffSync class. This would provide a framework for bulk write operations, without blowing away the logic implemented on "sync_from()"
class BackendYAML(DiffSync):
def write(self, source):
"""Bulk write operation to dump data to disk from another backend
Called automatically by super().sync_from()
Args:
source (DiffSync): DiffSync object from which data is being synchronized
"""
# Validate whether or not any changes need to be made to the circuits/providers files
self._write_providers_from(source)
self._write_circuits_from(source)
DiffSync to provide hooks for status reporting, for example:
Example API to consider is that of urllib's reporthook
(https://docs.python.org/3/library/urllib.request.html#urllib.request.URLopener.retrieve):
If reporthook is given, it must be a function accepting three numeric parameters: A chunk number, the maximum size chunks are read in and the total size of the download (-1 if unknown). It will be called once at the start and after each chunk of data is read from the network.
When using DiffSync for a large set of records, both diffing and syncing may take some time to complete. Although logging can be enabled to get highly detailed information about DiffSync's progress, it would be useful to have a less-detailed status/progress information API as well, which could be used to (for example) update a progress bar.
Add a model flag that can be used to control whether an unmatched model class or instance will trigger deletion (and/or creation) of records when a sync operation is run.
The existing global IGNORE_UNMATCHED_DST flag is not sufficiently granular as it applies to all records and all models. In some cases that may be adequate, but in others there needs to be per-model or even per-record control over this behavior -- for example, an application may not wish to delete unmatched Device records (perhaps a device is temporarily offline and hence not included in the source data), but may still wish to delete unmatched Interface records (as they reflect incorrect information about existing devices).
When using diffsync within Nautobot-plugin-ssot, it would be nice to allow the Destination (Target) to have no data.
An empty DiffSync() evaluates to False. This causes a failure to proceed to the Diff.
the base DiffSync class has a len() method, an object’s default bool() casting must use that to determine whether it evaluates as truthy:
diffsync = DiffSync()
bool(diffsync)
False
diffsync.add(DiffSyncModel())
bool(diffsync)
True
diffsync proceeds to diff and then syncs data to the Destination (Target).
Documentation generation and publication to readthedocs.org, including examples
Discoverability, usability.
Both Keepass and Bitwarden manage "credentials" but have a different schema and a way to access.
The examples 1 & 2 of diffsync describe the use of datasets with identical "schemas" and the same access.
The README.md confuses me (I don't seem to find what I look for).
Is diffsync the right module to try to synchronise these databases keepass and bitwarden?
Would this the way the creator of this module envisioned synchronisation:
A sub class of DiffSyncModel
. E.g. CredModel
so that DiffSync
can compare the important elements in a generic way.
This could than be the base class for a KeepassCredModel
and BitwardenCredModel
.
These 2 latter classes each have their specific create
/update
/delete
class methods,
coping with the specific schema and access method.
There would be 2 dataset classes: KeepassDataset
and BitwardenDataset
both inheriting from DiffSync
.
Currently, if there are records in the target system that aren't in the source system, DSync will delete these unmatched records from the target system when performing a sync. In some scenarios, it may be desirable to instead preserve these records without modification. This should be a configurable option.
This is needed for the SOT Sync project.
Currently doing a diff_to/from
followed by a sync_to/from
, result in calculating the diff twice because sync_to/from
are calculating a new diff automatically.
The proposal is to extend sync_from
and sync_to
to accept an existing diff
self.log_info(message="Loading current data from Data Source...")
diffsync1 = DataSourceDiffSync(job=self, sync=self.sync)
diffsync1.load()
self.log_info(message="Loading current data from Nautobot...")
diffsync2 = NautobotDiffSync(job=self, sync=self.sync)
diffsync2.load()
diffsync_flags = DiffSyncFlags.CONTINUE_ON_FAILURE
self.log_info(message="Calculating diffs...")
diff = diffsync1.diff_to(diffsync_1, flags=diffsync_flags)
if not self.kwargs["dry_run"]:
self.log_info(message="Syncing from Data Source to Nautobot...")
diffsync1.sync_to(diffsync2, flags=diffsync_flags, diff=diff). <<<<<<<<<<<
self.log_info(message="Sync complete")
Performance improvement, there is no need to calculate the diff twice on the same dataset
Currently when an attribute is defined as a list, DSync will report a diff if the lists have the same content but in a different order.
In some cases that's the expected behavior but in other cases the order doesn't matter and it's hard to predict how things will be loaded on both adapters.
It would be great to be able to explicitly define if a list should be sorted or not when we are calculating the diff.
In some cases it's hard to predict how a list will be loaded which can lead to false positive when we are generating the diff. A possible workaround is to ensure that, as we construct the list, the content is always ordered but it adds some complexity in the adapter.
Update the documentation to explain how to create and use a custom Diff class
For reference,
It would be nice to get the existing object from the exception, rather than having to potentially scan all the objects to find the correct one, when handling the ObjectAlreadyExists
exception.
When a .has_diffs()
call is made to a diffsync.diff.DiffElement
object, the object returns False when children do, indeed, have diffs. This occurs whether or not the include_children=True
argument is passed into the method call.
I expected .has_diffs()
to evaluate to True
in the case that diffs do not exist in the parent object but do exist in one of the children, and for the method to also evaluate to True when include_children=True
is passed in as an argument and diffs exist in the children
slug
, name
, and site
. Add the slug
to the _identifiers
tuple, name
to the _attributes
tuple, and a dictionary of {'site': 'sites'} to the _children
attribute.name
and slug
top_level
attribute regions
top_level
attribute regions
sync_from()
method on the YAML backend taking one argument of source
and define a break point (or a pdb.set_trace() inside the method.source
argument.diff = self.diff_from(source)
regions = [region for region in diff.get_children]
regions[0].has_diffs()
regions[0].has_diffs(include_children=True)
Observe that the region who's children has diffs does not itself show any diffs, regardless of whether include_children
is passed in as an argument or not.
Rename master branch to main
Align with other NTC repositories
When installing DiffSync from pip, some mandatory dependencies like pydantic or structlog are not being installed automatically
All mandatory dependencies should be installed by default
Currently DiffSync Adapters are always leveraging an internal in-memory datastore that is storing the entire dataset.
It would be great to support different types of datastore, like Redis in addition to the in-memory datastore.
As an option it would be useful to deactivate the internal datastore as well or provide a solution to pull the data directly from the remote system.
When we are dealing with a large dataset, the volume of data stored in-memory can become very large and can present some challenges. And external datastore like Redis would reduce the volume of data stored in memory.
In some cases, DiffSync is running very closely to an existing database and duplicating the data in memory is redundant and inefficient.
Currently the pyproject.toml
file restricts structlog
the major version to 20.
structlog = "^20.1.0"
The current major version is 21.
For my project, there is another required package that has pinned a minimum structlog version of 21.0.0. When adding diffsync as a requirement to my project, pip will install a much older version of the other required package where the minimum structlog version was 20. This causes an issue with needed features not being available.
Implement either a global or model flag (or both) called IGNORE_CASE
, that will tell DiffSync to ignore case-sensitive mismatches.
Example for Global Flags:
from diffsync.enum import DiffSyncFlags
flags = DiffSyncFlags.IGNORE_CASE
diff = nautobot.diff_from(local, flags=flags)
Example for Model Flags:
from diffsync import DiffSync
from diffsync.enum import DiffSyncModelFlags
from model import MyDeviceModel
class MyAdapter(DiffSync):
device = MyDeviceModel
def load(self, data):
"""Load all devices into the adapter and add the flag IGNORE to all firewall devices."""
for device in data.get("devices"):
obj = self.device(name=device["name"])
if "firewall" in device["name"]:
obj.model_flags = DiffSyncModelFlags.IGNORE_CASE
self.add(obj)
Currently, if we are trying to sync the same object from different backends that have the same name but without the same case (i.e.: "my-device" & "My-Device"), they will be marked as different, thus deleting the first device to replace it with the new one.
Below is an example to show the current limitations of not having such flag. As you can see from the DATA_BACKEND_A
and DATA_BACKEND_B
variables, the values are the same, but the first is in all caps, whereas the second is all lowercase.
from diffsync.logging import enable_console_logging
from diffsync import DiffSync
from diffsync import DiffSyncModel
class Site(DiffSyncModel):
_modelname = "site"
_identifiers = ("name",)
name: str
@classmethod
def create(cls, diffsync, ids, attrs):
print(f"Create {cls._modelname}")
return super().create(ids=ids, diffsync=diffsync, attrs=attrs)
def update(self, attrs):
print(f"Update {self._modelname}")
return super().update(attrs)
def delete(self):
print(f"Delete {self._modelname}")
super().delete()
return self
DATA_BACKEND_A = ["SITE-A"]
DATA_BACKEND_B = ["site-A"]
class BackendA(DiffSync):
site = Site
top_level = ["site"]
def load(self):
for site_name in DATA_BACKEND_A:
site = self.site(name=site_name)
self.add(site)
class BackendB(DiffSync):
site = Site
top_level = ["site"]
def load(self):
for site_name in DATA_BACKEND_B:
site = self.site(name=site_name)
self.add(site)
def main():
enable_console_logging(verbosity=0)
backend_a = BackendA(name="Backend-A")
backend_a.load()
backend_b = BackendB(name="Backend-B")
backend_b.load()
backend_a.sync_to(backend_b)
if __name__ == "__main__":
main()
Upon executing this script, the output is:
Create site
Delete site
So we are replacing an object that could potentially be the same.
The implementation of this flag could help mitigate unexpected results when the user knows he might have case-insensitive data from both backends, and remove the need to use functions such as .lower()
or .casefold()
each time he creates a new object.
The README makes mention of extending a "base" DiffSyncModel for handling CRUD actions in a backend, but doesn't do a great job of visualizing this concept.
you need to extend your DiffSyncModel class(es) to define your own create, update and/or delete methods for each model.
I think extending the example out a little more would go a long way to showing how to build your models and adapters.
class Device(DiffSyncModel):
"""Example model of a network Device."""
_modelname = "device"
_identifiers = ("name",)
_attributes = ()
_children = {"interface": "interfaces"}
name: str
site_name: Optional[str] # note that this attribute is NOT included in _attributes
role: Optional[str] # note that this attribute is NOT included in _attributes
interfaces: List = list()
class SystemADevice(Device):
system_A_unique_field: Optional[str] = None
@classmethod
def create(cls, diffsync, ids, attrs):
"""Talk to SystemA to create device"""
pass
class SystemBDevice(Device):
system_B_unique_field: Optional[str] = None
@classmethod
def create(cls, diffsync, ids, attrs):
"""Talk to SystemB to create device"""
pass
This should help newbies (like myself) to get a better idea of how to architect a diffsync-based integration.
This came up in internal conversations around usage and some potential improvements.
Potentially adding a flag that will set the create/update to process parents before children and then also the reverse, children are deleted before parents.
An example would be Nautobot and the dependencies of objects within Nautobot. Say you want to delete a site, but children objects exist such as devices, you need to delete the devices before deleting the site. This caused some intermittent and hard to troubleshoot scenarios.
It was brought up that some of these deletions would be deferred as well which may not be wanted.
Extend the DiffSyncModel create
, update
, and delete
APIs with an additional logger
parameter, or provide a public log
API on the already-included diffsync
instance.
Currently a DiffSyncModel implementation must construct its own logging context from scratch and lacks access to the context of any surrounding sync operation.
When attempting a diff between two SoTs using the SSoT plugin I get the following Exception error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/diffsync/__init__.py", line 543, in diff_from
return differ.calculate_diffs()
File "/usr/local/lib/python3.7/site-packages/diffsync/helpers.py", line 92, in calculate_diffs
self.diff.add(diff_element)
File "/usr/local/lib/python3.7/site-packages/diffsync/diff.py", line 58, in add
raise ObjectAlreadyExists(f"Already storing a {element.type} named {element.name}")
diffsync.exceptions.ObjectAlreadyExists: Already storing a port named 0/0
This error message is unhelpful as I can't determine from the error message what's the parent context the Object is in relation to.
I would expect more information to be provided in the error message denoting the parent context of the Object that already exists so further investigation of the issue can be done. The current message is unhelpful as I have no idea which device the 0/0 port is on that it's referencing.
Migrate CI to Github Action and deprecate Travis
Align with other NTC projects
Improve the documentation with more information regarding the usage of the callback function in sync_to / sync_from etc..
Glenn covered some of it in the second example
Extend the list of counter returned by diff.summary() to include skip
in addition to create
, update
, delete
& no-change
Currently the models that are being skipped because of some global or model flags like SKIP_UNMATCHED_SRC
| SKIP_UNMATCHED_DST
are not accounted for in the diff
#90 introduced a breaking API change in that DiffElement.action
changed from a string value to an enum value. This impacts projects such as network-importer
(networktocode/network-importer#256) and any other project that is relying on the value of DiffElement.action
, such as to implement custom Diff ordering based on the action.
In the short term it may be simplest just to revert the entirety of #90 and cut a new DiffSync release.
API to remain stable in minor and patch releases.
Any given DSync model should be able to generate log messages using a generic API, without knowing or caring whether these logs are going to stdout, creating a set of NetBox database records, etc. This API needs to be configurable to specify its target.
This is needed for the SOT Sync project.
When generating a diff or a sync with a custom diff_class, the main class is instantiated with the proper class but the children of this class are still instantiated using the default Diff class
When a custom diff_class is provided, the main class an all its children should be created using the custom diff_class
child_diff
definedI believe the issue is line 155 in the diff.py file
https://github.com/networktocode/dsync/blob/master/dsync/diff.py#L155
Hi,
The last release 1.3.0 is from 30 April 2021. Since then, many new features were added (such as the get_or_instantiate()
method).
It would be nice to make a new release to include these changes. This would also mean updating the CHANGELOG.md file.
Thanks.
Before making this repo public we need to add a license at the root of the repo and at the top of each file.
I've been using this one for the onboarding plugin
Copyright 2020 Network to Code <[email protected]>
Network to Code, LLC
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The build pipeline for Read The Doc appears to be broken at the moment
The PR #95 didn't help for sure but I think it was broken even before that.
Looking at other projects, it doesn't look like we have a clear pattern in place but I think we should replicate that we have in netutils with a dedicated requirements.txt
file, just for RTD
https://github.com/networktocode/netutils/blob/develop/docs/requirements.txt
A new version of the documentation should be build and published to RTD when we have a new commit in main
Add a flag per DSyncModel object to indicate that a specific object should be ignored during the diff/sync.
We have a situation right now for the network-importer where the netbox adapter is getting some cables from netbox and for various reasons these cables should be ignore because they can't be touched ...
Currently we don't have a way to solve this situation, if the objects exist in the network adapter but are removed from the netbox adapter, they will be flag as MISSING and the sync will try to create them.
Adding a flag per object that indicate if the object should be ignore all together would solve this situation and I can imagine other use cases where it will be useful to explicitly ignore an object.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.