GithubHelp home page GithubHelp logo

elastic / connectors Goto Github PK

View Code? Open in Web Editor NEW
58.0 69.0 116.0 5.46 MB

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework

Home Page: https://www.elastic.co/guide/en/enterprise-search/master/index.html

License: Other

Python 99.06% Makefile 0.08% Shell 0.86% Dockerfile 0.01% DIGITAL Command Language 0.01%
enterprise-search app-search elastic elastic-stack elasticsearch workplace-search

connectors's Introduction

Build status

Elastic connectors

search-icon

Connectors

This repository contains the source code for all Elastic connectors, developed by the Search team at Elastic. Use connectors to sync data from popular data sources to Elasticsearch.

These connectors are available as:

ℹī¸ For an overview of the steps involved in deploying connector clients refer to Connector clients in the official Elastic documentation.

To get started quickly with self-managed connectors using Docker Compose, check out this README file.

Connector documentation

The main documentation for using connectors lives in the Search solution's docs. Here are the main pages:

You'll also find the individual references for each connector there. For everything to do with developing connectors, you'll find that here in this repo.

API documentation

Since 8.12.0, you can manage connectors and sync jobs programmatically using APIs. Refer to the Connector API documentation in the Elasticsearch docs.

Command-line interface

Learn about our CLI tool in docs/CLI.md.

Connector service code

In addition to the source code for individual connectors, this repo also contains the connector service code, which is used for tasks like running connectors, and managing scheduling, syncs, and cleanup. This is shared code that is not used by individual connectors, but helps to coordinate and run a deployed instance/process.

Connector framework

This repo is also the home of the Elastic connector framework. This framework enables developers to build Elastic-supported connector clients. The framework implements common functionalities out of the box, so developers can focus on the logic specific to integrating their chosen data source.

The framework ensures compatibility, makes it easier for our team to review PRs, and help out in the development process. When you build using our framework, we provide a pathway for the connector to be officially supported by Elastic.

Running a self-managed stack

This repo provides a set of scripts to allow a user to set up a full Elasticsearch, Kibana, and Connectors service stack using Docker. This is useful to get up and running with the Connectors framework with minimal effort, and provides a guided set of prompts for setup and configuration. For more information, instructions, and options, see the README file in the stack folder.

Framework use cases

The framework serves two distinct, but related use cases:

  • Customizing an existing Elastic connector client
  • Building a new connector client

Guides for using the framework

connectors's People

Contributors

abhishekjoshi-crest avatar acrewdson avatar afoucret avatar akanshi-crest avatar akanshi-elastic avatar artem-shelkovnikov avatar danajuratoni avatar dependabot[bot] avatar dianajourdan avatar efegurkan avatar j-bennet avatar jedrazb avatar jignesh-crest avatar leemthompo avatar markjhoy avatar mchernyavskaya avatar moxarth-elastic avatar navarone-feekery avatar oli-g avatar parth-crest avatar parth-elastic avatar parthpuri-elastic avatar praveen-elastic avatar saarikabhasi avatar seanstory avatar sphilipse avatar tarekziade avatar timgrein avatar vidok avatar wangch079 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

connectors's Issues

Shorter MySQL configurable field names

As MySQL does not need to be part of each configurable field name, we need to rename the fields as listed below.

Acceptance criteria

  • Host
  • Port
  • Username
  • Password
  • Databases

Note:
This change should be visible only for 8.6 and will require the elastic.co documentation to be updated.

Recurring sync scheduled as every one minute from the UI does not work as expected, after the first sync is completed the second sync gets triggered immediately after few seconds.

Bug Description

Recurring sync scheduled as every one minute from the UI does not work as expected, after the first sync is completed the second sync gets triggered immediately after few seconds.

Pre-requisites

  1. Create an index from the UI
  2. Generate the API key and assign privileges to the API key
  3. Configure config.yml file with proper API key and source.py file with appropriate parameter values
  4. Update the package

To Reproduce

Steps to reproduce the behavior:

  1. Execute the indexing using the command elastic-ingest -c config-file --action poll
  2. Set schedule in the UI for every 1 minute and click Sync
  3. Observe the execution process and notice if the next sync starts 1 minute after the previous sync is completed

Expected behavior

Next sync should start 1 minute after the previous sync is completed

Actual behavior

Next sync starts immediately after the previous sync is completed

Screenshots

Please find attached screenshots for reference
next sync triggers immediately

Environment

Linux CentOS 7

Count of total sync done always 0 irrespective of total number of documents indexed

Steps to reproduce:

  1. Edit config.yml file and configure correct value for elasticsearch host
  2. Go to connectors/Sources and edit connector.py file
  3. Update the package and Run python3 kibana.py --config-file config.yml --service-type <service_type_file> --index-name <index_name>
  4. Upload a new file with .txt extension in the source
  5. Execute command elastic-ingest -c config.yml --action poll to index the file in elasticsearch
  6. Observe the count displayed on logs

Actual Result:

Sync done: 0 indexed, 0  deleted.

Expected Result:

Sync done: 1 indexed, 0  deleted.

Note: Probably a regression impact of #14 fix

Error handling issue in MySQL

Bug Description

There might be an error handling issue. https://github.com/elastic/connectors-python/blob/main/connectors/sources/mysql.py#L257 indicates that the list of databases provided should be comma separated. If I run it with the first database, it will create a bunch of documents. But if I then update the config to use the comma notation and specify a second database (but a bogus one), then it seems to “fail” the whole job and it will delete all the documents previously indexed when I only had the first database configured. It seems like it should fail only the 2nd database and not delete the entries previously indexed from the first? Or is this an all-or-nothing design?

Use string data type in place of long for field 'id'

Problem Description

When we run the connector for a source with very large ids and scheduling enabled, for the first interval(first sync) it ran successfully, However for the 2nd interval(next sync) it throws a KeyError for the field 'id'

We have charted down the following observation:
In byoei.py file the content of _id field is duplicated into another field 'id' :

https://github.com/elastic/connectors-python/blob/05ec3b4f40a3562dde518c636314f07dd798001d/connectors/byoei.py#L152

However elasticsearch takes 'id' as type long so it will only allow values upto 2^64 -1.
So, when bulk api is invoked inside byoei file, the response stored as res variable in line:
https://github.com/elastic/connectors-python/blob/05ec3b4f40a3562dde518c636314f07dd798001d/connectors/byoei.py#L35
shows that error is thrown: mapper_parsing_exception', reason: failed to parse field [id] of type [long] in document.

Proposed Solution

Since _id allows values upto 512 bytes, but the long type has a limit of 2^64, we should store 'id' as string.

doc_id = doc["id"] = str(doc.pop("_id"))

Additional Context

Attaching few screenshots for your reference:

Error logs
image

response on hitting bulk API.
image

MySQL align BE to reflect "List of MySQL databases" is required

Current behavior

Configured MySQL with the vault data, leaving "List of MySQL databases" empty.
Sync gets triggered, but 0 docs are indexed.
image

Expected behavior

For 8.6 we should align the BE code to reflect that no databases should be sync'd if the parameter "List of MySQL databases" is empty.

Context

This ticket is a short-/mid- term solution. Long term solution is to add UI option where user should be able to explicitly select that they want to select all databases, this enhancement is dependent on adding support for rich configurable fields in the framework.

Ref code

Errors while running the connector terminate the loop

Describe the bug

If an error happens during the connector run, then the application hangs indefinitely until terminated.

To Reproduce

Steps to reproduce the behavior:

  1. Create a mongodb connector
  2. In Kibana UI assign invalid port to it - for example '123456'
  3. Run the connector host
  4. See error:
[FMWK][14:55:17][CRITICAL] Port must be an integer between 0 and 65535: '270211'
Traceback (most recent call last):
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 100, in poll
    data_source = get_data_source(connector, self.config)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/source.py", line 118, in get_data_source
    _CACHED_SOURCES[service_type] = get_source_klass(
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/sources/mongo.py", line 20, in __init__
    self.client = AsyncIOMotorClient(
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/motor/core.py", line 148, in __init__
    delegate = self.__delegate_class__(*args, **kwargs)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/mongo_client.py", line 743, in __init__
    seeds.update(uri_parser.split_hosts(entity, port))
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/uri_parser.py", line 376, in split_hosts
    nodes.append(parse_host(entity, port))
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/uri_parser.py", line 137, in parse_host
    raise ValueError("Port must be an integer between 0 and 65535: %r" % (port,))
ValueError: Port must be an integer between 0 and 65535: '270211'
Traceback (most recent call last):
  File "/Users/artemshelkovnikov/git_tree/connectors-py/bin/elastic-ingest", line 33, in <module>
    sys.exit(load_entry_point('elasticsearch-connectors', 'console_scripts', 'elastic-ingest')())
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/cli.py", line 77, in main
    return run(args)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 147, in run
    logger.info("Bye")
  File "/usr/local/Cellar/[email protected]/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 114, in poll
    await self.connectors.close()
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 66, in raise_if_spurious
    raise exception
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 100, in poll
    data_source = get_data_source(connector, self.config)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/source.py", line 118, in get_data_source
    _CACHED_SOURCES[service_type] = get_source_klass(
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/sources/mongo.py", line 20, in __init__
    self.client = AsyncIOMotorClient(
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/motor/core.py", line 148, in __init__
    delegate = self.__delegate_class__(*args, **kwargs)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/mongo_client.py", line 743, in __init__
    seeds.update(uri_parser.split_hosts(entity, port))
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/uri_parser.py", line 376, in split_hosts
    nodes.append(parse_host(entity, port))
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/uri_parser.py", line 137, in parse_host
    raise ValueError("Port must be an integer between 0 and 65535: %r" % (port,))
ValueError: Port must be an integer between 0 and 65535: '270211'
  1. Observe the output from the terminal - nothing will happen

Expected behavior

Service should display the error and continue running. Once config is corrected, service properly runs mongodb connector.

Add support for environment variables

We want to be able to use elasticsearch environment variables in docker, so the config file can look like:

elasticsearch:
  host: ${host}
  user: ${username}
  password: {password}

Plug the new mappings creation

Now that we have #27 we can plug it, so the index is properly created if it does not exist.

Steps:

  • move templates/elasticsearch/index as connectors/index (that also unshadows elasticsearch)
  • move all tests located in templates/elasticsearch/index/tests in connectors/tests
  • in connectors/byoc.py change the elastic_server.prepare_index(self.index_name) so it passes the mappings and settings baked by the new helpers.

Bonus:

  1. add a high-level function that returns settings and mappings in a single call.
  2. consider merging index/mappings.py and index/settings.py into a single index.py file to simplify the code layout
  3. if 2. is done, consider renaming IndexMappings to Mappings or DefaultMappings so one can do from connectors.index import Mappings.

Tests fail when ran locally - test_aws.py::test_get_docs tries to connect to 169.254.169.254

When running make test locally, 1 test fails:

================================================================================================================================================= short test summary info ==================================================================================================================================================
FAILED connectors/sources/tests/test_aws.py::test_get_docs - botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "http://169.254.169.254/latest/api/token"

Seems like this test is still trying to access to real endpoint, and this logic should be mocked.

Getting `Nonetype is not callable` when lazy_download sets to None

Bug Description

The sources which doesn't have an attachment, we pass an explicit None for the field lazy_download. This gets accepted in the first sync but in the second sync, it raises an error NoneType object is not callable.

Expected behavior

As the first sync ignores the content extraction if the lazy_download is set to None, so does the 2nd sync should do. Probably, the reason for the same is that the index is empty and the timestamp is not being checked.

Screenshots

Error raised while 2nd sync is running

image

Screenshot of the code part where the issue persist

image

Generate resource and detailed memory usage reports in functional tests

Problem Description

Even with the new mem tool I added, it takes some time to dig into memory usage.
The same goes for profiling code, it requires to run a series of perf tools.

Proposed Solution

Let's reunite all our measurement tools under the same script. https://github.com/tarekziade/perf8/

It'll run

  • psutil (memory, CPU, fds)
  • detailed memory usage with memray
  • profiling info with py-spy

Below are a few screenshots of the reports we get.

memray flamegraph reports :

Screenshot 2022-10-21 at 21 23 34

Screenshot 2022-10-21 at 21 26 50

psutil outputs a CVS and we generate a matplotlib

report

We can add this in make ftest

Deleted document count displayed in the console along with the created document count

Bug Description

After creating a new index and indexing new set of documents, in the console it shows deleted document count along with the created count

Pre-requisites

  1. An index already exists with 50k docs indexed
  2. Create a new index from the UI
  3. Generate the API key and assign privileges to the API key
  4. Configure config.yml file with proper API key and source.py file(with a database name having limited records - 28 for this case) with appropriate parameter values
  5. Update the package

To Reproduce

Steps to reproduce the behavior:

  1. Execute the indexing using the command elastic-ingest -c config-file --action poll
  2. Set schedule in the UI for every 1 minute and click Sync
  3. Observe the execution process and the logs printed

Expected behavior

Count should be displayed as {'create': 28} and Sync done: 28 indexed, 0 deleted

Actual behavior

Count is displayed as {'create': 28, 'delete':50000} and Sync done: 28 indexed, 50000 deleted. This deleted document count refers to the count of the document indexed in the older index which is 50k

Screenshots

indexed and deleted document count

Environment

Elasticsearch v8.4.1
macOS Monterey v12.5

FileNotFoundError observed while executing elastic-ingest poll command

Bug Description

Connector gives FileNotFoundError while running the poll command.

To Reproduce

Steps to reproduce the behavior:

  1. Go to config file and add host, username, password in elasticsearch and add connector in sources
  2. Go to connectors/Sources and add configuration parameters of source
  3. Update the package using the command pip install .
  4. Run python3 kibana.py --config-file config.yml --service-type <service_type_file> --index-name <index_name>
  5. Execute elastic-ingest -c config.yml --action poll
  6. Observe the logs

Expected behavior

Connector should successfully index the data to the elasticsearch without any error.

Screenshots

image

Additional context

The yml file not being installed while using "pip install ." to install the package could be the cause of this. This might be fixed by including .yml files in package data in setup.cfg, as shown below.

[options.package_data]
* = *.yml

Inform user about MySQL connection errors (e.g. invalid database)

If a sync is triggered on a non-existent database don't obfuscate not-found errors. Currently the user does not get any feedback about issues with the connection / possible misconfiguration, and the sync reports 0 documents being indexed.

Expected behavior

Ideally, we should validate that the configuration is correct and inform the user about any possible issues.
Error message should be displayed with the relevant message to guide users stating that chosen database is not available. Ideally, provide list of available databases.

This should be reported in UI in a similar manner to how MongoDB does it
image

NoneType: None is printed when unable to connect to Elasticsearch

Steps to reproduce:

  1. Edit config.yml file and configure incorrect value for elasticsearch host
  2. Edit sources.py file and configure all the correct parameter values for mysql server
  3. Update the package using the command pip3.7 install .
  4. Run python3.7 kibana.py
  5. Execute elastic-ingest -c config.yml --action poll

Actual Result:

NoneType: None gets printed on the console

Expected Result:

NoneType: None error looks weird. Is this expected?

MySQL_NoneTypeError

Serialize `settings`

Make sure we serialize connectors.index.Settings

elastic_transport.SerializationError: Unable to serialize to JSON: {'mappings': {'dynamic': 'true', 'dynamic_templates': [{'data': {'match_mapping_type': 'string', 'mapping': {'type': 'text', 'analyzer': 'iq_text_base', 'index_options': 'freqs', 'fields': {'stem': {'type': 'text', 'analyzer': 'iq_text_stem'}, 'prefix': {'type': 'text', 'analyzer': 'i_prefix', 'search_analyzer': 'q_prefix', 'index_options': 'docs'}, 'delimiter': {'type': 'text', 'analyzer': 'iq_text_delimiter', 'index_options': 'freqs'}, 'joined': {'type': 'text', 'analyzer': 'i_text_bigram', 'search_analyzer': 'q_text_bigram', 'index_options': 'freqs'}, 'enum': {'type': 'keyword', 'ignore_above': 2048}}}}}], 'properties': {'id': {'type': 'keyword'}, '_subextracted_as_of': {'type': 'date'}, '_subextracted_version': {'type': 'keyword'}}}, 'settings': <connectors.index.Settings object at 0x7fe0cd1bf400>} (type: dict)

service_type not configurable through kibana UI(v8.4)

Bug Description

While running the connector through CLI we have an option to add service_type however
through kibana UI in version8.4 there is no option to configure the service_type parameter perhaps it passes by default null at the time of creating .elastic-connector index.

Expected behavior

As we are able to add the service_type from CLI so it should also be possible to do the same with kibana UI.

Memory leakage observed in MySQL connector while indexing large dataset

Bug Description

Memory leakage observed in MySQL connector while indexing large dataset with approx 10GB size and 200k documents (This was taken as example data)

Pre-requisite

Create a data setup in MySQL DB with 5 DBs, 50 tables and 800 records per table with a size of 50kb for each row

To Reproduce

Steps to reproduce the behavior:

  • Create an index in Elasticsearch using Kibana v8.5.0-SNAPSHOT and update privileges for the API key generated
  • Add MySQL configuration from the UI and provide all 5 databases as a comma separated values
  • Do necessary changes in the config file and utils.py file
  • Update package and execute poll command
  • Observe the behaviour of the connector and check if all documents are correctly indexed in Elasticsearch or not

Expected behavior

  • All the documents should be successfully indexed in Elasticsearch
  • Memory leakage should not happen

Actual Result

  • It indexed only limited number of documents
  • We observed an error in the log file of Kibana . Mentioning error snippet below for your reference
  • The RAM keeps on gradually increasing in the beginning and gets fully utilized to a point where document fetching stops and no further documents are indexed

Screenshots or Attachments

Click here for reference https://watch.screencastify.com/v/b4gPnMzPuPVPptMNUG2m

ResponseError: [parent] Data too large, data for [<http_request>] would be [513871946/490mb], which is larger than the limit of [510027366/486.3mb], real usage: [513870000/490mb], new bytes reserved: [1946/1.9kb], usages [eql_sequence=0/0b, fielddata=9080/8.8kb, request=1654784/1.5mb, inflight_requests=106877370/101.9mb, model_inference=0/0b]: circuit_breaking_exception: [circuit_breaking_exception] Reason: [parent] Data too large, data for [<http_request>] would be [513871946/490mb], which is larger than the limit of [510027366/486.3mb], real usage: [513870000/490mb], new bytes reserved: [1946/1.9kb], usages [eql_sequence=0/0b, fielddata=9080/8.8kb, request=1654784/1.5mb, inflight_requests=106877370/101.9mb, model_inference=0/0b]

Environment

  • OS: Linux CentOS7
  • h/w config - 1 core CPU, 2GB RAM (Also referred to 6 core CPU, 12 GB RAM and 8 core CPU, 16GB RAM - same behaviour observed)
  • Elasticsearch version - v8.5.0-SNAPSHOT (8.5.0-c52257ee-SNAPSHOT)

Additional context

  • We also tried this with a single database with minimal script for MySQL but noted the same behavior. Sharing the script in slack for a quick look.
  • On a second thought to re-confirm if this is specific to MySQL connector, we checked this issue with Network Drive connector with large dataset and found that it mimics same error.

Connection timed out error while executing elastic-ingest with large dataset

Bug Description

Connection timed out error while executing elastic-ingest with large dataset

To Reproduce

Steps to reproduce the behavior:

  1. Configure config.yml file with proper values
  2. Configure mysql.py files with proper values for connecting to MySQL server and pass empty list as an argument in database parameter
  3. Update the package and execute kibana.py
  4. Execute elastic-ingest -c config.yml --action poll command

Expected behavior

Connection timed out error should not occur and all the documents should be successfully indexed into elasticsearch

Actual behaviour

  1. Facing connection timed out error and documents getting missed from being indexed.
  2. There are around 8L records but only approximately 4L records are getting indexed. So, the documents are also getting missed from being indexed into Elasticsearch due to the connection timed out error

Note - Please find attached screenshot for more details

MySQL_LargeData_ConnectionTimedOut

Environment

Linux VM CentOS7

Misleading count information in logger message

Pre-requisite:

  1. One document already indexed to Elasticsearch
  2. Added one new file in source

Test Steps:

  1. Go to connectors/Sources and edit connector.py file
  2. Update the package and Run python3 kibana.py --config-file config.yml --service-type <service_type_file> --index-name <index_name>
  3. Upload a new file with .txt extension in the source
  4. Execute command elastic-ingest -c config.yml --action poll to index the file in elasticsearch
  5. Observe the count displayed on logs
  6. Login to elastic search and Check the newly indexed file

Expected Result:

  1. The count shown on console should be of the newly added file
    In this case only 1 txt file is uploaded so, the count should be create:1 and Sync done:1

Actual Result:
Extra count is shown on console for the txt file as {'create': 1, 'update': 1} and Sync done: 2 indexed.

get_default_configuration() method is not being loaded properly through Kibana UI(v8.4)

Bug Description

At the time of running connector with Kibana UI(v8.4) we are unable to load the default configuration of a connector properly because the method get_default_configuration() is not being used. However while running the connector from CLI we were passing the default configuration in kibana file.

Expected behavior

In Kibana UI(v8.4) there should be a process through which we can load the default configuration of a connector.

Connector is unable to sync when running with schedule set by Kibana

Describe the bug

Connector is unable to run when created through Kibana flow due to the problem connected to the Quartz Cron expression stored by Kibana.

To Reproduce

Steps to reproduce the behavior:

  1. Create a connector record in Kibana, you can use this query:
POST .elastic-connectors/_doc/
{
  "configuration": {
    "database": {
      "label": "MongoDB Database",
      "value": "listingsAndReviews"
    },
    "host": {
      "label": "MongoDB Server Hostname",
      "value": "127.0.0.1:27028"
    },
    "collection": {
      "label": "MongoDB Collection",
      "value": "sample_airbnb"
    }
  },
  "index_name": "search-mongodb",
  "language": null,
  "last_seen": "2022-08-25T10:16:45.502+00:00",
  "last_sync_error": null,
  "last_sync_status": null,
  "last_synced": null,
  "name": "mongodb",
  "scheduling": {
    "enabled": false,
    "interval": "0 0 0 * * ?"
  },
  "service_type": "mongo",
  "status": "configured",
  "sync_now": false
}
  1. Go to Kibana and change scheduling interval to any interval and enable scheduling
  2. Run the connector
  3. See error:
[12:44:58][CRITICAL] cannot use '?' in the 'year' field
Traceback (most recent call last):
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 71, in poll
    await connector.sync(data_source, es, self.idling)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/byoc.py", line 212, in sync
    next_sync = self.next_sync()
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/byoc.py", line 177, in next_sync
    return CronTab(self.scheduling["interval"]).next(default_utc=True)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 386, in __init__
    self.matchers = self._make_matchers(crontab, loop, random_seconds)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 410, in _make_matchers
    matchers = [_Matcher(which, entry, loop) for which, entry in enumerate(ct)]
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 410, in <listcomp>
    matchers = [_Matcher(which, entry, loop) for which, entry in enumerate(ct)]
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 204, in __init__
    al, en = self._parse_crontab(which, it)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 325, in _parse_crontab
    _assert(which in (DAY_OFFSET, WEEK_OFFSET),
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 181, in _assert
    raise ValueError(message%args)
ValueError: cannot use '?' in the 'year' field

Expected behavior

Connector is able to sync, respecting the schedule

An index is not getting created when the index name has any letter in Uppercase

An index is not getting created when the index name has any letter in Uppercase

To Reproduce

Steps to reproduce the behavior:

  1. Edit config.yml file and configure the correct value for elastic search host.
  2. Go to connectors/Sources and edit the connector.py file
  3. Update the package and Run python3 kibana.py --config-file config.yml --service-type <service_type_file> --index-name <index_name>. Index name should have letters in Uppercase. E.g: search-NetworkDrive
  4. Execute command elastic-ingest -c config.yml --action poll to index the file in elastic search. It will successfully index the documents.
  5. Check for the index created in the Index Management.

##Actual Result:
The execution for kibana & elastic-ingest gets completed and there is no error shown for Index not created. On searching the index in Index Management, there will be no such Index with a letter in upper case. i.e. search-NetworkDrive

Expected behavior

The user should be shown an error message for Index not getting created with letters in Uppercase in its name as the error is shown while creating Index with an uppercase letter via API.

Screenshots

error while creating index with upper case letter via API
sync done with elastic-ingest
kibana executed with upper case letter in index name NetworkDrive
no index found

Additional context

If the user is creating the Index with API, it shows an error message as "type": "invalid_index_name_exception", "reason": "Invalid index name [search-NetworkDrive], must be lowercase",

This is a test

Bug Description

Sean is filing this to test our new github actions. He'll close this momentarily

To Reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Screenshots

Environment

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context

Include Tika in aws connector so the SUPPORTED_FILETYPE can included csv, json and xml files

Problem Description

The new AWS connector connects to S3 - people place standard data file types here,i.e., log.json, table.csv, and old.xml files.
Our current support types are for programing language files to be read. This isn't the normal place to keep your python, ruby, and shell scripts.

Proposed Solution

Since Tika is used throughout enterprise search to handle multiple file types, we should use it within AWS connector so we can expose the file types here as well.

Alternatives

The alternative is to pull in binary representation and use ingest pipeline (using tika) to perform the extraction.

Additional Context

This is an awesome connector + the directory one ++

Someone asked about arvo files stored in s3 - So I assume the list is endless of supported files.
For something like avro - we should build a ingest pipeline to handle these non-tika types, I would assume.

MySQL logging improvement

These messages logged at debug level
Next sync for mysql due in -1 seconds
create confusion and should be made more explicit.

Logs should be rephrased to state that the sync is disabled.

The recurring sync keeps executing (every minute) inspite of the scheduled enabled being set to FALSE.

The recurring sync keeps executing (every minute) in spite of the scheduled enabled being set to FALSE.

  1. Keep the scheduling enabled default value as enabled: False and run the connector with sync_now: true.
  2. Execute the command elastic-ingest -c config.yml --action poll and wait for all the documents to be indexed.
  3. Once the execution is completed for the first time, check if for another recurring sync happening.
  4. Validate the recurring sync after the indexing execution of the previous command is completed.

Expected Result:
The recurring sync should not happen as per the schedule enabled is set to false.

Actual Result:
The recurring scheduling is happening every minute and it is not working as per the enabled set to false. Still the sync is happening every minute.

Error observed while executing elastic-ingest command with list argument

Steps to reproduce:

  1. Install the connector using pip install command
  2. After successful installation, execute the command elastic-ingest -c config.yml --action list

Actual Result:
Error appears as "Attribute Error:'NoneType' object has no attribute 'strip'"

Expected Result:
List of sources should be displayed on executing elastic-ingest with list argument.

@tarekziade Can you please take a look into it?
MySQL_Error on list argument output

Attaching screenshot for reference.

Sync button in the UI is exhibiting inconsistent behaviour

Bug Description

Once the user sets the schedule and executes the poll command, the first sync gets completed successfully but at times next sync does not get invoked as per the schedule and user manually needs to click Sync button from the UI.

Pre-requisites:

  1. Create an index from the UI
  2. Generate the API key and assign privileges to the API key
  3. Configure config.yml file with proper API key and source.py file with appropriate parameter values
  4. Update the package

To Reproduce

Steps to reproduce the behavior:

  1. Execute the indexing using the command elastic-ingest -c config-file --action poll
  2. Set schedule in the UI for every 1 minute and click Sync
  3. Let the first sync get completed successfully and notice if the next sync gets triggered spontaneously as per the schedule

Expected behavior

Next sync should get triggered spontaneously as per the schedule

Actual behavior

Next sync does not trigger as per the schedule and user needs to manually click Sync button from UI to start the sync.
This is observed as an inconsistent behaviour

Environment

Elasticsearch v8.4.1
macOS Monterey v12.5

The elastic-ingest execution remains incomplete when the scheduling enabled is set to FALSE

It is observed that when the scheduling enabled is set to False, the manual execution of the elastic-ingest command (2nd time) does not complete the execution, and the user has to terminate the script as it keeps running without indexing any documents.

Steps:

  1. Set the scheduling enabled: False in kibana.py and run the connector
  2. Execute the command elastic-ingest -c config.yml --action poll and wait for all the documents to be indexed
  3. Once the execution is completed for the first time, terminate the script manually(abruptly killing the cmd service) to ensure the connector service is completely stopped
  4. Run the poll service again with the same config i.e. scheduling enable: False
  5. Wait for the service to index the documents or print proper logs in case the documents are not indexed

Actual Result:
The service continuously keeps running without indexing any documents

Expected Result:
The service should either index the documents or print a proper logger message

What we understand from the scheduling flag in the config is that it will decide to whether schedule the connector or not. Setting it to False would mean that the connector is not a scheduled run but although it should run on-demand i.e. with scheduling as False, if I want to index documents on demand, I can start the poll service at any given point of time and the service should index all the documents that are present till that time.
However, this is only happening for the first run i.e. if I try to execute the connector again(of course after killing the service completely), the documents are not indexed and no proper logs are shown

@tarekziade Please look into this and confirm the behavior when scheduling enabled is set to False.

Build fails on Python 3.6

The pinned versions for some of the libraries listed in requirements.txt do not support Python 3.6 due to which setup will fail when running on python version 3.6.

  • motor 3.0.0, aioboto3 9.6.0, pytest-asyncio 0.19.0, require Python >= 3.7

The setup file mentions 'Requires Python 3.6 or superior'. I think it is safe to remove support for python 3.6 as it has reached it's EOL

Suggested change in setup file:

if sys.version_info < (3, 7):
    raise ValueError("Requires Python 3.7 or superior")

Run buildkite on centos 7

Problem Description

We need to support centos 7, let's make sure the service works there

Proposed Solution

A new dockerfile cfore centos7 + python 3.7, used in the CI to run our tests, in parallel of the Python image

FROM centos:7

RUN yum update -y
RUN yum groupinstall "Development Tools" -y
RUN yum install openssl-devel libffi-devel bzip2-devel -y
RUN yum install wget -y

# open ssl
RUN wget https://www.openssl.org/source/openssl-3.0.5.tar.gz
RUN yum install perl-IPC-Cmd -y
RUN tar xzf openssl-3.0.5.tar.gz
RUN cd openssl-3.0.5 && ./config --prefix=/usr/local/custom-openssl --libdir=lib --openssldir=/etc/ssl && make -j1 depend && make -j8 && make install_sw

# python 3.10
RUN wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
RUN tar xvf Python-3.10.0.tgz
RUN cd Python-3.10.0
RUN cd Python-3.10.0 && ./configure --with-openssl=/usr/local/custom-openssl --with-openssl-rpath=auto --prefix=/usr/local/custom-openssl && make -j8 && make altinstall
RUN yum install -y python3-pip
RUN pip3 install certifi
RUN python3 -c 'import urllib.request;  print(urllib.request.urlopen("https://python.org/").status)'

Alternatives

Additional Context

Clarify `QueueFull` exception

Problem Description

On QueueFull exception in https://github.com/elastic/connectors-python/blob/main/connectors/utils.py#L292
it's hard to know why we've reached the limit.

Proposed Solution

When we raise the error, let's add more debug info:

  • self._current_memsize
  • item_size
  • refresh_timeout

To understand the size of the queue and the size of time you try to put in the queue. maybe that particular item is huge and is bigger than the queue limit.

we can also surface refresh_timeout

Alternatives

Additional Context

Intermittent test failure

def test_next_run():
        assert next_run("1 * * * * *") < 70.0
        assert next_run("* * * * * *") == 0
>       assert next_run("0/5 14,18,3-39,52 * ? JAN,MAR,SEP MON-FRI 2002-2010") > 0


connectors/tests/test_utils.py:8: 

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
connectors/utils.py:93: in next_run
    when = cron_obj.next_trigger()
    connectors/quartz.py:1081: in next_trigger
    self._process_time_unit_queue(overflow, unit_names)
    connectors/quartz.py:1042: in _process_time_unit_queue
    ) = self._get_parser(unit_name)().parse(

connectors/quartz.py:306: in parse
    ) = self._comma_handler(date_pointer, value, _trigger_secondary)

connectors/quartz.py:177: in _comma_handler
    values = sorted([int(val) for val in value.split(",")])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

0 = <list_iterator object at 0x7f0b004ff250>
>   values = sorted([int(val) for val in value.split(",")])

E   ValueError: invalid literal for int() with base 10: '3-39'


connectors/quartz.py:177: ValueError

Verify.py file has elasticsearch config parameter as `user`, should be `username`

Make ftest fails due to incorrect config parameter name in verify.py file

To Reproduce

Steps to reproduce the behavior:

  1. Configure docker and config file from the fixtures
  2. execute the command: make run-stack
  3. install the package: make install
  4. run the ftest: make ftest NAME=<connector_name>

Expected behavior

The ftest should complete and verify.py ensures more than 10k documents are indexed.

Actual result

make ftest fails with KeyError 'user' in verify.py file

Screenshots

image

Additional context

The config.yml has the elasticsearch parameters as username and password. So username should be there in place of user in verify.py too.
https://github.com/elastic/connectors-python/blob/e3e257d9967f5283097e419421f92abad372e8a5/config.yml#L3

Unable to connect AWS S3

Steps to reproduce:

- Edit config.yml file and configure correct value for elasticsearch host
- Go to connectors/Sources and update configuration values in aws.py file
- Update the package and Run python3 kibana.py --config-file config.yml --service-type s3 --index-name <index_name>
- Execute command elastic-ingest -c config.yml --action poll to index the files in elasticsearch
- Observe the logs [Connection Timeout Error]

Actual Result:

Connect timeout on endpoint URL: "https://169.254.169.254/latest/api/token

Expected Result:

Sync done: 1 indexed, 0  deleted.

Screenshot 2022-08-24 190552

Misleading information in the count displayed in .elastic-connectors-sync-jobs index

Bug Description
Misleading information in the count displayed in .elastic-connectors-sync-jobs index

Pre-requisite
Go to Elasticsearch v8.4.1 and perform below steps:

  1. Add integration -> Build connector
  2. Provide index name and generate API key
  3. Assign privileges to the API
  4. Set recurring schedule using Sync

Steps to reproduce the behavior:

  1. Edit config.yml file and configure correct value for elasticsearch host and API key
  2. Go to connectors/Sources and edit connector.py file
  3. Update the package and execute command elastic-ingest -c config.yml --action poll to index the file in elasticsearch
  4. Observe the count displayed in the logs and in .elastic-connectors-sync-jobs index in kibana

Expected Result:
Count should be properly displayed in console and in .elastic-connectors-sync-jobs index in kibana

Actual Result:
Count is properly displayed in console but it is misleading in .elastic-connectors-sync-jobs index in kibana
The actual number of documents indexed is shown in deleted_document_count and the indexed_document_count is shown as 0.
Also, after the first sync is completed , from the second sync onwards the count is not all updated in .elastic-connectors-sync-jobs index in kibana irrespective of any new documents indexed or existing documents deleted from source.
Screen Shot 2022-09-09 at 2 54 51 PM
Screen Shot 2022-09-09 at 3 07 03 PM
Screen Shot 2022-09-09 at 3 07 25 PM

While Indexing data from Network Drive to Elastic Search, version_conflict_engine_exception is occurring.

Bug Description

While Indexing data from Network Drive to Elastic Search, version_conflict_engine_exception is occurring.

To Reproduce

Steps to reproduce the behavior:

  1. Create an Index in Elasticsearch
  2. Index documents by executing the elastic-ingest --action poll -c config.yml command and check for sync completion.
  3. Validate the sync completion process

Expected behavior

The sync should get completed by indexing all the files available in the path of a network drive.

Actual behavior:

The sync is not getting completed and shows a version_conflict_engine_exception.

Screenshots:

Attaching a log file:
https://drive.google.com/file/d/1EEi6QaWAK4boL8z3hcJw4qWqW39r-jm-/view?usp=sharing

Environment

Linux CentOS7

Additional context

We think, this might have occurred due to the event loop, as update_by_query takes a snapshot of the index before updating it, which overlaps the document version in the elastic search.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.