elastic / connectors Goto Github PK

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework

Home Page: https://www.elastic.co/guide/en/enterprise-search/master/index.html

License: Other

Python 99.06% Makefile 0.08% Shell 0.86% Dockerfile 0.01% DIGITAL Command Language 0.01%

enterprise-search app-search elastic elastic-stack elasticsearch workplace-search

connectors's Introduction

Elastic connectors

Connectors

This repository contains the source code for all Elastic connectors, developed by the Search team at Elastic. Use connectors to sync data from popular data sources to Elasticsearch.

These connectors are available as:

Connector clients to be self-managed on your own infrastructure
Native connectors using our fully managed service on Elastic Cloud

ℹ️ For an overview of the steps involved in deploying connector clients refer to Connector clients in the official Elastic documentation.

To get started quickly with self-managed connectors using Docker Compose, check out this README file.

Connector documentation

The main documentation for using connectors lives in the Search solution's docs. Here are the main pages:

You'll also find the individual references for each connector there. For everything to do with developing connectors, you'll find that here in this repo.

API documentation

Since 8.12.0, you can manage connectors and sync jobs programmatically using APIs. Refer to the Connector API documentation in the Elasticsearch docs.

Command-line interface

Learn about our CLI tool in docs/CLI.md.

Connector service code

In addition to the source code for individual connectors, this repo also contains the connector service code, which is used for tasks like running connectors, and managing scheduling, syncs, and cleanup. This is shared code that is not used by individual connectors, but helps to coordinate and run a deployed instance/process.

Connector framework

This repo is also the home of the Elastic connector framework. This framework enables developers to build Elastic-supported connector clients. The framework implements common functionalities out of the box, so developers can focus on the logic specific to integrating their chosen data source.

The framework ensures compatibility, makes it easier for our team to review PRs, and help out in the development process. When you build using our framework, we provide a pathway for the connector to be officially supported by Elastic.

Running a self-managed stack

This repo provides a set of scripts to allow a user to set up a full Elasticsearch, Kibana, and Connectors service stack using Docker. This is useful to get up and running with the Connectors framework with minimal effort, and provides a guided set of prompts for setup and configuration. For more information, instructions, and options, see the README file in the stack folder.

Framework use cases

The framework serves two distinct, but related use cases:

Customizing an existing Elastic connector client
Building a new connector client

Guides for using the framework

connectors's People

Contributors

Stargazers

Watchers

Forkers

praveen-elastic parth-crest akanshi-crest dasuberleo moxarth-elastic tarekziade jignesh-crest mohsin-ul-islam parth-elastic akanshi-elastic tg97test ppf2 abhishekjoshi-crest shelmkiv torrer6 sbb72 clydelovesjimmybuckets danieltest244 johnmcguir bebetterthinker llermaly mduchesne64 bigt912 chronicleenterprise shanmugapriyan22 yikes667 ikigaidata cmendi03 johannespertl hartl3y94 gokulrajp devfreddy brycelin caplena garbsam97 maorbenzvi356 michaelcizmar farhanjafri demjened senakaradeniz shashikisku j-br01 kahtan777 anubhavgupta47 idrisbrewster acrewdson telaprolugit duongptryu moolay mykobyhub howe829 lamerkamp nv-basab ravintpillai parthpuri-crest dcampillo parthpuri-elastic balazsmikoqlm agustinmredondo vingnir behmah ardakalo leemthompo indyres-ben billhtml-dev ray0920 tran-hieu-1986 gowtham-devarajan-zs0305 opscidia sjors101 1337cybersecurity navarone-feekery iceberg-dev project0802 daniel112 getintheq zaydounkl srivictor rit-lwpan denisescosta15 identeqnadiya hudanazy kidcarson rover4x4 taha-scogo xiaoshuo-lim xiaoshuolim dam-elastic-test alexandrejoson terran1942 adonayteshome ginodamasceno31593 wassimbellil anghelnicolae yuquiedwin asifthebhai kaiqueamaralcs subhankarb wangch079 bean710

connectors's Issues

report misconfiguration in `error`

make sure we report a connector misconfiguration when the syncing failed

Shorter MySQL configurable field names

As MySQL does not need to be part of each configurable field name, we need to rename the fields as listed below.

Acceptance criteria

Note:
This change should be visible only for 8.6 and will require the elastic.co documentation to be updated.

Recurring sync scheduled as every one minute from the UI does not work as expected, after the first sync is completed the second sync gets triggered immediately after few seconds.

Bug Description

Recurring sync scheduled as every one minute from the UI does not work as expected, after the first sync is completed the second sync gets triggered immediately after few seconds.

Pre-requisites

Create an index from the UI
Generate the API key and assign privileges to the API key
Configure config.yml file with proper API key and source.py file with appropriate parameter values
Update the package

To Reproduce

Steps to reproduce the behavior:

Execute the indexing using the command elastic-ingest -c config-file --action poll
Set schedule in the UI for every 1 minute and click Sync
Observe the execution process and notice if the next sync starts 1 minute after the previous sync is completed

Expected behavior

Next sync should start 1 minute after the previous sync is completed

Actual behavior

Next sync starts immediately after the previous sync is completed

Screenshots

Please find attached screenshots for reference

Environment

Linux CentOS 7

Count of total sync done always 0 irrespective of total number of documents indexed

Steps to reproduce:

  1. Edit config.yml file and configure correct value for elasticsearch host
  2. Go to connectors/Sources and edit connector.py file
  3. Update the package and Run python3 kibana.py --config-file config.yml --service-type <service_type_file> --index-name <index_name>
  4. Upload a new file with .txt extension in the source
  5. Execute command elastic-ingest -c config.yml --action poll to index the file in elasticsearch
  6. Observe the count displayed on logs

Actual Result:

Sync done: 0 indexed, 0  deleted.

Expected Result:

Sync done: 1 indexed, 0  deleted.

Note: Probably a regression impact of #14 fix

Error handling issue in MySQL

Bug Description

There might be an error handling issue. https://github.com/elastic/connectors-python/blob/main/connectors/sources/mysql.py#L257 indicates that the list of databases provided should be comma separated. If I run it with the first database, it will create a bunch of documents. But if I then update the config to use the comma notation and specify a second database (but a bogus one), then it seems to “fail” the whole job and it will delete all the documents previously indexed when I only had the first database configured. It seems like it should fail only the 2nd database and not delete the entries previously indexed from the first? Or is this an all-or-nothing design?

Use string data type in place of long for field 'id'

Problem Description

When we run the connector for a source with very large ids and scheduling enabled, for the first interval(first sync) it ran successfully, However for the 2nd interval(next sync) it throws a KeyError for the field 'id'

We have charted down the following observation:
In byoei.py file the content of _id field is duplicated into another field 'id' :

https://github.com/elastic/connectors-python/blob/05ec3b4f40a3562dde518c636314f07dd798001d/connectors/byoei.py#L152

However elasticsearch takes 'id' as type long so it will only allow values upto 2^64 -1.
So, when bulk api is invoked inside byoei file, the response stored as res variable in line:
https://github.com/elastic/connectors-python/blob/05ec3b4f40a3562dde518c636314f07dd798001d/connectors/byoei.py#L35
shows that error is thrown: mapper_parsing_exception', reason: failed to parse field [id] of type [long] in document.

Proposed Solution

Since _id allows values upto 512 bytes, but the long type has a limit of 2^64, we should store 'id' as string.

doc_id = doc["id"] = str(doc.pop("_id"))

Additional Context

Attaching few screenshots for your reference:

Error logs

response on hitting bulk API.

MySQL align BE to reflect "List of MySQL databases" is required

Current behavior

Configured MySQL with the vault data, leaving "List of MySQL databases" empty.
Sync gets triggered, but 0 docs are indexed.

Expected behavior

For 8.6 we should align the BE code to reflect that no databases should be sync'd if the parameter "List of MySQL databases" is empty.

Context

This ticket is a short-/mid- term solution. Long term solution is to add UI option where user should be able to explicitly select that they want to select all databases, this enhancement is dependent on adding support for rich configurable fields in the framework.

Ref code

Errors while running the connector terminate the loop

Describe the bug

If an error happens during the connector run, then the application hangs indefinitely until terminated.

To Reproduce

Steps to reproduce the behavior:

Create a mongodb connector
In Kibana UI assign invalid port to it - for example '123456'
Run the connector host
See error:

[FMWK][14:55:17][CRITICAL] Port must be an integer between 0 and 65535: '270211'
Traceback (most recent call last):
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 100, in poll
    data_source = get_data_source(connector, self.config)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/source.py", line 118, in get_data_source
    _CACHED_SOURCES[service_type] = get_source_klass(
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/sources/mongo.py", line 20, in __init__
    self.client = AsyncIOMotorClient(
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/motor/core.py", line 148, in __init__
    delegate = self.__delegate_class__(*args, **kwargs)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/mongo_client.py", line 743, in __init__
    seeds.update(uri_parser.split_hosts(entity, port))
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/uri_parser.py", line 376, in split_hosts
    nodes.append(parse_host(entity, port))
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/uri_parser.py", line 137, in parse_host
    raise ValueError("Port must be an integer between 0 and 65535: %r" % (port,))
ValueError: Port must be an integer between 0 and 65535: '270211'
Traceback (most recent call last):
  File "/Users/artemshelkovnikov/git_tree/connectors-py/bin/elastic-ingest", line 33, in <module>
    sys.exit(load_entry_point('elasticsearch-connectors', 'console_scripts', 'elastic-ingest')())
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/cli.py", line 77, in main
    return run(args)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 147, in run
    logger.info("Bye")
  File "/usr/local/Cellar/[email protected]/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 114, in poll
    await self.connectors.close()
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 66, in raise_if_spurious
    raise exception
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 100, in poll
    data_source = get_data_source(connector, self.config)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/source.py", line 118, in get_data_source
    _CACHED_SOURCES[service_type] = get_source_klass(
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/sources/mongo.py", line 20, in __init__
    self.client = AsyncIOMotorClient(
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/motor/core.py", line 148, in __init__
    delegate = self.__delegate_class__(*args, **kwargs)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/mongo_client.py", line 743, in __init__
    seeds.update(uri_parser.split_hosts(entity, port))
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/uri_parser.py", line 376, in split_hosts
    nodes.append(parse_host(entity, port))
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/pymongo/uri_parser.py", line 137, in parse_host
    raise ValueError("Port must be an integer between 0 and 65535: %r" % (port,))
ValueError: Port must be an integer between 0 and 65535: '270211'

Observe the output from the terminal - nothing will happen

Expected behavior

Service should display the error and continue running. Once config is corrected, service properly runs mongodb connector.

Add support for environment variables

We want to be able to use elasticsearch environment variables in docker, so the config file can look like:

elasticsearch:
  host: ${host}
  user: ${username}
  password: {password}

Plug the new mappings creation

Now that we have #27 we can plug it, so the index is properly created if it does not exist.

Steps:

move templates/elasticsearch/index as connectors/index (that also unshadows elasticsearch)
move all tests located in templates/elasticsearch/index/tests in connectors/tests
in connectors/byoc.py change the elastic_server.prepare_index(self.index_name) so it passes the mappings and settings baked by the new helpers.

Bonus:

add a high-level function that returns settings and mappings in a single call.
consider merging index/mappings.py and index/settings.py into a single index.py file to simplify the code layout
if 2. is done, consider renaming IndexMappings to Mappings or DefaultMappings so one can do from connectors.index import Mappings.

Tests fail when ran locally - test_aws.py::test_get_docs tries to connect to 169.254.169.254

When running make test locally, 1 test fails:

================================================================================================================================================= short test summary info ==================================================================================================================================================
FAILED connectors/sources/tests/test_aws.py::test_get_docs - botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "http://169.254.169.254/latest/api/token"

Seems like this test is still trying to access to real endpoint, and this logic should be mocked.

Make the service ignore service types it can't handle

We want to make sure the service only handles sync jobs it knows how.

It's done by catching KeyError on this line: https://github.com/elastic/connectors-python/blob/main/connectors/runner.py#L100
and ignoring it with a logger.debug("Can't handle this source")

Getting `Nonetype is not callable` when lazy_download sets to None

Bug Description

The sources which doesn't have an attachment, we pass an explicit None for the field lazy_download. This gets accepted in the first sync but in the second sync, it raises an error NoneType object is not callable.

Expected behavior

As the first sync ignores the content extraction if the lazy_download is set to None, so does the 2nd sync should do. Probably, the reason for the same is that the index is empty and the timestamp is not being checked.

Screenshots

Error raised while 2nd sync is running

Screenshot of the code part where the issue persist

Generate resource and detailed memory usage reports in functional tests

Problem Description

Even with the new mem tool I added, it takes some time to dig into memory usage.
The same goes for profiling code, it requires to run a series of perf tools.

Proposed Solution

Let's reunite all our measurement tools under the same script. https://github.com/tarekziade/perf8/

It'll run

psutil (memory, CPU, fds)
detailed memory usage with memray
profiling info with py-spy

Below are a few screenshots of the reports we get.

memray flamegraph reports :

psutil outputs a CVS and we generate a matplotlib

We can add this in make ftest

Cached connectors don't reload config

Cached connectors should be invalidated after each sync so changes in the Kibana config are picked.

Deleted document count displayed in the console along with the created document count

Bug Description

After creating a new index and indexing new set of documents, in the console it shows deleted document count along with the created count

Pre-requisites

An index already exists with 50k docs indexed
Create a new index from the UI
Generate the API key and assign privileges to the API key
Configure config.yml file with proper API key and source.py file(with a database name having limited records - 28 for this case) with appropriate parameter values
Update the package

To Reproduce

Steps to reproduce the behavior:

Execute the indexing using the command elastic-ingest -c config-file --action poll
Set schedule in the UI for every 1 minute and click Sync
Observe the execution process and the logs printed

Expected behavior

Count should be displayed as {'create': 28} and Sync done: 28 indexed, 0 deleted

Actual behavior

Count is displayed as {'create': 28, 'delete':50000} and Sync done: 28 indexed, 50000 deleted. This deleted document count refers to the count of the document indexed in the older index which is 50k

Screenshots

Environment

Elasticsearch v8.4.1
macOS Monterey v12.5

FileNotFoundError observed while executing elastic-ingest poll command

Bug Description

Connector gives FileNotFoundError while running the poll command.

To Reproduce

Steps to reproduce the behavior:

Go to config file and add host, username, password in elasticsearch and add connector in sources
Go to connectors/Sources and add configuration parameters of source
Update the package using the command pip install .
Run python3 kibana.py --config-file config.yml --service-type <service_type_file> --index-name <index_name>
Execute elastic-ingest -c config.yml --action poll
Observe the logs

Expected behavior

Connector should successfully index the data to the elasticsearch without any error.

Screenshots

Additional context

The yml file not being installed while using "pip install ." to install the package could be the cause of this. This might be fixed by including .yml files in package data in setup.cfg, as shown below.

[options.package_data]
* = *.yml

Inform user about MySQL connection errors (e.g. invalid database)

If a sync is triggered on a non-existent database don't obfuscate not-found errors. Currently the user does not get any feedback about issues with the connection / possible misconfiguration, and the sync reports 0 documents being indexed.

Expected behavior

Ideally, we should validate that the configuration is correct and inform the user about any possible issues.
Error message should be displayed with the relevant message to guide users stating that chosen database is not available. Ideally, provide list of available databases.

This should be reported in UI in a similar manner to how MongoDB does it

plug the ingest pipeline

Plug the ingest pipeline
Describe what is required to use it

NoneType: None is printed when unable to connect to Elasticsearch

Steps to reproduce:

Edit config.yml file and configure incorrect value for elasticsearch host
Edit sources.py file and configure all the correct parameter values for mysql server
Update the package using the command pip3.7 install .
Run python3.7 kibana.py
Execute elastic-ingest -c config.yml --action poll

Actual Result:

NoneType: None gets printed on the console

Expected Result:

NoneType: None error looks weird. Is this expected?

Serialize `settings`

Make sure we serialize connectors.index.Settings

elastic_transport.SerializationError: Unable to serialize to JSON: {'mappings': {'dynamic': 'true', 'dynamic_templates': [{'data': {'match_mapping_type': 'string', 'mapping': {'type': 'text', 'analyzer': 'iq_text_base', 'index_options': 'freqs', 'fields': {'stem': {'type': 'text', 'analyzer': 'iq_text_stem'}, 'prefix': {'type': 'text', 'analyzer': 'i_prefix', 'search_analyzer': 'q_prefix', 'index_options': 'docs'}, 'delimiter': {'type': 'text', 'analyzer': 'iq_text_delimiter', 'index_options': 'freqs'}, 'joined': {'type': 'text', 'analyzer': 'i_text_bigram', 'search_analyzer': 'q_text_bigram', 'index_options': 'freqs'}, 'enum': {'type': 'keyword', 'ignore_above': 2048}}}}}], 'properties': {'id': {'type': 'keyword'}, '_subextracted_as_of': {'type': 'date'}, '_subextracted_version': {'type': 'keyword'}}}, 'settings': <connectors.index.Settings object at 0x7fe0cd1bf400>} (type: dict)

service_type not configurable through kibana UI(v8.4)

Bug Description

While running the connector through CLI we have an option to add service_type however
through kibana UI in version8.4 there is no option to configure the service_type parameter perhaps it passes by default null at the time of creating .elastic-connector index.

Expected behavior

As we are able to add the service_type from CLI so it should also be possible to do the same with kibana UI.

Memory leakage observed in MySQL connector while indexing large dataset

Bug Description

Memory leakage observed in MySQL connector while indexing large dataset with approx 10GB size and 200k documents (This was taken as example data)

Pre-requisite

Create a data setup in MySQL DB with 5 DBs, 50 tables and 800 records per table with a size of 50kb for each row

To Reproduce

Steps to reproduce the behavior:

Create an index in Elasticsearch using Kibana v8.5.0-SNAPSHOT and update privileges for the API key generated
Add MySQL configuration from the UI and provide all 5 databases as a comma separated values
Do necessary changes in the config file and utils.py file
Update package and execute poll command
Observe the behaviour of the connector and check if all documents are correctly indexed in Elasticsearch or not

Expected behavior

All the documents should be successfully indexed in Elasticsearch
Memory leakage should not happen

Actual Result

It indexed only limited number of documents
We observed an error in the log file of Kibana . Mentioning error snippet below for your reference
The RAM keeps on gradually increasing in the beginning and gets fully utilized to a point where document fetching stops and no further documents are indexed

Screenshots or Attachments

Click here for reference https://watch.screencastify.com/v/b4gPnMzPuPVPptMNUG2m

ResponseError: [parent] Data too large, data for [<http_request>] would be [513871946/490mb], which is larger than the limit of [510027366/486.3mb], real usage: [513870000/490mb], new bytes reserved: [1946/1.9kb], usages [eql_sequence=0/0b, fielddata=9080/8.8kb, request=1654784/1.5mb, inflight_requests=106877370/101.9mb, model_inference=0/0b]: circuit_breaking_exception: [circuit_breaking_exception] Reason: [parent] Data too large, data for [<http_request>] would be [513871946/490mb], which is larger than the limit of [510027366/486.3mb], real usage: [513870000/490mb], new bytes reserved: [1946/1.9kb], usages [eql_sequence=0/0b, fielddata=9080/8.8kb, request=1654784/1.5mb, inflight_requests=106877370/101.9mb, model_inference=0/0b]

Environment

OS: Linux CentOS7
h/w config - 1 core CPU, 2GB RAM (Also referred to 6 core CPU, 12 GB RAM and 8 core CPU, 16GB RAM - same behaviour observed)
Elasticsearch version - v8.5.0-SNAPSHOT (8.5.0-c52257ee-SNAPSHOT)

Additional context

We also tried this with a single database with minimal script for MySQL but noted the same behavior. Sharing the script in slack for a quick look.
On a second thought to re-confirm if this is specific to MySQL connector, we checked this issue with Network Drive connector with large dataset and found that it mimics same error.

Connection timed out error while executing elastic-ingest with large dataset

Bug Description

Connection timed out error while executing elastic-ingest with large dataset

To Reproduce

Steps to reproduce the behavior:

Configure config.yml file with proper values
Configure mysql.py files with proper values for connecting to MySQL server and pass empty list as an argument in database parameter
Update the package and execute kibana.py
Execute elastic-ingest -c config.yml --action poll command

Expected behavior

Connection timed out error should not occur and all the documents should be successfully indexed into elasticsearch

Actual behaviour

Facing connection timed out error and documents getting missed from being indexed.
There are around 8L records but only approximately 4L records are getting indexed. So, the documents are also getting missed from being indexed into Elasticsearch due to the connection timed out error

Note - Please find attached screenshot for more details

Environment

Linux VM CentOS7

expand_wildcards only on `.` prefixed indices

Use expand_wildcards=open for all indices and expand_wildcards=hidden for . prefixed

Misleading count information in logger message

Pre-requisite:

One document already indexed to Elasticsearch
Added one new file in source

Test Steps:

Go to connectors/Sources and edit connector.py file
Update the package and Run python3 kibana.py --config-file config.yml --service-type <service_type_file> --index-name <index_name>
Upload a new file with .txt extension in the source
Execute command elastic-ingest -c config.yml --action poll to index the file in elasticsearch
Observe the count displayed on logs
Login to elastic search and Check the newly indexed file

Expected Result:

The count shown on console should be of the newly added file
In this case only 1 txt file is uploaded so, the count should be create:1 and Sync done:1

Actual Result:
Extra count is shown on console for the txt file as {'create': 1, 'update': 1} and Sync done: 2 indexed.

get_default_configuration() method is not being loaded properly through Kibana UI(v8.4)

Bug Description

At the time of running connector with Kibana UI(v8.4) we are unable to load the default configuration of a connector properly because the method get_default_configuration() is not being used. However while running the connector from CLI we were passing the default configuration in kibana file.

Expected behavior

In Kibana UI(v8.4) there should be a process through which we can load the default configuration of a connector.

Connector is unable to sync when running with schedule set by Kibana

Describe the bug

Connector is unable to run when created through Kibana flow due to the problem connected to the Quartz Cron expression stored by Kibana.

To Reproduce

Steps to reproduce the behavior:

Create a connector record in Kibana, you can use this query:

POST .elastic-connectors/_doc/
{
  "configuration": {
    "database": {
      "label": "MongoDB Database",
      "value": "listingsAndReviews"
    },
    "host": {
      "label": "MongoDB Server Hostname",
      "value": "127.0.0.1:27028"
    },
    "collection": {
      "label": "MongoDB Collection",
      "value": "sample_airbnb"
    }
  },
  "index_name": "search-mongodb",
  "language": null,
  "last_seen": "2022-08-25T10:16:45.502+00:00",
  "last_sync_error": null,
  "last_sync_status": null,
  "last_synced": null,
  "name": "mongodb",
  "scheduling": {
    "enabled": false,
    "interval": "0 0 0 * * ?"
  },
  "service_type": "mongo",
  "status": "configured",
  "sync_now": false
}

Go to Kibana and change scheduling interval to any interval and enable scheduling
Run the connector
See error:

[12:44:58][CRITICAL] cannot use '?' in the 'year' field
Traceback (most recent call last):
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/runner.py", line 71, in poll
    await connector.sync(data_source, es, self.idling)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/byoc.py", line 212, in sync
    next_sync = self.next_sync()
  File "/Users/artemshelkovnikov/git_tree/connectors-py/connectors/byoc.py", line 177, in next_sync
    return CronTab(self.scheduling["interval"]).next(default_utc=True)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 386, in __init__
    self.matchers = self._make_matchers(crontab, loop, random_seconds)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 410, in _make_matchers
    matchers = [_Matcher(which, entry, loop) for which, entry in enumerate(ct)]
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 410, in <listcomp>
    matchers = [_Matcher(which, entry, loop) for which, entry in enumerate(ct)]
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 204, in __init__
    al, en = self._parse_crontab(which, it)
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 325, in _parse_crontab
    _assert(which in (DAY_OFFSET, WEEK_OFFSET),
  File "/Users/artemshelkovnikov/git_tree/connectors-py/lib/python3.9/site-packages/crontab/_crontab.py", line 181, in _assert
    raise ValueError(message%args)
ValueError: cannot use '?' in the 'year' field

Expected behavior

Connector is able to sync, respecting the schedule

An index is not getting created when the index name has any letter in Uppercase

To Reproduce

Steps to reproduce the behavior:

Edit config.yml file and configure the correct value for elastic search host.
Go to connectors/Sources and edit the connector.py file
Update the package and Run python3 kibana.py --config-file config.yml --service-type <service_type_file> --index-name <index_name>. Index name should have letters in Uppercase. E.g: search-NetworkDrive
Execute command elastic-ingest -c config.yml --action poll to index the file in elastic search. It will successfully index the documents.
Check for the index created in the Index Management.

##Actual Result:
The execution for kibana & elastic-ingest gets completed and there is no error shown for Index not created. On searching the index in Index Management, there will be no such Index with a letter in upper case. i.e. search-NetworkDrive

Expected behavior

The user should be shown an error message for Index not getting created with letters in Uppercase in its name as the error is shown while creating Index with an uppercase letter via API.

Screenshots

Additional context

If the user is creating the Index with API, it shows an error message as "type": "invalid_index_name_exception", "reason": "Invalid index name [search-NetworkDrive], must be lowercase",

This is a test

Bug Description

Sean is filing this to test our new github actions. He'll close this momentarily

To Reproduce

Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

Screenshots

Environment

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Additional context

Include Tika in aws connector so the SUPPORTED_FILETYPE can included csv, json and xml files

Problem Description

The new AWS connector connects to S3 - people place standard data file types here,i.e., log.json, table.csv, and old.xml files.
Our current support types are for programing language files to be read. This isn't the normal place to keep your python, ruby, and shell scripts.

Proposed Solution

Since Tika is used throughout enterprise search to handle multiple file types, we should use it within AWS connector so we can expose the file types here as well.

Alternatives

The alternative is to pull in binary representation and use ingest pipeline (using tika) to perform the extraction.

Additional Context

This is an awesome connector + the directory one ++

Someone asked about arvo files stored in s3 - So I assume the list is endless of supported files.
For something like avro - we should build a ingest pipeline to handle these non-tika types, I would assume.

MySQL logging improvement

These messages logged at debug level
Next sync for mysql due in -1 seconds
create confusion and should be made more explicit.

Logs should be rephrased to state that the sync is disabled.

The recurring sync keeps executing (every minute) inspite of the scheduled enabled being set to FALSE.

The recurring sync keeps executing (every minute) in spite of the scheduled enabled being set to FALSE.

Keep the scheduling enabled default value as enabled: False and run the connector with sync_now: true.
Execute the command elastic-ingest -c config.yml --action poll and wait for all the documents to be indexed.
Once the execution is completed for the first time, check if for another recurring sync happening.
Validate the recurring sync after the indexing execution of the previous command is completed.

Expected Result:
The recurring sync should not happen as per the schedule enabled is set to false.

Actual Result:
The recurring scheduling is happening every minute and it is not working as per the enabled set to false. Still the sync is happening every minute.

Error observed while executing elastic-ingest command with list argument

Steps to reproduce:

Install the connector using pip install command
After successful installation, execute the command elastic-ingest -c config.yml --action list

Actual Result:
Error appears as "Attribute Error:'NoneType' object has no attribute 'strip'"

Expected Result:
List of sources should be displayed on executing elastic-ingest with list argument.

@tarekziade Can you please take a look into it?

Attaching screenshot for reference.

Update access settings

Add the Enterprise Search team as admins to this repo please.

buildkite can't trigger jobs for branches on forked repos

The CI looks like it's not working on forked repos

Sync button in the UI is exhibiting inconsistent behaviour

Bug Description

Once the user sets the schedule and executes the poll command, the first sync gets completed successfully but at times next sync does not get invoked as per the schedule and user manually needs to click Sync button from the UI.

Pre-requisites:

Create an index from the UI
Generate the API key and assign privileges to the API key
Configure config.yml file with proper API key and source.py file with appropriate parameter values
Update the package

To Reproduce

Steps to reproduce the behavior:

Execute the indexing using the command elastic-ingest -c config-file --action poll
Set schedule in the UI for every 1 minute and click Sync
Let the first sync get completed successfully and notice if the next sync gets triggered spontaneously as per the schedule

Expected behavior

Next sync should get triggered spontaneously as per the schedule

Actual behavior

Next sync does not trigger as per the schedule and user needs to manually click Sync button from UI to start the sync.
This is observed as an inconsistent behaviour

Environment

Elasticsearch v8.4.1
macOS Monterey v12.5

The elastic-ingest execution remains incomplete when the scheduling enabled is set to FALSE

It is observed that when the scheduling enabled is set to False, the manual execution of the elastic-ingest command (2nd time) does not complete the execution, and the user has to terminate the script as it keeps running without indexing any documents.

Steps:

Set the scheduling enabled: False in kibana.py and run the connector
Execute the command elastic-ingest -c config.yml --action poll and wait for all the documents to be indexed
Once the execution is completed for the first time, terminate the script manually(abruptly killing the cmd service) to ensure the connector service is completely stopped
Run the poll service again with the same config i.e. scheduling enable: False
Wait for the service to index the documents or print proper logs in case the documents are not indexed

Actual Result:
The service continuously keeps running without indexing any documents

Expected Result:
The service should either index the documents or print a proper logger message

What we understand from the scheduling flag in the config is that it will decide to whether schedule the connector or not. Setting it to False would mean that the connector is not a scheduled run but although it should run on-demand i.e. with scheduling as False, if I want to index documents on demand, I can start the poll service at any given point of time and the service should index all the documents that are present till that time.
However, this is only happening for the first run i.e. if I try to execute the connector again(of course after killing the service completely), the documents are not indexed and no proper logs are shown

@tarekziade Please look into this and confirm the behavior when scheduling enabled is set to False.

The elastic-ingest command fails for non-native service_type created using 'build a connector' from UI

Problem Description

When running a non-native service_type using build a connector, the connector status is created. Hence, the elastic-ingest command fails since the if-condition in runner.py checks whether the connector status is created:

https://github.com/elastic/connectors-python/blob/5ec59cf588e535046ae64cc011b272f90ed0a7f1/connectors/runner.py#L98

Build fails on Python 3.6

The pinned versions for some of the libraries listed in requirements.txt do not support Python 3.6 due to which setup will fail when running on python version 3.6.

motor 3.0.0, aioboto3 9.6.0, pytest-asyncio 0.19.0, require Python >= 3.7

The setup file mentions 'Requires Python 3.6 or superior'. I think it is safe to remove support for python 3.6 as it has reached it's EOL

Suggested change in setup file:

if sys.version_info < (3, 7):
    raise ValueError("Requires Python 3.7 or superior")

Run buildkite on centos 7

Problem Description

We need to support centos 7, let's make sure the service works there

Proposed Solution

A new dockerfile cfore centos7 + python 3.7, used in the CI to run our tests, in parallel of the Python image

FROM centos:7

RUN yum update -y
RUN yum groupinstall "Development Tools" -y
RUN yum install openssl-devel libffi-devel bzip2-devel -y
RUN yum install wget -y

# open ssl
RUN wget https://www.openssl.org/source/openssl-3.0.5.tar.gz
RUN yum install perl-IPC-Cmd -y
RUN tar xzf openssl-3.0.5.tar.gz
RUN cd openssl-3.0.5 && ./config --prefix=/usr/local/custom-openssl --libdir=lib --openssldir=/etc/ssl && make -j1 depend && make -j8 && make install_sw

# python 3.10
RUN wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
RUN tar xvf Python-3.10.0.tgz
RUN cd Python-3.10.0
RUN cd Python-3.10.0 && ./configure --with-openssl=/usr/local/custom-openssl --with-openssl-rpath=auto --prefix=/usr/local/custom-openssl && make -j8 && make altinstall
RUN yum install -y python3-pip
RUN pip3 install certifi
RUN python3 -c 'import urllib.request;  print(urllib.request.urlopen("https://python.org/").status)'

Alternatives

Additional Context

Clarify `QueueFull` exception

Problem Description

On QueueFull exception in https://github.com/elastic/connectors-python/blob/main/connectors/utils.py#L292
it's hard to know why we've reached the limit.

Proposed Solution

When we raise the error, let's add more debug info:

self._current_memsize
item_size
refresh_timeout

To understand the size of the queue and the size of time you try to put in the queue. maybe that particular item is huge and is bigger than the queue limit.

we can also surface refresh_timeout

Alternatives

Additional Context

Intermittent test failure

def test_next_run():
        assert next_run("1 * * * * *") < 70.0
        assert next_run("* * * * * *") == 0
>       assert next_run("0/5 14,18,3-39,52 * ? JAN,MAR,SEP MON-FRI 2002-2010") > 0


connectors/tests/test_utils.py:8: 

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
connectors/utils.py:93: in next_run
    when = cron_obj.next_trigger()
    connectors/quartz.py:1081: in next_trigger
    self._process_time_unit_queue(overflow, unit_names)
    connectors/quartz.py:1042: in _process_time_unit_queue
    ) = self._get_parser(unit_name)().parse(

connectors/quartz.py:306: in parse
    ) = self._comma_handler(date_pointer, value, _trigger_secondary)

connectors/quartz.py:177: in _comma_handler
    values = sorted([int(val) for val in value.split(",")])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

0 = <list_iterator object at 0x7f0b004ff250>
>   values = sorted([int(val) for val in value.split(",")])

E   ValueError: invalid literal for int() with base 10: '3-39'


connectors/quartz.py:177: ValueError

Verify.py file has elasticsearch config parameter as `user`, should be `username`

Make ftest fails due to incorrect config parameter name in verify.py file

To Reproduce

Steps to reproduce the behavior:

Configure docker and config file from the fixtures
execute the command: make run-stack
install the package: make install
run the ftest: make ftest NAME=<connector_name>

Expected behavior

The ftest should complete and verify.py ensures more than 10k documents are indexed.

Actual result

make ftest fails with KeyError 'user' in verify.py file

Screenshots

Additional context

The config.yml has the elasticsearch parameters as username and password. So username should be there in place of user in verify.py too.
https://github.com/elastic/connectors-python/blob/e3e257d9967f5283097e419421f92abad372e8a5/config.yml#L3

Unable to connect AWS S3

Steps to reproduce:

- Edit config.yml file and configure correct value for elasticsearch host
- Go to connectors/Sources and update configuration values in aws.py file
- Update the package and Run python3 kibana.py --config-file config.yml --service-type s3 --index-name <index_name>
- Execute command elastic-ingest -c config.yml --action poll to index the files in elasticsearch
- Observe the logs [Connection Timeout Error]

Actual Result:

Connect timeout on endpoint URL: "https://169.254.169.254/latest/api/token

Expected Result:

Sync done: 1 indexed, 0  deleted.

Expose locales

see #33 (comment)

Misleading information in the count displayed in .elastic-connectors-sync-jobs index

Bug Description
Misleading information in the count displayed in .elastic-connectors-sync-jobs index

Pre-requisite
Go to Elasticsearch v8.4.1 and perform below steps:

Add integration -> Build connector
Provide index name and generate API key
Assign privileges to the API
Set recurring schedule using Sync

Steps to reproduce the behavior:

Edit config.yml file and configure correct value for elasticsearch host and API key
Go to connectors/Sources and edit connector.py file
Update the package and execute command elastic-ingest -c config.yml --action poll to index the file in elasticsearch
Observe the count displayed in the logs and in .elastic-connectors-sync-jobs index in kibana

Expected Result:
Count should be properly displayed in console and in .elastic-connectors-sync-jobs index in kibana

Actual Result:
Count is properly displayed in console but it is misleading in .elastic-connectors-sync-jobs index in kibana
The actual number of documents indexed is shown in deleted_document_count and the indexed_document_count is shown as 0.
Also, after the first sync is completed , from the second sync onwards the count is not all updated in .elastic-connectors-sync-jobs index in kibana irrespective of any new documents indexed or existing documents deleted from source.

While Indexing data from Network Drive to Elastic Search, version_conflict_engine_exception is occurring.

Bug Description

While Indexing data from Network Drive to Elastic Search, version_conflict_engine_exception is occurring.

To Reproduce

Steps to reproduce the behavior:

Create an Index in Elasticsearch
Index documents by executing the elastic-ingest --action poll -c config.yml command and check for sync completion.
Validate the sync completion process

Expected behavior

The sync should get completed by indexing all the files available in the path of a network drive.

Actual behavior:

The sync is not getting completed and shows a version_conflict_engine_exception.

Screenshots:

Attaching a log file:
https://drive.google.com/file/d/1EEi6QaWAK4boL8z3hcJw4qWqW39r-jm-/view?usp=sharing

Environment

Linux CentOS7

Additional context

We think, this might have occurred due to the event loop, as update_by_query takes a snapshot of the index before updating it, which overlaps the document version in the elastic search.

Make sure the e2e test runs several syncs

see #35 (comment)

Surface `analysis_icu` and `language_code` in config

When the service creates the mapping, it needs to pass a default analysis_icu and language_code

Let's surface them in the config file

https://github.com/elastic/connectors-python/blob/main/connectors/byoc.py#L245