oliver006 / elasticsearch-test-data Goto Github PK
View Code? Open in Web Editor NEWGenerate and upload test data to Elasticsearch for performance and load testing
License: MIT License
Generate and upload test data to Elasticsearch for performance and load testing
License: MIT License
If people using this tool against an insecure https endpoint, whose cert cannot be verified, they will get an error message similar to
SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed ....
Tornado supports validate_cert parameter in tornado.httpclient.HTTPRequest but this tool doesn't expose it.
It is a good tool to generate bulk elasticsearch test data however parameters passed are not being handled correctly and parameter name given in --help option is not consistent with variable used in script ( - in help output vs _ in the script ). For example index-name vs index_name .
When specifying index_name it still creates index as test_data.
When I am trying to upload data into a live cluster I am getting the following error:
RuntimeError: Cannot run the event loop while another loop is running
Please see attached doc for output.
Python_Errors.txt
Hi, I'm running this in localhost and the ES server is a vagrant box running locally, I'm forwarding port 9200 from guest to 9201 host. I'd like to get this working, see errors below. Any help would be appreciated. Thanks
python3 es_test_data.py --es_url=http://127.0.0.1:9201 git:master* [I 210612 04:39:13 es_test_data:55] Trying to create index http://127.0.0.1:9201/test_data [I 210612 04:39:13 es_test_data:60] Looks like the index exists already [I 210612 04:39:13 es_test_data:228] Generating 100000 docs, upload batch size is 1000 [I 210612 04:39:13 es_test_data:81] Upload: FAILED - upload took: 152ms, total docs uploaded: 1000 [I 210612 04:39:13 es_test_data:81] Upload: FAILED - upload took: 5ms, total docs uploaded: 2000 [I 210612 04:39:13 es_test_data:81] Upload: FAILED - upload took: 42ms, total docs uploaded: 3000
Running es_test_data.py
under Python 3.7.3 and tornado==6.0.3
(version is not specified in requirements.txt
) throws the following error:
$ python es_test_data.py --es_url=http://localhost:9200 --count=100 --index_name=fluentd-test-data --num_of_shards=10 --num_of_replicas=1
[I 190709 17:04:16 es_test_data:46] Trying to create index http://localhost:9200/fluentd-test-data
Traceback (most recent call last):
File "es_test_data.py", line 276, in <module>
tornado.ioloop.IOLoop.instance().run_sync(generate_test_data)
File "/Users/me/.pyenv/versions/es-test-data-py37/lib/python3.7/site-packages/tornado/ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "/Users/me/.pyenv/versions/es-test-data-py37/lib/python3.7/site-packages/tornado/gen.py", line 209, in wrapper
yielded = next(result)
File "es_test_data.py", line 193, in generate_test_data
create_index(tornado.options.options.index_name)
File "es_test_data.py", line 48, in create_index
response = tornado.httpclient.HTTPClient().fetch(request)
File "/Users/me/.pyenv/versions/es-test-data-py37/lib/python3.7/site-packages/tornado/httpclient.py", line 107, in __init__
self._async_client = self._io_loop.run_sync(make_client)
File "/Users/me/.pyenv/versions/es-test-data-py37/lib/python3.7/site-packages/tornado/ioloop.py", line 526, in run_sync
self.start()
File "/Users/me/.pyenv/versions/es-test-data-py37/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 148, in start
self.asyncio_loop.run_forever()
File "/Users/me/.pyenv/versions/3.7.3/lib/python3.7/asyncio/base_events.py", line 529, in run_forever
'Cannot run the event loop while another loop is running')
RuntimeError: Cannot run the event loop while another loop is running
Installing tornado==4.5.3
solves the issue. I've found workaround here: jupyter/notebook#3397
Nice work! At an initial glance, the first couple features I think would be great are:
ability to provide a JSON file to describe the format for test messages
...I realise of course that this would require all data types (including nested objects) to be parseable
Ability to specify different dicts and tie different field names to those dicts
Generate (separate or combined) random street address data... although this might be covered by feature (2) above.
Hi,
I'm trying to override the index config defaults using the below:
docker run --rm -it --network host oliver006/es-test-data --es_url=http://... --batch_size=1000 --num_of_shards=1 --num_of_replicas=2 --index_name=test_data4
However it seems to ignore the options for shards & replicas
curl http://.../test_data4/ | jq
{
"test_data4": {
"aliases": {},
"mappings": {
"properties": {
"age": {
"type": "long"
},
"last_updated": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"index": {
"creation_date": "1621856375498",
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "_LBis-bSTmOqmEIi2lnJNQ",
"version": {
"created": "7040199"
},
"provided_name": "test_data4"
}
}
}
}
Can you confirm that this is working? Am I missing something?
How to reproduce:
--force_delete_index=True
option400 Bad Request error
is returned by Elasticsearch, the index is not deletedFix:
?refresh=true
from the url variable in function delete_index- url = "%s/%s?refresh=true" % (tornado.options.options.es_url, idx_name)
+ url = "%s/%s" % (tornado.options.options.es_url, idx_name)
I have a local Elasticsearch 2.3.2 instance and using Python 3.5...
I ran the following:
python es_test_data.py --es_url=http://localhost:9200
Here's the output with exception:
[I 160515 23:59:36 es_test_data:47] Trying to create index http://localhost:9200/test_data
[I 160515 23:59:37 es_test_data:50] Creating index "test_data" done b'{"acknowledged":true}'
[I 160515 23:59:37 es_test_data:217] Generating 10000 docs, upload batch size is 1000
Traceback (most recent call last):
File "es_test_data.py", line 274, in <module>
tornado.ioloop.IOLoop.instance().run_sync(generate_test_data)
File "C:\Python35\lib\site-packages\tornado\ioloop.py", line 453, in run_sync
return future_cell[0].result()
File "C:\Python35\lib\site-packages\tornado\concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "C:\Python35\lib\site-packages\tornado\gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "es_test_data.py", line 235, in generate_test_data
yield upload_batch(upload_data_txt)
File "C:\Python35\lib\site-packages\tornado\gen.py", line 1008, in run
value = future.result()
File "C:\Python35\lib\site-packages\tornado\concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "C:\Python35\lib\site-packages\tornado\gen.py", line 1017, in run
yielded = self.gen.send(value)
File "es_test_data.py", line 68, in upload_batch
result = json.loads(response.body)
File "C:\Python35\lib\json\__init__.py", line 312, in loads
s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
I got the following error when I run the command
python es_test_data.py --es_url=http://localhost:9200 --index_name=test
The error:
[I 200622 08:48:49 es_test_data:52] Trying to create index http://localhost:9200/test
Traceback (most recent call last):
File "es_test_data.py", line 281, in
tornado.ioloop.IOLoop.instance().run_sync(generate_test_data)
File "C:\Program Files (x86)\Python\Python38-32\lib\site-packages\tornado\ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "C:\Program Files (x86)\Python\Python38-32\lib\site-packages\tornado\gen.py", line 209, in wrapper
yielded = next(result)
File "es_test_data.py", line 199, in generate_test_data
create_index(tornado.options.options.index_name)
File "es_test_data.py", line 54, in create_index
response = tornado.httpclient.HTTPClient().fetch(request)
File "C:\Program Files (x86)\Python\Python38-32\lib\site-packages\tornado\httpclient.py", line 107, in init
self._async_client = self._io_loop.run_sync(make_client)
File "C:\Program Files (x86)\Python\Python38-32\lib\site-packages\tornado\ioloop.py", line 526, in run_sync
self.start()
File "C:\Program Files (x86)\Python\Python38-32\lib\site-packages\tornado\platform\asyncio.py", line 149, in start
self.asyncio_loop.run_forever()
File "C:\Program Files (x86)\Python\Python38-32\lib\asyncio\windows_events.py", line 316, in run_forever
super().run_forever()
File "C:\Program Files (x86)\Python\Python38-32\lib\asyncio\base_events.py", line 560, in run_forever
self._check_running()
File "C:\Program Files (x86)\Python\Python38-32\lib\asyncio\base_events.py", line 554, in _check_running
raise RuntimeError(
RuntimeError: Cannot run the event loop while another loop is running
@oliver006 Thank you for this great script, it works perfectly.
Just thought you might want to use this Dockerfile that I've created. Please note that I'm copying from tests/es_test_data
so you'll need to remove the tests
prefix if you build it from this repository.
DockerHub unfor19/es-test-data
Size unpacked: 47.9MB
Size @ DockerHub: 17.21MB
Example:
$ docker-compose up -d # docker-compose.yml at the bottom
$ docker run --rm -it --network host unfor19/es-test-data \
--es_url=http://localhost:9200 \
--batch_size=10000 \
--username=elastic \
--password="esbackup-password"
[I 210317 22:22:19 es_test_data:54] Trying to create index http://localhost:9200/test_data
[I 210317 22:22:19 es_test_data:61] Looks like the index exists already
[I 210317 22:22:19 es_test_data:238] Generating 100000 docs, upload batch size is 10000
[I 210317 22:22:20 es_test_data:82] Upload: OK - upload took: 731ms, total docs uploaded: 10000
[I 210317 22:22:21 es_test_data:82] Upload: OK - upload took: 749ms, total docs uploaded: 20000
[I 210317 22:22:22 es_test_data:82] Upload: OK - upload took: 679ms, total docs uploaded: 30000
[I 210317 22:22:24 es_test_data:82] Upload: OK - upload took: 729ms, total docs uploaded: 40000
[I 210317 22:22:25 es_test_data:82] Upload: OK - upload took: 691ms, total docs uploaded: 50000
[I 210317 22:22:26 es_test_data:82] Upload: OK - upload took: 729ms, total docs uploaded: 60000
[I 210317 22:22:27 es_test_data:82] Upload: OK - upload took: 766ms, total docs uploaded: 70000
[I 210317 22:22:28 es_test_data:82] Upload: OK - upload took: 698ms, total docs uploaded: 80000
[I 210317 22:22:30 es_test_data:82] Upload: OK - upload took: 709ms, total docs uploaded: 90000
[I 210317 22:22:31 es_test_data:82] Upload: OK - upload took: 705ms, total docs uploaded: 100000
[I 210317 22:22:31 es_test_data:272] Done - total docs uploaded: 100000, took 12 seconds
Dockerfile
### --------------------------------------------------------------------
### Docker Build Arguments
### Available only during Docker build - `docker build --build-arg ...`
### --------------------------------------------------------------------
ARG DEBIAN_VERSION="buster"
ARG ALPINE_VERSION="3.12"
ARG PYTHON_VERSION="3.9.1"
ARG APP_NAME="tornado"
ARG APP_VERSION="4.5.3"
ARG APP_PYTHON_USERBASE="/app"
ARG APP_USER_NAME="appuser"
ARG APP_USER_ID="1000"
ARG APP_GROUP_NAME="appgroup"
ARG APP_GROUP_ID="1000"
# Reminder- the ENTRYPOINT is hardcoded so make sure you change it (remove this comment afterwards)
### --------------------------------------------------------------------
### --------------------------------------------------------------------
### Build Stage
### --------------------------------------------------------------------
FROM python:"$PYTHON_VERSION"-slim-"${DEBIAN_VERSION}" as build
ARG APP_PYTHON_USERBASE
ARG APP_VERSION
ARG APP_NAME
# Define env vars
ENV PIP_DISABLE_PIP_VERSION_CHECK=1 \
PYTHONUSERBASE="$APP_PYTHON_USERBASE" \
PATH="${APP_PYTHON_USERBASE}/bin:${PATH}"
# Upgrade pip and then install build tools
RUN pip install --upgrade pip && \
pip install --upgrade wheel setuptools wheel
# Define workdir
WORKDIR "$APP_PYTHON_USERBASE"
# Install the app
RUN pip install --ignore-installed --no-warn-script-location --prefix="/dist" "$APP_NAME"=="$APP_VERSION"
WORKDIR /dist/
COPY tests/es_test_data.py .
# For debugging the Build Stage
CMD ["bash"]
### --------------------------------------------------------------------
### --------------------------------------------------------------------
### App Stage
### --------------------------------------------------------------------
FROM python:"$PYTHON_VERSION"-alpine"${ALPINE_VERSION}" as app
# Fetch values from ARGs that were declared at the top of this file
ARG APP_NAME
ARG APP_PYTHON_USERBASE
ARG APP_USER_ID
ARG APP_USER_NAME
ARG APP_GROUP_ID
ARG APP_GROUP_NAME
# Define env vars
ENV HOME="$APP_PYTHON_USERBASE" \
PYTHONUSERBASE="$APP_PYTHON_USERBASE" \
APP_NAME="$APP_NAME" \
PYTHONUNBUFFERED=0
ENV PATH="${PYTHONUSERBASE}/bin:${PATH}"
# Define workdir
WORKDIR "$PYTHONUSERBASE"
# Run as a non-root user
RUN \
addgroup -g "${APP_GROUP_ID}" "${APP_GROUP_NAME}" && \
adduser -H -D -u "$APP_USER_ID" -G "$APP_GROUP_NAME" "$APP_USER_NAME" && \
chown -R "$APP_USER_ID":"$APP_GROUP_ID" "$PYTHONUSERBASE"
USER "$APP_USER_NAME"
# Copy artifacts from Build Stage
COPY --from=build --chown="$APP_USER_NAME":"$APP_GROUP_ID" /dist/ "$PYTHONUSERBASE"/
# The container runs the application, or any other supplied command, such as "bash" or "echo hello"
# CMD python -m ${APP_NAME}
# Use ENTRYPOINT instead CMD to force the container to start the application
ENTRYPOINT ["python", "es_test_data.py"]
docker-compose.yml
version: "3.7"
### ------------------------------------------------------------------
### Variables
### ------------------------------------------------------------------
x-variables:
exposed-port: &exposed-port 9200
es-base: &es-base
image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
ulimits:
memlock:
soft: -1
hard: -1
networks:
- elastic
data-path: &data-path /usr/share/elasticsearch/data
snapshots-repository-path: &snapshots-repository-path /usr/share/elasticsearch/backup
volume-snapshots-repository: &volume-snapshots-repository
- type: volume
source: snapshots-repository
target: *snapshots-repository-path
services-es-env: &es-env-base
"cluster.name": "es-docker-cluster"
"cluster.initial_master_nodes": "es01,es02"
"bootstrap.memory_lock": "true"
"ES_JAVA_OPTS": "-Xms512m -Xmx512m"
"ELASTIC_PASSWORD": "esbackup-password"
"xpack.security.enabled": "true"
"path.repo": *snapshots-repository-path
### ------------------------------------------------------------------
services:
es01: # master
<<: *es-base
container_name: es01
environment:
<<: *es-env-base
node.name: es01
discovery.seed_hosts: es02
volumes:
- <<: *volume-snapshots-repository
- type: volume
source: data01
target: *data-path
ports:
- published: *exposed-port
target: 9200
protocol: tcp
mode: host
es02:
<<: *es-base
container_name: es02
environment:
<<: *es-env-base
node.name: es02
discovery.seed_hosts: es01
volumes:
- <<: *volume-snapshots-repository
- type: volume
source: data02
target: *data-path
volumes:
data01:
driver: local
data02:
driver: local
snapshots-repository:
driver: local
networks:
elastic:
driver: bridge
Is it possible to reduce CPU usage by using predefined strings in memory as field value instead of generating random strings each time? Reason for this request is I observed 100% CPU installation when running this tool. Each random string generation seems to consume CPU cycle. Further , as this is single threaded script, it does not make use of available CPU in multicore nodes. Thus I am not able to fully stress the Elasticsearch nodes. When single thread CPU utilization reaches 100%, latency of indexing increases though CPU, Load, Memory or IOPs are not a bottleneck on ES node. Can the script use multi-threading option?
In addition to just insert, option for updating together with search queries could make it even better to simulate realistic cases.
[root@es01 elasticsearch-test-data-master]# python es_test_data.py --es_url=http://10.10.1.63:39201 --username=elastic --password=elastic --count=10000 --num_of_shards=5 --num_of_replicas=2 --batch_size=1000 --index_name=index03
[I 220905 21:34:08 es_test_data:56] Trying to create index http://10.10.1.63:39201/index03
Traceback (most recent call last):
File "es_test_data.py", line 319, in
tornado.ioloop.IOLoop.instance().run_sync(generate_test_data)
File "/usr/local/lib64/python3.6/site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/usr/lib64/python3.6/asyncio/futures.py", line 243, in result
raise self._exception
File "/usr/local/lib64/python3.6/site-packages/tornado/gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "/usr/local/lib64/python3.6/site-packages/tornado/gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "es_test_data.py", line 215, in generate_test_data
create_index(tornado.options.options.index_name)
File "es_test_data.py", line 58, in create_index
response = tornado.httpclient.HTTPClient().fetch(request)
File "/usr/local/lib64/python3.6/site-packages/tornado/httpclient.py", line 135, in fetch
functools.partial(self._async_client.fetch, request, **kwargs)
AttributeError: 'Task' object has no attribute 'fetch'
Exception ignored in: <bound method HTTPClient.del of <tornado.httpclient.HTTPClient object at 0x7f82db854160>>
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/tornado/httpclient.py", line 113, in del
File "/usr/local/lib64/python3.6/site-packages/tornado/httpclient.py", line 118, in close
AttributeError: 'Task' object has no attribute 'close'
seems like that the u cant spec field names with a '.'
eg:
$ ... '--format=sourceSystem:text,sourceSystemCustomerId:text,tenant:text,firstName:text,lastName:text,email.address'
[I 190521 14:58:02 es_test_data:46] Trying to create index ......
[I 190521 14:58:02 es_test_data:51] Looks like the index exists already
[I 190521 14:58:02 es_test_data:220] Generating 1 docs, upload batch size is 1000
Traceback (most recent call last):
File "es_test_data.py", line 276, in <module>
tornado.ioloop.IOLoop.instance().run_sync(generate_test_data)
File "/home/ubuntu/.local/lib/python2.7/site-packages/tornado/ioloop.py", line 576, in run_sync
return future_cell[0].result()
File "/home/ubuntu/.local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
raise_exc_info(self._exc_info)
File "/home/ubuntu/.local/lib/python2.7/site-packages/tornado/gen.py", line 326, in wrapper
yielded = next(result)
File "es_test_data.py", line 223, in generate_test_data
item = generate_random_doc(format)
File "es_test_data.py", line 155, in generate_random_doc
f_key, f_val = get_data_for_format(f)
File "es_test_data.py", line 81, in get_data_for_format
field_type = split_f[1]
IndexError: list index out of range
while this works
$ ... '--format=sourceSystem:text,sourceSystemCustomerId:text,tenant:text,firstName:text,lastName:text'
to be fair I am a cs rookie
[E 220930 18:28:48 es_test_data:76] upload failed, error: Stream closed
[E 220930 18:28:48 es_test_data:76] upload failed, error: Stream closed
[E 220930 18:28:48 es_test_data:76] upload failed, error: Stream closed
[E 220930 18:28:48 es_test_data:76] upload failed, error: Stream closed
[E 220930 18:28:48 es_test_data:76] upload failed, error: Stream closed
[E 220930 18:28:48 es_test_data:76] upload failed, error: Stream closed
help would be appreciated
root@DESKTOP-71H61H4:~/elasticsearch-test-data# python3 es_test_data.py --es_url=http://localhost:9200
[I 210806 10:48:07 es_test_data:55] Trying to create index http://localhost:9200/test_data
Traceback (most recent call last):
File "es_test_data.py", line 284, in
tornado.ioloop.IOLoop.instance().run_sync(generate_test_data)
File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/usr/lib/python3.6/asyncio/futures.py", line 243, in result
raise self._exception
File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "es_test_data.py", line 202, in generate_test_data
create_index(tornado.options.options.index_name)
File "es_test_data.py", line 57, in create_index
response = tornado.httpclient.HTTPClient().fetch(request)
File "/usr/local/lib/python3.6/dist-packages/tornado/httpclient.py", line 135, in fetch
functools.partial(self._async_client.fetch, request, **kwargs)
AttributeError: 'Task' object has no attribute 'fetch'
Exception ignored in: <bound method HTTPClient.del of <tornado.httpclient.HTTPClient object at 0x7f0902ab3198>>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tornado/httpclient.py", line 113, in del
File "/usr/local/lib/python3.6/dist-packages/tornado/httpclient.py", line 118, in close
AttributeError: 'Task' object has no attribute 'close'
Hello, it seems a very useful tool for me to test indexing performance. But really I cannot figure how to use it in https..
Any option like "--insecure" when using curl? I tried validate_cert=false with no success. Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.