nikepan / clickhouse-bulk Goto Github PK
View Code? Open in Web Editor NEWCollects many small inserts to ClickHouse and send in big inserts
License: Apache License 2.0
Collects many small inserts to ClickHouse and send in big inserts
License: Apache License 2.0
Hi,
At NewClickhouse
method your have been implemented not right behavior for the ConnectTimeout
options:
c.ConnectTimeout = connectTimeout
if c.ConnectTimeout > 0 {
c.ConnectTimeout = 10
}
So, if I will set any positive value for ConnectTimeout
- it will be not used but rewritten to 10 seconds
;
Hi, first of all, thank your making clickhouse-bulk 💐
I am running with this config
{
"listen": ":8125",
"flush_count": 10000,
"flush_interval": 3000,
"debug": true,
"dump_dir": "dumps",
"clickhouse": {
"down_timeout": 300,
"servers": [
"http://127.0.0.1:8123"
]
}
}
Shouldn't this config collect and insert incoming requests in every 3 seconds, in bulk? I am watching logs and seeing every insert I send via HTTP (I am doing insert by python's requests
) being processed immediately as I send. What am I doing wrong?
while got a error,the data send failed until restart,is a bug?
Hi!
Right now, on one of the instances there are 197 files in dumps directory, they contain seemingly failed queries and it looks like this:
-rw-r--r-- 1 app app 5130 Nov 26 13:21 dump202311241214102-98-500.dmp
-rw-r--r-- 1 app app 12358 Nov 26 13:21 dump202311241214102-99-500.dmp
...
app@clickhouse-bulk-576cb9c658-h2dvx $ ls -1 /app/dumps | wc -l
197
What does clickhouse-bulk do with all these files, should it remove them after resending?
Hey there,
To install clickhouse-bulk on our server I added a systemd service for it. I just noticed that we frequently get issues with our journal log filling up the whole disk space.
I've tried to reconfigure systemd-journal to limit disk usage, but as the clickhouse-bulk log spits out so much data we miss out a lot of data then (e.g. log file is growing >100mb per hour)
Could me make the "sending x rows" and "sent x rows" messages configurable so I can deactivate them and only log warn/error messages?
Thanks in advance
Периодически под нагрузкой падает такой лог
clickhouse-bulk_1 | 2021/03/05 11:03:10.847398 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.847752 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.847858 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.847954 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848023 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848081 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848224 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848425 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848488 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848615 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848839 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848950 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.849238 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.849771 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850151 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850361 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850426 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850513 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850565 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850753 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:04:32.243796 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:04:42.244128 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:04:42.244156 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:04:52.244517 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:04:52.244552 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:05:02.244919 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:05:02.244950 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:05:12.245236 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:05:12.245261 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:05:22.245596 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:05:22.245626 ERROR: server error (503) No working clickhouse servers
но при этом сам сервер кх жив
echo 'SELECT 1' | curl 'http://default:root@1111:8123/' --data-binary @-
1
Periodicaly not found or database, or auth error
-- auto-generated definition
create table test
(
id Int32,
name String
)
engine = MergeTree PARTITION BY id
PRIMARY KEY id
ORDER BY (id, name)
SETTINGS index_granularity = 8192;
⇨ http server started on [::]:8124
2021/09/10 21:26:10.503957 DEBUG: query INSERT INTO gc.test (id, name) VALUES (7, 'xcvbx')
2021/09/10 21:26:11.506905 INFO: sending 1 rows to http://10.0.10.141:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:11.521327 INFO: sent 1 rows to http://10.0.10.141:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:16.768973 DEBUG: query INSERT INTO gc.test (id, name) VALUES (8, 'xcvbx')
2021/09/10 21:26:17.504073 INFO: sending 1 rows to http://10.0.10.142:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:17.517043 INFO: sent 1 rows to http://10.0.10.142:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:17.517161 ERROR: Send (500) Wrong server status 500:
response: Code: 516, e.displayText() = DB::Exception: chtdidx: Authentication failed: password is incorrect or there is no user with such name (version 21.2.2.8 (official build))
request: "INSERT INTO gc.test (id, name) VALUES\n(8, 'xcvbx')"; response Code: 516, e.displayText() = DB::Exception: chtdidx: Authentication failed: password is incorrect or there is no user with such name (version 21.2.2.8 (official build))
2021/09/10 21:26:19.228692 DEBUG: query INSERT INTO gc.test (id, name) VALUES (8, 'xcvbx')
2021/09/10 21:26:19.508245 INFO: sending 1 rows to http://10.0.10.143:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:19.522896 INFO: sent 1 rows to http://10.0.10.143:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:19.523012 ERROR: Send (500) Wrong server status 500:
response: Code: 516, e.displayText() = DB::Exception: chtdidx: Authentication failed: password is incorrect or there is no user with such name (version 21.2.2.8 (official build))
request: "INSERT INTO gc.test (id, name) VALUES\n(8, 'xcvbx')"; response Code: 516, e.displayText() = DB::Exception: chtdidx: Authentication failed: password is incorrect or there is no user with such name (version 21.2.2.8 (official build))
2021/09/10 21:26:22.539692 DEBUG: query INSERT INTO gc.test (id, name) VALUES (8, 'xcvbx')
2021/09/10 21:26:23.503982 INFO: sending 1 rows to http://10.0.10.141:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:23.531772 INFO: sent 1 rows to http://10.0.10.141:8123 of INSERT INTO gc.test (id, name) VALUES
ParseQuery in in collector.go incorrectly handles case when there are params before and after query
Line 315 in b34e73c
It might cause incorrect results or panic: runtime error: slice bounds out of range
depending on params length.
Instead of
if eoq >= 0 {
q = queryString[i+6 : eoq+6]
params = queryString[:i] + queryString[eoq+7:]
}
It should be:
if eoq >= 0 {
q = queryString[i+6 : i+eoq+6]
params = queryString[:i] + queryString[i+eoq+7:]
}
Example of problematic string:
queryString = "a=11111111111111111111111111111&query=insert into x format fmt&a=1"
Now all logs store in syslog, that is not very useful. Please, make option to choose place for logs storing (file, syslog, work without logging) and make possible to set up log level (for ex. I don't want to store logs like ' clickhouse-bulk[40680]: 2020/08/17 13:58:01 INFO: send 0 rows to ...')
Hello. I write simple test script on go:
package main
import (
"fmt"
"io/ioutil"
"net/http"
"strings"
)
func main() {
fmt.Println("Hello, playground")
for i := 0; i < 500; i++ {
post(fmt.Sprintf("(%d)", i))
}
println("done")
}
func post(b string) {
bod := strings.NewReader(b)
req, err := http.NewRequest("POST", "http://127.0.0.1:8124/?query=INSERT%20INTO%20t%20VALUES", bod)
if err != nil {
panic(err)
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
_, err = ioutil.ReadAll(resp.Body)
if err != nil {
panic(err)
}
}
I have default params in config
A see in log this records:
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (493)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (494)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (495)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (496)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (497)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (498)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (499)
2019/11/18 18:14:08 INFO: send 500 rows to http://u:pass@ip:8123 of INSERT INTO t VALUES
But i see in CH:
curl 'some:8123?query=SELECT%20MAX(a)%20FROM%20t'
255
Looks like first packet was sended 2 times:
250
251
252
253
254
255
0
1
2
3
4
5
6
Hi
Why do you HTTP interface for Clickhouse?
Is it better than clickhouse-go ?
Hi there,
Could you kindly confirm my initial thoughts about using your tool as a savior to my system?
I have a scenario where small inserts of data are posted to Clickhouse (e.g gps updates from a number of mobile devices).
Often Clickhouse returns http 500 error due to max connection count reached or due to timeout.
There are some MV that are being calculated on inserts so that might slow it down.
I changed default value from 100 to 500 but it doesn't seem to help just more queries are waiting.
I thought that using your tool can improve the situation as bulk inserts are advised due to performance boost.
Other option that I think of is usage of buffer tables.
Thanks!
The last lines in log file:
2023/04/21 08:29:37.756987 ERROR: server down (502): Post "CH_URL": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023/04/21 08:29:37.756995 ERROR: Send (503) No working clickhouse servers; response
After that, data isn't sent to the CH server and the pod's RAM has increased over time.
I've checked app status and it looks ok:
/app # curl -s http://127.0.0.1:8124/metrics | grep "^ch_"
ch_bad_servers 0
ch_dump_count 14763
ch_good_servers 1
ch_queued_dumps 14743
ch_received_count 5.4660103e+07
ch_sent_count 2.154301e+06
Version: 1.3.3
The upstream servers credentials are leaked into the log output on:
use servers XXXX
sending N rows to XXXX
sent N rows to XXXX
{
"listen": ":8124",
"flush_count": 10000,
"flush_interval": 1000,
"dump_check_interval": 300,
"debug": false,
"dump_dir": "dumps",
"clickhouse": {
"down_timeout": 60,
"connect_timeout": 10,
"servers": [
"http://clickhouse:[email protected]:8123"
]
}
}
$ curl -s http://127.0.0.1:8124/metrics | grep "^ch_" ch_bad_servers 0 ch_dump_count 0 ch_good_servers 0 ch_queued_dumps 0 ch_received_count 0 ch_sent_count 0
$curl http://clickhouse:[email protected]:8123 Ok.
It seems that he does not see servers, how can I solve this problem?
So I have two databases, we shall call them default
and newdb
containing identical tables, and if I connect to clickhouse directly using http://username:password@localhost:8123/default
or http://username:password@localhost:8123/newdb
I am able to submit queries to the correct database.
However, with clickhouse-bulk, the exact same connection strings as above, inserts to both databases are aggregated into the same DB.
I can (and am now) running a copy of clickhouse-bulk per database, but this seems sub-optimal. Absolute minimum case, clickhouse-bulk should reject queries sent to /newdb
if it is only going to insert into a single DB.
Hi.
The proxy can not execute queries an prints this log after the update but it has worked before the update
request: "INSERT INTO `display` (`uuid`,`user_id`,`app_uuid`,`uuid1`,`uuid2`,`created_at`) VALUES\n('7ecd41eb-58b6-44d2-ab59-01235bc32135',86,'00806453-89a0-4fd2-9f9f-2b012f45049e','0069f823-f901-48c6-b8bb-3d5a5d61d470','4264487b-fa40-47ae-939b-a492df46caaa','1618914155')"
2021/04/21 07:12:41.249482 INFO: sending 1 rows to http://192.168.88.1:8123 of INSERT INTO `display` (`uuid`,`user_id`,`app_uuid`,`uuid1`,`uuid1`,`created_at`) VALUES
2021/04/21 07:12:41.257645 INFO: sent 1 rows to http://192.168.88.1:8123 of INSERT INTO `display` (`uuid`,`user_id`,`app_uuid`,`uuid1`,`uuid1`,`created_at`) VALUES
2021/04/21 07:12:41.257817 ERROR: server error (400) Wrong server status 400:
Please take a look, it is very urgent for us.
Best Regards
Arthur
If forwarding queries on to a server requiring authentication, if URL is of the form http://username:password@localhost:8123, these credentials are disclosed in the log file by
Line 182 in 4f084dd
We should redact the password portion of this string before echoing it.
I want to add bulk features to chproxy,can you help me how to do it?
just getting 400 status
We have noticed, that we have some old dump with 502 error. They are not resend. When service is restarted dumps are resent after 5 minutes. Looks like we have some bug with d.LockedFiles and these dumps are locked.
The data will multi in memory, and if the service dead when the data is saved too much, the data will be lost
Hi
I cant send quires to clickhouse-bulk
I add clickhouse -bulk to my docker compose
clickhouse:
image: yandex/clickhouse-server:21.1.2
ports:
- 8123:8123
- 9090:9000
clickhouse-bulk:
image: nikepan/clickhouse-bulk:1.3.3
ports:
- "8124:8124"
environment:
- CLICKHOUSE_SERVERS=http://0.0.0.0:8123
When i tried to send quires to port 8214 i get 503
* Trying 127.0.0.1:8124...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8124 (#0)
> POST /?query=CREATE&DATABASE&IF&NO&EXISTS&test HTTP/1.1
> Host: 127.0.0.1:8124
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=UTF-8
< Date: Mon, 26 Jul 2021 23:10:49 GMT
< Content-Length: 0
<
* Connection #0 to host 127.0.0.1 left intact
Both containers work and i can send curl http://0.0.0.0:8124/metrics | grep "^ch_"
What am I doing wrong?
When doing request using the following lib https://github.com/smi2/phpClickHouse which is the best one at the moment. It seems like the clickhouse-bulk except auth config via query param, but the new version of the phpClickHouse lib is using authorization via headers.
Unfortunately the phpClickHouse doesn't provide a way to send custom query param so I can do a workaround to add username/password to query string too.
Is there any solution for this? Or could you update the lib to except the param via new way also.
The one of the official auth methods of Clickhouse is with headers: X-ClickHouse-Key, X-ClickHouse-User
Here all options:
$ echo 'SELECT 1' | curl 'http://user:password@localhost:8123/' -d @-
$ echo 'SELECT 1' | curl 'http://localhost:8123/?user=user&password=password' -d @-
$ echo 'SELECT 1' | curl -H 'X-ClickHouse-User: user' -H 'X-ClickHouse-Key: password' 'http://localhost:8123/' -d @-
We have following configuration.
{
"listen": ":8124",
"flush_count": 30000,
"flush_interval": 1000,
"debug": false,
"dump_dir": "dumps",
"clickhouse": {
"down_timeout": 300,
"servers": [
"http://127.0.0.1:8123",
"http://172.16.10.78:8123"
]
}
}
For example, this query:
INSERT INTO test (date, args) VALUES ('2019-06-13', 'query=select%20args%20from%20test%20group%20by%20date%20FORMAT%20JSON')
or this
INSERT INTO test (date, args) VALUES ('2019-06-13', 'query=select%2520args%2520from%2520test%2520group%2520by%2520date%2520FORMAT%2520JSON')
generates an error:
2019/06/13 12:49:46 Send ERROR 500: Code: 27, e.displayText() = DB::Exception: Cannot parse input: expected ( before: \'NDA\', ...: (at row 2)
It'd be great to have an (either automatic or manual; or actually both :)) ability to easily resend the queries that were failed and dumped.
This option could be helpful in cases when service might be killed with oom or unexpected reboot server, etc. To prevent losing all of the data collected in memory and guarantee delivery after recovery service.
Useful new options:
{
"listen": ":8123",
"flush_count": 10000,
"flush_interval": 3000,
"debug": true,
"dump_dir": "dumps",
"clickhouse": {
"down_timeout": 300,
"servers": [
"http://0.0.0.0:8070"
]
}
}
2018/07/11 09:15:27 query query=Insert+into+Log_buffer+FORMAT+JSONEachRow&input_format_skip_unknown_fields=1 {"ts":"2018-07-11 09:15:27","level":"DEBUG","logger":"plugins.base_core","pid":19847,"procname":"wkr:1","file":"base_core.py:352","body":"Action start for '***********'","node":"US-2","jobid":"51399907","uid":"2","type":"monitor","plug":"*****"}
2018/07/11 09:15:27 Send ERROR 502: No working clickhouse servers
while direct insert into CH works fine -
$ curl 0.0.0.0:8087
Ok.
🙏
PS Не выдерживает нагрузки?
Hi!
I use library that can send inserts only in CSVWithNames format and it doesn't work with clichkhouse-bulk.
It will be cool if clickhouse-bulk supports CSVWithNames format too.
Just curious the reason for "For better performance words FORMAT and VALUES must be uppercase." Why is this so?
Is it possible to start the service with tcp and send it to port 9906 of clickhouse?
For example, if I send INSERT request with wrong date format "2020-02-06 07:15:23.364727" (ruby Time format), then this request will never been executed and makes computation power leak.
Could you make a solution for this?
For example: if exception is not about clickhouse server connection then remove this bad request from a sending cycle to a separate dump file with list of bad requests.
There are two locks (FileDumper.mu and Clickhouse.mu) that are called in different order.
When dumping files to disk Clickhouse.Dump locks Clickhouse.mu, then FileDumper.Dump is called with a lock on FileDumper.mu. At the same time, FileDumper.Listen calls ProcessNextDump with a FileDumper.mu lock and then calls SendQuery->GetNextQuery with a Clickhouse.mu lock.
One potential solution is to remove a lock from Clickhouse.Dump, since it already locks FileDumper.mu in FileDumper.Dump.
Hello, Nikolay!
I got Request Timeout
while sending more then X (about 1000 in my case) rows at ones.
I batching requests on client side (nodejs) and sending this batch every 2500 ms. Everything goes well until my batch size reaches about 1000-1100 rows. I'm using offiсial clickhouse client for nodejs.
bulk instance deployed on another instance available via 1Gbit network.
Have you any idea why and how it could happen?
I use the bulker one of my projects with the binary. I get logs from AWS Lambda and sent to bulker. I discovered that memory usage increase linearly and never stabilizes. Do you have any clue what can be the reason?
Here is my config file
"listen": ":8124",
"flush_count": 100000,
"flush_interval": 5000,
"dump_check_interval": 300,
"debug": false,
"dump_dir": "dumps",
"clickhouse": {
"down_timeout": 60,
"connect_timeout": 10,
"servers": [
"http://X.X.X.X:8123",
"http://X.X.X.X:8123",
"http://X.X.X.X:8123",
"http://X.X.X.X:8123"
]
}
}```
One common practice after creating the connection is to check for Ping/Pong from server
if err := connect.Ping(); err != nil {
logger.Fatal(err)
return nil, err
}
This method works as intended when connecting to a clickhouse server via either http or tcp.
When i connect instead to clickhouse-bulk http i receive bad connection error from the driver.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.