htcat / htcat Goto Github PK

Parallel and Pipelined HTTP GET Utility

License: BSD 2-Clause "Simplified" License

Go 100.00%

htcat's Introduction

htcat

htcat is a utility to perform parallel, pipelined execution of a single HTTP GET. htcat is intended for the purpose of incantations like:

htcat https://host.net/file.tar.gz | tar -zx

It is tuned (and only really useful) for faster interconnects:

$ htcat http://test.com/file | pv -a > /dev/null
[ 109MB/s]

This is on a gigabit network, between an AWS EC2 instance and S3. This represents 91% use of the theoretical maximum of gigabit (119.2 MiB/s).

Installation

This program depends on Go 1.1 or later. One can use go get to download and compile it from source:

$ go get github.com/htcat/htcat/cmd/htcat

Help and Reporting Bugs

For correspondence of all sorts, write to [email protected]. Bugs can be filed at htcat's GitHub Issues page.

Approach

htcat works by determining the size of the Content-Length of the URL passed, and then partitioning the work into a series of GETs that use the Range header in the request, with the notable exception of the first issued GET, which has no Range header and is used to both start the transfer and attempt to determine the size of the URL.

Unlike most programs that do similar Range-based splitting, the requests that are performed in parallel are limited to some bytes ahead of the data emitted so far instead of splitting the entire byte stream evenly. The purpose of this is to emit those bytes as soon as reasonably possible, so that pipelined execution of another tool can, too, proceed in parallel.

These requests may complete slightly out of order, and are held in reserve until contiguous bytes can be emitted by a defragmentation routine, that catenates together the complete, consecutive payloads in memory for emission.

Tweaking the number of simultaneous transfers and the size of each GET makes a trade-off between latency to fill the output pipeline, memory usage, and churn in requests and connections and incurring their associated start-up costs.

If htcat's peer on the server side processes Range requests more slowly than regular GET without a Range header, then, htcat's performance can suffer relative to a simpler, single-stream GET.

Numbers

These are measurements falling well short of real benchmarks that are intended to give a rough sense of the performance improvements that may be useful to you. These were taken via an AWS EC2 instance connecting to S3, and there is definitely some variation in runs, sometimes very significant, especially at the higher speeds.

Tool	TLS	Rate
htcat	no	109 MB/s
curl	no	36 MB/s
aria2c -x5	no	113 MB/s
htcat	yes	59 MB/s
curl	yes	5 MB/s
aria2c -x5	yes	17 MB/s

On somewhat small files, the situation changes: htcat chooses smaller parts, as to still get some parallelism.

Below are results while performing a 13MB transfer from S3 (Seattle) to an EC2 instance in Virginia. Notably, TLS being on or off did not seem to matter, perhaps in this case it was not a bottleneck.

Tool	Time
curl	5.20s
curl	7.75s
curl	6.36s
htcat	2.69s
htcat	2.50s
htcat	3.25s

Results while performing a transfer of the same 13MB file from S3 to EC2, but all within Virginia:

Tool	TLS	Time
curl	no	0.29s
curl	no	0.75s
curl	no	0.44s
htcat	no	0.30s
htcat	no	0.30s
htcat	no	0.48s
curl	yes	2.69s
curl	yes	2.69s
curl	yes	2.62s
htcat	yes	1.37s
htcat	yes	0.45s
htcat	yes	0.59s

Results while performing a 4.6MB transfer on a fast (same-region) link. This file is small enough that htcat disables multi-request parallelism. Given that, it's unclear why htcat performs markedly better on the TLS tests than curl.

Tool	TLS	Time
curl	no	0.14s
curl	no	0.13s
curl	no	0.14s
htcat	no	0.23s
htcat	no	0.16s
htcat	no	0.17s
curl	yes	0.95s
curl	yes	0.97s
curl	yes	0.99s
htcat	yes	0.38s
htcat	yes	0.34s
htcat	yes	0.24s

htcat's People

Contributors

Stargazers

Watchers

htcat's Issues

Avoid goroutine leaks

Right now, HtCat.WriteTo does not close the channels it creates for other processes to notify it of various status updates, e.g. cancellations and piece registration.

This may not matter for the cmd/htcat use case, which exits immediately after the transfer finishes or has an error, but if someone embeds the HtCat type and operator in a long-lived program that would be unfortunate.

error in fetched data

Hi, I'm trying to use htcat, freshly built from github yesterday, to retrieve files from the reddit comments corpus, but when I fetch with
./htcat http://archive.org/download/2015_reddit_comments_corpus/reddit_data/2011/RC_2011-08.bz2 | pv > rc-2011-08-htcat.bz2
I get different data than if I fetch with
wget http://archive.org/download/2015_reddit_comments_corpus/reddit_data/2011/RC_2011-08.bz2 -q -O /proc/self/fd/1 | pv > rc-2011-08-wget.bz2

Specifically,

[nk@final-gateway ~]$ cksum *bz2
1344497712 1104365673 rc-2011-08-htcat.bz2
3401997682 1104365673 rc-2011-08-wget.bz2

Using some cmp trickery, it looks like the files are the same up through offset 0x1400000, about 21MB in.
Can I provide any other diagnostic info?

Spuriously decodes URL path

I was getting inconsistent behavior between htcat and curl and after applying the following instrumentation:

diff --git a/http.go b/http.go
index 297694a..dfd5024 100644
--- a/http.go
+++ b/http.go
@@ -5,6 +5,7 @@ import (
        "fmt"
        "io"
        "net/http"
+       "net/http/httputil"
        "net/url"
        "strconv"
        "sync"
@@ -53,6 +54,13 @@ func (cat *HtCat) startup(parallelism int) {
                return
        }

+       reqStr, err := httputil.DumpRequestOut(&req, true)
+       if err != nil {
+               fmt.Println(err)
+       }
+
+       fmt.Printf("%v\n", string(reqStr))
+
        // Check for non-200 OK response codes from the startup-GET.
        if resp.StatusCode != 200 {
                err = HttpStatusError{

I tracked it down to differences in how percent-encoded entities are treated in htcat and curl. Here is a request made with the instrumented htcat:

maciek@mothra:~$ ./htcat http://example.com/foo%2Bbar
GET /foo+bar HTTP/1.1
Host: example.com
User-Agent: Go 1.1 package http
Accept-Encoding: gzip


2015/08/11 15:00:44 aborting: could not write to output stream: Expected HTTP Status 200, received: "404 Not Found"

and here is the same URL requested with curl:

maciek@mothra:~$ curl -I -v http://example.com/foo%2Bbar
* Hostname was NOT found in DNS cache
*   Trying 93.184.216.34...
* Connected to example.com (93.184.216.34) port 80 (#0)
> HEAD /foo%2Bbar HTTP/1.1
> User-Agent: curl/7.38.0
> Host: example.com
> Accept: */*
> 
...

Note that curl uses the path "/foo%2Bbar" whereas htcat appears to have decoded that percent-encoding to "/foo+bar".

I'm not positive which behavior is correct but it seems like they should be consistent, and my money's on curl here. Thoughts?

Degrade if range requests are ignored

A server may decide not to handle a range request; this leaves htcat in a bad state.

First and foremost, 1fb7e83 should be reverted. A standard compliant server should return a 206. If a 200 is returned, it should degrade into non-parallel fetching. If a server returns a 206, but later returns a 200, it should be treated as an error.

Possible explanation for performance on serial request

From the README:

Results while performing a 4.6MB transfer on a fast (same-region) link. This file is small enough that htcat disables multi-request parallelism. Given that, it's unclear why htcat performs markedly better on the TLS tests than curl.

It might be that Go's HTTP library performs optimizations that curl does not? I can't imagine what optimizations those might be, but it seems the only plausible explanation.

man page or help?

if I run htcat --help, man htcat, or simply htcat, I don't get any of the instructions or information listed in this README.

Doesn't work with go 1.2

Performance in IO bound Systems

Howdy! First, thanks this is a really cool tool. Also as a side note, i made a Heroku buildpack https://github.com/schneems/htcat_buildpack for easy Heroku consumption.

Now, down to business. I've got an app that downloads large tar archives and unzips them. Perfect for htcat, or so it would seem. However I ran into an issue. I'm running my program via sidekiq which scales out through multiple threads. I'm currently using curl and wanted to benchmark against htcat simulating a sidekiq environment (shelling out in multiple threads). The results on performance were somewhat surprising to me.

Here's an approximation of the benchmark I wrote: https://gist.github.com/schneems/fb86d87cce1b543a5882

For running htcat inside of multiple processes (inside of multiple ruby threads), i saw htcat outperform curl for 1 or 2 threads. Above that (running on a heroku dyno) htcat becomes massively slower. When I tried downloading 50 archives in parallel (varying from 30-150mb in size) at the same time, curl completed in around 80 seconds, and htcat too so long that it errored out.

What I think is happening is that at such high levels of parallel downloading, the network is faster than the disk, and the IO is the bottleneck. When this scenario happens, htcat seems to perform far worse than curl. Though i'm not sure why. Maybe too many threads in the mix cause too much context switching? I'm not sure, but thought it was an interesting data point.

If you're interested, I have a running benchmark that is not suitable for public consumption, but is on a heroku dyno, ping me via hipchat and I can give you access.