dataiku / dctc Goto Github PK
View Code? Open in Web Editor NEWDataiku Cloud Transport Client
Dataiku Cloud Transport Client
Ran dctc ls ssh://[email protected]:/home/dataiku/ and got:
debug: SshFileBuilder: path[0]: test1.dataiku.com
debug: SshFileBuilder: path[1]: /home/dataiku/
debug: SshFileBuilder: user[0]: dataiku
debug: SshFileBuilder: user[1]: null
Exception in thread "main" java.lang.NoClassDefFoundError: com/jcraft/jsch/JSchException
at com.dataiku.dctc.file.SshFileBuilder.buildFile(SshFileBuilder.java:69)
at com.dataiku.dctc.file.FileBuilder.buildFile(FileBuilder.java:59)
at com.dataiku.dctc.command.Command.build(Command.java:159)
at com.dataiku.dctc.command.Command.getArgs(Command.java:129)
at com.dataiku.dctc.command.Ls.perform(Ls.java:35)
at com.dataiku.dctc.Main.main(Main.java:144)
Caused by: java.lang.ClassNotFoundException: com.jcraft.jsch.JSchException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 6 more
Same issue with dctc cp.
Running thread are considered done.
Hadoop parameters could be done the following way:
hadoop classpath
to retrieve hadoop classpathdctc rm: ERROR: cannot remove 'ssh://localhost/home/vash/.': Is a directory
Running dctc sync ftp://noaa@pub/data/gsod/2013 . fails with java.io.IOException: Fail Connect: Connection closed without indication.
The directory contains 12K+ files, and dctc keeps stuck with [0] [main] [INFO ] [dctc.sync] - Check ftp://anonymous:*****@ftp.ncdc.noaa.gov/pub/data/gsod/2013 for a very long time before failing
Is it relevant / doable to add the HTTP protocol for file transfer ?
Example would be :
dctc cp http://download.geonames.org/export/dump/FR.zip .
Need to use pagination mechanism
In help.
dctc ls ssh://test2@:/
shows a directory only if the directory a full read/write/exec on it (I did a chmod 777 to be able to see it)
Probably same bug as s3
Issuing dctc cp ftp://noaa@pub/data/gsod/ish-history.csv . gives the following error:
dctc cp: ERROR: Omitting directory `ftp://ftp.ncdc.noaa.gov/pub/data/gsod/ish-history.csv'
(the file exists and can be read via dctc head ftp://noaa@pub/data/gsod/ish-history.csv, configuration set in .dctcrc)
done: 2,15M/2,95M in 0m26s - 0/1 files done - 78,06kBps - 1 transfer(s) running. g.```
dctc cat s3://dctc-test
Exception in thread "main" com.amazonaws.AmazonClientException: Unable to unmarshall response (null)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:551)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:291)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:164)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2906)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:885)
at com.dataiku.dctc.file.S3File.inputStream(S3File.java:183)
at com.dataiku.dctc.AutoGZip.buildInput(AutoGZip.java:25)
at com.dataiku.dctc.command.Cat.print(Cat.java:55)
at com.dataiku.dctc.command.Cat.perform(Cat.java:33)
at com.dataiku.dctc.command.Command.perform(Command.java:51)
at com.dataiku.dctc.Main.main(Main.java:144)
Caused by: java.lang.NullPointerException
at com.amazonaws.services.s3.internal.ServiceUtils.isMultipartUploadETag(ServiceUtils.java:81)
at com.amazonaws.services.s3.internal.S3ObjectResponseHandler.handle(S3ObjectResponseHandler.java:49)
at com.amazonaws.services.s3.internal.S3ObjectResponseHandler.handle(S3ObjectResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:530)
... 10 more
./dctc.sh ls ftp://a:a
Exception in thread "main" java.lang.NumberFormatException: For input string: "a"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:449)
at java.lang.Short.parseShort(Short.java:120)
at java.lang.Short.parseShort(Short.java:78)
at com.dataiku.dctc.file.FTPFileBuilder.build(FTPFileBuilder.java:68)
at com.dataiku.dctc.file.FTPFileBuilder.buildFile(FTPFileBuilder.java:44)
at com.dataiku.dctc.file.FTPFileBuilder.buildFile(FTPFileBuilder.java:1)
at com.dataiku.dctc.file.FileBuilder.buildFile(FileBuilder.java:59)
at com.dataiku.dctc.command.Command.build(Command.java:159)
at com.dataiku.dctc.command.Command.getArgs(Command.java:129)
at com.dataiku.dctc.command.Ls.perform(Ls.java:35)
at com.dataiku.dctc.Main.main(Main.java:144)
% dctc ls ftp://tamere.enstring.sur.internet.com/pub
dctc ls: ERROR: ls failed
A bug has be introduce with the latest modification on the file builder.
Before two following syntax was accepted:
ftp://localhost/path/to/some/file
ftp://localhost:/path/to/some/file
Now only the first syntax is accepted. It's a real bug while we have the default_path
option set in the configuration file. Indeed if we set this option to /a/real
, this:
ftp://localhost:path/to/some/file
become:
ftp://localhost:/a/real/path/to/some/file
While, the following didn't change:
ftp://localhost:/path/to/some/file
This syntax allow the usage of the default_path
for ftp (#32) and follow the syntax used for ssh
.
The syntax commonly used for ftp
must also be implemented to not disturb the user.
dctc still hangs in the main thread if some tasks fail.
Support for Dropbox as a new protocol should be great in dctc (especially if integrated with standalone products)
I have 2 aws accounts defined via dctc add-account, and dctc ls s3:// will list the buckets / files from the last created. How do I access the other one ?
Mighty cool feature !
Now, we need to:
I just did a:
dctc cp ~/Downloads/SFPD_Incidents_-_Previous_Three_Months.csv ssh://test2@/path/to/data
and got :
Weird path encountered: test2.dataiku.com/path/to/data/path/to/data
done: 5,40M/5,40M in 0m1s - 1/1 files done - 819,20kBps - 0 transfer(s) running.
Copied 5,40M in 0m1s
The target directory path is duplicated twice, and when listing the target directory, nothing is there...
Issuing "dctc add-account --help" does not give anything, should explain what it should be used for
caused by 7566bc7.
Currently:
[ftp]
company.hostname = ftp.mycompany.com
Should be:
[ftp]
company.host = ftp.mycompany.com
tick is a loop counter, not a time counter. It will become imprecise if your loop is not always immediate.
Right now I'm stucked when trying to define a FTP or a SSH connection in .dctcrc (trying to follow the doc).
Maybe:
-> .dctcrc should contain templates for configuring the different types of protocols
-> the doc should reflect how to use the dctc commands on a remote machine
Example, in the doc, in the SCP/SFTP section:
"
On the command line
Connect using ssh://user@host:/absolute/path or ssh://user@host:path/relative/to/homedir
"
Then how do I run dctc commands ?
Currently, it's directly in the user dir. Warning: don't use hardcoded paths. There are some variables to find out where to put it.
Upload the docs, protocols.html afterwards.
I smell a very nasty synchronization bug. Like, we create the bucket using out.mkpath() in each CopyTaskRunnable, and it fails because we try to create several times the same bucket.
7:58 clement@MacBook-Air-de-Clement ~/code/dctc% ./dctc-eclipse.sh cp -r dist s3://dctc-dist
[0] [main] [INFO ] [dctc.command] - Add task local:///Users/clement/code/dctc/dist/dataiku-core.jar
[1] [main] [INFO ] [dctc.command] - Add task local:///Users/clement/code/dctc/dist/dataiku-dctc.jar
[1] [main] [INFO ] [dctc.command] - Add task local:///Users/clement/code/dctc/dist/dctc-tool.jar
done: 5,20M/8,45M in 0m1s - 0/3 files done - 0,00Bps - 3 transfer(s) running.Exception in thread "Thread-2" Status Code: 404, AWS Service: Amazon S3, AWS Request ID: 8B34469723ED844A, AWS Error Code: NoSuchBucket, AWS Error Message: The specified bucket does not exist, S3 Extended Request ID: nryRkPhMQHb6ypS4sC6MQ2fUdus15sHrHWxqkFG5VooWRHY8AijlfI7RYD1Ya/UO
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:610)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:310)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:164)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2906)
at com.amazonaws.services.s3.AmazonS3Client.initiateMultipartUpload(AmazonS3Client.java:2157)
at com.amazonaws.services.s3.model.AmazonS3OutputStream.<init>(AmazonS3OutputStream.java:18)
at com.dataiku.dctc.file.S3File.outputStream(S3File.java:222)
at com.dataiku.dctc.copy.DirectCopyTaskRunnable.work(DirectCopyTaskRunnable.java:46)
at com.dataiku.dctc.copy.CopyTaskRunnable.run(CopyTaskRunnable.java:56)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:680)
[884] [Thread-4] [INFO ] [org.apache.http.impl.client.DefaultHttpClient] - I/O exception (java.net.SocketException) caught when processing request: Connection reset
[884] [Thread-4] [INFO ] [org.apache.http.impl.client.DefaultHttpClient] - Retrying request
done: 5,20M/8,45M in 0m2s - 0/3 files done - 1,56MBps - 3 transfer(s) running.Exception in thread "Thread-4" Status Code: 404, AWS Service: Amazon S3, AWS Request ID: 8B84BCA2058A65D9, AWS Error Code: NoSuchBucket, AWS Error Message: The specified bucket does not exist, S3 Extended Request ID: q6F98s3KNcTO3Lcsf+6+mYG3Juh+I/2yIMpz06AU3NJC+ZXuEc6cgfupoQZHh9O3
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:610)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:310)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:164)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2906)
at com.amazonaws.services.s3.AmazonS3Client.abortMultipartUpload(AmazonS3Client.java:2066)
at com.amazonaws.services.s3.model.AmazonS3OutputStream.send(AmazonS3OutputStream.java:57)
at com.amazonaws.services.s3.model.AmazonS3OutputStream.write(AmazonS3OutputStream.java:28)
at java.io.OutputStream.write(OutputStream.java:99)
at com.dataiku.dctc.copy.DirectCopyTaskRunnable.work(DirectCopyTaskRunnable.java:68)
at com.dataiku.dctc.copy.CopyTaskRunnable.run(CopyTaskRunnable.java:56)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:680)
done: 5,20M/8,45M in 0m27s - 1/3 files done - 0,00Bps - 2 transfer(s) running.^Czsh: exit 130 ./dctc-eclipse.sh cp -r dist s3://dctc-dist
Because it calls getAbsolutePath which does some calls on SSH, escaping all normal error handling.
Only do it if there are actually some globbing chars
When using the "cp" command on a windows terminal, a new line is printed every second to update the transfer status.
dctc head s3://path/to/my/file.pig works but not dctc tail s3://path/to/my/file.pig:
Exception in thread "main" java.lang.NullPointerException
at java.io.Reader.(Reader.java:61)
at java.io.InputStreamReader.(InputStreamReader.java:80)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1435)
at com.dataiku.dctc.command.Tail.performLine(Tail.java:84)
at com.dataiku.dctc.command.Tail.perform(Tail.java:55)
at com.dataiku.dctc.command.Command.perform(Command.java:51)
at com.dataiku.dctc.Main.main(Main.java:144)
The commit d96bfe2 the ability for dctc to handle local://path and file://path. This naming was already changed by b4afc605.
Did we really need two naming for local files, when we didn't really use it (I prefer to write path instead of local://path)?
Tested on my Mac (Mac OS X)
dctc ls: ERROR: `ssh://10.1.204.7/' failed: dctc Sshfile: failed: /
java.io.IOException: dctc Sshfile: failed: /
at com.dataiku.dctc.file.SshFile.exec(SshFile.java:458)
at com.dataiku.dctc.file.SshFile.resolve(SshFile.java:435)
at com.dataiku.dctc.file.SshFile.exists(SshFile.java:221)
at com.dataiku.dctc.command.Ls.perform(Ls.java:150)
at com.dataiku.dctc.command.Ls.perform(Ls.java:40)
at com.dataiku.dctc.Main.main(Main.java:157)
Caused by: com.jcraft.jsch.JSchException: java.lang.ArrayIndexOutOfBoundsException: 766
at com.jcraft.jsch.KnownHosts.setKnownHosts(KnownHosts.java:171)
at com.jcraft.jsch.KnownHosts.setKnownHosts(KnownHosts.java:60)
at com.jcraft.jsch.JSch.setKnownHosts(JSch.java:269)
at com.dataiku.dctc.file.SshFile.openSessionAndResolveHome(SshFile.java:114)
at com.dataiku.dctc.file.SshFile.connect(SshFile.java:152)
at com.dataiku.dctc.file.SshFile.exec(SshFile.java:451)
... 5 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 766
at com.jcraft.jsch.Util.fromBase64(Util.java:51)
at com.jcraft.jsch.KnownHosts.setKnownHosts(KnownHosts.java:157)
... 10 more
Will copy 15724 file(s).
done: 12.58G/194.17G in 2m51s - 1124/15724 files done - 72.26MBps - 24 transfer(s) running.Exception in thread "Thread-14" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in path at index 102: http://storage.googleapis.com/cds-dp-bck-eu/dtm_fact_omn_daily_tracking_visitors/dt=$%7Bhiveconf%3ADAY}/000000_0.gz
at com.google.api.client.http.GenericUrl.<init>(GenericUrl.java:102)
at com.dataiku.dctc.file.GSFile.copy(GSFile.java:224)
at com.dataiku.dctc.copy.DirectCopyTaskRunnable.work(DirectCopyTaskRunnable.java:79)
at com.dataiku.dctc.copy.CopyTaskRunnable.run(CopyTaskRunnable.java:51)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.URISyntaxException: Illegal character in path at index 102: http://storage.googleapis.com/cds-dp-bck-eu/dtm_fact_omn_daily_tracking_visitors/dt=$%7Bhiveconf%3ADAY}/000000_0.gz
at java.net.URI$Parser.fail(URI.java:2829)
at java.net.URI$Parser.checkChars(URI.java:3002)
at java.net.URI$Parser.parseHierarchical(URI.java:3086)
at java.net.URI$Parser.parse(URI.java:3034)
at java.net.URI.<init>(URI.java:595)
at com.google.api.client.http.GenericUrl.<init>(GenericUrl.java:100)
... 6 more
Prefer:
foo -> bar/foo
Instead of:
foo -> bar/./foo
Ran dctc mkdir s3://dctc-thomas with no warning or error messages.
==> Tested dctc cp s3://aws_account2@dataiku-emr/piggybank.jar s3://dctc-thomas and got :
done: 0,00/0m1s in 339,21k - 0/1 files done - 0,00Bps - 1 transfer(s) running.Exception in thread "Thread-2" Status Code: 404, AWS Service: Amazon S3, AWS Request ID: 897A547412FFA2D7, AWS Error Code: NoSuchBucket, AWS Error Message: The specified bucket does not exist, S3 Extended Request ID: iL16jedxPvyYAQEV7g9KAfx2n6IQ5gz4AjsEnK/cN2YZTr5QqrvvkE/y/igjuRXm
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:610)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:310)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:164)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2906)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1235)
at com.dataiku.dctc.file.S3File.directCopy(S3File.java:208)
at com.dataiku.dctc.copy.DirectCopyTaskRunnable.work(DirectCopyTaskRunnable.java:29)
at com.dataiku.dctc.copy.CopyTaskRunnable.run(CopyTaskRunnable.java:54)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:680)
done: 0,00/6m33s in 339,21k - 0/1 files done - 0,00Bps - 1 transfer(s) running.
Stopped after 6 minutes
==> then tested dctc ls s3://dctc-thomas and got dctc ls: ERROR: `s3://dctc-thomas' failed: Not Found
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.