GithubHelp home page GithubHelp logo

dctc's People

Contributors

cstenac avatar fterrazzoni avatar fulmicoton avatar pbertin avatar rluta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

rluta echiu64

dctc's Issues

Can make ssh protocol work

Ran dctc ls ssh://[email protected]:/home/dataiku/ and got:

debug: SshFileBuilder: path[0]: test1.dataiku.com
debug: SshFileBuilder: path[1]: /home/dataiku/
debug: SshFileBuilder: user[0]: dataiku
debug: SshFileBuilder: user[1]: null
Exception in thread "main" java.lang.NoClassDefFoundError: com/jcraft/jsch/JSchException
at com.dataiku.dctc.file.SshFileBuilder.buildFile(SshFileBuilder.java:69)
at com.dataiku.dctc.file.FileBuilder.buildFile(FileBuilder.java:59)
at com.dataiku.dctc.command.Command.build(Command.java:159)
at com.dataiku.dctc.command.Command.getArgs(Command.java:129)
at com.dataiku.dctc.command.Ls.perform(Ls.java:35)
at com.dataiku.dctc.Main.main(Main.java:144)
Caused by: java.lang.ClassNotFoundException: com.jcraft.jsch.JSchException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 6 more

Same issue with dctc cp.

Hadoop configuration

Hadoop parameters could be done the following way:

  • execute hadoop classpath to retrieve hadoop classpath
    (perhaps cache the result)
  • lookup for hadoop jars / core config in this classpath
    (perhaps cache the result)

Syncing from a FTP directory with a large number of files times out

Running dctc sync ftp://noaa@pub/data/gsod/2013 . fails with java.io.IOException: Fail Connect: Connection closed without indication.

The directory contains 12K+ files, and dctc keeps stuck with [0] [main] [INFO ] [dctc.sync] - Check ftp://anonymous:*****@ftp.ncdc.noaa.gov/pub/data/gsod/2013 for a very long time before failing

cp a FTP file doesn't work

Issuing dctc cp ftp://noaa@pub/data/gsod/ish-history.csv . gives the following error:
dctc cp: ERROR: Omitting directory `ftp://ftp.ncdc.noaa.gov/pub/data/gsod/ish-history.csv'

(the file exists and can be read via dctc head ftp://noaa@pub/data/gsod/ish-history.csv, configuration set in .dctcrc)

Weird error message on "cat folder" on S3

 dctc cat s3://dctc-test      
Exception in thread "main" com.amazonaws.AmazonClientException: Unable to unmarshall response (null)
        at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:551)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:291)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:164)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2906)
        at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:885)
        at com.dataiku.dctc.file.S3File.inputStream(S3File.java:183)
        at com.dataiku.dctc.AutoGZip.buildInput(AutoGZip.java:25)
        at com.dataiku.dctc.command.Cat.print(Cat.java:55)
        at com.dataiku.dctc.command.Cat.perform(Cat.java:33)
        at com.dataiku.dctc.command.Command.perform(Command.java:51)
        at com.dataiku.dctc.Main.main(Main.java:144)
Caused by: java.lang.NullPointerException
        at com.amazonaws.services.s3.internal.ServiceUtils.isMultipartUploadETag(ServiceUtils.java:81)
        at com.amazonaws.services.s3.internal.S3ObjectResponseHandler.handle(S3ObjectResponseHandler.java:49)
        at com.amazonaws.services.s3.internal.S3ObjectResponseHandler.handle(S3ObjectResponseHandler.java:31)
        at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:530)
        ... 10 more

Bad error message when wrong FTP port given

./dctc.sh ls ftp://a:a            
Exception in thread "main" java.lang.NumberFormatException: For input string: "a"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Integer.parseInt(Integer.java:449)
        at java.lang.Short.parseShort(Short.java:120)
        at java.lang.Short.parseShort(Short.java:78)
        at com.dataiku.dctc.file.FTPFileBuilder.build(FTPFileBuilder.java:68)
        at com.dataiku.dctc.file.FTPFileBuilder.buildFile(FTPFileBuilder.java:44)
        at com.dataiku.dctc.file.FTPFileBuilder.buildFile(FTPFileBuilder.java:1)
        at com.dataiku.dctc.file.FileBuilder.buildFile(FileBuilder.java:59)
        at com.dataiku.dctc.command.Command.build(Command.java:159)
        at com.dataiku.dctc.command.Command.getArgs(Command.java:129)
        at com.dataiku.dctc.command.Ls.perform(Ls.java:35)
        at com.dataiku.dctc.Main.main(Main.java:144)

ftp://localhost:/path/to/some/file

A bug has be introduce with the latest modification on the file builder.
Before two following syntax was accepted:

ftp://localhost/path/to/some/file
ftp://localhost:/path/to/some/file

Now only the first syntax is accepted. It's a real bug while we have the default_path option set in the configuration file. Indeed if we set this option to /a/real, this:

ftp://localhost:path/to/some/file

become:

ftp://localhost:/a/real/path/to/some/file

While, the following didn't change:

ftp://localhost:/path/to/some/file

This syntax allow the usage of the default_path for ftp (#32) and follow the syntax used for ssh.

The syntax commonly used for ftp must also be implemented to not disturb the user.

Integration with Dropbox

Support for Dropbox as a new protocol should be great in dctc (especially if integrated with standalone products)

Aliases

Mighty cool feature !

Now, we need to:

  • Add at least one in the default autogenerated conf. Nothing works better than self documentation. I would recommend to set ls = ls -G :)
  • Document it, at the top of the "commands" page (checkout the gh-pages branch, edit, push, and the website will update)

ssh: cp produces strange path and issues with no warning

I just did a:

dctc cp ~/Downloads/SFPD_Incidents_-_Previous_Three_Months.csv ssh://test2@/path/to/data

and got :
Weird path encountered: test2.dataiku.com/path/to/data/path/to/data
done: 5,40M/5,40M in 0m1s - 1/1 files done - 819,20kBps - 0 transfer(s) running.
Copied 5,40M in 0m1s

The target directory path is duplicated twice, and when listing the target directory, nothing is there...

No help on add-account

Issuing "dctc add-account --help" does not give anything, should explain what it should be used for

Templates for configurations

Right now I'm stucked when trying to define a FTP or a SSH connection in .dctcrc (trying to follow the doc).
Maybe:
-> .dctcrc should contain templates for configuring the different types of protocols
-> the doc should reflect how to use the dctc commands on a remote machine
Example, in the doc, in the SCP/SFTP section:
"
On the command line
Connect using ssh://user@host:/absolute/path or ssh://user@host:path/relative/to/homedir
"
Then how do I run dctc commands ?

Copy to a new bucket fails for some files

I smell a very nasty synchronization bug. Like, we create the bucket using out.mkpath() in each CopyTaskRunnable, and it fails because we try to create several times the same bucket.

7:58 clement@MacBook-Air-de-Clement ~/code/dctc% ./dctc-eclipse.sh cp -r dist s3://dctc-dist
[0] [main] [INFO ] [dctc.command]  - Add task local:///Users/clement/code/dctc/dist/dataiku-core.jar
[1] [main] [INFO ] [dctc.command]  - Add task local:///Users/clement/code/dctc/dist/dataiku-dctc.jar
[1] [main] [INFO ] [dctc.command]  - Add task local:///Users/clement/code/dctc/dist/dctc-tool.jar
done: 5,20M/8,45M in 0m1s - 0/3 files done - 0,00Bps - 3 transfer(s) running.Exception in thread "Thread-2" Status Code: 404, AWS Service: Amazon S3, AWS Request ID: 8B34469723ED844A, AWS Error Code: NoSuchBucket, AWS Error Message: The specified bucket does not exist, S3 Extended Request ID: nryRkPhMQHb6ypS4sC6MQ2fUdus15sHrHWxqkFG5VooWRHY8AijlfI7RYD1Ya/UO
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:610)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:310)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:164)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2906)
        at com.amazonaws.services.s3.AmazonS3Client.initiateMultipartUpload(AmazonS3Client.java:2157)
        at com.amazonaws.services.s3.model.AmazonS3OutputStream.<init>(AmazonS3OutputStream.java:18)
        at com.dataiku.dctc.file.S3File.outputStream(S3File.java:222)
        at com.dataiku.dctc.copy.DirectCopyTaskRunnable.work(DirectCopyTaskRunnable.java:46)
        at com.dataiku.dctc.copy.CopyTaskRunnable.run(CopyTaskRunnable.java:56)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:680)
[884] [Thread-4] [INFO ] [org.apache.http.impl.client.DefaultHttpClient]  - I/O exception (java.net.SocketException) caught when processing request: Connection reset
[884] [Thread-4] [INFO ] [org.apache.http.impl.client.DefaultHttpClient]  - Retrying request
done: 5,20M/8,45M in 0m2s - 0/3 files done - 1,56MBps - 3 transfer(s) running.Exception in thread "Thread-4" Status Code: 404, AWS Service: Amazon S3, AWS Request ID: 8B84BCA2058A65D9, AWS Error Code: NoSuchBucket, AWS Error Message: The specified bucket does not exist, S3 Extended Request ID: q6F98s3KNcTO3Lcsf+6+mYG3Juh+I/2yIMpz06AU3NJC+ZXuEc6cgfupoQZHh9O3
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:610)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:310)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:164)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2906)
        at com.amazonaws.services.s3.AmazonS3Client.abortMultipartUpload(AmazonS3Client.java:2066)
        at com.amazonaws.services.s3.model.AmazonS3OutputStream.send(AmazonS3OutputStream.java:57)
        at com.amazonaws.services.s3.model.AmazonS3OutputStream.write(AmazonS3OutputStream.java:28)
        at java.io.OutputStream.write(OutputStream.java:99)
        at com.dataiku.dctc.copy.DirectCopyTaskRunnable.work(DirectCopyTaskRunnable.java:68)
        at com.dataiku.dctc.copy.CopyTaskRunnable.run(CopyTaskRunnable.java:56)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:680)
done: 5,20M/8,45M in 0m27s - 1/3 files done - 0,00Bps - 2 transfer(s) running.^Czsh: exit 130   ./dctc-eclipse.sh cp -r dist s3://dctc-dist

Transfer Log on windows

When using the "cp" command on a windows terminal, a new line is printed every second to update the transfer status.

tail fails on s3 file

dctc head s3://path/to/my/file.pig works but not dctc tail s3://path/to/my/file.pig:
Exception in thread "main" java.lang.NullPointerException
at java.io.Reader.(Reader.java:61)
at java.io.InputStreamReader.(InputStreamReader.java:80)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1435)
at com.dataiku.dctc.command.Tail.performLine(Tail.java:84)
at com.dataiku.dctc.command.Tail.perform(Tail.java:55)
at com.dataiku.dctc.command.Command.perform(Command.java:51)
at com.dataiku.dctc.Main.main(Main.java:144)

local vs. file

The commit d96bfe2 the ability for dctc to handle local://path and file://path. This naming was already changed by b4afc605.

Did we really need two naming for local files, when we didn't really use it (I prefer to write path instead of local://path)?

dctc ls ssh://user:[email protected]:/ -V fails

Tested on my Mac (Mac OS X)

dctc ls: ERROR: `ssh://10.1.204.7/' failed: dctc Sshfile: failed: /
java.io.IOException: dctc Sshfile: failed: /
at com.dataiku.dctc.file.SshFile.exec(SshFile.java:458)
at com.dataiku.dctc.file.SshFile.resolve(SshFile.java:435)
at com.dataiku.dctc.file.SshFile.exists(SshFile.java:221)
at com.dataiku.dctc.command.Ls.perform(Ls.java:150)
at com.dataiku.dctc.command.Ls.perform(Ls.java:40)
at com.dataiku.dctc.Main.main(Main.java:157)
Caused by: com.jcraft.jsch.JSchException: java.lang.ArrayIndexOutOfBoundsException: 766
at com.jcraft.jsch.KnownHosts.setKnownHosts(KnownHosts.java:171)
at com.jcraft.jsch.KnownHosts.setKnownHosts(KnownHosts.java:60)
at com.jcraft.jsch.JSch.setKnownHosts(JSch.java:269)
at com.dataiku.dctc.file.SshFile.openSessionAndResolveHome(SshFile.java:114)
at com.dataiku.dctc.file.SshFile.connect(SshFile.java:152)
at com.dataiku.dctc.file.SshFile.exec(SshFile.java:451)
... 5 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 766
at com.jcraft.jsch.Util.fromBase64(Util.java:51)
at com.jcraft.jsch.KnownHosts.setKnownHosts(KnownHosts.java:157)
... 10 more

URL encoding issue in GS

Will copy 15724 file(s).
done: 12.58G/194.17G in 2m51s - 1124/15724 files done - 72.26MBps - 24 transfer(s) running.Exception in thread "Thread-14" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in path at index 102: http://storage.googleapis.com/cds-dp-bck-eu/dtm_fact_omn_daily_tracking_visitors/dt=$%7Bhiveconf%3ADAY}/000000_0.gz
        at com.google.api.client.http.GenericUrl.<init>(GenericUrl.java:102)
        at com.dataiku.dctc.file.GSFile.copy(GSFile.java:224)
        at com.dataiku.dctc.copy.DirectCopyTaskRunnable.work(DirectCopyTaskRunnable.java:79)
        at com.dataiku.dctc.copy.CopyTaskRunnable.run(CopyTaskRunnable.java:51)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.URISyntaxException: Illegal character in path at index 102: http://storage.googleapis.com/cds-dp-bck-eu/dtm_fact_omn_daily_tracking_visitors/dt=$%7Bhiveconf%3ADAY}/000000_0.gz
        at java.net.URI$Parser.fail(URI.java:2829)
        at java.net.URI$Parser.checkChars(URI.java:3002)
        at java.net.URI$Parser.parseHierarchical(URI.java:3086)
        at java.net.URI$Parser.parse(URI.java:3034)
        at java.net.URI.<init>(URI.java:595)
        at com.google.api.client.http.GenericUrl.<init>(GenericUrl.java:100)
        ... 6 more

Issue with mkdir on s3

Ran dctc mkdir s3://dctc-thomas with no warning or error messages.

==> Tested dctc cp s3://aws_account2@dataiku-emr/piggybank.jar s3://dctc-thomas and got :
done: 0,00/0m1s in 339,21k - 0/1 files done - 0,00Bps - 1 transfer(s) running.Exception in thread "Thread-2" Status Code: 404, AWS Service: Amazon S3, AWS Request ID: 897A547412FFA2D7, AWS Error Code: NoSuchBucket, AWS Error Message: The specified bucket does not exist, S3 Extended Request ID: iL16jedxPvyYAQEV7g9KAfx2n6IQ5gz4AjsEnK/cN2YZTr5QqrvvkE/y/igjuRXm
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:610)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:310)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:164)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2906)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1235)
at com.dataiku.dctc.file.S3File.directCopy(S3File.java:208)
at com.dataiku.dctc.copy.DirectCopyTaskRunnable.work(DirectCopyTaskRunnable.java:29)
at com.dataiku.dctc.copy.CopyTaskRunnable.run(CopyTaskRunnable.java:54)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:680)
done: 0,00/6m33s in 339,21k - 0/1 files done - 0,00Bps - 1 transfer(s) running.

Stopped after 6 minutes

==> then tested dctc ls s3://dctc-thomas and got dctc ls: ERROR: `s3://dctc-thomas' failed: Not Found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.