GithubHelp home page GithubHelp logo

hrbrmstr / drill-url-tools Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 3.0 24 KB

A set of Apache Drill UDFs for working with URLs

License: MIT License

Makefile 8.86% Java 91.14%
apache-drill drill udf url-parser

drill-url-tools's Introduction

drill-url-tools

A set of Apache Drill UDFs for working with URLs

It uses the galimatias Java library for the parsing.

UDFs

The following UDFs are included:

  • url_parse(url-string): Given a URL/URI string input, a set of fields (url, scheme, username, password, host, port, path, queryandfragment`) will be returned in a map.

Building

Retrieve the dependencies and build the UDF:

make deps
make udf

To automatically install it locally, ensure DRILL_HOME is set (the Makefile has a default of /usr/local/drill) and:

make install

Assuming you're running in standalone mode, you can then do:

make restart

You can manually copy:

  • target/drill-url-tools-1.0.jar
  • target/drill-url-tools-1.0-sources.jar
  • deps/galimatias-0.2.0.jar
  • deps/icu4j-53.1.jar

(after a successful build) to your $DRILL_HOME/jars/3rdparty directory and manually restart Drill as well.

Example

Using the following query:

SELECT
  a.url AS url,
  a.rec.scheme AS scheme,
  a.rec.username AS username,
  a.rec.password AS password,
  a.rec.host AS host,
  a.rec.port AS port,
  a.rec.path AS path,
  a.rec.query AS query,
  a.rec.fragment AS fragment
FROM
  (SELECT url, url_parse(url) AS rec
  FROM
    (SELECT 'https://www.test.url/something/a.cgi?first=no#frag' AS url
     FROM (VALUES((1))))) a;

Here's the output:

$ drill-conf
apache drill 1.14.0-SNAPSHOT
"this isn't your grandfather's sql"
0: jdbc:drill:> !set outputFormat vertical
0: jdbc:drill:> SELECT
. . . . . . . >   a.url AS url,
. . . . . . . >   a.rec AS rec
. . . . . . . > FROM
. . . . . . . >   (SELECT url, url_parse(url) AS rec
. . . . . . . >   FROM
. . . . . . . >     (SELECT 'https://www.test.url/something/a.cgi?first=no#frag' AS url
. . . . . . . >      FROM (VALUES((1))))) a;
url       https://www.test.url/something/a.cgi?first=no#frag
scheme    https
username
password  null
host      www.test.url
port      443
path      /something/a.cgi
query     first=no
fragment  frag

1 row selected (0.126 seconds)

drill-url-tools's People

Contributors

hrbrmstr avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

drill-url-tools's Issues

unexpected result for url containing characters "--" in third and fourth position in any domain level

Examples
`
apache drill> select url_parse('http://www.gooogle.com.uk') /* OK */;
+----------------------------------------------------------------------------------+
| EXPR$0 |
+----------------------------------------------------------------------------------+
| {"scheme":"http","username":"","host":"www.gooogle.com.uk","port":"80","path":"/"} |
+----------------------------------------------------------------------------------+

apache drill> select url_parse('http://www.go--oogle.com.uk') /* NOK */;
+--------+
| EXPR$0 |
+--------+
| {} |
+--------+

apache drill> select url_parse('http://www.g--ooogle.com.uk') /* OK */;
+----------------------------------------------------------------------------------+
| EXPR$0 |
+----------------------------------------------------------------------------------+
| {"scheme":"http","username":"","host":"www.g--ooogle.com.uk","port":"80","path":"/"} |
+----------------------------------------------------------------------------------+

apache drill> select url_parse('http://www.goo--ogle.com.uk') /* OK */;
+----------------------------------------------------------------------------------+
| EXPR$0 |
+----------------------------------------------------------------------------------+
| {"scheme":"http","username":"","host":"www.goo--ogle.com.uk","port":"80","path":"/"} |
+----------------------------------------------------------------------------------+

apache drill> select url_parse('http://www.gooogle.co--m.uk') /* NOK */ ;
+--------+
| EXPR$0 |
+--------+
| {} |
+--------+

apache drill> select url_parse('http://tv--portal.blogspot.de/2011/07/tvn-en-vivo-panama.html') /* NOK */;
+--------+
| EXPR$0 |
+--------+
| {} |
+--------+
`

Bad uses of reallocIfNeeded

Find in the source
buffer.reallocIfNeeded(...);
Instead of
buffer = buffer.reallocIfNeeded(...);

The size of the buffer is unchanged and some problems may appears (some of these problems are currently catched by the "try/catch" in the code but produce bad results.

Unexpected result on big dataset

With big dataset (>1000000 rows) url_parse produce unexpected results.
Example:
SELECT * FROM ( SELECT R.Url, R.f.host as host FROM ( SELECT T.Url, url_parse(T.Url)) AS f FROM ... AS T) AS R) ) WHERE host LIKE '80to%' LIMIT 1;
+----------------------------------------------------------------------------------+-----------------+
| Url | host |
+----------------------------------------------------------------------------------+-----------------+
| http://patriot-auto.ru/addnews.html | 80togh-pro.rozb |
+----------------------------------------------------------------------------------+-----------------+

Although
SELECT t.p.host FROM (SELECT url_parse('http://patriot-auto.ru/addnews.html') AS p) AS t;
+-----------------+
| EXPR$0 |
+-----------------+
| patriot-auto.ru |
+-----------------+

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.