GithubHelp home page GithubHelp logo

ak-coram / cl-duckdb Goto Github PK

View Code? Open in Web Editor NEW
34.0 34.0 1.0 164 KB

Common Lisp CFFI wrapper around the DuckDB C API

Home Page: https://github.com/ak-coram/cl-duckdb/blob/main/README.org

License: MIT License

Common Lisp 100.00%
c-bindings common-lisp data-science duckdb lisp olap parquet sql

cl-duckdb's People

Contributors

ak-coram avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cl-duckdb's Issues

About that NULL handling again…

I had to think about it for a while, but I am reaching an opinion now. NULL should always be indicated by :NULL in CL. Let me explain why I think so.

  1. NULL values are messing up booleans AND everything represented by lists on the cl side. An empty list is not always the same as NULL.
  2. Assuming that the user is required to write a specific query to distinguish NULL values makes for problems when building libraries on top of duckdb (leaky abstraction).
  3. It is just a more convenient solution for the user. One less thing to remember while using the package.
  4. It is a solution already well proven by postmodern.

This is obviously my private opinion.

Remote parquet data sources?

The following example of loading a parquet file uses a file on local disk:

(ddb:initialize-default-connection)
(ddb:query "SELECT * FROM '~/Downloads/yellow_tripdata_2023-01.parquet' LIMIT 5" nil)

The duckdb FROM clause doesn't seem to offer http download as an option.

Edit: There is HTTP Parquet Import, but it requires the httpfs extension to be installed. Any way to do that?

Another thing to consider is a wrapper, similar to the way that Lisp-Stat does in read-csv that wraps multiple stream protocols so the user doesn't have to be concerned with the location of the source. The I/O simplification is where the work is done.

Figure out a way to remove trivial-with-current-source

@snunez1: It looks like our witch-hunt based on licensing can never end! This is also a non-permissively licensed dependency pulled in by esrap which itself is pulled in by local-time-duration. It looks like they have a single call to it here:

https://github.com/enaeher/local-time-duration/blob/fa20a4a03a1ee076eada649796e2f2345c930c21/iso8601.lisp#L156

Maybe we can resolve it with the project itself (it is licensed under MIT itself), but it hasn't seen any activity for 5 years now.

@enaeher: Hello! Would you be so glad and chime in on this? Your project includes a transitive dependency under the LGPLv3, which is more restrictive than the MIT license your project itself is licensed under. We'd like to keep using local-time-duration for representing duration values that come out of DuckDB (see this project for details): would you be open to a discussion on how we could resolve this issue? Thank you!

Querying Lisp Data-frames?

How much effort would it be for cl-duckdb to query Lisp-Stat data frames? They are a hashtable of vectors and it looks from the duckdb documentation these might be suitable for direct query scanning, in the same way that Pandas, Julia and R data frames can be scanned.

Phrase out trivia dependency

Trivia is (was?) a transitive dependency, possibly via ironclad via uuid. It's definitely not from ascii-table, spark nor let-plus. Mostly this is a reminder to check (is there a way to print the dependency tree with ASDF? That would be quite useful).

Type mapping

I've spent some time looking at how to easily map to/from data frames. Whilst of course anything is possible, the exercise is turning out to be a bit more painful than it perhaps needs to be. The main problem is in tacking on the duckdb type into the alist of column values in with-static-table This is turning out to be somewhat unwieldy, and prevents us from easily manipulating the data alist from the CL side.

There is an analogous situation in Lisp-Stat when we load CSV files, and there we pass an alist that specifies the mapping for missing values. The problem we're trying to solve is that we have no way of knowing what the input might use as a missing value, nor what the user might want as a missing value in his data frame.

Should we consider an alist to map types in duckdb? Perhaps as a special variable that can be set globally for a system that's using cl-duckdb as a driver, or overridden in a let as required?

Here's a sketch of how that might work in the case of Lisp-Stat: I use column (vector) types defined in the generic system numerical-utilities, such as simple-double-float-vector, simple-fixnum-vector, simple-boolean-vector, etc. Using cl-duckdb as a driver, I want to map these to cl-duckdb, like so:

'((simple-double-float-vector . duckdb-double)
  (simple-single-float-vector . duckdb-real)
 ...)

You can see this used in delimited-text.lisp.

I may have over-engineering the description in this issue. Basically I'm trying to find a way to separate the data alist from the type information and more easily work with the data alist in CL.

Interval query test failure on ECL & ARM only

Something seems to be off with SQL intervals:

 QUERY-INTERVAL in DUCKDB []: 
      
TS

 evaluated to 

@3057-04-09T09:57:42.002001Z

 which is not 

LOCAL-TIME:TIMESTAMP=

 to 

@3057-04-09T10:57:42.002001Z

Phase out serapeum dependency

This library currently prevents loading cl-duckdb with Clasp as it doesn't compile:
clasp-developers/clasp#1365

Only serapeum:mvlet* and serapeum:count-cpus are currently used, which doesn't warrant the inclusion of such a large library dependency.

String vectors as columns in a query?

I am trying to create a database using the example in the README. The example is:

(ddb:initialize-default-connection) ; => #<DUCKDB::CONNECTION {10074E8BE3}>

;; Use vectors as columns in a query:
(let ((indexes (make-array '(10) :element-type '(unsigned-byte 8)
                                 :initial-contents '(1 2 3 4 5 6 7 8 9 10)))
      (primes (make-array '(10) :element-type '(unsigned-byte 8)
                                :initial-contents '(2 3 5 7 11 13 17 19 23 29))))
  (ddb:with-static-table ("primes" `(("i" . ,indexes)
                                     ("p" . ,primes)))
    (ddb:format-query "SELECT * FROM primes" nil)))

The code I'm trying to create should 'round trip' a data frame:

  1. Create a database and populate it from an alist
  2. Query with select * and get the original values back

In lisp-stat, the code is:

(ddb:with-static-table ("mtcars" (as-alist mtcars))
  (ddb:format-query "SELECT * from mtcars" nil))

mtcars is a common data set from R and loaded by default in lisp-stat. The first column is the model of the car, a string. However when the alist contains a vector of strings, I get:

#("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive"
  "Hornet Sportabout" "Valiant" "Duster 360" "Merc 240D"
  "Merc 230" "Merc 280" "Merc 280C" "Merc 450SE" "Merc 450SL"
  "Merc 450SLC" "Cadillac Fleetwood" "Lincoln Continental"
  "Chrysler Imperial" "Fiat 128" "Honda Civic" "Toyota Corolla"
  "Toyota Corona" "Dodge Challenger" "AMC Javelin" "Camaro Z28"
  "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" "Lotus Europa"
  "Ford Pantera L" "Ferrari Dino" "Maserati Bora"
  "Volvo 142E") fell through ETYPECASE expression.
Wanted one of ((SIMPLE-ARRAY BIT)
               (SIMPLE-ARRAY (UNSIGNED-BYTE 8))
               (SIMPLE-ARRAY (SIGNED-BYTE 8))
               (SIMPLE-ARRAY (UNSIGNED-BYTE 16))
               (SIMPLE-ARRAY (SIGNED-BYTE 16))
               (SIMPLE-ARRAY (UNSIGNED-BYTE 32))
               (SIMPLE-ARRAY (SIGNED-BYTE 32))
               (SIMPLE-ARRAY (UNSIGNED-BYTE 64))
               (SIMPLE-ARRAY (SIGNED-BYTE 64))
               (SIMPLE-ARRAY SINGLE-FLOAT)
               (SIMPLE-ARRAY DOUBLE-FLOAT)).
   [Condition of type SB-KERNEL:CASE-FAILURE]

with the source of the error being:

Backtrace:
  0: (DUCKDB-API::MAKE-STATIC-COLUMN MODEL #("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ...) NIL :LENGTH NIL)
      Locals:
        #:G0 = NIL
        #:G46 = NIL
        NAME = MODEL
        VALUES = #("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ...)
  1: (DUCKDB-API:MAKE-STATIC-COLUMNS ((MODEL . #("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ...)) (MPG . #(21 21 22.8d0 21.4d0 18.7d0 18.1d0 ...)) (CYL . #(6 6 ..
  2: ((LAMBDA ()))
  3: (SB-INT:SIMPLE-EVAL-IN-LEXENV (DUCKDB:WITH-STATIC-TABLE ("mtcars" (AS-ALIST MTCARS)) (DUCKDB:FORMAT-QUERY "SELECT * from mtcars" NIL)) #<NULL-LEXENV>)
  4: (EVAL (DUCKDB:WITH-STATIC-TABLE ("mtcars" (AS-ALIST MTCARS)) (DUCKDB:FORMAT-QUERY "SELECT * from mtcars" NIL)))

Any ideas?

Round trip example?

Is it possible to 'round trip' data to and from CL and a duckdb table? One that includes any valid types?

For example if I get an alist from ddb:query, can I then use that somehow in the with-static-table macro to recreate the table, with a different name?

Investigate table functions failure

The current implementation for static tables seems to sporadically fail:

CL-USER> (let ((integers (make-array (list 100000) :element-type '(signed-byte 32))))
           (loop :for i :of-type (signed-byte 32) :below 100000
                 :do (setf (aref integers i) i))
           (ddb:with-transient-connection
             (count t (loop :for _ :below 1000
                            :collect (eql 4999950000
                                          (ddb:with-static-table ("integers" `(("i" . ,integers)))
                                            (ddb:get-result (ddb:query "SELECT sum(i) AS x FROM static_table('integers')" nil)
                                                            'x 0)))))))
996 (10 bits, #x3E4)
CL-USER> (let ((integers (make-array (list 100000) :element-type '(signed-byte 32))))
           (loop :for i :of-type (signed-byte 32) :below 100000
                 :do (setf (aref integers i) i))
           (ddb:with-transient-connection
             (count t (loop :for _ :below 1000
                            :collect (eql 4999950000
                                          (ddb:with-static-table ("integers" `(("i" . ,integers)))
                                            (ddb:get-result (ddb:query "SELECT sum(i) AS x FROM static_table('integers')" nil)
                                                            'x 0)))))))
995 (10 bits, #x3E3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.