bsless / clj-fast Goto Github PK

View Code? Open in Web Editor NEW

232.0 232.0 1.0 4.5 MB

Unpredictably faster Clojure

License: Eclipse Public License 2.0

Clojure 88.54% Shell 0.36% HTML 11.10%

clojure performance

clj-fast's People

Contributors

Stargazers

Watchers

Forkers

yuhan0

clj-fast's Issues

Faster fast-merge

I was independently working on something similar, albeit with a focus on improving clojure.core/merge performance. Watching Tommi's talk reminded me of the slowness (and I end up using it in a lot of legacy code on hot paths...). I tried just optimizing the persistent map-based variant, which got me into the same ballpark you referenced (~30%). Going with transients and a useful heuristic yields some better fruit though (~50%)...

I did some testing to eliminate sources of slowdown:

prefer discrete args vs. the default var-args for everything version,
use transients instead of persistent map-based conj,
use direct method invocation everywhere,
accumulate keys from r->l instead of l->r, since we can prune more (in some cases, e.g. 2-arg versions like (merge {:a 2 :b 3 :c 4} {:a 1 :b 1 :c 1}) we can exclude all keys from l since they already appear in R, leading to 0 actual assoc's)

(defn rmerge! [^clojure.lang.IKVReduce l  r]
  (.kvreduce l
             (fn [^clojure.lang.ITransientAssociative acc k v]
               (if-not (acc k)
                 (.assoc acc k v)
                 acc)) r))

;;~50% faster.
(defn fast-merge
  ([] {})
  ([m] m)
  ([m1 m2]          (rmerge m1 m2))
  ([m1 m2 m3]       (->> (transient m3) (rmerge! m2) (rmerge! m1) persistent!))
  ([m1 m2 m3 m4]    (->> (transient m4) (rmerge! m3) (rmerge! m2) (rmerge! m1) persistent!))
  ([m1 m2 m3 m4 m5] (->> (transient m5) (rmerge! m4) (rmerge! m3) (rmerge! m2) (rmerge! m1) persistent!))
  ([m1 m2 m3 m4 m5 & ms]
   (let [rs (reverse ms)]
     (->> (reduce rmerge! (transient (first rs)) (rest rs))
          (rmerge! m5)
          (rmerge! m4)
          (rmerge! m3)
          (rmerge! m3)
          (rmerge! m1)
          persistent!))))

faster aset

per Chris's note here:

https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tech.2Eml.2Edataset/near/267758532

just defines overloads in java for a custom aset that doesn't return a value, incurring 0 boxing cost.

fast-case

I ran across this a while back, when optimizing for some identity-based comparisons. It reared its head again when I was optimizing clojure's defrecord during the icfpc 2019 optimization exercise, specifically based on how clojure.core/defrecord handles its implementations of valAt, assoc, and a (lack of) IFn lookup style implementation. I'll post the alternative I'm working with (fastrecord) in another issue. Here though are some optimizations for clojure.core/case that leverage efficient identical? checks in the case where you have all keywords.

clojure.core/case already internally optimizes its lookup scheme based on the test constants. I haven't dug into the int test case (I assume it's optimal, why not). The other cases are if you have keyword test cases, or otherwise structural hasheq constants that can be tested by hashing.

In the case of records, clojure uses clojure.core/case to dispatch based on the keyword associated with the field for many operations, which will go down the identical? lookup path. This is good in general, with a substantial caveat. For test case cardinality <=20, it is empirically faster to do a linear scan through the test cases and check identical? rather than the scheme clojure.core/case uses, which is to look up in a map (I "think" a persistent one, not sure).

This means, for small sets of test constants, you always pay the price of lookup into the hash. You also eschew any possible short circuiting opportunities which may arise by design or naturally from the data.

My alternative is twofold: a macro that establishes a case-like form specific to identical? cases (e.g. keywords, or anything else the caller is confident that identical? is appropriate for):

(defmacro case-identical?
  "Like clojure.core/case, except instead of a lookup map, we
   use `condp` and `identical?` in an unfolding macroexpansion
   to allow fast case lookups for smaller cases where we may
   beat the o(1) cost of hashing the clojure.core/case incurs
   via its lookup map.  Some workloads are substantially (~3x)
   faster using linear lookup and `identical?` checks.

   Caller should be aware of the differences between `identical?`
   and `=` or other structural hashing comparisons.  `identical?`
   is appropriate for object (e.g. pointer) equality between
   instances, and is more restrictive than structural equality
   per `clojure.core/=`; objects may be = but not `identical?`,
   where `indentical?` objects are almost certainly `=`."
  [e & clauses]
  (let [ge      (with-meta (gensym) {:tag Object})
        default (if (odd? (count clauses))
                  (or (last clauses) ::nil)
                  `(throw (IllegalArgumentException. (str "No matching clause: " ~ge))))
        conj-flat   (fn [acc [k v]]
                      (conj acc k v))]
    (if (> 2 (count clauses))
      `(let [~ge ~e] ~default)
      (let [pairs     (->> (partition 2 clauses)
                           (reduce (fn [acc [l r]]
                                     (if (seq? l)
                                       (reduce conj acc (for [x l] [x r]))
                                       (conj acc [l r])))  []))
            dupes    (->> pairs
                          (map first)
                          frequencies
                          (filter (fn [[k v]]
                                    (> v 1)))
                          (map first))
            args     (reduce conj-flat [] pairs)]
        (when (seq dupes)
          (throw (ex-info (str "Duplicate case-identical? test constants: " (vec dupes)) {:dupes dupes})))
        `(let [~ge ~e]
           (condp identical? ~ge
             ~@(if default (conj args (case default ::nil nil default)) args)))))))

and a drop-in replacement for clojure.core/case, fast-case, which detects conditions where it would be better to use linear scans rather than hashing and lookups:

(defmacro fast-case
   "Drop-in replacement for clojure.core/case that attempts to optimize
    identical? case comparison (e.g. keywords).
    Takes an expression, and a set of clauses.

    Each clause can take the form of either:

    test-constant result-expr

    (test-constant1 ... test-constantN)  result-expr

    The test-constants are not evaluated. They must be compile-time
    literals, and need not be quoted.  If the expression is equal to a
    test-constant, the corresponding result-expr is returned. A single
    default expression can follow the clauses, and its value will be
    returned if no clause matches. If no default expression is provided
    and no clause matches, an IllegalArgumentException is thrown.

    Unlike cond and condp, fast-case does a constant-time dispatch for
    ints and non-keyword constants; the clauses are not considered
    sequentially.

    If all test cases are keywords, then fast-case will leverage an
    optimized path for `identical?` checks, where we balance the
    performance of a linear comparison of entries by object
    identity with the cost of an associative lookup and hashing
    of the case objects.  This can yield signficant savings
    for cases that are all keywords, and when there may be
    benefit for short-circuiting operations (e.g. the most
    likely case is first).

    All manner of constant expressions are acceptable in case,
    including numbers, strings,  symbols, keywords, and (Clojure)
    composites thereof. Note that since lists are used to group
    multiple constants that map to the same expression, a vector
    can be used to match a list if needed. The  test-constants
    need not be all of the same type."
  [e & clauses]
  (let [ge (with-meta (gensym) {:tag Object})
        default (if (odd? (count clauses))
                  (last clauses)
                  `(throw (IllegalArgumentException. (str "No matching clause: " ~ge))))
        conj-flat   (fn [acc [k v]]
                      (conj acc k v))]
    (if (> 2 (count clauses))
      `(let [~ge ~e] ~default)
      (let [pairs (->> (partition 2 clauses)
                       (reduce (fn [acc [l r]]
                                 (if (seq? l)
                                   (reduce conj acc (for [x l] [x r]))
                                   (conj acc [l r])))  []))]
        (if (and (every? keyword? (map first pairs))
                 (<= (count pairs) 20))
          `(case-identical? ~e ~@clauses)
          `(clojure.core/case ~e ~@clauses))))))

Up to values of 20, looking up the 20th value is still faster than O(1) hashing via clojure.core/case.
Fewer values have far more substantial gains (around 2-3x for small test sets). For code that's on a hot path (like the internal implementations of clojure.core/defrecord and its lookups in valAt, assoc, etc., which dispatch to clojure.core/case off of keyword identity), this can provide substantial performance savings (which is the intent of records and similar performance accelerators after all).

(let [k :a]
  (c/quick-bench
   (fast-case k :a 0 :b 1 :c 2 :d 3 :e 4 :f 5 :g 6 :h 7 :i 9 :j 10 :k 11 :l
     12 :m 13 :n 14 :o 15 :p 16 :q 17 :r 18 :s 19 :none)))

;; Evaluation count : 108369882 in 6 samples of 18061647 calls.
;; Execution time mean : 3.791586 ns

(let [k :a]
  (c/quick-bench
   (case k :a 0 :b 1 :c 2 :d 3 :e 4 :f 5 :g 6 :h 7 :i 9 :j 10 :k 11 :l 12 :m
         13 :n 14 :o 15 :p 16 :q 17 :r 18 :s 19 :none)))

;; Evaluation count : 55105488 in 6 samples of 9184248 calls.
;; Execution time mean : 8.826299 ns

(let [k :s]
  (c/quick-bench
   (fast-case k :a 0 :b 1 :c 2 :d 3 :e 4 :f 5 :g 6 :h 7 :i 9 :j 10 :k 11 :l
              12 :m 13 :n 14 :o 15 :p 16 :q 17 :r 18 :s 19 :none)))

;; Execution time mean : 9.388600 ns

(let [k :s]
  (c/quick-bench
   (fast-case k :a 0 :b 1 :c 2 :d 3 :e 4 :f 5 :g 6 :h 7 :i 9 :j 10 :k 11 :l
              12 :m 13 :n 14 :o 15 :p 16 :q 17 :r 18 :s 19 :none)))

;; Evaluation count : 55049616 in 6 samples of 9174936 calls.
;; Execution time mean : 9.402726 ns


(let [k :a] (c/quick-bench (fast-case k :a 0 :b 1 :none)))

;; Evaluation count : 105935916 in 6 samples of 17655986 calls.
;; Execution time mean : 3.748773 ns


(let [k :missing] (c/quick-bench (fast-case k :a 0 :b 1 :none)))

;; Evaluation count : 101388984 in 6 samples of 16898164 calls.
;; Execution time mean : 4.107746 ns

(let [k :a] (c/quick-bench (case k :a 0 :b 1 :none)))

;; Evaluation count : 58064934 in 6 samples of 9677489 calls.
;; Execution time mean : 8.382319 ns

(let [k :missing] (c/quick-bench (case k :a 0 :b 1 :none)))

;; Evaluation count : 56006400 in 6 samples of 9334400 calls.
;; Execution time mean : 8.979109 ns

[edit] added a default case that doesn't flip out on nil for case-identical?

not-found arities of get and get-in

Why does the inline/get macro take [m k & nf] arguments?
Having more than one not-found argument in the expanded (clojure.lang.RT/get ..) form seems to be clearly an error.

Similarly, is there a reason why inline/get-in does not take a not-found argument?

I added one like so:

(defmacro get-in
  "Like `get-in` but faster and uses code generation.
  `ks` must be either vector, list or set."
  ([m ks]
   {:pre [(u/simple-seq? ks)]}
   (lens/get (fn [k] `(get ~k)) m ks))
  ([m ks nf]
   {:pre [(u/simple-seq? ks)]}
   (let [g (gensym)]
     `(let [~g ~nf]
        ~(lens/get (fn [k] `(get ~k ~g)) m ks)))))

make sure inlining is wrapped in a function that can be jit'd

It may not make a difference, but something I missed when doing some recent toy profiling at the REPL measuring traversals between arrays and vectors:

When testing areduce over a primitive array vs. a call to reduce over a vector for iteration comparison, the raw areduce expression ended up being confusingly slower (or close to) the boxed HAMT vector reduction. This seemed very odd, since prior experience indicated primitive array traversal was blazingly fast. I then ensured that the macroexpansion from areduce happened inside a wrapper function, like traverse-arr

user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))  
"Elapsed time: 30.8967 msecs"
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 16.4168 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 16.7463 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 15.8058 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 18.5438 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 19.4693 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 17.4582 msecs"

(defn traverse-arr [^longs arr]
  (areduce ys idx acc nil (aget ^longs arr  idx)))

and got my expected performance.

user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 30.0336 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 17.178 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 3.6118 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 10.2396 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 15.9891 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 4.4253 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 3.7816 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 3.9419 msecs"
nil

It seems the JIT is kicking in on the tiny function, but not the areduce call, which is a macro expansion into a loop/recur form. Very interesting. Might be worth a look to make sure the JIT isn't being restricted in the inlined forms as well (I haven't looked hard).

Operations on ordered collections (probing) can be faster

For reference, I am using a sorted map to denote a set of intervals. I would like to quickly do intersection tests for an arbitrary point x to see if it intersects with a known interval. So, for the ordered map {a b}, the key a denotes the left bound of an interval, and b denotes the right, e.g. [a b]. If a value is contained by the interval, it satisfies the properties of (>= x a), (<= x b). Since we already have sorted maps out of the box, this seems like a simple use case. We can leverage the operation rsubseq to leverage the tree's structure and search for entries where the key fits our first criteria, then test the val for the second criteria. Notably, clojure.data.avl provides more robust features and is possibly faster, so we'll include it as an option.

Let's look at performance.

(require '[criterium.core :as c])
(require '[clojure.data.avl :as avl])

(def samples (sorted-map 10 20 35 40 50 60))
(def avl-samples (avl/sorted-map 10 20 35 40 50 60))

;;slow but portable version...
(defn slow-intersection [sm k]
  (when-let [ab (first (rsubseq sm <= k))]
    (when (<= k (val ab))
      ab)))
;;user> (c/quick-bench (slow-intersection samples 10))
;;Execution time mean : 1.142835 µs

;;fast but not portable version...
(defn intersection [sm k]
  (when-let [ab (when-let [xs (.seqFrom ^clojure.lang.PersistentTreeMap sm k false)]
                  (.first ^clojure.lang.ISeq xs))]
    (when (<= k (.getValue ^java.util.Map$Entry ab))
      ab)))

;;user> (c/quick-bench (intersection samples 10))
;;Execution time mean : 112.924740 ns

(set! *unchecked-math* true)
(defn fast-intersection [sm k]
  (when-let [ab (when-let [xs (.seqFrom ^clojure.lang.PersistentTreeMap sm k false)]
                  (.first ^clojure.lang.ISeq xs))]
    (when (<= ^long k ^long (.getValue ^java.util.Map$Entry ab))
      ab)))
(set! *unchecked-math* false)
;;user> (c/quick-bench (fast-intersection samples 10))
;;Execution time mean : 91.716445 ns


(defn avl-intersection [sm k]
  (when-let [ab (avl/nearest sm <= k)]
    (when (<= k (val ab))
      ab)))
;;Execution time mean : 163.168134 ns

(defn fast-avl-intersection [sm k]
  (when-let [ab (avl/nearest sm <= k)]
    (when (<= k (.val ^clojure.lang.MapEntry ab))
      ab)))
;;Execution time mean : 156.294912 ns

(set! *unchecked-math* true)
(defn fastest-avl-intersection [sm k]
  (when-let [ab (avl/nearest sm <= k)]
    (when (<= ^long k ^long (.val ^clojure.lang.MapEntry ab))
      ab)))
(set! *unchecked-math* false)
;; user> (c/quick-bench (fastest-avl-intersection avl-samples 10))
;; Execution time mean : 142.348268 ns

So at least for this operation (maybe there other others), direct method invocation and hinting really helps out in the peformance. 112ns vs 1.4microseconds is pretty substantial. This could open the door to additional performance questions related to the less often used sorted collections in core.

Array cloning performance diffs

Doing some performance stuff for a local search optimization via simulated annealing. My travels included exploring different numeric representations and trade offs, to include using COW for numeric arrays in some places vs. persistent vector/transient vector defaults. It appears aclone is slow for some unknown reason (to me!):

(set! *unchecked-math* true)

(defn cow-update-slow [^longs arr ^long idx ^long v]
  (let [^longs res (aclone arr)]
    (aset res idx v)
    res))

user> (let [xs (long-array 15)] (c/quick-bench (cow-update-slow xs 10 2)))
             Execution time mean : 114.164498 ns
    Execution time std-deviation : 0.747223 ns
   Execution time lower quantile : 113.189737 ns ( 2.5%)
   Execution time upper quantile : 114.879358 ns (97.5%)
                   Overhead used : 11.166062 ns

(defn cow-update [^longs arr ^long idx ^long v]
  (let [^longs res (java.util.Arrays/copyOf arr (alength arr))]
    (aset res idx v)
    res))

user> (let [xs (long-array 15)] (c/quick-bench (cow-update xs 10 2)))
Evaluation count : 12961884 in 6 samples of 2160314 calls.
             Execution time mean : 35.007116 ns
    Execution time std-deviation : 0.240233 ns
   Execution time lower quantile : 34.668212 ns ( 2.5%)
   Execution time upper quantile : 35.291867 ns (97.5%)
                   Overhead used : 11.166062 ns
nil

java.util.Arrays/copyOf will beat it handily, and performance is a hair faster (for small arrays) than normal persistent vector's hinted assocN for a similar operation. I always thought aclone was more or less optimal....apparently not.

General inliner

All of my inlining implementations, although close to the original implementation they're mimicking, are ad-hoc. They require prior knowledge of the inlined form.
Ideally, I would want to expose a macro like definline which will include a sophisticated :inline function. This inliner will examine the call site and try to inline what it can.

Prior works on this include F expressions
https://github.com/halgari/heliotrope
https://web.wpi.edu/Pubs/ETD/Available/etd-090110-124904/unrestricted/jshutt.pdf

And Haskell's Core compiler
https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/core-to-core-pipeline

I started trying to generalize an implementation, starting out with replacing function application with a rewrite rule which would replace application by the instructions to write an application:

(defn abstract-call
  [sym]
  (fn [& args]
    `(~sym ~@args)))

(defmacro ac
  [sym]
  `(abstract-call ~sym))

This allows to quite trivially port an existing definition:

(defn a-get-in
  [m ks]
  (reduce (ac `get) m ks))

(a-get-in 'm '[a b c])
;=>
(clojure.core/get (clojure.core/get (clojure.core/get m a) b) c)

It requires some massaging of more complex definitions but essentially works:

(defn a-assoc-in
  [m [k & ks] v]
  (let [g (gensym)]
    `(let [~g ~m]
       ~(if ks
          ((ac `assoc) g k (a-assoc-in ((ac `get) g k) ks v))
          ((ac `assoc) g k v)))))

(a-assoc-in 'm '[a b c] 'v)
;=>
(clojure.core/let
 [G__11548 m]
 (clojure.core/assoc
  G__11548
  a
  (clojure.core/let
   [G__11549 (clojure.core/get G__11548 a)]
   (clojure.core/assoc
    G__11549
    b
    (clojure.core/let
     [G__11550 (clojure.core/get G__11549 b)]
     (clojure.core/assoc G__11550 c v))))))

Things get more complicated when trying to generalize this

This is my initial abortive attempt:

(defn replace-args
  [args form]
  (let [argm (zipmap args (repeatedly gensym))
        imap (interleave (vals argm) (keys argm))]
    `(let [~@imap]
       ~(walk/postwalk-replace argm form))))

(defn fnsym
  [sym]
  (when-let [v (resolve sym)]
    (when (and (ifn? (deref v)) (not (:macro (meta v))))
      (symbol v))))

(defn abstract-fn
  [sym]
  (if-let [sym (fnsym sym)]
    `(abstract-call ~sym)
    sym))

(comment
  ((abstract-fn 'get) 'm 'k))

(defn abstract-form
  [name form]
  (walk/postwalk
   (fn [expr]
     (if-let [expr (and (symbol? expr)
                        (not= name expr)
                        (abstract-fn expr))]
       expr
       expr))
   form))

(defn regenerate-form
  [name args form]
  (fn [& args']
    (let [gnosis (map (fn [argn arg] (if (known-at-callsite? arg)
                                       (with-meta argn (assoc (meta argn) :known true))
                                       argn)) args args')
          known (filter (comp :known meta) gnosis)
          unknown (remove (comp :known meta) gnosis)]
      (if (seq known)
        (->> form
             (replace-args unknown)
             (abstract-form name))
        'boring))))

(defn emit-inliner
  [name args form]
  (let [generator (regenerate-form name args form)]
    (fn [& callsite]
      (let [emitter (apply generator callsite)
            emitter (eval `(fn [~@args] ~emitter))]
        (apply emitter callsite)))))

(defmacro definline+
  [name & decl]
  (let [[pre-args [args expr]] (split-with (comp not vector?) decl)]
    `(do
       (defn ~name ~@pre-args ~args ~expr)
       (alter-meta! (var ~name) assoc :inline (emit-inliner (quote ~name) (quote ~args) (quote ~expr)))
       (var ~name))))

(definline+ my-assoc-in
  [m [k & ks] v]
  (if ks
    (assoc m k (my-assoc-in (get m k) ks v))
    (assoc m k v)))

(defn foo
  [m v]
  (my-assoc-in m [1 2 3] v))

There are heuristics to be considered when attempting to inline, some of which are discussed in the Haskell Core compiler literature, such as the number of occurrences of a variable in a form determining if it needs to be bound (see the input map in assoc-in).

cc: @joinr, what do you think, does a solution present itself readily here, or is it completely inscrutable?

Bug in update-many

(assoc-in m [:x] false)
Change leaf check to contains?

Clojurescript support?

Are there any performance improvements for the same cases possible in cljs?
Can some of the files be moved .cljc?

May not be in scope, but fast detection of set intersection

I ran into a use case for a work product that needed to quickly test membership between 2 sets to
see if "any" member existed in both, and return the first intersecting member as fast as possible.

A naive way one might reach for is

(defn slow-some-member [s1 s2]
  (first (clojure.set/intersection s1 s2)))

Original implementation was

(defn some-member [s1 s2]  
  (let [[l r]  (if (< (count s1) (count s2)) [s1 s2]
                   [s2 s1])]
    (reduce (fn [acc x]
              (if (r x)
                (reduced x)
                acc)) nil l)))

Which of course has some inefficiencies due to destructuring and slow function calls through clojure.lang.RT.

Revised version assumes sets as input,

(defn some-member2 [^clojure.lang.APersistentSet s1
                    ^clojure.lang.APersistentSet s2]
  (let [l  (if (< (.count s1) (.count s2)) s1
               s2)
        r   (if (identical? l s1) s2 s1)]
    (reduce (fn [acc x]
              (if (r x)
                (reduced x)
                acc)) nil l)))

user> (let [s1 #{:a :b } s2 #{:g :a}] (c/quick-bench (slow-some-member s1 s2)))
Evaluation count : 975378 in 6 samples of 162563 calls.
             Execution time mean : 612.287581 ns
    Execution time std-deviation : 3.310927 ns
   Execution time lower quantile : 608.261646 ns ( 2.5%)
   Execution time upper quantile : 615.629475 ns (97.5%)
                   Overhead used : 2.233057 ns
nil

user> (let [s1 #{:a :b } s2 #{:g :a}] (c/quick-bench (some-member s1 s2)))
Evaluation count : 2004582 in 6 samples of 334097 calls.
             Execution time mean : 300.056678 ns
    Execution time std-deviation : 2.946006 ns
   Execution time lower quantile : 297.321682 ns ( 2.5%)
   Execution time upper quantile : 303.695844 ns (97.5%)
                   Overhead used : 2.233057 ns
nil
user> (let [s1 #{:a :b } s2 #{:g :a}] (c/quick-bench (some-member2 s1 s2)))
Evaluation count : 3298134 in 6 samples of 549689 calls.
             Execution time mean : 181.087016 ns
    Execution time std-deviation : 0.816367 ns
   Execution time lower quantile : 180.408265 ns ( 2.5%)
   Execution time upper quantile : 182.115366 ns (97.5%)
                   Overhead used : 2.233057 ns
nil

Gains are variable, and I haven't studied them for a wide range of set combinations. I'm interested in any faster means of doing this, although I just realized that my specific use case is amenable to memoization, which improves runtime (via memo-2) like 10x over my original "optimized" some-member.

user> (let [f (memo-2 some-member2) s1 #{:a :b } s2 #{:g :a}] (c/quick-bench (f s1 s2)))
Evaluation count : 19762416 in 6 samples of 3293736 calls.
             Execution time mean : 28.191368 ns
    Execution time std-deviation : 0.181916 ns
   Execution time lower quantile : 28.035774 ns ( 2.5%)
   Execution time upper quantile : 28.427699 ns (97.5%)
                   Overhead used : 2.233057 ns

fastrecord

Building off of #8 , we have a better implementation for records. During my optimization exercise for the ICFPC 2019 contest, I ran into a couple of peculiar performance phenomena with records. Records generally did fine with direct field lookups, but other map behaviors (particularly things invoking valAt) tended to perform worse than array-maps. Looking up keys outside the fields also performed worse, but far more than array maps.

It turns out, records have overhead on lookup in a couple of areas:

implementation uses (relatively) slow identical? case lookup to detect fields (keyword tests)
implementation uses clojure.core/get to lookup non-field keys in the __extmap hashmap, despite
knowing that this is a clojure persistentmap. clojure.core/get invokes a bunch of overhead...

These are found throughout the implementation, particularly valAt, which is used a bunch.

Fortunately, we can hack the defrecord implementation to use more efficient code paths with identical semantics. A better implementation would be to rewrite the defecord macro and emitters from the ground up, but I use code walking and transforms here to define a vaster record variant, the fastrecord:

(defmacro fastrecord
  "Like defrecord, but adds default map-like function application
   semantics to the record.  Fields are checked first in O(1) time,
   then general map lookup is performed.  Users may supply and optional
   ^:static hint for the arg vector, which will enforce the invariant
   that the record always and only has the pre-defined fields, and
   will throw an exception on any operation that tries to access
   fields outside of the predefined static fields.  This moves
   the record into more of a struct-like object.

   Note: this is not a full re-immplementation of defrecord,
   and still leverages the original's code emission.  The main
   difference is the implementation of key-lookup semantics
   ala maps-as-functions, and drop-in performance that should
   be equal-to or superior to the clojure.core/defrecord
   implementation.  Another refinement that makes arraymaps
   superior for fields <= 8, is the avoidance of a case dispatch
   which is slower in practice than a linear scan or a
   sequential evaluation of if identical? expressions.
   Small records defined in this way should be competitive
   in general purpose map operations."
  [name keys & impls]
  (let [fields (map keyword keys)
        binds  (reduce (fn [acc [l r]]
                     (conj acc l r))
                   []
                   (map vector fields (map #(with-meta % {})  keys)))
        [_ name keys & impls] &form
        this (gensym "this")
        k    (gensym "k")
        extmap (with-meta '__extmap {:tag 'clojure.lang.ILookup})
        default (gensym "default")
        n       (count keys)
        caser   'spork.util.general/fast-case
        lookup (fn [method]
                 `[(~method [~this ~k]
                    (~caser ~k
                     ~@binds
                     ~(if (-> keys meta :strict)
                        `(throw (ex-info "key not in strict record" {:key ~k}))
                        `(if ~extmap
                           (~'.valAt ~extmap ~k)))))
                   (~method [~this ~k ~default]
                    (~caser ~k
                     ~@binds
                     ~(if (-> keys meta :strict)
                        `(throw (ex-info "key not in strict record" {:key ~k}))
                        `(if ~extmap
                           (~'.valAt ~extmap ~k ~default)))))])
        replace-val-at (fn [impls]
                         (->> impls
                              (remove (fn [impl]
                                        (and (seq impl)
                                             (#{'valAt  'clojure.core/valAt}
                                              (first impl)))))
                              (concat (lookup 'valAt))))
        replace-deftype (fn [emitted]
                          (->> emitted
                               (reduce (fn [acc x]
                                         (if (and (seq? x)
                                                  (= (first x) 'deftype*))
                                           (let [init (take 6 x)
                                                 impls (drop 6 x)]
                                             (conj acc (concat init
                                                               (replace-val-at impls))))
                                           (conj acc x))) [])
                               seq))
        rform (->> `(~'defrecord ~name ~keys ~@impls
                     ~'clojure.lang.IFn
                     ~@(lookup 'invoke))
                   macroexpand-1
                   replace-deftype
                   (walk/postwalk-replace {'clojure.core/case caser
                                           'case caser}))]
    `(~@rform)))

fastrecord gets us back to efficient lookups, and even regains our performance for non-static fields.

spork.util.record> (fastrecord bf [x y])
spork.util.record.bf
spork.util.record> (defrecord blah [x y])
spork.util.record.blah
spork.util.record> (let [r (bf. 10 10)] (c/quick-bench (get r :x)))
Evaluation count : 75102528 in 6 samples of 12517088 calls.
             Execution time mean : 6.308735 ns
    Execution time std-deviation : 0.267464 ns
   Execution time lower quantile : 5.993483 ns ( 2.5%)
   Execution time upper quantile : 6.654294 ns (97.5%)
                   Overhead used : 1.794957 ns
nil
spork.util.record> (let [r (blah. 10 10)] (c/quick-bench (get r :x)))
Evaluation count : 47213862 in 6 samples of 7868977 calls.
             Execution time mean : 11.291437 ns
    Execution time std-deviation : 0.296358 ns
   Execution time lower quantile : 10.737574 ns ( 2.5%)
   Execution time upper quantile : 11.523370 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
spork.util.record> (let [r (bf. 10 10)] (c/quick-bench (.valAt ^clojure.lang.ILookup r :x)))
Evaluation count : 102439014 in 6 samples of 17073169 calls.
             Execution time mean : 4.110668 ns
    Execution time std-deviation : 0.094549 ns
   Execution time lower quantile : 4.003708 ns ( 2.5%)
   Execution time upper quantile : 4.218361 ns (97.5%)
                   Overhead used : 1.794957 ns
nil
spork.util.record> (let [r (blah. 10 10)] (c/quick-bench (.valAt ^clojure.lang.ILookup r :x)))
Evaluation count : 53414316 in 6 samples of 8902386 calls.
             Execution time mean : 9.790087 ns
    Execution time std-deviation : 0.165520 ns
   Execution time lower quantile : 9.627463 ns ( 2.5%)
   Execution time upper quantile : 10.015848 ns (97.5%)
                   Overhead used : 1.794957 ns
nil
spork.util.record> (let [r (assoc (bf. 10 10) :a 1 :b 2 :c 3 :d 4)] (c/quick-bench (.valAt ^clojure.lang.ILookup r :a)))
Evaluation count : 77783904 in 6 samples of 12963984 calls.
             Execution time mean : 6.005716 ns
    Execution time std-deviation : 0.258388 ns
   Execution time lower quantile : 5.702144 ns ( 2.5%)
   Execution time upper quantile : 6.358060 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 2 outliers in 6 samples (33.3333 %)
	low-severe	 1 (16.6667 %)
	low-mild	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
spork.util.record> (let [r (assoc (blah. 10 10) :a 1 :b 2 :c 3 :d 4)] (c/quick-bench (.valAt ^clojure.lang.ILookup r :a)))
Evaluation count : 27405528 in 6 samples of 4567588 calls.
             Execution time mean : 20.762105 ns
    Execution time std-deviation : 0.445198 ns
   Execution time lower quantile : 20.163919 ns ( 2.5%)
   Execution time upper quantile : 21.269557 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 2 outliers in 6 samples (33.3333 %)
	low-severe	 1 (16.6667 %)
	low-mild	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
spork.util.record> (let [r (assoc (bf. 10 10) :a 1 :b 2 :c 3 :d 4)] (c/quick-bench (r :a)))
Evaluation count : 79352808 in 6 samples of 13225468 calls.
             Execution time mean : 5.945103 ns
    Execution time std-deviation : 0.130571 ns
   Execution time lower quantile : 5.726372 ns ( 2.5%)
   Execution time upper quantile : 6.076753 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
spork.util.record> (let [r (assoc (bf. 10 10) :a 1 :b 2 :c 3 :d 4)] (c/quick-bench (r :x)))
Evaluation count : 99531684 in 6 samples of 16588614 calls.
             Execution time mean : 4.073560 ns
    Execution time std-deviation : 0.093923 ns
   Execution time lower quantile : 3.900763 ns ( 2.5%)
   Execution time upper quantile : 4.162056 ns (97.5%)
                   Overhead used : 1.794957 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil

In my use case, recovering that ~3.3x improvement for looking up non-static keys ended up being important. I added an IFn implementation in this case, which uses the optimized valAt pathways, but that doesn't have to be the case (and indeed deviates from defrecord). One option would be detecting user-supplied IFn implementations and using that instead, otherwise providing a map-lookup semantics for records. There are other improvements to make, including better hashing for primitive fields, but these are a step in the right direction (I think).

[edit] Added a fix for empty dynamic map.

Extend assoc-in to take multiple k-v pairs?

Sometimes when I want to do several updates on a deeply nested map, I end up writing something like this:

(-> m
  (assoc-in [:x :y] 1)
  (assoc-in [:x :z :a] 2)
  (assoc-in [:x :z :b] 3)
  (assoc-in [:x :z :c] 4))

Replacing with the clj-fast inlined assoc-in does not help much, because there's still many redundant operations being made across macro boundaries (eg. (get m :x) being calculated 4 times).

Instead, it should be possible for a macro that takes multiple kv pairs to statically unroll it into something like:

(inline/assoc-in m
  [:x :y]    1
  [:x :z :a] 2
  [:x :z :b] 3
  [:x :z :c] 4)

;; expands into
(let [x (get m :x)
      z (get x :z)]
  (assoc m :x
    (-> x
      (assoc :y 1)
      (assoc :z (-> z
                  (assoc :a 2)
                  (assoc :b 3)
                  (assoc :c 4))))))

I did a quick criterium benchmark and found it to be 6x faster than the naive version above, and 4.5x faster than a drop-in replacement of the inline/assoc-in.

I haven't looked too closely at the implementation details, but is this something that's achievable or within the scope of this library?

Thanks!

better memoize example

This is an example (for the 2 arg version) of how discrete memoize can kick the crap out of clojure.core/memoize (which uses varargs - slow hashing - plus a hashmap - slow lookup).

If we use mutable containers in the background, and threadsafe ones, we get the same semantics but way better performance. If we eliminate inefficient hashing and use discrete args (many functions are like this), we get even better lookup than the naive clojure.core version.

I've used variants like this (I think I saw it demoe'd by Zach Tellman once) for 1-arg and 2-arg, but there's like a macro for defining arbitrary arities.

;;4x faster than clojure.core/memoize...
;;we can do better with a macro, but I haven't sussed it out.
;;This is a much as we probably need for now though, meh.
(defn memo-2 [f]
  (let [xs (java.util.concurrent.ConcurrentHashMap.)]
    (fn [x y]
      (if-let [^java.util.concurrent.ConcurrentHashMap ys (.get xs x)]
        (if-let [res (.get ys y)]
          res
          (let [res (f x y)]
            (do (.putIfAbsent ys y res)
                res)))
        (let [res     (f x y)
              ys    (doto (java.util.concurrent.ConcurrentHashMap.)
                      (.putIfAbsent y res))
              _     (.putIfAbsent xs x ys)]
          res)))))

I explored using tuples from clj-tuple for this purpose, to have a generalized variant with a simpler implementation. You get a drastic speedup over the stock clojure.core/memoize, since tuple hashing is typically pretty good. This (old) implementation uses a HashMap, where we'd probably prefer to use a concurrent map for threadsafety I guess (maybe it doesn't matter if the map gets clobbered by multiple writers though). The macro spork.util.general/memo-fn is a replacement for the (memo (fn [....])) idiom that's more efficient:

user>` (require 'spork.util.general)
nil
user> (ns spork.util.general)
nil
spork.util.general> (memo-fn [x y z] (+ x y z))
#function[spork.util.general/eval15554/memoized--14673--auto----15555]
spork.util.general> (def blah (memo-fn [x y z] (+ x y z)))
#'spork.util.general/blah
spork.util.general> (def blee (memoize (fn [x y z] (+ x y z))))
#'spork.util.general/blee
spork.util.general> (time (dotimes [i 1000000] (blah 1 2 3)))
"Elapsed time: 27.303857 msecs"
nil
spork.util.general> (time (dotimes [i 1000000] (blee 1 2 3)))
"Elapsed time: 207.389697 msecs"
nil

FileNotFoundException when try to use clj-fast

Hi,

I'd added this library to the dependencies list in my leinengen project.clj file like:

:dependencies [; ...
               [bsless/clj-fast "0.0.9"]
               ; ...
               ]

And then I've started the repl and tried to use somethink from README, but got a error:

x.y.core=> (use '[clj-fast.collections.concurrent_hash_map :as chm])

Execution error (FileNotFoundException) at x.y.core/eval39551 (form-init17436836735770775491.clj:1).
Could not locate clj_fast/collections/concurrent_hash_map__init.class, clj_fast/collections/concurrent_hash_map.clj or clj_fast/collections/concurrent_hash_map.cljc on classpath. Please check that namespaces with dashes use underscores in the Clojure file name.

What I do wrong?

faster mapv and others (reference) using iterators

https://hackernoon.com/faster-clojure-reduce-57a104448ea4

via andre rauh

JMH

static-merge

If we enforce the invariant that we have literal maps, with distinct keys, we can merge statically with type hints and avoid iteration costs. I noticed I do this "a lot" in my legacy code, so this refactoring came out during optimization passes. Only works with Associatives

(defmacro static-merge
  [& ms]
  (assert (every? map? (rest ms)))
  (assert (every? #(= 1 %) (->> (rest ms) (mapcat keys) frequencies vals)))
  (let [kvs (mapcat seq (rest ms))]
    (reduce (fn [acc [k v]]
              `(assoc ~acc ~k ~v)) (first ms) kvs)))

;;type hinted method invocations avoid clojure.lang.RT shave off a bit of overhead
(defmacro static-merge2
  [& ms]
  (assert (every? map? (rest ms)))
  (assert (every? #(= 1 %) (->> (rest ms) (mapcat keys) frequencies vals)))
  (let [kvs (mapcat seq (rest ms))
        assoc! (fn [m k v]
                 `(.assoc  ~(with-meta m {:tag 'clojure.lang.Associative})
                              ~k ~v))]
    (reduce (fn [acc [k v]]
              (assoc! acc k v)) (first ms) kvs)))

fastest to slowest. Not sure if there's any downside to using .assoc directly over clojure.core/assoc (assumedly clojure.core/assoc is more general, but...)

user> (let [m {:a 2 :b 3}] (c/quick-bench (static-merge2  m {:c 4 :e 5} {:d 6 :f 7} {:h 9 :j 10})))
Evaluation count : 2711934 in 6 samples of 451989 calls.
             Execution time mean : 221.200769 ns
    Execution time std-deviation : 1.056652 ns
   Execution time lower quantile : 219.797515 ns ( 2.5%)
   Execution time upper quantile : 222.274280 ns (97.5%)
                   Overhead used : 1.859794 ns
nil
user> (let [m {:a 2 :b 3}] (c/quick-bench (static-merge  m {:c 4 :e 5} {:d 6 :f 7} {:h 9 :j 10})))
Evaluation count : 2555754 in 6 samples of 425959 calls.
             Execution time mean : 231.992038 ns
    Execution time std-deviation : 2.440858 ns
   Execution time lower quantile : 230.296052 ns ( 2.5%)
   Execution time upper quantile : 235.110715 ns (97.5%)
                   Overhead used : 1.859794 ns
nil
user> (let [m {:a 2 :b 3}] (c/quick-bench (fast-merge  m {:c 4 :e 5} {:d 6 :f 7} {:h 9 :j 10})))
Evaluation count : 870594 in 6 samples of 145099 calls.
             Execution time mean : 695.676793 ns
    Execution time std-deviation : 9.920187 ns
   Execution time lower quantile : 685.828793 ns ( 2.5%)
   Execution time upper quantile : 711.901821 ns (97.5%)
                   Overhead used : 1.859794 ns

Found 2 outliers in 6 samples (33.3333 %)
	low-severe	 1 (16.6667 %)
	low-mild	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
user> (let [m {:a 2 :b 3}] (c/quick-bench (merge  m {:c 4 :e 5} {:d 6 :f 7} {:h 9 :j 10})))
Evaluation count : 567864 in 6 samples of 94644 calls.
             Execution time mean : 1.063203 µs
    Execution time std-deviation : 8.960491 ns
   Execution time lower quantile : 1.053826 µs ( 2.5%)
   Execution time upper quantile : 1.073839 µs (97.5%)
                   Overhead used : 1.859794 ns
nil

In this sample, static-merge is ~4.8x faster than clojure.core/merge, and ~3.14x (lol) faster than the fast-merge function submitted earlier.

Better get-in behavior (maybe already implemented)

So the default clojure.core nested operations like get-in can be optimized if we have a path of literals known a-priori. We can compile that down to a sequence of lookups (preferably optimized .get ops on a java.util.Map or .valAt, or for broad generality, clojure.core/get). This avoids creating a vector at runtime, reducing over the vector, etc.

We can't do that in the case where values are passed in at runtime and we don't know the compile-time path (e.g. we have a path vector that includes symbols).

We can still kick clojure.core/get-in to the curb if we replace the default implementation with one that's slightly restrictive on types, in this case java.util.Map:

(defmacro get-in-map [m ks]
  (if (seq ks)
    `(let [^java.util.Map m# ~m]
       (if-let [res# (.get ^java.util.Map m#  ~(first ks))]
         (get-in-map res# ~(rest ks))))
  `~m))

The only restriction is that we provide a literal collection rather than a symbol for the path, but it works with normal clojure idioms.

user> (let [m {:a {:b {:c 3}}} x :c] (time (dotimes [i 1000000] (get-in m [:a :b x]))))
"Elapsed time: 116.3892 msecs"

user> (let [m {:a {:b {:c 3}}} x :c] (time (dotimes [i 1000000] (get-in-map m [:a :b x]))))
"Elapsed time: 28.9272 msecs"

user> (let [m {:a {:b {:c 3}}} x :c]  (get-in-map m [:a :b x]))
3

I dug this up since I overlooked get-in before when doing deep-assoc and friends, and I did not at the time know of how costly clojure.core/get is.

Oh, and unlike clojure.core/get-in, this variant is naturally short-circuiting if the path leads to nothing, so we get a bit more optimization (potentially a lot if there are deeply nested lookups):

user> (let [m {:a {:b {:c 3}}} x :d] (time (dotimes [i 1000000] (get-in m [:a x :c]))))
"Elapsed time: 119.1608 msecs"

user> (let [m {:a {:b {:c 3}}} x :d] (time (dotimes [i 1000000] (get-in-map m [:a x :c]))))
"Elapsed time: 19.3677 msecs"

Let me know if this is old news :)

bsless / clj-fast Goto Github PK

clj-fast's People

Contributors

Stargazers

Watchers

Forkers

clj-fast's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs