Babbage: A Clojure library for accumulation and graph computation

[+] kenko|13 years ago|reply

here's the announcement email (very boiled down version of the readme, essentially):

babbage is a library for easily gathering data and computing summary measures in a declarative way.

The summary measure functionality allows you to compute multiple measures over arbitrary partitions of your input data simultaneously and in a single pass. You just say what you want to compute:

    > (def my-fields {:y (stats :y count)
                      :x (stats :x count)
                      :both (stats #(+ (or (:x %) 0) (or (:y %) 0)) count sum mean)})

and the sets that are of interest:

    > (def my-sets (-> (sets {:has-y #(contains? % :y})
                       (complement :has-y))) ;; could also take intersections, unions

And then run it with some data:

    > (calculate my-sets my-fields [{:x 1 :y 2} {:x 10} {:x 4 :y 3} {:x 5}])
    {:not-has-y
     {:y {:count 0}, :x {:count 2}, :both {:mean 7.5, :sum 15, :count 2}},
     :has-y
     {:y {:count 2}, :x {:count 2}, :both {:mean 5.0, :sum 10, :count 2}},
     :all
     {:y {:count 2}, :x {:count 4}, :both {:mean 6.25, :sum 25, :count 4}}}

The functions :x, :y, and #(+ (or (:x %) 0) (or (:y %) 0)) defined in the fields map are called once per input element no matter how many sets the element contributes to. The function #(contains? % y) is also called once per input element, no matter how many unions, intersections, complements, etc. the set :has-y contributes to.

A variety of measure functions, and structured means of combining them, are supplied; it's also easy to define additional measures.

babbage also supplies a method for running computations structured as dependency graphs; this can make gathering the initial data for summarizing simpler to express. To give an example that's probably familiar from another context:

    > (defgraphfn sum [xs]
        (apply + xs))
    > (defgraphfn sum-squared [xs]
        (sum (map #(* % %) xs)))
    > (defgraphfn count-input :count [xs]
        (count xs))
    > (defgraphfn mean [count sum]
        (double (/ sum count)))
    > (defgraphfn mean2 [count sum-squared]
        (double (/ sum-squared count)))
    > (defgraphfn variance [mean mean2]
        (- mean2 (* mean mean)))
    > (run-graph {:xs [1 2 3 4]} sum variance sum-squared count-input mean mean2)
    {:sum 10
     :count 4
     :sum-squared 30
     :mean 2.5
     :variance 1.25
     :mean2 7.5
     :xs [1 2 3 4]}

Options are provided for parallel, sequential, and lazy computation of the elements of the result map, and for resolving the dependency graph in advance of running the computation for a given input, either at runtime or at compile time.

[+] defrost|13 years ago|reply

Cutting to the chase, does this make the summary results available in the midst of the sequence; eg: if it takes two hours to gather pressure data (or any other time series data) does this expose the running variance 10 minutes in, an hour in, etc. ?

[+] eschulte|13 years ago|reply

I actually wrote something similar in bash which I use frequently when I need to munge a table of numbers on the command line [1]. The whole time I was thinking I should really be doing this in common lisp.

[1] http://eschulte.github.com/data-wrapper/

[+] yayitswei|13 years ago|reply

This will be great for building our stats dashboard. Thanks!

[+] innovate|13 years ago|reply

we use this actively internally @ReadyForZero for a variety of different analysis, hopefully it's helpful for you and others

[+] furqanrydhan|13 years ago|reply

This is great!

[+] unknown|13 years ago|reply

[deleted]

8 comments