top | item 40627537

(no title)

Hi, KDB is used for this kind of thing in probably all the Tier 1 banks, or has been at some point. I'm surprised that you seem to have been given so little help by the KDB guys as it really matters how you store your data. That's informed by the data itself and the access patterns you're likely to use. When you say you're saving them as complex-ish models it makes me think that it may not be optimal for KDB to process.

KDB is in some respects as dumb as a bag of rocks. There is no execution profiler nor explain plan, no query analysis at all. When running your query over tabular data it simply applies the where-clause constraints in-order, passing the boolean vector result from one to the next, which refines the rows still under active consideration. It's for this reason that newbies are always told to put the date-constraint first, or they'll try to load the entire history (typically VOD.L) into memory.

KDB really is very fast at processing vector data. Writing nested vectors or dictionaries to individual cells could easily be slowing you down; I've heard of one approach which writes nested dictionaries into vectors with the addition of a column to contain the dictionary keys. Then you get KDB to go faster over the 1-D data, nicely laid out on disk. You really do need to write it down in a way that is sympathetic to the way you will eventually process it.

You can create hashmap indices over column data but the typical way of writing down equity L1 data is to "partition by date" (write it into a date directory) and "apply the parted attribute" to the symbol column (group by symbol, sort by time ascending). Each of the remaining vectors (time, price, size, exchange, whatnot) are obviously sorted to match and finding the next or previous trade for a given symbol is O1 simplicity itself. I've never worked on options data and so can't opine on the problems it presents, but if you've been asked to write this down without any help, then it's pretty "rubbish" of the KDB guys in your firm. You have asked for help, right?

I'm really going on a bit but just a few more things:

- KDB will compress IPC data — if it wants to. The data needs to exceed some size-threshold and you must, I think, be sending it between hosts. It won't bother compressing it to localhost, at least, according to some wisdom received from one of the guys at Kx, many moons ago. The IPC format itself is more or less a tag-length-value format, and good enough. It evolved to support vectors bigger than INT32_MAX a while ago but most IPC-interop libraries don't tend to advertise support for the later version that lets you send silly-big amounts of data around, so my guess is you may not want to load data out of KDB a day at a time. Try to do the processing in KDB!

You said you're scared to do queries that return a lot of data, and that it often freezes. Are you sure the problem is at the KDB end? This may sound glib but you wouldn't be the first person to have been given a VM to do your dev-work on that isn't quite up to the job. You can find out the size of the payload you're trying to read by running the same query with the "-22!" system call. It'll tell you how many bytes it's trying to send. Surely there's help to be had from the KDB guys if you reach out?

- I'm confused by the use of the term "data lake": to me this includes unstructured data. I'm not sure I'd ever characterise a KDB HDB as such.

- If your firm has had KDB for ages there's a good chance it's big enough to be signed up to one of the research groups who maintain a set of test-suites they will run over a vendor's latest hardware offering, letting them claim the crown for the fastest Greeks or something. If your firm is a member you may be able to access the test-suites and look at how the data in the options tests is being written and read, and there are quite a few, I think.

- KDB can scale horizontally. It can employ a number (I forget whether it's bounded) of slave instances and farm-out work. I think I read that the latest version has a better work-stealing algo. It's often about the data, though: if the data for a particular symbol/date tuple is on that one server over there, then you're probably better off doing big historic-reads on that one server alone. I doubt very much you're compute-bound or you'd have told us that your KDB licence limited you to a single or N (rather than any number) of cores.

- Many years ago I was told never to run KDB on NFS. Except Solaris' NFS. I have no idea whether this is relevant ;)

Good luck, sonthonax

discuss

sonthonax|1 year ago

Thanks for the thorough response.

But firstly:

> If your firm has had KDB for ages there's a good chance it's big enough to be signed up to one of the research groups who maintain a set of test-suites they will run over a vendor's latest hardware offering, letting them claim the crown for the fastest Greeks or something. If your firm is a member you may be able to access the test-suites and look at how the data in the options tests is being written and read, and there are quite a few, I think.

Unfortunately my firm isn't that big ~150 in total and maybe about ~40 developers, if which there are 2 full time KDB devs who's job is mostly maintaining the ingestion and writing some quite basic functions like `as_of`. There's only two people who work on our options desk as developers, so there's a lack of resourcing for KDB. When I have these issues with KDB around performance, it's quite hard to get support within my firm from the two KDB devs (one of which is very junior).

> I've never worked on options data and so can't opine on the problems it presents

The thing about options data is that it's generally lower frequency but a lot more complex. If spot data is 1 dimensional, and futures data is 2 dimensional, options are 3 dimensional. You also have a lot more parameterizations which leads me to the second point :)

> you may not want to load data out of KDB a day at a time. Try to do the processing in KDB

Just to give you a very specific example of the processing I need to do. I have a data structure in KDB like this (sort of typescript notation):

     row = mapping<datetime, { a: number, b: number, mu: number: sigma: number, rho: number}>

This is vol surface. To convert that into volatility requires:

    f = log_moneyness - m;
    total_var = a + b * (rho * f + (f * f + sigma * sigma).sqrt())
    vol = total_var / time

Then in order to calculate the log_moneyness I need to calculate the forward price from an interest rate which is slightly more trivial.

Now I have a base in which I can start generating data like the delta, but this also requires a lot of math.

I was pulling this stuff out of KDB because I already had my code in rust that does all of this.

> You said you're scared to do queries that return a lot of data, and that it often freezes. Are you sure the problem is at the KDB end?

Yeah, I'm pretty sure in my case. We have some functions designed for getting data written by the KDB guys. Even for functions that return 30 something rows, like an as_of query it takes ~10s.

rak1507|1 year ago

The volatility calculation looks like it should be doable in q/k, I'm not sure about the more complicated stuff but at the end of the day it's a general purpose language too so anything is possible. KDB being columnar means thinking in terms of rows can often be slower. Sounds like you have a keyed table? If the KDB guys you have aren't that good/helpful you could check out some other forums. Could be useful for the future to be able to circumvent issues you have with the kdb devs.