(no title)
michaelg7x | 1 year ago
KDB is in some respects as dumb as a bag of rocks. There is no execution profiler nor explain plan, no query analysis at all. When running your query over tabular data it simply applies the where-clause constraints in-order, passing the boolean vector result from one to the next, which refines the rows still under active consideration. It's for this reason that newbies are always told to put the date-constraint first, or they'll try to load the entire history (typically VOD.L) into memory.
KDB really is very fast at processing vector data. Writing nested vectors or dictionaries to individual cells could easily be slowing you down; I've heard of one approach which writes nested dictionaries into vectors with the addition of a column to contain the dictionary keys. Then you get KDB to go faster over the 1-D data, nicely laid out on disk. You really do need to write it down in a way that is sympathetic to the way you will eventually process it.
You can create hashmap indices over column data but the typical way of writing down equity L1 data is to "partition by date" (write it into a date directory) and "apply the parted attribute" to the symbol column (group by symbol, sort by time ascending). Each of the remaining vectors (time, price, size, exchange, whatnot) are obviously sorted to match and finding the next or previous trade for a given symbol is O1 simplicity itself. I've never worked on options data and so can't opine on the problems it presents, but if you've been asked to write this down without any help, then it's pretty "rubbish" of the KDB guys in your firm. You have asked for help, right?
I'm really going on a bit but just a few more things:
- KDB will compress IPC data — if it wants to. The data needs to exceed some size-threshold and you must, I think, be sending it between hosts. It won't bother compressing it to localhost, at least, according to some wisdom received from one of the guys at Kx, many moons ago. The IPC format itself is more or less a tag-length-value format, and good enough. It evolved to support vectors bigger than INT32_MAX a while ago but most IPC-interop libraries don't tend to advertise support for the later version that lets you send silly-big amounts of data around, so my guess is you may not want to load data out of KDB a day at a time. Try to do the processing in KDB!
You said you're scared to do queries that return a lot of data, and that it often freezes. Are you sure the problem is at the KDB end? This may sound glib but you wouldn't be the first person to have been given a VM to do your dev-work on that isn't quite up to the job. You can find out the size of the payload you're trying to read by running the same query with the "-22!" system call. It'll tell you how many bytes it's trying to send. Surely there's help to be had from the KDB guys if you reach out?
- I'm confused by the use of the term "data lake": to me this includes unstructured data. I'm not sure I'd ever characterise a KDB HDB as such.
- If your firm has had KDB for ages there's a good chance it's big enough to be signed up to one of the research groups who maintain a set of test-suites they will run over a vendor's latest hardware offering, letting them claim the crown for the fastest Greeks or something. If your firm is a member you may be able to access the test-suites and look at how the data in the options tests is being written and read, and there are quite a few, I think.
- KDB can scale horizontally. It can employ a number (I forget whether it's bounded) of slave instances and farm-out work. I think I read that the latest version has a better work-stealing algo. It's often about the data, though: if the data for a particular symbol/date tuple is on that one server over there, then you're probably better off doing big historic-reads on that one server alone. I doubt very much you're compute-bound or you'd have told us that your KDB licence limited you to a single or N (rather than any number) of cores.
- Many years ago I was told never to run KDB on NFS. Except Solaris' NFS. I have no idea whether this is relevant ;)
Good luck, sonthonax
sonthonax|1 year ago
But firstly:
> If your firm has had KDB for ages there's a good chance it's big enough to be signed up to one of the research groups who maintain a set of test-suites they will run over a vendor's latest hardware offering, letting them claim the crown for the fastest Greeks or something. If your firm is a member you may be able to access the test-suites and look at how the data in the options tests is being written and read, and there are quite a few, I think.
Unfortunately my firm isn't that big ~150 in total and maybe about ~40 developers, if which there are 2 full time KDB devs who's job is mostly maintaining the ingestion and writing some quite basic functions like `as_of`. There's only two people who work on our options desk as developers, so there's a lack of resourcing for KDB. When I have these issues with KDB around performance, it's quite hard to get support within my firm from the two KDB devs (one of which is very junior).
> I've never worked on options data and so can't opine on the problems it presents
The thing about options data is that it's generally lower frequency but a lot more complex. If spot data is 1 dimensional, and futures data is 2 dimensional, options are 3 dimensional. You also have a lot more parameterizations which leads me to the second point :)
> you may not want to load data out of KDB a day at a time. Try to do the processing in KDB
Just to give you a very specific example of the processing I need to do. I have a data structure in KDB like this (sort of typescript notation):
This is vol surface. To convert that into volatility requires: Then in order to calculate the log_moneyness I need to calculate the forward price from an interest rate which is slightly more trivial.Now I have a base in which I can start generating data like the delta, but this also requires a lot of math.
I was pulling this stuff out of KDB because I already had my code in rust that does all of this.
> You said you're scared to do queries that return a lot of data, and that it often freezes. Are you sure the problem is at the KDB end?
Yeah, I'm pretty sure in my case. We have some functions designed for getting data written by the KDB guys. Even for functions that return 30 something rows, like an as_of query it takes ~10s.
rak1507|1 year ago