top | item 46654036

(no title)

Interesting. Are you willing to try out some 'experimental' software?

As I indicated in my previous post, I have a unique kind of data management system that I have built over the years as a hobby project.

It was originally designed to be a replacement for conventional file systems. It is an object store where you could store millions or billions of files in a single container and attach metadata tags to each one. Searches for data could be based on these tags. I had to design a whole new kind of metadata manager to handle these tags.

Since thousands or millions of different kinds of tags could be defined, each with thousands or millions of unique values within them; the whole system started to look like a very wide, sparse relational table.

I found that I could use the individual 'columnar stores' that I built, to also build conventional database tables. I was actually surprised at how well it worked when I started benchmarking it against popular database engines.

I would test my code by downloading and importing various public datasets and then doing analytics against that data. My system does both analytic and transactional operations pretty well.

Most of the datasets only had a few dozen columns and many had millions of rows; but I didn't find any with over a thousand columns.

As I said before, I had previously only tested it out to 10,000 columns. But since reading your original question, I started to play with large numbers of columns.

After tweaking the code, I got it to create tables with up to a million columns and add some random test data to them. A 'SELECT *' query against such a table can take a long time, but doing some queries where only a few dozen of the columns were returned, worked very fast.

How many patients were represented in your dataset? I assume that most rows did not have a value in every column.

discuss

No comments yet.