mbroecheler's comments

mbroecheler | 11 months ago | on: OpenAI adds MCP support to Agents SDK

Exactly, any generic GraphQL server can be turned into a set of LLM tools with minimal overhead and complexity.

mbroecheler | 11 months ago | on: OpenAI adds MCP support to Agents SDK

Second that. A lot of our use cases are "remote tooling", i.e. calling APIs. Implementing an MCP server to wrap APIs seems very complex - both in terms of implementation and infrastructure.

We have found GraphQL to be a great "semantic" interface for API tooling definitions since GraphQL schema allows for descriptions in the spec and is very humanly readable. For "data-heavy" AI use cases, the flexibility of GraphQL is nice so you can expose different levels of "data-depth" which is very useful in controlling cost (i.e. context window) and performance of LLM apps.

In case anybody else wants to call GraphQL APIs as tools in their chatbot/agents/LLM apps, we open sourced a library for the boilerplate code: https://github.com/DataSQRL/acorn.js

mbroecheler | 1 year ago | on: The future of kdb+?

I agree that being able to write one piece of code that solves your use case is a big benefit over having to cobble together a message queue, stream processor, database, query engine, etc.

We've been playing around with the idea of a building such an integration layer in SQL on top of open-source technologies like Kafka, Flink, Postgres, and Iceberg with some syntactic sugar to make timeseries processing nicer in SQL: https://github.com/DataSQRL/sqrl/

The idea is to give you the power of kdb+ with open-source technologies and SQL in an integrated package by transpiling SQL, building the computational DAG, and then running an cost-based optimizer to "cut" the DAG to the underlying data technologies.

mbroecheler | 2 years ago | on: Cloud, why so difficult? (2022)

Totally agree with the motivation - it is too cumbersome to stitch all these cloud services together by hand. Another project that's similar in motivation but focused on cloud data infrastructure is https://www.datasqrl.com/

mbroecheler | 2 years ago | on: Uplevel database development with DataSQRL: A compiler for the data layer

You are totally right. We did not want to create a new language and we are trying to keep it as close to SQL as possible. The problem is that SQL lacks streaming constructs you need for temporal joins or creating streams from relational tables. Jennifer Widom's group at Stanford did a lot of work on this (e.g. [1]). We are adding their operators to SQL in a way that is hopefully "easy enough". The rest is just syntactic sugar.

But we are not tied to SQRL and totally open to ideas for making the language piece less of a hurdle.

GPT4 is surprisingly good at writing SQRL scripts with few-shot learning.

You are also right on the schema piece. We are trying to track schemas like dependencies in software engineering. So you can keep them in a repo and let a package manager + compiler handle schema compatibility and synchronization. https://dev.datasqrl.com/ is an early prototype of the repository idea.

[1] Arasu, A., Babu, S., & Widom, J. (2006). The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, 15, 121-142.

mbroecheler | 2 years ago | on: Uplevel database development with DataSQRL: A compiler for the data layer

Exactly, there are so many amazing dataflow engines, stream processors, and databases out there. We are not competing with those.

We are trying to "compile away" all of the data plumbing code you have to write to integrate those systems into your application, so that it becomes easier to use them.

MySQL support in DataSQRL is definitely on the short-list.

mbroecheler | 2 years ago | on: Uplevel database development with DataSQRL: A compiler for the data layer

Yes, the idea to maintain materialized views based on standing queries to make the queries instantaneous is the same. In addition, DataSQRL handles the ingest (e.g. consuming events off a queue, pre-processing the data, and populating the database) and egress (i.e. serving the data through an API) so that all your data logic can be in one place.

Another key difference to Noria is that DataSQRL is an abstraction layer on top of existing technologies like Postgres, Flink, Kafka, etc and does not aim to be another datastore. That way, you can use the technologies you already trust without having to write the integration code.

mbroecheler | 2 years ago | on: Uplevel database development with DataSQRL: A compiler for the data layer

We'd love for you to join us in building a high-level data development language to simplify data-driven application development.

mbroecheler | 10 years ago | on: Titan: Distributed Graph Database

I suppose our communication around Titan has caused some confusion after the acquisition by DataStax. As one of the Titan devs I can say that we have no plans to abandon Titan. What we were trying to say is that we will have less time to dedicate to the project in order to encourage others in the community to step up and contribute. That has happened. Over the last couple of months, other Titan users have actively helped out on the mailing list to get newcomers started and contributed bugfixes and features via pull requests. This has allowed us to keep the Titan 1.0 release on its original plan date. What we are trying to do is make the Titan project less dependent on Dan and myself and more open and inviting to other developers who wish to contribute. For instance, we have dedicated more time than usual to reviewing PRs then before. I realize there is still more work that we need to do here but so far the increased contributions have been an encouraging sign that we are heading in the right direction. So, Titan is here to stay and - as others have pointed out - there is more momentum than ever behind the project.

mbroecheler | 10 years ago | on: The Gremlin Graph Traversal Language

Take a look at Gremlin 3 - it now supports both declarative and imperative queries. In fact, you can even mix and match the two.

You want to match a complex pattern? Use declarative Gremlin so the query optimizer can figure out the best execution strategy for you. You have a highly custom path traversal? Use imperative Gremlin which gives you full control over the execution and provides you with everything you'd expect from a pipeline language. You have both? Combine them in a single traversal.

While Gremlin2 was an imperative query language, Gremlin 3 is a new type of query language that aims to combine the best of both worlds.

mbroecheler | 11 years ago | on: Titan Distributed Graph Database 0.5.0

Yes, it works :-) Support for multiple storage backends gives Titan a lot of deployment flexibility and allows it to inherit some great features like multi DC support. Software component reuse is pretty standard these days. What lead you to the conclusion that it is the worst of all worlds?

mbroecheler | 13 years ago | on: Simulation of the world's universities in a single unified graph

Approximately $63 per hour on Amazon EC2.

mbroecheler | 13 years ago | on: The Distributed Graph Database Titan Provides Real-Time Big Graph Data

I think you are looking at a very different use case here. The systems that I think you are referring to analyze a static graph representation. The Graph500 benchmark in particular loads one big static, unlabeled, undirected, property-free graph and then runs extensive (BFS) analysis algorithms on it. The fact that the graph is not changing allows significant investment into building locality optimizing data structures (which is essentially what space decomposition is all about).

Titan on the other hand is a transactional database system to handle large, multi-relational (labeled) graphs with heterogeneous properties. A Titan graph is constantly evolving (as in the posted benchmark). For graphs (unlike geo-spatial domains), applying space decomposition techniques first requires a metric space embedding which is a non-trivial and computationally expensive process. For changing graphs, this embedding will change as well making this very difficult to use in practice. The best approaches I know of for achieving locality therefore use adaptive graph partitioning techniques instead. However, for the types of OLTP workloads that Titan is optimized for, this would be overkill in the sense that the time spend on partitioning will likely exceed the time saved at runtime. At very large scale, it is most important for OLTP systems to focus on access path optimization based on the ACTUAL query load experienced by the system and not some perceived sense of locality based on connectedness. I published a paper a while ago suggesting one approach to do so: http://www.knowledgefrominformation.com/2010/08/01/cosi-clou... The Graph500 benchmark explicitly prohibits this optimization ("The first kernel constructs an undirected graph in a format usable by all subsequent kernels. No subsequent modifications are permitted to benefit specific kernels").

mbroecheler | 13 years ago | on: The Distributed Graph Database Titan Provides Real-Time Big Graph Data

Absolutely, without NoSQL solutions like Cassandra Titan would not be possible.

Regarding Zookeeper: We actually build a locking system into Titan that uses quorum reads/writes with time-outs and cleanup to ensure consistency for certain edge/property types as defined inside Titan. This gives you consistency guarantees out of the box without having to introduce another component (like Zookeeper) into your deployment. For infrequent lock usage (which I strongly encourage ;-) this should be sufficient. For frequent locking, something like Zookeeper is far superior.

mbroecheler | 13 years ago | on: The Distributed Graph Database Titan Provides Real-Time Big Graph Data

Hey,

- the data we used was crawled by Kwak et. al in 2009. We wanted to use a real social network dataset for the experiment and that was the largest/most useful one we could find. Other than de-duplication we did not make any modifications to the dataset, so the statistics reported in their paper still hold: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153....

- You mean what is the overhead induced by pre-computing the stream edge rather than collecting the relevant streams at query time? You are right that this requires a significant amount of storage space, however, as you also pointed out, this will get cold quickly and be sitting on disk only (i.e. not taking up space in the valuable cache). The reason this is very efficient is because of the time-based vertex centric index we build for the stream edges. This allows us to quickly pull out the most recent tweets for any user. If we had to compute those at query time, we would have to traverse to each person followed, get their 10 most recent tweets and then merge those in-memory. That would be significantly more expensive and since stream reading is probably the most frequent activity on twitter, pre-computing it saves a lot of time at the expense of inexpensive disk storage.