Since starting my career, the tech stack I use has changed numerous times. Technologies have come, gone, evolved. But except for a brief period in 2010 where we thought we could kill it, SQL is the one technology that has stayed around. It might be the only TECHNICAL knowledge I learned in the first year of my career that is still relavent. You're probably not making the wrong decision if you decide to build on top of SQL.
> except for a brief period in 2010 where we thought we could kill it
Unfortunately, we never actually tried to kill SQL. We tried to kill relational databases, with the death of SQL as a side effect. This proved to be a stupid idea - non-relational data stores have value of course, but so do relational databases.
SQL, as you point out, is 48 years old. It shows. The ergonomics are awful. It uses natural-language inspired syntax that has fallen completely out of favor (with good reason imho). The ordering of fields and tables precludes the use of autocomplete. It uses orthogonal syntax for identical constructions, i.e.
And worst of all, it comes with such a woefully inadequate standard that half of the universally supported features are expressed in a completely different fashion for each vendor.
It is very common to argue that its longevity is due to its excellence as a language. I think it's longevity is due to its broad target audience, which includes plenty of people in less technical roles who are high on the org chart. Tech workers are happy to jump on new technology that improves ergonomics (or shiny new toys that are strictly more interesting because they are new). Executives who's primary way of interacting with the tech stack is querying a read only replica are much less open to learning a new language.
A huge benefit of SQL is that it's a declarative language (you describe what the output should be); whereas with imperative languages like C/Python/Ruby you have to describe how to generate the output with specific instructions and procedures.
I'm sure this will ruffle some feathres, but the technical skill that has been the most persistently valuable across the past 25 years of my career is object-oriented programming and design. It's not the only paradigm I use, but being able to organize large programs in terms of OOP has been incredibly valuable and hugely productive for me across several languages and domains.
I feel my abilities as a developer really took off when I took the time to really learn SQL and take advantage of the features that my database offers. For example, learning to use window functions felt like that galaxy brain meme where you become one with the universe.
Being able to take 50 lines of ruby code (manual (anti-)joins and aggregates, result partitioning, etc...) and replace it with a couple lines of sql that is much faster and less buggy is a life changing experience.
The only other time I had such a dramatic shift in the way that I look at building applications is when I learned Scala (functional programming).
I’ve been using HTML since the mid 1990s. That is probably the oldest technology that I still use on a regular basis. SQL would be a close second (or maybe third - I probably started learning sql around 2001 and I can’t remember when JavaScript came out. Surely before that.)
Honestly, mines has mostly stayed the same. I've played with various techs but I mostly work the same underlying tech. Some items get added but very rarely does something get replaced or removed. The tech gets improved and new features added .
The last time I can remember something getting replaced is vargrant for docker. And honestly, I think that's a mistake for many companies. Before that it was git for svn.
The only way I can see tech coming and going is you're always an early adoptor and need to replace things that failed. If you build on stable tech, that stuff stays around for decades even if it was terrible in the first place. A good example is PHP. It was terrible at first still around and improving year on year.
The problem I see here is building SQL on top of the cloud, which is like building SQL on top of a database where the tables are always changing - not just adding columns but the meaning of the columns and rows, or even the existence of the table.
I think this is a good idea, but the cloud doesn't seem to care too much about backwards compatibility. If you use MySQL/PostgreSQL/SQL Server, you can be fairly sure in 20 years time your SQL will still work :-).
The need for this library is more an indictment on "the cloud" than anything. I am mostly an Azure user so AWS/Google might be better but man it changes it's UI and CLI interfaces alot - way too much.
SQL is a technology that engineers should know in an out, and while in early 2000's you could not move an inch without knowing it, I find that the new generation of devs really considers database calls a freebie - hidden behind abstractions and ORMs. It's just a line of code, right?
This leads to highly non-performant software since, unless we have bent the laws of physics in the last 15 years, your data access layer is the most expensive one.
"Since starting my career, the tech stack I use has changed numerous times. Technologies have come, gone, evolved. But except for a brief period in 2010 where we thought we could kill it, SQL is the one technology that has stayed around."
Ethernet comes to mind when I think of technologies that just can't be killed ...
> It might be the only TECHNICAL knowledge I learned in the first year of my career that is still relavent.
Agree. IIRC, I wrote my first SQL is 1996 and wrote some today. While not perfect, it’s amazing at what it does. I suggest reading Joe Celko to really up your SQL skills.
That's not random. SQL is basically math. The syntax is a little old school and arcane (but not terrible) but the principles behind it are set theory and logic.
Excited to see Steampipe shared here - thanks kiyanwang! I'm a lead on the project, so sharing some quick info below and happy to answer any questions.
Steampipe is open source [1] and uses Postgres foreign data wrappers under the hood [2]. We have 84+ plugins to SQL query AWS, GitHub, Slack, HN, etc [3]. Mods (written in HCL) provide dashboards as code and automated security & compliance benchmarks [3]. We'd love your help & feedback!
Hey Nathan. Can you comment on some of the known performance pitfalls of steampipe? I'm not super familiar with the Postgres foreign data wrappers API. I assume steampipe inherits many of its technical limitations from this API.
Having done some work in this space, I'm aware that it's no small thing to compile a high-level SQL statement that describes some analytics task to be executed on a dataset into a low-level program that will efficiently perform that task. This is especially true in the context of big data and distributed analytics. Also true if you're trying to blend different data sources that themselves might not have efficient query engines.
Would love to use this tool, but just curious about some of the details of its implementation.
Hey!
Steampipe looks great and I think the architecture you chose is very smart. Some feedback from myself:
I've tried setting up steampipe with metabase to do some dashboarding. However, I’ve found that it mostly exposes "configuration" parameters, so to say. I couldn't find dynamic info, like S3 bucket size or autoscaling group instance count.
Have I done something backwards or not noticed a table, or is that a design decision of some sort? That was half a year ago, so things might've changed since then, too.
It might just be a coincidence, but an hour before this HN post, I discovered it way back in our queue of things to review for https://golangweekly.com/ and featured it in today's issue. Hopefully kiyanwang is one of our readers :-D
Congratulations! This looks incredibly powerful and I'm excited to check it out.
Although this is pitched primarily as a "live" query tool, it feels like we could get the most value out of combining this with our existing warehouse, ELT, and BI toolchain.
Do you see people trying to do this, and any advice on approaches? For example, do folks perform joins in the BI layer? (Notably, Looker doesn't do that well.) Or do people just do bulk queries to basically treat your plugins as a Fivetran/Stitch competitor?
Steampipe killed my startup. I couldn't be happier about it.
I started a company six years ago to do exactly this -- make a SQL interface for the AWS API. Steampipe out-executed us and made a way better product, in part because their technology choices were much smarter than ours. I shut down the startup as soon as I saw steampipe.
Writing custom scripts for even the simplest of queries comes nowhere near the convenience of using PostgreSQL to get what you want. If you wanted to find and report all instances in an AWS account across all regions, with steampipe it's just:
SELECT * FROM aws_ec2_instance;
Even the simplest implementation with a Python script won't be nearly as nice, and once you start combining tables and doing more complicated queries, there's just no competition.
Tried steampipe out to replace my simple python script. It wasn't able to handle our 1k+ accounts and multiple regions, initialization took over 1 hour before I was even able to do a query.
Meanwhile my python script can run any API calls in all our accounts and regions and finish in a few minutes. Maybe 100 lines of python using boto3 and multiprocessing that outputs to json or yaml.
I don't agree that this simple query wouldn't be as nice with boto3.
But combining different services is definitely a nice feature, if you have the use case for it.
Im the author of CloudQuery (https://github.com/cloudquery) and I believe we started at the same time though took some different approaches.
PG FDW - is def an exciting PostgreSQL tech and is great for things like on-demand querying and filtering.
We took a more standard ELT approach where we have a protocol between source plugins and destination plugins so we can support multiple databases, data-lakes, storage layers, kinda similar to Fivetran and airbyte but with focus on infrastructure tools.
Also, as a more standard ELT approach our policies use standard SQL directly and dashboards re-use existing BI tools rather then implementing dashboarding in-house.
I can see the power of all-in-one and FDW and def will be following this project!
How does this handle pagination and also how does it behave towards rate limiters? For example, my API (keygen.sh) has a page size limit of 100, and often times there can be hundreds or even thousands of pages for certain resources. How would Steampipe handle `select count(*) from keygen_licenses` while obeying the rate limiter? What if a heavy query is run twice?
This makes me want to create a Grafana dashboard for everything. Instead of relying on custom plugins (or pulling & storing the data yourself), now it's just another postgres data source. I hope this becomes mainstream!
Yes, we've focused on SELECT for read only - which provides a lot of value as a "universal API" to live query these many sources. Steampipe is built on Postgres Foreign Data Wrappers [1] under the hood, so could be extended to full CRUD in time. Our current belief is that it's better to follow CQRS [2] in this case though - using SQL for the queries and more native tools for the commands.
I'm in the process of determining which RDS instance type to switch to and I've had to have a browser open with a million tabs copying and pasting information and pricing data from AWS. Could I use steam pipe to help me with this?
How does this relate to or differ from Multicorn? I see that in most of the "Type" columns in the "specific web wrappers" section of the Postgres FDW page
There is another contender in infrasql realm called cloudquery[1]. Steampipe has been around for very long. Mind you the steampipe has an AGPL license.
I really like how it just worked when I installed it and the AWS plugin. The second one I tried was tailscale... I get some different results to the same query which is weird and broken, and an error "device can only be managed by original tailnet (403) (SQLSTATE HV000)"
Still, getting access to all this stuff with a SQL interface is very compelling so I'll try to get the plugins for all the stuff I have working properly.
Glad AWS worked. Sorry you hit a bump with tailscale - we'd love feedback / contributions for that plugin [1], we just released it this month so may be edges for more complex configurations :-(
We built a UI-centric cross-cloud search engine, too. <edit: link redacted as other commenter perceived this to be an ad>
The first VC we pitched to felt like this was too niche of a problem. They wanted us to come back with a different, grander, pitch, so we'll see. In the current fundraising climate, it's been difficult to gather data points on whether we are on the right track (this post makes us feel like we're not crazy). We reached out to investors outside America, where we're based, and we're quickly realizing the VCs aren't as tech-savvy as we expected. After going thru the YC application process this cycle, we've have much greater appreciation for YC. They understand tech and startups, both. For one, we're instructed to define a toy / small problem, "don't talk, do" as opposed to pretty slides and ideas.
Best of luck to Steampipe and whoever else is working on this problem.
So this looks interesting and I wanted to test it on a use-case I have currently. I want to query both a csv and a zipped csv living online but it doesn't appear that is possible. The csv plugin implies you can only work with local files, nothing over http(s). This kind of seems like a big oversight for a product like this, no?
I have recently tried to set up Steampipe to query all our cloud costs, but I noticed that costs are only available for AWS. Azure, GCP, DigitalOcean costs are not available in their respective plugins.
Has anyone found a way to get the costs in those plugins? It is a limitation of the plugins or of the underlying cloud APIs?
Please open GitHub issues (or PRs) for any tables you'd like to added! If there is a cost API then we'd definitely like to have a table to query it. In the meantime, you may find the "Thrifty" mods useful - they have many prebuilt queries to look for cost saving opportunities in AWS, Azure, GCP and DigitalOcean - https://hub.steampipe.io/mods?q=thrifty
[+] [-] swalsh|3 years ago|reply
[+] [-] strbean|3 years ago|reply
Unfortunately, we never actually tried to kill SQL. We tried to kill relational databases, with the death of SQL as a side effect. This proved to be a stupid idea - non-relational data stores have value of course, but so do relational databases.
SQL, as you point out, is 48 years old. It shows. The ergonomics are awful. It uses natural-language inspired syntax that has fallen completely out of favor (with good reason imho). The ordering of fields and tables precludes the use of autocomplete. It uses orthogonal syntax for identical constructions, i.e.
vs. And worst of all, it comes with such a woefully inadequate standard that half of the universally supported features are expressed in a completely different fashion for each vendor.It is very common to argue that its longevity is due to its excellence as a language. I think it's longevity is due to its broad target audience, which includes plenty of people in less technical roles who are high on the org chart. Tech workers are happy to jump on new technology that improves ergonomics (or shiny new toys that are strictly more interesting because they are new). Executives who's primary way of interacting with the tech stack is querying a read only replica are much less open to learning a new language.
[+] [-] tiffanyh|3 years ago|reply
A huge benefit of SQL is that it's a declarative language (you describe what the output should be); whereas with imperative languages like C/Python/Ruby you have to describe how to generate the output with specific instructions and procedures.
[+] [-] munificent|3 years ago|reply
[+] [-] rubyist5eva|3 years ago|reply
Being able to take 50 lines of ruby code (manual (anti-)joins and aggregates, result partitioning, etc...) and replace it with a couple lines of sql that is much faster and less buggy is a life changing experience.
The only other time I had such a dramatic shift in the way that I look at building applications is when I learned Scala (functional programming).
[+] [-] irrational|3 years ago|reply
[+] [-] that_guy_iain|3 years ago|reply
The last time I can remember something getting replaced is vargrant for docker. And honestly, I think that's a mistake for many companies. Before that it was git for svn.
The only way I can see tech coming and going is you're always an early adoptor and need to replace things that failed. If you build on stable tech, that stuff stays around for decades even if it was terrible in the first place. A good example is PHP. It was terrible at first still around and improving year on year.
[+] [-] quickthrower2|3 years ago|reply
I think this is a good idea, but the cloud doesn't seem to care too much about backwards compatibility. If you use MySQL/PostgreSQL/SQL Server, you can be fairly sure in 20 years time your SQL will still work :-).
The need for this library is more an indictment on "the cloud" than anything. I am mostly an Azure user so AWS/Google might be better but man it changes it's UI and CLI interfaces alot - way too much.
[+] [-] papito|3 years ago|reply
This leads to highly non-performant software since, unless we have bent the laws of physics in the last 15 years, your data access layer is the most expensive one.
But we are crushing LeetCode, yeah?
[+] [-] rsync|3 years ago|reply
Ethernet comes to mind when I think of technologies that just can't be killed ...
[+] [-] matwood|3 years ago|reply
Agree. IIRC, I wrote my first SQL is 1996 and wrote some today. While not perfect, it’s amazing at what it does. I suggest reading Joe Celko to really up your SQL skills.
[+] [-] api|3 years ago|reply
[+] [-] sleiben|3 years ago|reply
[+] [-] nathanwallace|3 years ago|reply
Steampipe is open source [1] and uses Postgres foreign data wrappers under the hood [2]. We have 84+ plugins to SQL query AWS, GitHub, Slack, HN, etc [3]. Mods (written in HCL) provide dashboards as code and automated security & compliance benchmarks [3]. We'd love your help & feedback!
1 - https://github.com/turbot/steampipe 2 - https://steampipe.io/docs/develop/overview 3 - https://hub.steampipe.io/
[+] [-] davesque|3 years ago|reply
Having done some work in this space, I'm aware that it's no small thing to compile a high-level SQL statement that describes some analytics task to be executed on a dataset into a low-level program that will efficiently perform that task. This is especially true in the context of big data and distributed analytics. Also true if you're trying to blend different data sources that themselves might not have efficient query engines.
Would love to use this tool, but just curious about some of the details of its implementation.
[+] [-] klysm|3 years ago|reply
[+] [-] cube2222|3 years ago|reply
I've tried setting up steampipe with metabase to do some dashboarding. However, I’ve found that it mostly exposes "configuration" parameters, so to say. I couldn't find dynamic info, like S3 bucket size or autoscaling group instance count.
Have I done something backwards or not noticed a table, or is that a design decision of some sort? That was half a year ago, so things might've changed since then, too.
[+] [-] petercooper|3 years ago|reply
It might just be a coincidence, but an hour before this HN post, I discovered it way back in our queue of things to review for https://golangweekly.com/ and featured it in today's issue. Hopefully kiyanwang is one of our readers :-D
[+] [-] aaronharnly|3 years ago|reply
Although this is pitched primarily as a "live" query tool, it feels like we could get the most value out of combining this with our existing warehouse, ELT, and BI toolchain.
Do you see people trying to do this, and any advice on approaches? For example, do folks perform joins in the BI layer? (Notably, Looker doesn't do that well.) Or do people just do bulk queries to basically treat your plugins as a Fivetran/Stitch competitor?
[+] [-] ezekg|3 years ago|reply
[+] [-] breck|3 years ago|reply
[+] [-] VectorLock|3 years ago|reply
[+] [-] ComodoHacker|3 years ago|reply
The real power of SQL is locked until you can join different data sources.
[+] [-] Lorin|3 years ago|reply
[+] [-] jedberg|3 years ago|reply
I started a company six years ago to do exactly this -- make a SQL interface for the AWS API. Steampipe out-executed us and made a way better product, in part because their technology choices were much smarter than ours. I shut down the startup as soon as I saw steampipe.
I wish them all the best, it's a great product!
[+] [-] patrec|3 years ago|reply
[+] [-] chousuke|3 years ago|reply
Writing custom scripts for even the simplest of queries comes nowhere near the convenience of using PostgreSQL to get what you want. If you wanted to find and report all instances in an AWS account across all regions, with steampipe it's just:
Even the simplest implementation with a Python script won't be nearly as nice, and once you start combining tables and doing more complicated queries, there's just no competition.[+] [-] Hikikomori|3 years ago|reply
Meanwhile my python script can run any API calls in all our accounts and regions and finish in a few minutes. Maybe 100 lines of python using boto3 and multiprocessing that outputs to json or yaml.
[+] [-] phillu|3 years ago|reply
[+] [-] yevpats|3 years ago|reply
Im the author of CloudQuery (https://github.com/cloudquery) and I believe we started at the same time though took some different approaches.
PG FDW - is def an exciting PostgreSQL tech and is great for things like on-demand querying and filtering.
We took a more standard ELT approach where we have a protocol between source plugins and destination plugins so we can support multiple databases, data-lakes, storage layers, kinda similar to Fivetran and airbyte but with focus on infrastructure tools.
Also, as a more standard ELT approach our policies use standard SQL directly and dashboards re-use existing BI tools rather then implementing dashboarding in-house.
I can see the power of all-in-one and FDW and def will be following this project!
[+] [-] RicoElectrico|3 years ago|reply
No way they did not know about that name collision as SteamPipe has been around since 2013-ish.
[+] [-] armchairhacker|3 years ago|reply
[+] [-] ezekg|3 years ago|reply
[+] [-] judell|3 years ago|reply
The plugin SDK provides a default retry/backoff mechanism, and the plugin author can enhance that.
If a heavy query runs twice it'll load from cache the second time, if within the (user-configurable) cache TTL.
[+] [-] michaelsalim|3 years ago|reply
[+] [-] haolez|3 years ago|reply
[+] [-] nathanwallace|3 years ago|reply
1 - https://www.postgresql.org/docs/current/ddl-foreign-data.htm... 2 - https://martinfowler.com/bliki/CQRS.html
[+] [-] ashayh|3 years ago|reply
[+] [-] xwowsersx|3 years ago|reply
[+] [-] evilotto|3 years ago|reply
https://en.wikipedia.org/wiki/Yahoo!_Query_Language
[+] [-] pstuart|3 years ago|reply
[+] [-] nathanwallace|3 years ago|reply
[+] [-] intrasight|3 years ago|reply
https://wiki.postgresql.org/wiki/Foreign_data_wrappers
[+] [-] debarshri|3 years ago|reply
[1]https://www.cloudquery.io
[+] [-] Canada|3 years ago|reply
Still, getting access to all this stuff with a SQL interface is very compelling so I'll try to get the plugins for all the stuff I have working properly.
[+] [-] nathanwallace|3 years ago|reply
1 - https://github.com/turbot/steampipe-plugin-tailscale
[+] [-] diceduckmonk|3 years ago|reply
The first VC we pitched to felt like this was too niche of a problem. They wanted us to come back with a different, grander, pitch, so we'll see. In the current fundraising climate, it's been difficult to gather data points on whether we are on the right track (this post makes us feel like we're not crazy). We reached out to investors outside America, where we're based, and we're quickly realizing the VCs aren't as tech-savvy as we expected. After going thru the YC application process this cycle, we've have much greater appreciation for YC. They understand tech and startups, both. For one, we're instructed to define a toy / small problem, "don't talk, do" as opposed to pretty slides and ideas.
Best of luck to Steampipe and whoever else is working on this problem.
[+] [-] chrsstrm|3 years ago|reply
[+] [-] iteratorx|3 years ago|reply
Has anyone found a way to get the costs in those plugins? It is a limitation of the plugins or of the underlying cloud APIs?
[+] [-] nathanwallace|3 years ago|reply