top | item 31874767

Xlite: Query Excel and Open Document spreadsheets as SQLite virtual tables

252 points| thunderbong | 3 years ago |github.com | reply

29 comments

order
[+] eatonphil|3 years ago|reply
This is a cool project! But if you just want a simple CLI, you can query Excel and ODS files with dsq [0]. It's the same end result without needing to carry plugins around and with support for many additional data formats. Plus a growing standard library of functions that don't come built into SQLite such as best-effort date parsing, URL parsing/extraction, statistical aggregation functions, math functions, string and regex helpers, hashing functions and so on [1].

An annoying thing about this extension-based style of file support is needing to create a new table for every new file if the schema is different. This is a limitation [2] of sqlite unfortunately. dsq doesn't work this way so it doesn't have that limit.

On the other hand, if you go this route you can more easily combine with other extensions. That's not really possible with dsq right now.

[0] https://github.com/multiprocessio/dsq

[1] https://github.com/multiprocessio/go-sqlite3-stdlib

[2] https://sqlite.org/forum/forumpost/ec944414fa

[+] sitkack|3 years ago|reply
I think when Wasm gets a little bit more mature, it will enable the creation of more composable tools. One might be able to hot create modules? Kinda ironic that this is how it is structured given that arbitrary projections is what SQL does.

Is there a way to hook lower in the SQLite stack, and make it think that your DS is something it already understands?

What about a function that returns a view that it dynamically generates?

[+] mwenge|3 years ago|reply
[+] olah_1|3 years ago|reply
I'm looking for a way to have a searchable/filterable CSV file of 30k rows as a web app.

Some kind of fork of DirtyLittleSQL looks like it is the solution!

[+] _qua|3 years ago|reply
Well this is the coolest thing I've seen all week.
[+] loxias|3 years ago|reply
This is cool, and kudos to you for writing it, and getting it out there. I've always found that more than half the "hard work" is actually finishing something and getting it out there, and you get all my respect for doing that.

Querying Excel spreadsheets with SQL, however, is something that can already be done with no additional tools, no custom plugins, just good old fashioned command line knowledge.

(first, download superstore.xls, some test data I found [0])

  loxias@host:~$ sqlite3 -csv :memory: -cmd ".import '| ssconvert -T Gnumeric_stf:stf_csv superstore.xls fd://1' orders" -cmd '.mode column' 'SELECT State, AVG(Profit), COUNT(*) FROM orders GROUP BY State ORDER BY AVG(Profit) DESC LIMIT 5'

  State         AVG(Profit)       COUNT(*)
  ------------  ----------------  --------
  Vermont       204.088936363636  11      
  Rhode Island  130.100523214286  56      
  Indiana       123.375411409396  149     
  Montana       122.2219          15      
  Minnesota     121.608847191011  89 
     
  loxias@host:~$ sqlite3 -csv :memory: -cmd ".import '| ssconvert -T Gnumeric_stf:stf_csv superstore.xls fd://1' orders" -cmd '.mode column' 'SELECT State, AVG(Profit), COUNT(*) FROM orders GROUP BY State ORDER BY AVG(Profit) ASC LIMIT 5'

  State           AVG(Profit)        COUNT(*)
  --------------  -----------------  --------
  Ohio            -36.1863040511728  469     
  Colorado        -35.8673510989011  182     
  North Carolina  -30.0839847389558  249     
  Tennessee       -29.1895825136612  183     
  Pennsylvania    -26.5075984667803  587
And there you have the 5 top and 5 worst performing states by profit margin.

Look ma! No code! No temp files even! :D

[0] https://community.tableau.com/s/question/0D54T00000CWeX8SAL/...*

[+] eatonphil|3 years ago|reply
You just went from requiring one bit of code to requiring another. I.e. Does every Linux distro and Mac and Windows come with ssconvert?
[+] AtlasBarfed|3 years ago|reply
It is a crime against humanity how opaque office documents are to data extraction.

So much potential and efficiency lost. as in likely hundreds of billions of dollars.

Practically every single office in the world uses office suite products. There's probably a 10s to 100s of billions of person-hours or more of collective work invested in office suite documents and spreadsheets, and getting at it programmatically is not easy.

Not easy is a bad term.

Intentionally walled, obfuscated, undocumented, and constantly changed to maintain monopolies in core Office software as well as other "back office" products.

And it's still willingly accepted by virtually all corporations and organizations worldwide.

[+] justsomeuser|3 years ago|reply
How do virtual tables compare with regular tables for performance?

I would assume SQLite has some optimisations for native tables (rather than reading the data from another virtual table backed file)?

[+] lazypenguin|3 years ago|reply
It’s implementation dependent. You implement an interface and it’s up to the implementation to optimize the calls. The interface is also rather simplistic so there’s some optimization that can’t be done at the implementation level since SQLite didn’t provide all the information at the time of the query.
[+] don-code|3 years ago|reply
In addition to just loading and querying data, spreadsheets are great as database mocks in integration tests. I long ago used XLSQL (note: there is a new XLSQL project unrelated to this one) to mock an Oracle database: https://sourceforge.net/projects/xlsql/.
[+] tanin|3 years ago|reply
I built a desktop for a similar purpose, and it can handle GBs of CSV files pretty quickly. It is for CSV though.

It is more convenient if you do this regularly for your job.

Please check it out: https://superintendent.app

[+] kbd|3 years ago|reply
Do any of these tools take into account hidden Excel rows? It’s a constant frustration.
[+] eatonphil|3 years ago|reply
Dunno! But if you send me a sample I can try it out for dsq.
[+] Fnoord|3 years ago|reply
rga (ripgrep all) is basically rg (ripgrep) for binaries.