What If OpenDocument Used SQLite? (2014)

[+] nabla9|5 years ago|reply

Sustainability of Digital Formats: Planning for Library of Congress Collections https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...

The Library of Congress Recommended Formats Statement (RFS) includes SQLite as a preferred format for datasets. The RFS does not specify a particular version of SQLite.

[+] svnpenn|5 years ago|reply

I'd say better to export tables to CSV files, and dump the schema to an SQL file.

[+] danellis|5 years ago|reply

This reminds of the WebSQL spec being dropped because the only implementations used SQLite:

> The W3C Web Applications Working Group ceased working on the specification in November 2010, citing a lack of independent implementations (i.e. using database system other than SQLite as the backend) as the reason the specification could not move forward to become a W3C Recommendation.

-- Wikipedia

> This document was on the W3C Recommendation track but specification work has stopped. The specification reached an impasse: all interested implementors have used the same SQL backend (Sqlite), but we need multiple independent implementations to proceed along a standardisation path.

-- W3C

I wonder how much of their reasoning applies here.

[+] chungy|5 years ago|reply

I've grown to not mind WebSQL being dropped, despite that 10 years ago, I wish it were standardized.

WebAssembly's introduction a few years later means that SQLite can be built and used as a wasm module, and you aren't limited by whatever old version of SQLite was used for WebSQL -- you can always update it independently according to your needs.

A lot of sites actually do this.

[+] tasogare|5 years ago|reply

It strange that Microsoft didn’t try to use Sql Server Compact Edition, which was available on Windows Phone, to make the distinct required implementation.

[+] indolering|5 years ago|reply

The biggest problem with WebSQL == SQLite was that the semantics couldn't be nailed down beyond a pile of C code. That is doable if there is a single application vendor for a given file format, but a total nightmare if you are trying to write code for the last 3 versions of 5 different browsers.

But SQLite was also a problem for anyone that wanted different semantics than what SQLite offered, like statically typed columns. Better to just provide a fast indexing mechanism and let people define semantics on top of that.

[+] Shorel|5 years ago|reply

I think that was faulty reasoning back then, and it is no different now.

[+] garmaine|5 years ago|reply

Seems like the problem is in the standards process here...

[+] wmf|5 years ago|reply

Standards like OpenDocument are expected to have multiple independent implementations. I'm not sure such a thing is feasible with SQLite (this is why WebSQL was rejected).

I'm personally sympathetic to the idea that certain implementation-defined monocultures are OK but I don't think that idea has critical mass.

[+] otabdeveloper4|5 years ago|reply

Lack of multiple independent implementations means we can't ever switch to a different programing language or radically different system architecture.

Nobody is feasibly going to rewrite Sqlite in Rust.

So yeah, it is important.

[+] usefulcat|5 years ago|reply

I can think of reasons why re-implementing SQLite might lack appeal, buy why would it be infeasible? Even if the format is poorly- or under- specified (note: I have no idea whether that is true), you could definitely check the source.

[+] Ecco|5 years ago|reply

Kind of reminds me of http://utf8everywhere.org/, but in that case it'd be sqlite-everywhere. Which is a pretty good point. A large number of Apple-made iOS apps went this route, and they seem to be doing ok!

[+] takluyver|5 years ago|reply

I think a lot of Android apps use Sqlite as well. But the achilles heel of Sqlite in my experience is network filesystems - it's well known & documented that Sqlite can suffer corruption on NFS, because of issues with locking. That doesn't really matter for iOS & Android apps, where you know storage is local, but it's a serious sticking point for desktop applications, where e.g. NFS home directories are not unusual.

[+] zelly|5 years ago|reply

Maybe this is the right thread to ask:

I vaguely remember a file archive format similar to JAR/WAR (for Java) or ASAN (for Electron) which was designed for bundling software with its dependencies and static resources. It is not a container because it's just files with no specification for its runtime. Unlike JAR, it is language agnostic. These bundles could be compiled and compressed on your local machine then scp'd to a server that knows how to execute this bundle. If I recall correctly, it was some open source project from a big company.

I've spent hours trying to find it again and am wondering if I might be remembering something from a dream.

[+] Twisol|5 years ago|reply

XAR, perchance?

https://github.com/facebookincubator/xar

[+] theblazehen|5 years ago|reply

You might be thinking of AppImage?

[+] dathinab|5 years ago|reply

While I generally agree with the points they make I'm not sure I would want a (stock) sqlite being used for "less" trusted files.

While sqlite has a very comprehensive test suite there had been more than one attack with "manipulated" database files and as far as I know (I might be mistaken) the test suite(s) are not focused on testing that case much.

Through I'm positive that it would be fine with a "hardened" sqlite where certain features are not compiled in at all and in turn the attack surface is reduced.

[+] infogulch|5 years ago|reply

The article describes the current format as a zip of files where the main content of presentation slides is stored in a `content.xml` file within. Yes, sqlite may have had file format vulnerabilities in the past, but I cannot imagine trusting it less than any xml library in existence.

[+] teraflop|5 years ago|reply

You could always sandbox it. This has actually already been done at least once: years ago, there used to be a "pure Java" JDBC driver for SQLite that was generated by compiling the original C code to a MIPS binary, and then translating the result into Java bytecode.

http://web.archive.org/web/20080703234650/http://www.zentus....

[+] remram|5 years ago|reply

Can you elaborate about this? What are the gotchas when opening an untrusted SQLite3 database?

(I am developing a web application that will allow users to upload application files, which are SQLite3 databases)

[+] yencabulator|5 years ago|reply

Further reading: https://research.checkpoint.com/2019/select-code_execution-f...

I've never seen mention of this being fixed in any way. SQLite is not an acceptable format to transfer between computers.

[+] Aeolun|5 years ago|reply

I think the same is true for less trusted MS office files though? E.g. VBA can do some crazy stuff.

[+] knighthack|5 years ago|reply

The article convinces me on the failure of the ZIP method in at least 3 respects:

1) For efficient incremental updates. Especially when you're saving a large file often, making constant single changes to ZIP files is just inefficient.

2) The insane memory gulps that ZIP takes.

3) The pile of files method. The data in OpenDocument files could be much better sorted/accessed through SQLite tables, rather than through XML files in ZIPs.

I think SQLite makes far more sense for OpenDocument. Incidentally, SQLite has also made good arguments previously for its use to replace application file formats: https://www.sqlite.org/appfileformat.html

[+] etimberg|5 years ago|reply

In a past job i tried hard to replace an Access database file with sqlite for a c++ desktop application. One big win from sqlite was on the testing front. You can just open an in-memory sqlite database and test against that without worrying about disk state, or state leaking between test runs.

[+] coddle-hark|5 years ago|reply

We use in-memory sqlite databases for unit testing a MySQL backed web service. It’s great! You can build and test the code without setting up a MySQL server and no risk of state leaking like you said. The syntax isn’t exactly the same but we have a small wrapper layer that rewrites MySQL-specific queries into their SQLite equivalents.

The only real downside is that it’s a C dependency which makes cross compiling a bit of a pain until you figure out the magic incantation the compiler wants.

[+] alkonaut|5 years ago|reply

The whole "Writes are atomic" still breaks down if they assume that disks are local. Having worked with one of these zip document format for 20 years now, I have found that local disk is the edge case, and SMB is the norm in corporate environments.

Sure you can download the sqlite file at the start of a session and work on it locally in a temp directory with atomic writes - but in the end the user doesn't think they have "saved" until version N+1 of the file is actually on their network share.

[+] galgalesh|5 years ago|reply

SQLite tries to detect when the database is on a network share and changes it's behavior accordingly.

[+] wmertens|5 years ago|reply

Isn't that the same problem with zip files? So if we're assuming that the compressed SQLite format will take up the same or less space as the zip file, wouldn't that mean that exactly the same mitigations work?

[+] mattxxx|5 years ago|reply

In my experience, sqlite is rarely the wrong choice for a first pass at almost anything. It’s - Straightforward - Portable - Performant

So, unless you have a good reason to not use it: - Human-readable/editable file format - Non-trivial queries across multiple servers - Specific performance optimizations needed - Etc.

It’s a good first bet / gets you through mvp (and then some)

[+] sargun|5 years ago|reply

The problem I have with sqlite for being a general purpose file format is that it’s difficult to take a sqlite file in any language and work with it without linking against code.

I kind of wish that there were native implementations of sqlite in every language.

[+] simonw|5 years ago|reply

WASM is really interesting here.

https://github.com/sql-js/sql.js is SQLite complied to WASM, which means you can run it not just in the browser but also in any WASM environment.

https://github.com/wasmerio/wasmer-python lets you execute WASM compiled binaries within Python.

https://github.com/wasmerio/wasmer-ruby handles Ruby.

So it should be possible to take the WASM compiled SQLite and run it in other languages.

This is a thing that really excites me about WASM: as a mainly-Python programmer it has the potential to give me access to a much wider range of tools.

[+] Ecco|5 years ago|reply

What's the problem of linking against SQLite? In my book it's much better because:

- SQLite is coded in C, so performance and weight are going to be equal or better than any other reimplementation.

- Having one code base means no compatibility issue between implementations

- Also SQLite's codebase is so heavily tested and its track record is so good that I really don't see the point.

[+] an_d_rew|5 years ago|reply

In this particular case though you have to weigh linking against SQLight versus zlib and libxml.

[+] AtlasBarfed|5 years ago|reply

Yes.

Possibly if this was done however, we'd get native sqllite file format read/write libraries in most major languages.

Might overall be good. Sqllite-as-file-format isn't necessarily a bad idea if this infrastructure exists.

[+] 7thaccount|5 years ago|reply

Why can't you just have your language call to SQLite via the command line?

[+] systematical|5 years ago|reply

I use SQLite extensively for unit tests and demo code bases for my OSS libraries. For the latter because its super easy for someone to clone the code base and run the entire thing in seconds. Huge fan of SQLite in those two scenarios.

[+] molyss|5 years ago|reply

Does that still apply if the DB is in WAL mode ? according to https://www.sqlite.org/fileformat2.html I'd say "no", because it sounds like the journal file might be necessary if the DB dies unexpectedly

[+] teleforce|5 years ago|reply

Why not go to the extreme and just use TileDB for all the documents? By using it you can utilize universal storage engine supporting Arrow format that should work with SQL, NoSQL, dataframe and even streaming systems.

[+] ncmncm|5 years ago|reply

All I would like from OpenDocument, and LibreOffice, is an ASCII storage format that is not specifically designed to be incompatible with revision control systems.

As it is, they have an almost-serviceable format that fails only by having each line start with a random number that differs from the number on the corresponding line that was originally read in.

Fixing it would not even be an incompatible change! They could just remember the number that was there, and write it back out, next time.

There is a report for this in their bug tracker that mentions SVN because there was no Git, yet, when it was posted.

[+] MayeulC|5 years ago|reply

Not exactly a complete answer to your problem, but you can at least improve diffs for odt documents by using git textconv:

    *.ods diff=odf
    *.odt diff=odf
    *.odp diff=odf
    
    *.ods difftool=odf
    *.odt difftool=odf
    *.odp difftool=odf
    [diff "odf"]
      textconv=odt2txt

https://git.wiki.kernel.org/index.php/Textconv

[+] sally1620|5 years ago|reply

There is a single XML variant of the open document format. although images and videos won't be embedded in the document that way.

I used this format when I was generating ODF documents using a script.

[+] swiley|5 years ago|reply

I'm a huge fan of sqlite but I'm not sure it's an improvement over xml for holding a DOM. Especially since there's more or less just one implementation.

[+] rlpb|5 years ago|reply

> The atomic update capabilities of SQLite allow small incremental changes to be safely written into the document. This reduces total disk I/O and improves File/Save performance, enhancing the user experience.

How would this work without breaking the conventional "Save" paradigm? Unless the app tracks all incremental changes to apply when the user clicks "Save" instead of dumping it all back to disk, which seems difficult and error-prone to implement.

[+] teraflop|5 years ago|reply

> Unless the app tracks all incremental changes to apply when the user clicks "Save" instead of dumping it all back to disk, which seems difficult and error-prone to implement.

That's the point of using SQLite instead of implementing it yourself. Because of how SQLite supports transactions, you can just update the database as you go; your changes will be physically written to disk, but they won't actually replace the old version from a reader's perspective until you explicitly commit, which is an atomic (and typically fast) operation.

The downside of this approach is that if the application crashes while writing to disk, the database file itself might be in an inconsistent state, accompanied by a rollback journal (or write-ahead log) that contains the necessary information to recover it. But if you manually delete the journal, or otherwise separate it from its database, you get data corruption.

That's probably fine as long as the user is a sysadmin who can be trusted to Just Not Do That(tm), but it's very user-hostile behavior for an office suite.

[+] trevorishere|5 years ago|reply

Microsoft more or less does this already with their Office Open XML file format in SharePoint Online/OneDrive. Small changes are read/written as users edit the doc. Real-time co-authoring locks on a per-paragraph basis.

[+] okennedy|5 years ago|reply

SQLite already has logic for this built in: transactions (https://www.sqlite.org/lang_transaction.html)

More broadly though, tech seems to be migrating away from explicit saves in general (e.g., when's the last time you needed to save something in a web-app). Maybe that's not such a bad thing? I learned to develop a reflexive cmd-/ctrl-s twitch as a youth, but I would have been just as happy to have not lost hours of work to power outages and/or cooperative multitasking.

[+] mlyle|5 years ago|reply

Modern word processing formats write down incremental changes, to avoid writing all of a massive document back to disk.

Sometimes this causes people embarrassment, because if you send a normally saved file to someone else, they can often see the change history and intermediate versions.

[+] WClayFerguson|5 years ago|reply

I am the developer of a platform that hosts documents in a NoSQL (MongoDB) so that every paragraph of text is a 'record' in MongoDB and there's a 'path' property which is what builds out a 'tree' structure to give the documents a hierarchy. You guys might be interested if you're interested in new kinds of document/wiki innovations.

https://quanta.wiki

[+] jamesfisher|5 years ago|reply

Saving the versioning in the same file would lead to security issues. Imagine redacting a file before sending it to someone. All they have to do is roll back a few versions to recover the original.

(Note this is not a criticism of the "use SQLite" argument; just a criticism of the specific schema.)

[+] cyingfan|5 years ago|reply

The only application/document file format I know that uses sqlite is https://www.giuspen.com/cherrytree/. Is there others out there worth mentioning?

194 comments