mmyrte | 2 years ago | on: A tale of two teenagers (2023)
mmyrte's comments
mmyrte | 2 years ago | on: Understanding Parquet, Iceberg and Data Lakehouses
My thinking goes as follows: I'm trying to read chunks from n-dimensional data with a minimum of skips/random reads. For user-facing analytics and drilling down into the data, these chunks tend to be relatively few, and I'd like to have them close to one another. For high-level statistics however, I only care that the data for each chunk of work be contiguous, since I'm going to read all chunks eventually anyways.
You can reach these goals with a partitioning strategy either in HDF or zarr or parquet, but you could also reach it with blob fields in a more traditional DB, be it relational or document based or whatever. Since any storage and memory is linear, I don't care whether a row-major or column-major array is populated from a 1d vector from columnar storage with dimensionality metadata or an explicitly array based storage format; I just trust that a table with good columnar compression doesn't waste too much storage on what is implicit in (dense) array storage.
Often, I've found that even climatological data _as it pertains to a specific analytic scenario_ is actually a sparse subset of an originally dense nd-array, e.g. only looking at data over land. This has led me to advocate for more tabular approaches, but this is very domain specific.
mmyrte | 2 years ago | on: My uBlock Origin filters to remove distractions
mmyrte | 2 years ago | on: Using Lidar to map tree shadows
mmyrte | 2 years ago | on: Using Lidar to map tree shadows
edit: If you mean GIS (geographical information systems/science), there are plenty of undergraduate courses strewn over github. IMO, the R geospatial ecosystem is more mature than its Python counterpart, but both are very usable.