top | item 38816352

(no title)

nooorofe | 2 years ago

There is nice summary on the topic: https://aws.amazon.com/blogs/big-data/choosing-an-open-table... ("Optimizing read performance"). Those technologies primary "Data Management at Scale" but they also extend capabilities provided by raw storage formats such as parquet. So they may help you, but the question if you are really need it. I haven't worked with BigQuery, it may include [similar features](https://cloud.google.com/bigquery/docs/search-index).

You need to define what "latency" means in your case and what is "quite high levels". We are talking about analytical data storage, it is designed for efficient batch processing. To find a single record is not a primary goal of the architecture - you will need some kind of caching/indexing for fast search. Sometimes adding "limit 1" for your single record search may solve the problem.

Be sure you are using efficent data storage format as parquet, check size of the files to be sure you don't have the ["small file problem"](https://www.royalcyber.com/blog/data-services/managing-small...), then check if you are using relevant BigQuery features. And before and after those checks run "explain" on your query, if you don't use partition keys or indexed columns your search results won't be instant in any big data system.

discuss

No comments yet.