Glad to be helpful-- I'm in the business of data process automation, so I appreciate the opportunity to learn about new use cases. If you are willing to share what your end goal was in more detail (even as simple as an SQL query that you would now run want to run against your current schema), I'd be interested to see how an optimal process could be designed to easily generate that, and possibly suggest some tooling you could find useful. You may also want to try posting questions like this in forums such as the Seattle Data Guy's discord channel, and I'm suspect you will get lots of suggestions and advice.
Rendello|1 year ago
First, I have to find the characteristic I'm looking for:
I can see what I'm looking for, id 455, which is "Total - Mother tongue for the total population excluding institutional residents - 100% data; Single responses; Non-official languages; Indigenous languages; Inuktut (Inuit) languages; Inuktitut" (since these characteristics have parents and children, I can build up a graph with a CTE, which could be useful).To use it, I query the view with:
Which gives: So it appears to work as intended (although the men+women count isn't exactly the same as the total count, I suspect that's some StatCan wierdness that I'll have to look at). It's slow, but I'm not too worried about that, it's not meant for fast real time queries, and I can always add indexes later especially since it's now a read-only database in practice.My main concerns are regarding the schema and use of REALs and NULLs. Where there are empty strings in the CSV, I have NULLs in the database. I suspect that's the best move, but I suspect having REALs instead of DECIMALs (ie. TEXT processed with the SQLite extension) may be the wrong abstraction. For my usecase, I think the database covers all of my needs except having the areas associated with a province/territory, although I know which dataset I would need for that information and that will likely be in a future iteration of the database.
I say "it appears to work" because I'm not sure the best way to test the data in the database without reimplementing the parsing and rechecking every row in each CSV against the database. I'm wary of blindly trusting the data is correct and all there.
mattewong|1 year ago