top | item 42331245

(no title)

Glad to be helpful-- I'm in the business of data process automation, so I appreciate the opportunity to learn about new use cases. If you are willing to share what your end goal was in more detail (even as simple as an SQL query that you would now run want to run against your current schema), I'd be interested to see how an optimal process could be designed to easily generate that, and possibly suggest some tooling you could find useful. You may also want to try posting questions like this in forums such as the Seattle Data Guy's discord channel, and I'm suspect you will get lots of suggestions and advice.

discuss

Rendello|1 year ago

Here's an example case: getting all municipalities in Canada with more than 1500 native speakers of Inuktitut (the language of the Inuit):

First, I have to find the characteristic I'm looking for:

   SELECT id, description FROM characteristic WHERE description LIKE "%Inuktitut%";

I can see what I'm looking for, id 455, which is "Total - Mother tongue for the total population excluding institutional residents - 100% data; Single responses; Non-official languages; Indigenous languages; Inuktut (Inuit) languages; Inuktitut" (since these characteristics have parents and children, I can build up a graph with a CTE, which could be useful).

To use it, I query the view with:

  SELECT geo_name, c1_count_total, c2_count_men, c3_count_women FROM cview
  WHERE geo_level_name = 'Census subdivision' -- Roughly a municipality
  AND  characteristic_id = 455
  AND c1_count_total > 1500 -- Total count (speakers, in this case)
  ORDER BY c1_count_total DESC;

Which gives:

  ┌─────────────────────────────────────┬────────────────┬──────────────┬────────────────┐
  │              geo_name               │ c1_count_total │ c2_count_men │ c3_count_women │
  ├─────────────────────────────────────┼────────────────┼──────────────┼────────────────┤
  │ 'Arviat, Hamlet (HAM)'              │ 2465.0         │ 1235.0       │ 1235.0         │
  │ 'Iqaluit, City (CY)'                │ 2195.0         │ 985.0        │ 1215.0         │
  │ 'Puvirnituq, Village nordique (VN)' │ 1965.0         │ 940.0        │ 1030.0         │
  │ 'Igloolik, Hamlet (HAM)'            │ 1820.0         │ 975.0        │ 845.0          │
  │ 'Inukjuak, Village nordique (VN)'   │ 1745.0         │ 910.0        │ 835.0          │
  │ 'Kuujjuaq, Village nordique (VN)'   │ 1745.0         │ 795.0        │ 945.0          │
  └─────────────────────────────────────┴────────────────┴──────────────┴────────────────┘

So it appears to work as intended (although the men+women count isn't exactly the same as the total count, I suspect that's some StatCan wierdness that I'll have to look at). It's slow, but I'm not too worried about that, it's not meant for fast real time queries, and I can always add indexes later especially since it's now a read-only database in practice.

My main concerns are regarding the schema and use of REALs and NULLs. Where there are empty strings in the CSV, I have NULLs in the database. I suspect that's the best move, but I suspect having REALs instead of DECIMALs (ie. TEXT processed with the SQLite extension) may be the wrong abstraction. For my usecase, I think the database covers all of my needs except having the areas associated with a province/territory, although I know which dataset I would need for that information and that will likely be in a future iteration of the database.

I say "it appears to work" because I'm not sure the best way to test the data in the database without reimplementing the parsing and rechecking every row in each CSV against the database. I'm wary of blindly trusting the data is correct and all there.

mattewong|1 year ago

That is great, thank you. I'd love to continue the conversation-- maybe easier in a separate forum. Can I follow-up via the email address on your profile (gaven...)?