top | item 43987862

(no title)

jlpk | 9 months ago

This is a major part of my job - we're a small team that works with close to 100 partners in the nonprofit sector, who all store similar data differently in CSV, PDFs, 2-3 industry-specific CRMs, etc. that we need to standardize and load. Our partners have small datasets (usually <2000 rows, maybe 20 columns or so), that are most of the time is extremely poorly formatted. We work in a single domain, so each partner's data is largely a different formatted version of the same common elements.

We have a template with accompanying documentation - partners with technical teams and well-structured data can basically self-serve with SQL. We have a series of meetings where we map columns and values, review errors, etc. More irritating than the data transformations is understanding the structure of the data and practical use, e.g., the same column for Partner A means something entirely different for Partner B. Transforming it might not be the problem - making sure everyone across our teams understands what it means and where it should go is a large pain point in mapping, but the coding of this logic is trivial.

For non-technical partners where we handle the prep, over time I wrote a small internal Python library just building off DataFrames that contains a few hundred tests to validate the data, plus a set of commonly used data cleaning functions that are flexible in their inputs. We connect to APIs to verify addresses, time zones, where we can. We now use LLMs more frequently to parse and structure fields to our standard, but it is still painful to review results from the LLM and ensure correctness. Each incoming file will result in an accompanying Jupyter notebook that runs all the relevant scripts.

Issue with working Excel (formulas, Office JS/PY Scripts, manual edits) has always been version tracking - difficult to replicate work if new versions of files come in while we are prepping the data. If we find an error post-import, it is great to be able to track down where and why we made the decision or coding error. I haven't tried Power Query though. I have tried OpenRefine, but I think sometimes it slows me down for easy functions, and API-based data transformations becoming a separate task.

When we have interns, coordinating cleaning / prep on a single file across users can be tricky.

We did try an internal POC of a UI based tool to allow users to map their own data, start cleaning it, but honestly, either a) the partner would have a data analyst or steward that didn't need it, or b) the partner wouldn't have enough of a skillset to feel comfortable cleaning it. Plus pretty often we'll need to join data from multiple spreadsheets, but conditionally use certain rows from one or the other to get the most up-to-date data, which can be difficult to do. Didn't feel as if this was worth the effort to continue with.

Fuzzy de-duplication validation is a pain, or anything where you actually want to check with the partner about its correctness - like if I notice that an email is spelled correctly because it almost matches the person's last name, but 1 letter different - becomes a long list of bullet points in an email. Something I would like is an easy way to flag and share errors and have a partner correct those in a friendly UI, then store those changes as code, without a long series of emails or meetings. We've a few times had an Excel file uploaded to Sharepoint, then with a script that automatically adds a comment to cells with errors -> but again, some people just aren't comfortable working with Excel, or get confused when they see a very structured version of their data with additional formatting, and there is no convenient way to track or replicate their changes.

There are always also one-off corrections that will need to be made as well - I've just sort of accepted this, and so we generate a list of warnings / recommendations for the partner to review or correct post-import based on the test results rather that us trying to do it. That has worked fine.

discuss

_xsbz|9 months ago

Wow, thanks for the detailed and insightful breakdown of your process. This really resonates a lot of the complexities I've been exploring and encountered as an implementation consultant.

A few things particularly stood out to me:

  - Your point about semantic mapping ("understanding the structure... what it means and where it should go") being more irritating than the actual coding of transformations is spot on. Domain expertise on these things can make a big difference.

  - The version tracking issues with Excel and the need to trace back decisions or errors are a big reason Excel isn't scalable. I totally agree with you and think Excel is bad for this kind of stuff and repeatable transformations/processes are important.
  - Your desire for an easier way to flag/share errors with partners for correction and then effectively store those changes as part of a repeatable process (like 'code' or a versioned configuration) is a fascinating challenge in collaborative data onboarding. It seems crucial for those fixes to not be one-offs especially when dealing with partners/clients which are notoriously slow. I've used Flatfile in the past, but its limiting in that its all about one-offs and transformations aren't repeated.

It sounds like you've built a robust internal Python library to manage a lot of this, which is impressive. When you use Jupyter notebooks for each incoming file:

  - How do you manage the reusability of common logic or tests across those notebooks, especially if a core cleaning function needs an update that should apply to many different partner files?
  - And how do less technical team members (like the interns you mentioned) interact with or contribute to that Python/notebook-based process, particularly if they need to understand or slightly modify a transformation rule?

And for your usage of LLM, is that mostly for mapping source field to target fields?

Thanks again for sharing such a comprehensive view, it's incredibly helpful for understanding the friction points and the need for traceable, repeatable solutions.

jlpk|9 months ago

Course! These are one-time migrations, so at most we have 3-4 projects happening concurrently and don't need to worry about backwards compatibility either. We can just import a specific version or branch of the library, or at worst, we just copy and paste the function and make the change in the notebook. But a majority of the functions we've put work into in the library are really the consistently applied across any incoming data files - API scripts, de-duplication, assigning IDs... honestly any changes like this are usually pretty easy to make.

The tests are always consistent across files and primarily check for the validity of the upload (right type, for example, or logic (start times have end times, etc.). Every test should work for every file ever, since those are based on the platform and upload constraints rather than partner data.

Short answer is they don't. For onboarding interns into the process, I write a skeleton notebook that imports the internal library and walks them through cleaning a file. But we would hire for interns that have background in coding and data cleaning. Starting out, rather than change a existing function, they might add a line of code in the notebook that changes the data in a way that the existing function would now work, for example. There are cases where specific-business logic needs to be coded into a function, but we would just write those ad-hoc. This isn't an upload that needs to be done automatically or very quickly, so that hasn't been a problem.

The only reason other team members would contribute to this would be around shaping the business logic of which fields should be uploaded where and how they should be formatted, etc. But that data review part that sometimes needs to be done is very tricky, e.g., checking that the transformed results are what they want. It mostly happens in Excel - we did build a POC UI where they would upload a CSV, and they could click through each record and review and suggest changes in a cleaner way.

For LLMs, we don't use them for mapping, but we've tested it and it works fine. The LLM mapping doesn't really save us a ton of time compared to us just looking through the file and assigning columns in our mapping file (which is composed of about 120 columns). If we wanted to deploy it and allow the partner to upload a sheet and the LLM to suggest a mapping, that could work. (The mapping step in the demo video on your site looks great!) Instead we use them to format and extract unstructured data -> for example, we might need to pull tags out of descriptive text, or extract event hours out of a string of text describing the event. The LLM can do this really well now with structured JSON outputs, but reviewing it to ensure it is correct is still a manual process.