(no title)
jlpk | 9 months ago
We have a template with accompanying documentation - partners with technical teams and well-structured data can basically self-serve with SQL. We have a series of meetings where we map columns and values, review errors, etc. More irritating than the data transformations is understanding the structure of the data and practical use, e.g., the same column for Partner A means something entirely different for Partner B. Transforming it might not be the problem - making sure everyone across our teams understands what it means and where it should go is a large pain point in mapping, but the coding of this logic is trivial.
For non-technical partners where we handle the prep, over time I wrote a small internal Python library just building off DataFrames that contains a few hundred tests to validate the data, plus a set of commonly used data cleaning functions that are flexible in their inputs. We connect to APIs to verify addresses, time zones, where we can. We now use LLMs more frequently to parse and structure fields to our standard, but it is still painful to review results from the LLM and ensure correctness. Each incoming file will result in an accompanying Jupyter notebook that runs all the relevant scripts.
Issue with working Excel (formulas, Office JS/PY Scripts, manual edits) has always been version tracking - difficult to replicate work if new versions of files come in while we are prepping the data. If we find an error post-import, it is great to be able to track down where and why we made the decision or coding error. I haven't tried Power Query though. I have tried OpenRefine, but I think sometimes it slows me down for easy functions, and API-based data transformations becoming a separate task.
When we have interns, coordinating cleaning / prep on a single file across users can be tricky.
We did try an internal POC of a UI based tool to allow users to map their own data, start cleaning it, but honestly, either a) the partner would have a data analyst or steward that didn't need it, or b) the partner wouldn't have enough of a skillset to feel comfortable cleaning it. Plus pretty often we'll need to join data from multiple spreadsheets, but conditionally use certain rows from one or the other to get the most up-to-date data, which can be difficult to do. Didn't feel as if this was worth the effort to continue with.
Fuzzy de-duplication validation is a pain, or anything where you actually want to check with the partner about its correctness - like if I notice that an email is spelled correctly because it almost matches the person's last name, but 1 letter different - becomes a long list of bullet points in an email. Something I would like is an easy way to flag and share errors and have a partner correct those in a friendly UI, then store those changes as code, without a long series of emails or meetings. We've a few times had an Excel file uploaded to Sharepoint, then with a script that automatically adds a comment to cells with errors -> but again, some people just aren't comfortable working with Excel, or get confused when they see a very structured version of their data with additional formatting, and there is no convenient way to track or replicate their changes.
There are always also one-off corrections that will need to be made as well - I've just sort of accepted this, and so we generate a list of warnings / recommendations for the partner to review or correct post-import based on the test results rather that us trying to do it. That has worked fine.
_xsbz|9 months ago
A few things particularly stood out to me:
It sounds like you've built a robust internal Python library to manage a lot of this, which is impressive. When you use Jupyter notebooks for each incoming file: And for your usage of LLM, is that mostly for mapping source field to target fields?Thanks again for sharing such a comprehensive view, it's incredibly helpful for understanding the friction points and the need for traceable, repeatable solutions.
jlpk|9 months ago
The tests are always consistent across files and primarily check for the validity of the upload (right type, for example, or logic (start times have end times, etc.). Every test should work for every file ever, since those are based on the platform and upload constraints rather than partner data.
Short answer is they don't. For onboarding interns into the process, I write a skeleton notebook that imports the internal library and walks them through cleaning a file. But we would hire for interns that have background in coding and data cleaning. Starting out, rather than change a existing function, they might add a line of code in the notebook that changes the data in a way that the existing function would now work, for example. There are cases where specific-business logic needs to be coded into a function, but we would just write those ad-hoc. This isn't an upload that needs to be done automatically or very quickly, so that hasn't been a problem.
The only reason other team members would contribute to this would be around shaping the business logic of which fields should be uploaded where and how they should be formatted, etc. But that data review part that sometimes needs to be done is very tricky, e.g., checking that the transformed results are what they want. It mostly happens in Excel - we did build a POC UI where they would upload a CSV, and they could click through each record and review and suggest changes in a cleaner way.
For LLMs, we don't use them for mapping, but we've tested it and it works fine. The LLM mapping doesn't really save us a ton of time compared to us just looking through the file and assigning columns in our mapping file (which is composed of about 120 columns). If we wanted to deploy it and allow the partner to upload a sheet and the LLM to suggest a mapping, that could work. (The mapping step in the demo video on your site looks great!) Instead we use them to format and extract unstructured data -> for example, we might need to pull tags out of descriptive text, or extract event hours out of a string of text describing the event. The LLM can do this really well now with structured JSON outputs, but reviewing it to ensure it is correct is still a manual process.