Would that book be useful as a reference to introduce data journalism students to AI? I'm less interested in the basics of using the API or claude code etc than best practices for workflows dealing with unstructured data, entity extraction, automated pipelines (with evals)? Although I do have some decent workflows around this I'd be interested in reading from someone who lives and breathes this kind of work. Pure data analysis to me is also something where I haven't found a good bridge between the current "generate a python script for me that I'll double check" paradigm and the spreadsheet centric world of most data journalists.
The book is likely a good fit to this type of work. The chapter on structured outputs shows how to extract out data from text, walking through prompt engineering and k-shot examples to generate json, to pydantic, then batch processing with the different providers.
It also shows how to set up evals in different parts of the book. (Depending on what you want to do, the structured outputs has evals show comparing models/prompt changes to ground truth, and the agent chapter has evals LLM as a judge.)
I’m always curious why local models aren’t being pushed more for certain types of data the person is handling. Data leakage to a 3rd party LLM is top on my list of concerns.
I am not as concerned with that with API usage as I am with the GUI tools.
Most of the day gig is structured extraction and agents, which the foundation LLMs are much better than any of the small models. (And I would not be able to provision necessary compute for large models given our throughput.)
I do have on the ToDo list though evaluating Textract vs the smaller OCR models (in the book I show using docling, their are others though, like the newer GLM-OCR). Our spend for that on AWS is large enough and they are small enough for me to be able to spin up resources sufficient to meet our demand.
Part of the reason the book goes through examples with AWS/Google (in additiona to OpenAI/Anthropic) is that I suspect many individuals will be stuck with the cloud provider that their org uses out of the box. So I wanted to have as wide of coverage as possible for those folks.
Worth noting that AWS Bedrock makes it easy to have zero retention with premier claude models. Not quite local, but it feels local-adjacent for security while getting affordable access to top-performing models... GCP appears to be a bit harder to set this up.
Biggest gap I see in most "LLM for practitioners" guides is they skip the evaluation piece. Getting a prompt working on 5 examples is easy — knowing if it actually generalizes across your domain is the hard part. Especially for analysts who are used to statistical rigor, the vibes-based evaluation most LLM tutorials teach feels deeply unsatisfying.
Totally agree it is critical. Each of chapters 4/5/6 have specific sections demonstrating testing. For structured outputs it goes through an example ground truth and calculating accuracy, demoing an example comparing Haiku 3 vs 4.5.
For Chapter 5 on RAG, it goes through precision/recall (with emphasis typically on recall for RAG systems).
For Chapter 6, I show a demo of LLM as a judge (using structured outputs to have specific errors it looks for) to evaluate a more fuzzy objective (writing a report based on table output).
crashabr|10 days ago
apwheele|10 days ago
It also shows how to set up evals in different parts of the book. (Depending on what you want to do, the structured outputs has evals show comparing models/prompt changes to ground truth, and the agent chapter has evals LLM as a judge.)
schnau_software|10 days ago
apwheele|10 days ago
clemailacct1|10 days ago
apwheele|10 days ago
Most of the day gig is structured extraction and agents, which the foundation LLMs are much better than any of the small models. (And I would not be able to provision necessary compute for large models given our throughput.)
I do have on the ToDo list though evaluating Textract vs the smaller OCR models (in the book I show using docling, their are others though, like the newer GLM-OCR). Our spend for that on AWS is large enough and they are small enough for me to be able to spin up resources sufficient to meet our demand.
Part of the reason the book goes through examples with AWS/Google (in additiona to OpenAI/Anthropic) is that I suspect many individuals will be stuck with the cloud provider that their org uses out of the box. So I wanted to have as wide of coverage as possible for those folks.
pkress2|10 days ago
iririririr|10 days ago
ghostbrainalpha|10 days ago
Or is this actually a law enforcement related example?
apwheele|10 days ago
godelski|10 days ago
Right there it says in a big page width box
Schlagbohrer|10 days ago
cranberryturkey|10 days ago
Does this guide cover systematic eval at all?
apwheele|10 days ago
For Chapter 5 on RAG, it goes through precision/recall (with emphasis typically on recall for RAG systems).
For Chapter 6, I show a demo of LLM as a judge (using structured outputs to have specific errors it looks for) to evaluate a more fuzzy objective (writing a report based on table output).
nimbus-hn-test|10 days ago
[deleted]
nimbus-hn-test|10 days ago
[deleted]