top | item 46215166

(no title)

augment_me | 2 months ago

Your "research" is a vibe-coded mess that subtly cheats eval cleverly multiple times to inflate your results.

The HellaSwag dataset is a dataset with 4 options for each question, with 3 being wrong and 1 being right: https://huggingface.co/datasets/Rowan/hellaswag.

Your vibe-coded eval has cheated this to collapse it into a binary selection on row 46 in https://github.com/Anima-Core/an1-core/blob/main/experiments..., making the problem baseline 50% on random choice instead of 25%, making the problem much easier. HellaSwag is specifically constructed with adversarial examples that could be plausible. By not including them, the eval is much easier.

---

Then, in extract_fields_from_model, you have another cheating going on. The extraction logic (h[:, -1, :]) fails to account for padding in batches, likely extracting EOS/Pad tokens instead of the intended content tokens. This suggests the probe is relying on global sentence summaries (standard embeddings in causal structures) rather than the novel 'meaning fields' claimed in the paper.

---

I dont have time to look at more of this and I just looked at how the eval is made, but please dont waste peoples times when you dont even know what you are evaluating.

discuss

anima-core|2 months ago

I guess my "vibe" is just better than your coding :)... Let me explain a few things, if you will. A few clarifications so the discussion stays aligned with what the experiment is actually measuring.

1. The HellaSwag “binary collapse” is intentional and not a leaderboard claim. This work doesn’t attempt to benchmark HellaSwag in the standard four-choice setting. The goal is to probe whether a single frozen layer carries enough information for a small head to distinguish correct versus incorrect continuations. That's a representational geometry test, not a SOTA claim. Binary framing raises the baseline, but that's expected and documented. It's not meant to compare against full LLM HellaSwag results.

2. No adversarial filtering was done. I am using HuggingFace’s standard split directly. Nothing was removed or curated. The experiment doesn't claim robustness or benchmark competitiveness, so the “easier eval” framing doesn’t really apply.

3. EOS extraction isn't cheating, it's the whole point of the probe. The extraction logic takes the final token’s hidden state, which is basic and standard for classification heads and probing studies. If the EOS token captures a high-level sequence summary, that's exactly the structural feature being examined. The result is meant to show how much task-relevant signal is already present in that early representation, not to present a new generative mechanism.

4. The purpose of the work is clearly narrow by design. This is not proposed as a drop-in replacement for full-transformer inference. The paper states that directly. The contribution is about how much structure a single early layer encodes and how far a tiny head can go under strict frozen-teacher constraints. So several of the criticisms make assumptions about goals the work never even claimed.

Thaank you for the feedback and for taking the time.

augment_me|2 months ago

I dont know if you are trying to delude yourself or someone else with your Motte-and-Bailey fallacy(https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy), but it doesn't work when you are literally advertising 4 classes for HellaSwag on the website for the product:

https://www.animacore.ai/

As well as literally writing out "CUDA-compatible drop-in".

Look at your post being flagged, and think for yourself what you are actually doing. Seems to be some kind of LLM-induced psychosis, here is a good read that could ground you: https://www.lesswrong.com/posts/rarcxjGp47dcHftCP/your-llm-a...