nimitkalra's comments

nimitkalra | 9 months ago | on: Positional preferences, order effects, prompt sensitivity undermine AI judgments

One method to get a better estimate is to extract the token log-probabilities of "YES" and "NO" from the final logits of the LLM and take a weighted sum [1] [2]. If the LLM is calibrated for your task, there should be roughly a ~50% chance of sampling YES (1) and ~50% chance of NO (0) — yielding 0.5.

But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example

You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.

[1]: https://arxiv.org/abs/2303.16634 [2]: https://verdict.haizelabs.com/docs/concept/extractor/#token-...

nimitkalra | 9 months ago | on: Positional preferences, order effects, prompt sensitivity undermine AI judgments

There are technical quirks that make LLM judges particularly high variance, sensitive to artifacts in the prompt, and positively/negatively-skewed, as opposed to the subjectivity of human judges. These largely arise from their training distribution and post-training, and can be contained with careful calibration.

nimitkalra | 9 months ago | on: Positional preferences, order effects, prompt sensitivity undermine AI judgments

Some other known distributional biases include self-preference bias (gpt-4o prefers gpt-4o generations over claude generations for eg) and structured output/JSON-mode bias [1]. Interestingly, some models have a more positive/negative-skew than others as well. This library [2] also provides some methods for calibrating/stabilizing them.

[1]: https://verdict.haizelabs.com/docs/cookbook/distributional-b... [2]: https://github.com/haizelabs/verdict

nimitkalra | 10 years ago | on: Homebrew Now Publicly Displays Package Download Counts

In order to see download counts for specific packages, you'll need to navigate to:

  bintray.com/homebrew/bottles/{PACKAGE}/view#statistics

(where {PACKAGE} is installed with `brew install {PACKAGE}`)

nimitkalra | 10 years ago | on: Show HN: Python to C++14 transpiler

From what I've noticed, these "transpilers" output code that is readable (the code itself is written as though a human wrote it) where as "compilers" output code that has been optimized and show effects of name mangling in the code itself, etc. Just an observation.

I think it makes sense to use a different term for this "compiler"-esque behavior. For example, I might edit the output of CoffeeScript generated Javascript whereas I wouldn't know how to modify the output of gcc.

nimitkalra | 10 years ago | on: Trix: A rich text editor for everyday writing

Demo: https://wells.ee/trix-demo/

nimitkalra | 10 years ago | on: GitLab Mattermost, an open source on-premises Slack alternative

Duplicate: https://news.ycombinator.com/item?id=10081105

nimitkalra | 10 years ago | on: Project Sunroof

Risks to getting solar power (according to Project Sunroof) are something to consider.

> As with any investment, there are some risks, though a well-installed system will make most risks extremely rare. Risks include PV systems catching fire, installations leading to roof leaks, theft, obsolescence, and hail damage and/or wind damage to the solar system itself.

> Fast-growing trees can shade solar installations, reducing production over time. Utilities can change how much they charge their customers for electricity, changing the savings from solar.

nimitkalra | 10 years ago | on: Intel open sourced Stephen Hawking’s speech system

The code in the GitHub repository [1] is pretty interesting to look around in.

[1] https://github.com/01org/acat