top | item 47206698

(no title)

base76 | 2 days ago

I built a two-stage prompt compressor that runs entirely locally before your prompt hits any frontier model API.

  How it works:
  1. llama3.2:1b (via Ollama) compresses the prompt to its semantic minimum
  2. nomic-embed-text validates that the compressed version preserves the original meaning (cosine ≥ 0.85)
  3. If validation fails → original is returned unchanged. No silent corruption.

  When it actually helps:
  The effect is meaningful only on longer inputs. Short prompts are skipped entirely — no cost, no risk.

  ┌─────────────────────────────────┬────────────┬────────┐
  │              Input              │   Tokens   │ Saving │
  ├─────────────────────────────────┼────────────┼────────┤
  │ < 80 tokens                     │ skipped    │ 0%     │
  ├─────────────────────────────────┼────────────┼────────┤
  │ Academic abstract (207t)        │ 207 → 78   │ 62%    │
  ├─────────────────────────────────┼────────────┼────────┤
  │ Structured research doc (1116t) │ 1116 → 275 │ 75%    │
  ├─────────────────────────────────┼────────────┼────────┤
  │ Short command (4t)              │ skipped    │ 0%     │
  └─────────────────────────────────┴────────────┴────────┘

  If you're sending short one-liners, this won't help. If you're injecting long context, research text, or system prompts — it pays off from the first call.

  Known limitation:
  Cosine similarity is blind to negation. "way smaller" vs "way larger" scores 0.985. The LLM stage handles this by explicitly preserving negations and conditionals, but it's an open
  research question — tracked in issue #1.

  Install as MCP (Claude Code):
  {
    "mcpServers": {
      "token-compressor": {
        "command": "python3",
        "args": ["/path/to/token-compressor/mcp_server.py"]
      }
    }
  }

  Requires: Ollama + llama3.2:1b + nomic-embed-text

  Repo: https://github.com/base76-research-lab/token-compressor-

discuss

order

No comments yet.