SamVoice: an AI that learns exactly how you write — then helps you write more like yourself.
AI-generated text sounds the same no matter who prompted it. “Transformative,” “it’s worth noting,” “let’s dive in” — none of that is you.
You know your voice when you see it, but can you describe it? Most writers can’t articulate why their writing sounds like them.
There is no tool that tells you “this paragraph drifted from your voice” or scores how well a draft matches your actual writing patterns.
In 130,000 words of published essays, “However” appears exactly 3 times. Semicolons: 0.2 per thousand words. But every AI draft is littered with both. The signal is in the data — if you measure it.
Five years of Sunday Letters (2021–2025). 229 blog essays. 50 letters. Every word counted, every punctuation mark mapped, every transition cataloged.
Per-thousand-word frequencies extracted from the corpus. This is what makes a voice measurable, not just describable.
| Mark | Freq / 1K words | Signature |
|---|---|---|
| Scare quotes (“x”) | 10.6 | THE defining punctuation — interrogates terms constantly |
| Parenthetical asides | 7.0 | Thinks out loud — full clauses in parentheses |
| Question marks | 3.7 | Direct questions to the reader, mid-paragraph |
| “But” transitions | 3.2 | The engine of the prose — never “However” |
| Semicolons | 0.2 | Near zero. This alone flags AI-generated text. |
Contractions are NOT universal: “it is” appears 79 times vs “it’s” 60 times. “That is” 96 times vs “that’s” 17. The mix is deliberate rhythm, not inconsistency.
No GPU needed. A data-derived voice prompt encodes the precise punctuation fingerprint, vocabulary patterns, and anti-patterns. A retrieval tool pulls topic-relevant exemplar paragraphs from the corpus to calibrate the frontier model.
GPU required. Two LoRA adapters on Qwen2.5-3B-Instruct, trained on the real corpus. One scores text, one rewrites it.
The voice prompt approach is recommended for daily use. The LoRA models are for batch processing and research — scoring entire books paragraph by paragraph.
Load all 226 essays (~1,700 paragraphs). Score each paragraph on voice dimensions: informal quantifiers, hedging, parentheticals, scare quotes, transitions, sentence variety.
For a given topic, find paragraphs ranked by (topic relevance × voice quality). Backfill with highest-voice-score paragraphs if not enough topical matches.
Combine the quantitative fingerprint (word frequencies, punctuation ratios, sentence stats) with structural rules, anti-patterns, and the retrieved exemplars.
Send to Claude with voice-calibrated system prompt. Then score the result paragraph-by-paragraph — flagging formal transitions, AI slop, missing uncontracted forms, semicolons.
Hard negatives: real author paragraphs with one specific voice dimension violated. Each teaches the classifier a precise boundary.
“But” → “However” / “Moreover”
Removes “kind of”, “I think”, “I suspect”
Injects “transformative”, “it’s worth noting”
Adds semicolons (near-zero in real corpus)
Strips the signature parenthetical asides
Removes interrogative scare quotes
Adds “Not X. Not Y. Not Z.” speechwriter patterns
Contracts everything (Sam leaves some uncontracted)
Each violation type has a calibrated label (0.25–0.55) reflecting how far it drifts from authentic voice. 300+ hard negatives in v3 training data.
A voice pack is the interface contract between general-purpose writing tools and a specific voice. Any voice — a person, a character, a brand — can be a voice pack.
Quantitative fingerprint — word frequencies, punctuation ratios
6 scoring dimensions — Open, Dense, Anti-OE, Voice, Close, Feel
Signal phrases for pruning + densifier thresholds
3–5 ground-truth samples of authentic writing
Judge, rewrite, prune, densify — all work with any voice pack. The pipeline logic never changes.
What “good” sounds like, what to avoid, how to score. One pack per voice, portable across tools.
11 scripts + 7 pipeline modules. Judge, rewrite, batch, prune, densify, cross-validate, and training.
Repository: samvoice/ at /home/samschillace/dev/ANext/samvoice
Files examined: README.md, VOICE-PACK-SPEC.md, context/VOICE_PROMPT.md, context/SAMVOICE.md, voice-pack/manifest.yaml, voice-pack/rubric-anchors.yaml, voice-pack/writing-patterns.md, scripts/voice_tool.py, scripts/generate_hard_negatives.py. Git log for commit counts and contributor data. Line counts via wc -l.
All numbers are from the repository. No estimates or projections.
130,000 words contain a precise, measurable, reproducible fingerprint. Not a vague style guide — a quantitative signature that a machine can learn and a human can verify.