Every LLM call has a handful of sampling knobs. Most people only touch temperature. Here's what the rest actually do.
The core idea
At each step, the model outputs a probability distribution over the next token. Sampling parameters decide how that distribution gets turned into an actual choice.
Play with it yourself — this is a fixed distribution over plausible next tokens for "The cat sat on the ___". The knobs reshape and filter the distribution before the model rolls the dice.
Adjust the knobs. Watch the distribution reshape. Sample a token.
The cat sat on the ___temperature (usually 0.0–2.0)
Rescales the distribution before sampling. Low values sharpen it (more deterministic, picks likely tokens); high values flatten it (more creative, picks rarer tokens).
0.0— greedy, always picks the top token. Good for extraction, classification, structured output.0.7— a common default. Balanced.1.0+— creative writing, brainstorming. Above ~1.3 things get weird fast.
top_p (nucleus sampling, 0.0–1.0)
Keeps only the smallest set of tokens whose cumulative probability meets p, then samples from that set. top_p=0.9 means "consider the top 90% of probability mass."
Prefer top_p over temperature when you want creativity without the long tail of garbage tokens.
top_k
Keeps only the top k tokens by probability. Blunter than top_p but predictable. Not exposed by all providers (Anthropic has it; OpenAI doesn't).
Rule of thumb: tune one, not all three. They interact in confusing ways. Pick temperature or top_p and leave the other at its default.
max_tokens
Hard cap on output length. Set this deliberately — it's also your cost ceiling. Too low truncates mid-sentence; too high risks runaway generation.
frequency_penalty / presence_penalty (OpenAI-style, -2.0–2.0)
frequency_penalty— penalises tokens proportional to how often they've already appeared. Reduces repetition.presence_penalty— flat penalty once a token has appeared at all. Encourages new topics.
Useful for long-form generation that loops. Leave at 0 otherwise.
stop sequences
Strings that terminate generation when produced. Great for structured output ("stop at </answer>") or role-play transcripts ("stop at \nUser:").
seed
Some providers accept a seed for (best-effort) reproducibility. Not guaranteed — model updates and batching can still cause drift — but helpful for eval runs.
What I actually reach for
- Extraction / JSON output:
temperature=0, tightmax_tokens, stop sequences. - Chat / general use: defaults.
- Creative writing:
temperature=1.0ortop_p=0.95, nudge a penalty if it loops.
Defaults are defaults for a reason. Tune only when you have a specific behaviour to fix.
