Every LLM call has a handful of sampling knobs. Most people only touch temperature. Here's what the rest actually do.

The core idea

At each step, the model outputs a probability distribution over the next token. Sampling parameters decide how that distribution gets turned into an actual choice.

Play with it yourself — this is a fixed distribution over plausible next tokens for "The cat sat on the ___". The knobs reshape and filter the distribution before the model rolls the dice.

Interactive Demo

Adjust the knobs. Watch the distribution reshape. Sample a token.

Prompt The cat sat on the ___
mat74.9%
floor11.7%
couch6.6%
table2.4%
chair1.8%
rug1.0%
bed0.7%
stairs0.4%
windowsill0.2%
roof0.1%
fence0.1%
keyboard0.0%

temperature (usually 0.0–2.0)

Rescales the distribution before sampling. Low values sharpen it (more deterministic, picks likely tokens); high values flatten it (more creative, picks rarer tokens).

  • 0.0 — greedy, always picks the top token. Good for extraction, classification, structured output.
  • 0.7 — a common default. Balanced.
  • 1.0+ — creative writing, brainstorming. Above ~1.3 things get weird fast.

top_p (nucleus sampling, 0.0–1.0)

Keeps only the smallest set of tokens whose cumulative probability meets p, then samples from that set. top_p=0.9 means "consider the top 90% of probability mass."

Prefer top_p over temperature when you want creativity without the long tail of garbage tokens.

top_k

Keeps only the top k tokens by probability. Blunter than top_p but predictable. Not exposed by all providers (Anthropic has it; OpenAI doesn't).

Rule of thumb: tune one, not all three. They interact in confusing ways. Pick temperature or top_p and leave the other at its default.

max_tokens

Hard cap on output length. Set this deliberately — it's also your cost ceiling. Too low truncates mid-sentence; too high risks runaway generation.

frequency_penalty / presence_penalty (OpenAI-style, -2.0–2.0)

  • frequency_penalty — penalises tokens proportional to how often they've already appeared. Reduces repetition.
  • presence_penalty — flat penalty once a token has appeared at all. Encourages new topics.

Useful for long-form generation that loops. Leave at 0 otherwise.

stop sequences

Strings that terminate generation when produced. Great for structured output ("stop at </answer>") or role-play transcripts ("stop at \nUser:").

seed

Some providers accept a seed for (best-effort) reproducibility. Not guaranteed — model updates and batching can still cause drift — but helpful for eval runs.

What I actually reach for

  • Extraction / JSON output: temperature=0, tight max_tokens, stop sequences.
  • Chat / general use: defaults.
  • Creative writing: temperature=1.0 or top_p=0.95, nudge a penalty if it loops.

Defaults are defaults for a reason. Tune only when you have a specific behaviour to fix.