LLM Sampling Parameters

Every LLM call has a handful of sampling knobs. Most people only touch temperature. Here's what the rest actually do.

The core idea

At each step, the model outputs a probability distribution over the next token. Sampling parameters decide how that distribution gets turned into an actual choice.

Play with it yourself — this is a fixed distribution over plausible next tokens for "The cat sat on the ___". The knobs reshape and filter the distribution before the model rolls the dice.

Prompt The cat sat on the ___

temperature0.70 Low = sharp & deterministic · High = flat & creative top_p1.00 Keep smallest set of tokens summing to this probability top_k12 Keep only the top 12 candidates

mat74.9%

floor11.7%

couch6.6%

table2.4%

chair1.8%

rug1.0%

bed0.7%

stairs0.4%

windowsill0.2%

roof0.1%

fence0.1%

keyboard0.0%

temperature (usually 0.0–2.0)

Rescales the distribution before sampling. Low values sharpen it (more deterministic, picks likely tokens); high values flatten it (more creative, picks rarer tokens).

0.0 — greedy, always picks the top token. Good for extraction, classification, structured output.
0.7 — a common default. Balanced.
1.0+ — creative writing, brainstorming. Above ~1.3 things get weird fast.

top_p (nucleus sampling, 0.0–1.0)

Keeps only the smallest set of tokens whose cumulative probability meets p, then samples from that set. top_p=0.9 means "consider the top 90% of probability mass."

Prefer top_p over temperature when you want creativity without the long tail of garbage tokens.

top_k

Keeps only the top k tokens by probability. Blunter than top_p but predictable. Not exposed by all providers (Anthropic has it; OpenAI doesn't).

Rule of thumb: tune one, not all three. They interact in confusing ways. Pick temperature or top_p and leave the other at its default.

max_tokens

Hard cap on output length. Set this deliberately — it's also your cost ceiling. Too low truncates mid-sentence; too high risks runaway generation.

frequency_penalty / presence_penalty (OpenAI-style, -2.0–2.0)

frequency_penalty — penalises tokens proportional to how often they've already appeared. Reduces repetition.
presence_penalty — flat penalty once a token has appeared at all. Encourages new topics.

Useful for long-form generation that loops. Leave at 0 otherwise.

stop sequences

Strings that terminate generation when produced. Great for structured output ("stop at </answer>") or role-play transcripts ("stop at \nUser:").

seed

Some providers accept a seed for (best-effort) reproducibility. Not guaranteed — model updates and batching can still cause drift — but helpful for eval runs.

What I actually reach for

Extraction / JSON output: temperature=0, tight max_tokens, stop sequences.
Chat / general use: defaults.
Creative writing: temperature=1.0 or top_p=0.95, nudge a penalty if it loops.

Defaults are defaults for a reason. Tune only when you have a specific behaviour to fix.