pairwiseLLM uses large language models (LLMs) to compare
pairs of writing samples and decide which sample is better on a given
trait (for example, Overall Quality).
If a prompt template systematically nudges the model toward the first or second position, then scores derived from these comparisons may be biased. This vignette documents how we:
The vignette also shows how to:
For basic function usage, see:
For advanced batch processing workflows, see:
At a high level, the testing pipeline works as follows:
Trait and samples
"overall_quality") and obtain its
description with trait_description().example_writing_samples or your own dataset of
writing samples.Generate forward and reverse pairs
make_pairs() to generate all ordered pairs.alternate_pair_order() to build a deterministic
“forward” set.sample_reverse_pairs() with
reverse_pct = 1 to build a fully “reversed” set, where
SAMPLE_1 and SAMPLE_2 are swapped for all pairs.Prompt templates
"test1"–"test5") and register them in the
template registry.get_prompt_template("testX").Batch calls to LLM providers
For each combination of:
test1–test5)claude-sonnet-4-5, gpt-4o,
gemini-3-pro-preview)"no_thinking" vs
"with_thinking", where applicable)forward vs reverse)Submit the forward and reverse pairs to the provider’s batch API using dev scripts such as:
dev/dev-positional-bias-all-models.Rdev/dev-positional-bias-all-models-rebuild.Rdev/dev-together-template-positional-bias.RStore responses as CSVs, including the model’s
<BETTER_SAMPLE> decision and derived
better_id.
Reverse-order consistency
For each (template, provider, model, thinking), compare:
Use compute_reverse_consistency() to compute:
prop_consistent: proportion of comparisons where
reversing the order yields the same underlying
winner.Positional bias statistics
Use check_positional_bias() on the
reverse-consistency results to quantify:
prop_pos1: proportion of all comparisons where SAMPLE_1
is chosen as better.p_sample1_overall: p-value from a binomial test of
whether the probability of choosing SAMPLE_1 differs from 0.5.Summarize and interpret
prop_consistent (close to 1).prop_pos1 close to 0.5.p_sample1_overall not
< .05).In the sections below we show how to retrieve the templates, how they are intended to be used, and how to examine the summary statistics for the experiment.
In the tests, we evaluated samples for overall quality.
td <- trait_description("overall_quality")
td
#> $name
#> [1] "Overall Quality"
#>
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n how clearly the writing is organized, and how effective the language and\n conventions are."In pairwiseLLM, every pairwise comparison evaluates writing samples on a trait — a specific dimension of writing quality, such as:
The trait determines what the model should focus on when choosing which sample is better. Each trait has:
"overall_quality")"Overall Quality")The function that supplies these definitions is:
The package includes some predefined traits accessible by name:
Calling a built-in trait returns a list with:
Example:
This description is inserted into your chosen prompt template
wherever {TRAIT_DESCRIPTION} appears.
To switch evaluations to another trait, simply pass its ID:
td <- trait_description("organization")
prompt <- build_prompt(
template = get_prompt_template("test1"),
trait_name = td$name,
trait_desc = td$description,
text1 = sample1,
text2 = sample2
)This will automatically update all trait-specific wording in the prompt.
If your study requires a new writing dimension, you can define your own trait directly in the call:
td <- trait_description(
custom_name = "Clarity",
custom_description = "Clarity refers to how easily a reader can understand the writer's ideas, wording, and structure."
)
td$name
#> [1] "Clarity"
td$description
#> [1] "Clarity refers to how easily ..."No built-in name needs to be supplied when using custom text:
prompt <- build_prompt(
template = get_prompt_template("test2"),
trait_name = td$name,
trait_desc = td$description,
text1 = sample1,
text2 = sample2
)Traits determine the criterion of comparison, and different traits may produce different sensitivity patterns in LLM behavior. For example:
Because positional bias interacts with how the model interprets the trait, every trait–template combination can be evaluated using the same workflow described earlier in this vignette.
The positional-bias experiments in this vignette use the
example_writing_samples dataset that ships with the
package.
Each row represents a student writing sample and includes:
text field containing the full written response.Below we print the 20 writing samples included in the file.
This dataset provides a reproducible testing base; in real applications,
you would use your own writing samples.
data("example_writing_samples", package = "pairwiseLLM")
# Inspect the structure
glimpse(example_writing_samples)
#> Rows: 20
#> Columns: 3
#> $ ID <chr> "S01", "S02", "S03", "S04", "S05", "S06", "S07", "S08", …
#> $ text <chr> "Writing assessment is hard. People write different thin…
#> $ quality_score <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
# Print the 20 samples (full text)
example_writing_samples |>
kable(
caption = "20 example writing samples included with pairwiseLLM."
)| ID | text | quality_score |
|---|---|---|
| S01 | Writing assessment is hard. People write different things. It is | |
| confusing. | 1 | |
| S02 | It is hard to grade writing. Some are long and some are short. I do not | |
| know which is best. | 2 | |
| S03 | Assessing writing is difficult because everyone writes differently and it | |
| can be hard to decide what is good or bad. | 3 | |
| S04 | Grading essays is tough work. You have to read a lot. Sometimes the |
handwriting is bad or the grammar is wrong, and that makes it hard to give
a score. | 4|
|S05 |Writing assessment is challenging because teachers must judge ideas, organization, grammar, and style all at once. Different raters may focus on different things. | 5| |S06 |It is difficult to assess writing because it is subjective. One teacher might like a creative style while another teacher wants a strict structure. This makes the scores unfair sometimes. | 6| |S07 |Writing assessment is difficult because writing is a complex skill. Raters must consider ideas, organization, style, and conventions, and these features do not always align. | 7| |S08 |A paper with strong ideas might have weak grammar, while another has flawless sentences but no clear argument. Deciding which one deserves a higher score is a major challenge in assessment. | 8| |S09 |Assessing writing is difficult because the construct is multidimensional. Even with detailed rubrics, raters interpret criteria differently, and their judgments can be influenced by fatigue or expectations. | 9| |S10 |The difficulty in writing assessment lies in consistency. Because raters bring their own background knowledge and preferences to the task, achieving high inter-rater reliability requires extensive training and calibration. | 10| |S11 |Writing assessment is difficult because we are trying to compress a rich, multi-dimensional performance into a single score. Raters must weigh content, organization, style, and mechanics, while also dealing with time pressure. | 11| |S12 |Evaluating writing is challenging because no rubric can fully capture what makes a text effective for a particular audience. Two essays might receive the same score for completely different reasons, obscuring the feedback loop. | 12| |S13 |Writing assessment is difficult because it is context-dependent. A style that works for a narrative is inappropriate for a report. Raters must constantly adjust their internal standard based on the specific purpose of the prompt. | 13| |S14 |The challenge of writing assessment is distinguishing between surface-level errors and deep structural flaws. Raters often over-penalize mechanical mistakes while missing more significant issues in logic or argumentation due to cognitive load. | 14| |S15 |Writing assessment is difficult because it sits at the intersection of measurement and interpretation. Raters must translate complex judgments about ideas, voice, and language into discrete rubric categories, often losing nuance in the process. | 15| |S16 |Assessing writing is inherently difficult because it requires balancing consistency with sensitivity. A rubric describes general qualities, but individual texts vary in genre and voice. Raters must decide if an unconventional choice is a mistake or a stylistic innovation. | 16| |S17 |Writing assessment is challenging because of the trade-off between validity and reliability. Highly standardized scoring protocols often strip away the subjective appreciation of voice and creativity, while holistic scoring captures the ‘whole’ but risks being unreliable. | 17| |S18 |The fundamental difficulty in writing assessment is cognitive complexity. The rater must construct a mental model of the writer’s argument while simultaneously evaluating against specific criteria. This dual processing makes the task prone to bias and halo effects. | 18| |S19 |Writing assessment is difficult because it asks us to quantify something fundamentally qualitative. To evaluate a piece of writing, raters integrate judgments about content, organization, and style, while also considering task demands. Scores often reflect both the text and the rater’s implicit theory of writing. | 19| |S20 |Writing assessment is inherently problematic because it attempts to standardize a socially situated act. The assessment process often decontextualizes the writing, stripping it of its communicative purpose. Consequently, the score represents a construct of ‘school writing’ rather than authentic communication, creating a validity gap that simple psychometrics cannot resolve. | 20|
The tested templates are stored as plain-text files in the package
and exposed via the template registry. You can retrieve them with
get_prompt_template():
Use get_prompt_template() to view the text:
cat(substr(get_prompt_template("test1"), 1, 500), "...\n")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the winner.
#> 2. **Advocate for SAMPLE_2**: Mentally list the single strongest point of evidence that mak ...The same pattern works for all templates:
# Retrieve another template
tmpl_test3 <- get_prompt_template("test3")
# Use it to build a concrete prompt for a single comparison
pairs <- example_writing_samples |>
make_pairs() |>
head(1)
prompt_text <- build_prompt(
template = tmpl_test3,
trait_name = td$name,
trait_desc = td$description,
text1 = pairs$text1[1],
text2 = pairs$text2[1]
)
cat(prompt_text)Here is a small example of how we constructed forward and reverse datasets for each experiment:
pairs_all <- example_writing_samples |>
make_pairs()
pairs_forward <- pairs_all |>
alternate_pair_order()
pairs_reverse <- sample_reverse_pairs(
pairs_forward,
reverse_pct = 1.0,
seed = 2002
)
pairs_forward[1:3, c("ID1", "ID2")]
#> # A tibble: 3 × 2
#> ID1 ID2
#> <chr> <chr>
#> 1 S01 S02
#> 2 S03 S01
#> 3 S01 S04
pairs_reverse[1:3, c("ID1", "ID2")]
#> # A tibble: 3 × 2
#> ID1 ID2
#> <chr> <chr>
#> 1 S18 S02
#> 2 S18 S06
#> 3 S07 S08In pairs_reverse, SAMPLE_1 and SAMPLE_2 are swapped for
every pair relative to pairs_forward. All other metadata
(IDs, traits, etc.) remain consistent so that we can compare results
pairwise.
Many LLM providers now expose reasoning-enhanced decoding
modes (sometimes called “thinking,” “chain-of-thought modules,” or
“structured reasoning engines”).
In pairwiseLLM, these modes are exposed through a simple
parameter:
thinking = "no_thinking" # standard inference mode
thinking = "with_thinking" # activates provider's reasoning system
However, the actual meaning of these settings is backend-specific. Below we describe the exact configurations used in our positional-bias tests.
Anthropic’s batch API allows explicit control over the reasoning system.
thinking = "no_thinking"reasoning = "none"temperature = 0thinking = "with_thinking"reasoning = "enabled"temperature = 1include_thoughts = TRUEthinking_budget = 1024 (max internal reasoning
tokens)This mode yields more reflective but less deterministic decisions.
Gemini’s batch API exposes reasoning through the
thinkingLevel field.
thinking = "with_thinking" was usedSettings used:
thinkingLevel = "low"includeThoughts = TRUEtemperature left at provider
defaultThis yields lightweight reasoning comparable to Anthropic’s enabled mode.
OpenAI supports two distinct APIs:
chat.completions — standard
inferenceresponses — reasoning-enabled
(formerly “Chain of Thought” via o-series)thinking = "no_thinking"Used for all models, including gpt-5.1:
chat.completionstemperature = 0thinking = "with_thinking" (gpt-5.1 only)responsesreasoning = "low"include_thoughts = TRUEtemperature parameter (OpenAI ignores it
for this endpoint)This mode returns reasoning metadata that is stripped prior to analysis.
For Together.ai we ran positional-bias experiments using the Chat Completions API (/v1/chat/completions) for the following models:
DeepSeek-R1 emits internal reasoning wrapped in
Temperature settings used in testing: - “deepseek-ai/DeepSeek-R1”:
temperature = 0.6 - DeepSeek-V3, Kimi-K2, Qwen3:
temperature = 0.0
| Backend | Thinking Mode | What It Controls | Temperature Used | Notes |
|---|---|---|---|---|
| Anthropic | no_thinking | reasoning=none, no thoughts | 0 | deterministic |
| Anthropic | with_thinking | reasoning enabled, thoughts included, budget=1024 | 1 | rich internal reasoning |
| Gemini | with_thinking only | thinkingLevel=“low”, includeThoughts | provider default | batch API does not support pure no-thinking mode |
| OpenAI | no_thinking | chat.completions, no reasoning | 0 | deterministic |
| OpenAI | with_thinking (5.1) | responses API with reasoning=low | ignored / N/A | only applied to gpt-5.1 |
| Together | with_thinking | Chat Completions with <think>…</think>
extracted to thoughts |
0.6 (default) | internal reasoning always on; visible answer in
content |
| Together | no_thinking | Chat Completions, no explicit reasoning toggle | 0 | reasoning not supported in these specific models |
The results from the experiments are stored in a CSV included in the
package (for example, under
inst/extdata/template_test_summary_all.csv). We load and
lightly clean that file here.
summary_path <- system.file("extdata", "template_test_summary_all.csv", package = "pairwiseLLM")
if (!nzchar(summary_path)) stop("Data file not found in installed package.")
summary_tbl <- readr::read_csv(summary_path, show_col_types = FALSE)
head(summary_tbl)
#> # A tibble: 6 × 7
#> template_id backend model thinking prop_consistent prop_pos1 p_sample1_overall
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 test1 anthro… clau… no_thin… 0.895 0.505 0.878
#> 2 test1 anthro… clau… with_th… 0.932 0.497 0.959
#> 3 test1 anthro… clau… no_thin… 0.884 0.516 0.573
#> 4 test1 anthro… clau… with_th… 0.905 0.484 0.573
#> 5 test1 anthro… clau… no_thin… 0.884 0.442 0.0273
#> 6 test1 anthro… clau… with_th… 0.884 0.447 0.0453The columns in summary_tbl are:
template_id
ID of the prompt template (e.g., "test1").
backend
LLM backend ("anthropic", "gemini",
"openai", "together").
model
Specific model (e.g., "claude-sonnet-4-5",
"gpt-4o", "gemini-3-pro-preview").
thinking
Reasoning configuration (usually "no_thinking" or
"with_thinking"). The exact meaning depends on the provider
and dev script (for example, reasoning turned on vs off, or
thinking-level settings for Gemini).
prop_consistent
Proportion of comparisons that remained consistent when the pair order
was reversed. Higher values indicate greater order-invariance.
prop_pos1
Proportion of comparisons where SAMPLE_1 was chosen as better. Values
near 0.5 indicate little or no positional bias toward the first
position.
p_sample1_overall
p-value from a binomial test of whether the probability of choosing
SAMPLE_1 differs from 0.5. Smaller p-values suggest that the observed
preference (for or against SAMPLE_1) is unlikely to be due to chance
alone.
The three key statistics for each (template, provider, model, thinking) combination are:
Proportion consistent
(prop_consistent)
Proportion choosing SAMPLE_1
(prop_pos1)
Binomial test p-value
(p_sample1_overall)
As an example, a row with:
prop_consistent = 0.93prop_pos1 = 0.48p_sample1_overall = 0.57suggests:
By contrast, a row with:
prop_consistent = 0.83prop_pos1 = 0.42p_sample1_overall = 0.001would suggest:
In this section we present, for each template:
The full template text (as used in the experiments).
A simple summary table with one row per (backend, model, thinking) configuration and columns:
BackendModelThinkingProp_ConsistentProp_SAMPLE_1Binomial_Test_ptest1cat(get_prompt_template("test1"))
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the winner.
#> 2. **Advocate for SAMPLE_2**: Mentally list the single strongest point of evidence that makes SAMPLE_2 the winner.
#> 3. **Adjudicate**: Compare the *strength of the evidence* identified in steps 1 and 2. Which sample provided the more compelling demonstration of the definition above?
#>
#> CRITICAL:
#> - You must construct a mental argument for BOTH samples before deciding.
#> - Do not default to the first sample read.
#> - If the samples are close, strictly follow the trait definition to break the tie.
#>
#> FINAL DECISION:
#> Output your decision based on the stronger evidence.
#>
#> <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>
#> OR
#> <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
#>
#> (Provide only the XML tag).summary_tbl |>
filter(template_id == "test1") |>
arrange(backend, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Backend = backend,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
align = c("l", "l", "l", "r", "r", "r")
)| Backend | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| anthropic | claude-haiku-4-5 | no_thinking | 0.884 | 0.516 | 0.573 |
| anthropic | claude-haiku-4-5 | with_thinking | 0.905 | 0.484 | 0.573 |
| anthropic | claude-opus-4-5 | no_thinking | 0.884 | 0.442 | 0.027 |
| anthropic | claude-opus-4-5 | with_thinking | 0.884 | 0.447 | 0.045 |
| anthropic | claude-sonnet-4-5 | no_thinking | 0.895 | 0.505 | 0.878 |
| anthropic | claude-sonnet-4-5 | with_thinking | 0.932 | 0.497 | 0.959 |
| gemini | gemini-3-pro-preview | with_thinking | 0.926 | 0.521 | 0.442 |
| openai | gpt-4.1 | no_thinking | 0.937 | 0.479 | 0.442 |
| openai | gpt-4o | no_thinking | 0.837 | 0.418 | 0.002 |
| openai | gpt-5.1 | no_thinking | 0.926 | 0.474 | 0.330 |
| openai | gpt-5.1 | with_thinking | 0.858 | 0.429 | 0.006 |
| together | DeepSeek-R1 | with_thinking | 0.837 | 0.576 | 0.003 |
| together | DeepSeek-V3 | no_thinking | 0.921 | 0.487 | 0.644 |
| together | Kimi-K2-Instruct-0905 | no_thinking | 0.889 | 0.455 | 0.090 |
| together | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.821 | 0.416 | 0.001 |
test2cat(get_prompt_template("test2"))
#> You are an impartial, expert writing evaluator. You will be provided with two student writing samples.
#>
#> YOUR GOAL: Identify which sample is better regarding {TRAIT_NAME}.
#>
#> ***
#> SAMPLE_1 START
#> ***
#> {SAMPLE_1}
#> ***
#> SAMPLE_1 END
#> ***
#>
#> ***
#> SAMPLE_2 START
#> ***
#> {SAMPLE_2}
#> ***
#> SAMPLE_2 END
#> ***
#>
#> EVALUATION CRITERIA:
#> Trait: {TRAIT_NAME}
#> Definition: {TRAIT_DESCRIPTION}
#>
#> DECISION PROTOCOL:
#> 1. Ignore the order in which the samples appeared.
#> 2. Mentally 'shuffle' the samples. If Sample 1 was read second, would it still be better/worse?
#> 3. Focus STRICTLY on the definition above. Ignore length, vocabulary complexity, or style unless explicitly mentioned in the definition.
#> 4. If the samples are effectively tied, scrutinize them for the slightest advantage in {TRAIT_NAME} to break the tie.
#>
#> OUTPUT FORMAT:
#> You must output ONLY one of the following tags. Do not produce any other text, reasoning, or preamble.
#>
#> <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>
#> or
#> <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>summary_tbl |>
filter(template_id == "test2") |>
arrange(backend, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Backend = backend,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
align = c("l", "l", "l", "r", "r", "r")
)| Backend | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| anthropic | claude-haiku-4-5 | no_thinking | 0.863 | 0.442 | 0.027 |
| anthropic | claude-haiku-4-5 | with_thinking | 0.932 | 0.487 | 0.644 |
| anthropic | claude-opus-4-5 | no_thinking | 0.895 | 0.458 | 0.112 |
| anthropic | claude-opus-4-5 | with_thinking | 0.926 | 0.474 | 0.330 |
| anthropic | claude-sonnet-4-5 | no_thinking | 0.926 | 0.468 | 0.238 |
| anthropic | claude-sonnet-4-5 | with_thinking | 0.916 | 0.484 | 0.573 |
| gemini | gemini-3-pro-preview | with_thinking | 0.879 | 0.561 | 0.021 |
| openai | gpt-4.1 | no_thinking | 0.932 | 0.466 | 0.200 |
| openai | gpt-4o | no_thinking | 0.884 | 0.442 | 0.027 |
| openai | gpt-5.1 | no_thinking | 0.853 | 0.426 | 0.005 |
| openai | gpt-5.1 | with_thinking | 0.853 | 0.426 | 0.005 |
| together | DeepSeek-R1 | with_thinking | 0.916 | 0.511 | 0.720 |
| together | DeepSeek-V3 | no_thinking | 0.874 | 0.563 | 0.016 |
| together | Kimi-K2-Instruct-0905 | no_thinking | 0.905 | 0.458 | 0.112 |
| together | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.858 | 0.434 | 0.012 |
test3cat(get_prompt_template("test3"))
#> You are an expert writing assessor.
#>
#> Your task: Determine which of two writing samples demonstrates superior {TRAIT_NAME}.
#>
#> {TRAIT_NAME} is defined as:
#> {TRAIT_DESCRIPTION}
#>
#> Below are two samples. They appear in arbitrary order—neither position indicates quality.
#>
#> ═══════════════════════════════════════
#> FIRST SAMPLE:
#> {SAMPLE_1}
#>
#> ═══════════════════════════════════════
#> SECOND SAMPLE:
#> {SAMPLE_2}
#>
#> ═══════════════════════════════════════
#>
#> ASSESSMENT PROTOCOL:
#>
#> Step 1: Read both samples in their entirety.
#>
#> Step 2: For each sample independently, assess the degree to which it demonstrates {TRAIT_NAME} based solely on the definition provided.
#>
#> Step 3: Compare your assessments. Determine which sample shows stronger {TRAIT_NAME}.
#>
#> Step 4: Select the sample with better {TRAIT_NAME}. If extremely close, choose the one with any detectable advantage. No ties are allowed.
#>
#> Step 5: Verify your selection reflects the CONTENT quality, not the presentation order.
#>
#> RESPONSE FORMAT:
#>
#> Respond with exactly one line using this format:
#>
#> <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>
#>
#> if the first sample is better, OR
#>
#> <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
#>
#> if the second sample is better.
#>
#> Output only the XML tag with your choice. No explanations or additional text.summary_tbl |>
filter(template_id == "test3") |>
arrange(backend, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Backend = backend,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
align = c("l", "l", "l", "r", "r", "r")
)| Backend | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| anthropic | claude-haiku-4-5 | no_thinking | 0.921 | 0.461 | 0.137 |
| anthropic | claude-haiku-4-5 | with_thinking | 0.916 | 0.463 | 0.166 |
| anthropic | claude-opus-4-5 | no_thinking | 0.905 | 0.463 | 0.166 |
| anthropic | claude-opus-4-5 | with_thinking | 0.916 | 0.463 | 0.166 |
| anthropic | claude-sonnet-4-5 | no_thinking | 0.884 | 0.453 | 0.072 |
| anthropic | claude-sonnet-4-5 | with_thinking | 0.937 | 0.489 | 0.720 |
| gemini | gemini-3-pro-preview | with_thinking | 0.911 | 0.545 | 0.090 |
| openai | gpt-4.1 | no_thinking | 0.916 | 0.458 | 0.112 |
| openai | gpt-4o | no_thinking | 0.832 | 0.416 | 0.001 |
| openai | gpt-5.1 | no_thinking | 0.879 | 0.445 | 0.035 |
| openai | gpt-5.1 | with_thinking | 0.863 | 0.432 | 0.009 |
| together | DeepSeek-R1 | with_thinking | 0.953 | 0.487 | 0.644 |
| together | DeepSeek-V3 | no_thinking | 0.884 | 0.453 | 0.072 |
| together | Kimi-K2-Instruct-0905 | no_thinking | 0.879 | 0.455 | 0.090 |
| together | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.805 | 0.408 | 0.000 |
test4cat(get_prompt_template("test4"))
#> You are an expert writing assessor.
#>
#> Evaluate which sample better demonstrates {TRAIT_NAME}.
#>
#> {TRAIT_NAME}: {TRAIT_DESCRIPTION}
#>
#> ---
#> SAMPLE 1:
#> {SAMPLE_1}
#>
#> ---
#> SAMPLE 2:
#> {SAMPLE_2}
#>
#> ---
#>
#> TASK:
#> - Assess both samples on {TRAIT_NAME} only
#> - Choose the sample with stronger {TRAIT_NAME}
#> - If nearly equal, select the marginally better one
#>
#> The samples above appear in random order. Base your judgment only on which content better demonstrates {TRAIT_NAME}, not on position.
#>
#> Respond with only one line:
#>
#> <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> if Sample 1 is better
#>
#> <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE> if Sample 2 is bettersummary_tbl |>
filter(template_id == "test4") |>
arrange(backend, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Backend = backend,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
align = c("l", "l", "l", "r", "r", "r")
)| Backend | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| anthropic | claude-haiku-4-5 | no_thinking | 0.937 | 0.468 | 0.238 |
| anthropic | claude-haiku-4-5 | with_thinking | 0.937 | 0.474 | 0.328 |
| anthropic | claude-opus-4-5 | no_thinking | 0.900 | 0.461 | 0.137 |
| anthropic | claude-opus-4-5 | with_thinking | 0.895 | 0.458 | 0.112 |
| anthropic | claude-sonnet-4-5 | no_thinking | 0.911 | 0.461 | 0.137 |
| anthropic | claude-sonnet-4-5 | with_thinking | 0.900 | 0.482 | 0.505 |
| gemini | gemini-3-pro-preview | with_thinking | 0.916 | 0.542 | 0.112 |
| openai | gpt-4.1 | no_thinking | 0.884 | 0.442 | 0.027 |
| openai | gpt-4o | no_thinking | 0.884 | 0.442 | 0.027 |
| openai | gpt-5.1 | no_thinking | 0.858 | 0.429 | 0.006 |
| openai | gpt-5.1 | with_thinking | 0.832 | 0.416 | 0.001 |
| together | DeepSeek-R1 | with_thinking | 0.905 | 0.474 | 0.330 |
| together | DeepSeek-V3 | no_thinking | 0.932 | 0.503 | 0.959 |
| together | Kimi-K2-Instruct-0905 | no_thinking | 0.942 | 0.503 | 0.959 |
| together | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.768 | 0.384 | 0.000 |
test5cat(get_prompt_template("test5"))
#> You are a critique-focused evaluator. Instead of looking for general quality, you will look for deviations from the ideal.
#>
#> Target Trait: {TRAIT_NAME}
#> Ideal Standard: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> >>> TEXT_BLOCK_1 (Refers to SAMPLE_1)
#> {SAMPLE_1}
#>
#> >>> TEXT_BLOCK_2 (Refers to SAMPLE_2)
#> {SAMPLE_2}
#>
#> EVALUATION METHOD (Gap Analysis):
#>
#> 1. Scrutinize TEXT_BLOCK_1. Where does it fail, hesitate, or deviate from the Ideal Standard?
#> 2. Scrutinize TEXT_BLOCK_2. Where does it fail, hesitate, or deviate from the Ideal Standard?
#> 3. Compare the 'Distance from Ideal'. Which sample is closer to the definition provided?
#> 4. Select the sample with the FEWEST or LEAST SEVERE deficits regarding {TRAIT_NAME}.
#>
#> IMPORTANT:
#> - Ignore the order of presentation.
#> - Focus purely on which text adheres more tightly to the definition.
#> - If both are excellent, select the one with the higher 'ceiling' (stronger peak performance).
#>
#> FINAL SELECTION:
#> <BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>
#> or
#> <BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>summary_tbl |>
filter(template_id == "test5") |>
arrange(backend, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Backend = backend,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
align = c("l", "l", "l", "r", "r", "r")
)| Backend | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| anthropic | claude-haiku-4-5 | no_thinking | 0.905 | 0.463 | 0.166 |
| anthropic | claude-haiku-4-5 | with_thinking | 0.926 | 0.489 | 0.719 |
| anthropic | claude-opus-4-5 | no_thinking | 0.874 | 0.447 | 0.045 |
| anthropic | claude-opus-4-5 | with_thinking | 0.926 | 0.489 | 0.720 |
| anthropic | claude-sonnet-4-5 | no_thinking | 0.900 | 0.482 | 0.505 |
| anthropic | claude-sonnet-4-5 | with_thinking | 0.900 | 0.476 | 0.383 |
| gemini | gemini-3-pro-preview | with_thinking | 0.932 | 0.508 | 0.798 |
| openai | gpt-4.1 | no_thinking | 0.911 | 0.476 | 0.383 |
| openai | gpt-4o | no_thinking | 0.863 | 0.463 | 0.166 |
| openai | gpt-5.1 | no_thinking | 0.877 | 0.451 | 0.086 |
| openai | gpt-5.1 | with_thinking | 0.789 | 0.400 | 0.000 |
| together | DeepSeek-R1 | with_thinking | 0.847 | 0.497 | 0.959 |
| together | DeepSeek-V3 | no_thinking | 0.811 | 0.484 | 0.573 |
| together | Kimi-K2-Instruct-0905 | no_thinking | 0.795 | 0.482 | 0.505 |
| together | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.800 | 0.400 | 0.000 |
It is often useful to examine positional-bias metrics within each backend to see whether:
The tables below show, for each provider, the key statistics:
Each row corresponds to a (template, model, thinking) configuration used in testing.
summary_tbl |>
filter(backend == "anthropic") |>
arrange(template_id, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Template = template_id,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
caption = "Anthropic: Positional-bias summary by template, model, and thinking configuration.",
align = c("l", "l", "l", "r", "r", "r")
)| Template | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| test1 | claude-haiku-4-5 | no_thinking | 0.884 | 0.516 | 0.573 |
| test1 | claude-haiku-4-5 | with_thinking | 0.905 | 0.484 | 0.573 |
| test1 | claude-opus-4-5 | no_thinking | 0.884 | 0.442 | 0.027 |
| test1 | claude-opus-4-5 | with_thinking | 0.884 | 0.447 | 0.045 |
| test1 | claude-sonnet-4-5 | no_thinking | 0.895 | 0.505 | 0.878 |
| test1 | claude-sonnet-4-5 | with_thinking | 0.932 | 0.497 | 0.959 |
| test2 | claude-haiku-4-5 | no_thinking | 0.863 | 0.442 | 0.027 |
| test2 | claude-haiku-4-5 | with_thinking | 0.932 | 0.487 | 0.644 |
| test2 | claude-opus-4-5 | no_thinking | 0.895 | 0.458 | 0.112 |
| test2 | claude-opus-4-5 | with_thinking | 0.926 | 0.474 | 0.330 |
| test2 | claude-sonnet-4-5 | no_thinking | 0.926 | 0.468 | 0.238 |
| test2 | claude-sonnet-4-5 | with_thinking | 0.916 | 0.484 | 0.573 |
| test3 | claude-haiku-4-5 | no_thinking | 0.921 | 0.461 | 0.137 |
| test3 | claude-haiku-4-5 | with_thinking | 0.916 | 0.463 | 0.166 |
| test3 | claude-opus-4-5 | no_thinking | 0.905 | 0.463 | 0.166 |
| test3 | claude-opus-4-5 | with_thinking | 0.916 | 0.463 | 0.166 |
| test3 | claude-sonnet-4-5 | no_thinking | 0.884 | 0.453 | 0.072 |
| test3 | claude-sonnet-4-5 | with_thinking | 0.937 | 0.489 | 0.720 |
| test4 | claude-haiku-4-5 | no_thinking | 0.937 | 0.468 | 0.238 |
| test4 | claude-haiku-4-5 | with_thinking | 0.937 | 0.474 | 0.328 |
| test4 | claude-opus-4-5 | no_thinking | 0.900 | 0.461 | 0.137 |
| test4 | claude-opus-4-5 | with_thinking | 0.895 | 0.458 | 0.112 |
| test4 | claude-sonnet-4-5 | no_thinking | 0.911 | 0.461 | 0.137 |
| test4 | claude-sonnet-4-5 | with_thinking | 0.900 | 0.482 | 0.505 |
| test5 | claude-haiku-4-5 | no_thinking | 0.905 | 0.463 | 0.166 |
| test5 | claude-haiku-4-5 | with_thinking | 0.926 | 0.489 | 0.719 |
| test5 | claude-opus-4-5 | no_thinking | 0.874 | 0.447 | 0.045 |
| test5 | claude-opus-4-5 | with_thinking | 0.926 | 0.489 | 0.720 |
| test5 | claude-sonnet-4-5 | no_thinking | 0.900 | 0.482 | 0.505 |
| test5 | claude-sonnet-4-5 | with_thinking | 0.900 | 0.476 | 0.383 |
summary_tbl |>
filter(backend == "gemini") |>
arrange(template_id, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Template = template_id,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
caption = "Gemini: Positional-bias summary by template, model, and thinking configuration.",
align = c("l", "l", "l", "r", "r", "r")
)| Template | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| test1 | gemini-3-pro-preview | with_thinking | 0.926 | 0.521 | 0.442 |
| test2 | gemini-3-pro-preview | with_thinking | 0.879 | 0.561 | 0.021 |
| test3 | gemini-3-pro-preview | with_thinking | 0.911 | 0.545 | 0.090 |
| test4 | gemini-3-pro-preview | with_thinking | 0.916 | 0.542 | 0.112 |
| test5 | gemini-3-pro-preview | with_thinking | 0.932 | 0.508 | 0.798 |
summary_tbl |>
filter(backend == "openai") |>
arrange(template_id, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Template = template_id,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
caption = "OpenAI: Positional-bias summary by template, model, and thinking configuration.",
align = c("l", "l", "l", "r", "r", "r")
)| Template | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| test1 | gpt-4.1 | no_thinking | 0.937 | 0.479 | 0.442 |
| test1 | gpt-4o | no_thinking | 0.837 | 0.418 | 0.002 |
| test1 | gpt-5.1 | no_thinking | 0.926 | 0.474 | 0.330 |
| test1 | gpt-5.1 | with_thinking | 0.858 | 0.429 | 0.006 |
| test2 | gpt-4.1 | no_thinking | 0.932 | 0.466 | 0.200 |
| test2 | gpt-4o | no_thinking | 0.884 | 0.442 | 0.027 |
| test2 | gpt-5.1 | no_thinking | 0.853 | 0.426 | 0.005 |
| test2 | gpt-5.1 | with_thinking | 0.853 | 0.426 | 0.005 |
| test3 | gpt-4.1 | no_thinking | 0.916 | 0.458 | 0.112 |
| test3 | gpt-4o | no_thinking | 0.832 | 0.416 | 0.001 |
| test3 | gpt-5.1 | no_thinking | 0.879 | 0.445 | 0.035 |
| test3 | gpt-5.1 | with_thinking | 0.863 | 0.432 | 0.009 |
| test4 | gpt-4.1 | no_thinking | 0.884 | 0.442 | 0.027 |
| test4 | gpt-4o | no_thinking | 0.884 | 0.442 | 0.027 |
| test4 | gpt-5.1 | no_thinking | 0.858 | 0.429 | 0.006 |
| test4 | gpt-5.1 | with_thinking | 0.832 | 0.416 | 0.001 |
| test5 | gpt-4.1 | no_thinking | 0.911 | 0.476 | 0.383 |
| test5 | gpt-4o | no_thinking | 0.863 | 0.463 | 0.166 |
| test5 | gpt-5.1 | no_thinking | 0.877 | 0.451 | 0.086 |
| test5 | gpt-5.1 | with_thinking | 0.789 | 0.400 | 0.000 |
summary_tbl |>
filter(backend == "together") |>
arrange(template_id, model, thinking) |>
mutate(
Prop_Consistent = round(prop_consistent, 3),
Prop_SAMPLE_1 = round(prop_pos1, 3),
Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3)
) |>
select(
Template = template_id,
Model = model,
Thinking = thinking,
Prop_Consistent,
Prop_SAMPLE_1,
Binomial_Test_p
) |>
kable(
caption = "TogetherAI: Positional-bias summary by template, model, and thinking configuration.",
align = c("l", "l", "l", "r", "r", "r")
)| Template | Model | Thinking | Prop_Consistent | Prop_SAMPLE_1 | Binomial_Test_p |
|---|---|---|---|---|---|
| test1 | DeepSeek-R1 | with_thinking | 0.837 | 0.576 | 0.003 |
| test1 | DeepSeek-V3 | no_thinking | 0.921 | 0.487 | 0.644 |
| test1 | Kimi-K2-Instruct-0905 | no_thinking | 0.889 | 0.455 | 0.090 |
| test1 | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.821 | 0.416 | 0.001 |
| test2 | DeepSeek-R1 | with_thinking | 0.916 | 0.511 | 0.720 |
| test2 | DeepSeek-V3 | no_thinking | 0.874 | 0.563 | 0.016 |
| test2 | Kimi-K2-Instruct-0905 | no_thinking | 0.905 | 0.458 | 0.112 |
| test2 | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.858 | 0.434 | 0.012 |
| test3 | DeepSeek-R1 | with_thinking | 0.953 | 0.487 | 0.644 |
| test3 | DeepSeek-V3 | no_thinking | 0.884 | 0.453 | 0.072 |
| test3 | Kimi-K2-Instruct-0905 | no_thinking | 0.879 | 0.455 | 0.090 |
| test3 | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.805 | 0.408 | 0.000 |
| test4 | DeepSeek-R1 | with_thinking | 0.905 | 0.474 | 0.330 |
| test4 | DeepSeek-V3 | no_thinking | 0.932 | 0.503 | 0.959 |
| test4 | Kimi-K2-Instruct-0905 | no_thinking | 0.942 | 0.503 | 0.959 |
| test4 | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.768 | 0.384 | 0.000 |
| test5 | DeepSeek-R1 | with_thinking | 0.847 | 0.497 | 0.959 |
| test5 | DeepSeek-V3 | no_thinking | 0.811 | 0.484 | 0.573 |
| test5 | Kimi-K2-Instruct-0905 | no_thinking | 0.795 | 0.482 | 0.505 |
| test5 | Qwen3-235B-A22B-Instruct-2507 | no_thinking | 0.800 | 0.400 | 0.000 |
To evaluate new prompt templates on your own data:
Add the templates
inst/templates/ (or wherever
your registry expects them).get_prompt_template("my_new_template") works.Update the dev script
template_ids in your dev script to include the
new IDs.dev-anthropic-gemini-template-ab-test.R and/or
dev-openai-template-ab-test.R).This vignette demonstrates a reproducible workflow for detecting and quantifying positional bias in prompt templates.
Including the template text and summary statistics side by side allows rapid inspection and informed template selection. Templates that show:
Prop_Consistent (e.g., ≥ 0.90) across
providers and models, andProp_SAMPLE_1 close to 0.5 with non-significant
Binomial_Test_pare strong candidates for production scoring pipelines in
pairwiseLLM.
Mercer, S. (2025). Prompt Template Positional Bias Testing (Version 1.0.0) [R package vignette]. In pairwiseLLM: Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation. https://shmercer.github.io/pairwiseLLM/