Prompt Stability Scoring

For Text Annotation with Large Language Models

Standard practices for annotating data

Let’s say we want to classify something as:

Republican/Democrat
Populist/Not populist
Positive/Negative

The zero-shot revolution

A prompt is “all you need”

We analyze the results, and:

Then we tend to:

And then we tend to:

What’s wrong with this approach?

The prompt might be super fragile
- So some test diagnostic helpful
The author might have “prompt-hacked”
- So some robustness requirement helpful

This is what we’ve already been doing…

Parallels with human coders and IRR:

Intra-coder reliability is like asking same LM to do same job again.
Inter-coder reliability is (sort of) like asking same LM to do similar job again.

Why is this reliability important?

Why is this reliability important?

Why is this reliability important?

Why might things go wrong?

Why might things go wrong?

Why might things go wrong?

Current approaches

Repeating prompt variants
- Mainly to fine tune models/Gauge uncertainty
Elaborating new prompt scaffolding
- Mainly to boost accuracy
Prompt “peturbation” or substitutions
- To gauge stability

Current approaches

But we lack dedicated routine…

Our approach

Take data for annotation
Annotate the same data with the same prompt for n iterations
Annotate the same data with i different prompts j times each at k temperatures
Calculate Krippendorff’s Alpha of annotations
Bootstrap standard errors for each iteration/temperature

Our approach

# Function: intra_prompt_stability

def intra_prompt_stability(original_text, prompt_postfix, iterations, bootstrap_samples):
    """
    Calculate intra-prompt stability using annotations and bootstrapping.

    Args:
        original_text (str): The base text for the prompt.
        prompt_postfix (str): Additional text to modify the prompt.
        iterations (int): Number of times to run the process.
        bootstrap_samples (int): Number of bootstraps for Krippendorff's Alpha.

    Returns:
        dict: KA scores including mean_alpha, ci_lower, and ci_upper for each iteration.
    """
    # Combine original text and prompt postfix to create the full prompt
    prompt = f"{original_text} {prompt_postfix}"

    # Initialize storage for annotations and KA scores
    all_annotations = []
    ka_scores = {}

Our approach

    # Repeat annotation for each iteration
    for i in range(1, iterations + 1):
        annotations = []

        # Annotate data points
        for d in data:  # Assumes 'data' is defined elsewhere
            annotation = annotate(d, prompt)  # Assumes 'annotate' function exists
            annotations.append({
                "id": d["id"],
                "text": d["text"],
                "annotation": annotation,
                "iteration": i
            })

        # Store all annotations
        all_annotations.extend(annotations)

    # Convert annotations to a DataFrame
    import pandas as pd
    all_annotated = pd.DataFrame(all_annotations)

Our approach

    # Calculate KA scores over iterations using bootstrapping
    for i in range(1, iterations + 1):
        subset_annotated = all_annotated[all_annotated["iteration"] <= i]

        # Bootstrap Krippendorff's Alpha
        mean_alpha, ci_lower, ci_upper = bootstrap_krippendorff(
            subset_annotated, bootstrap_samples
        ) 

        ka_scores[i] = {
            "mean_alpha": mean_alpha,
            "ci_lower": ci_lower,
            "ci_upper": ci_upper
        }

    return ka_scores

Our approach

# Function: interprompt_stability

def interprompt_stability(original_text, prompt_postfix, nr_variations, 
                          temperatures, iterations, bootstrap_samples):
    """
    Calculate inter-prompt stability using paraphrased prompts and bootstrapping.

    Args:
        original_text (str): Base text for the prompt.
        prompt_postfix (str): Modifier text added to the original prompt.
        nr_variations (int): Number of variations to generate.
        temperatures (list): List of temperatures for generating paraphrases.
        iterations (int): Number of iterations to perform.
        bootstrap_samples (int): Number of bootstraps for Krippendorff's Alpha.

    Returns:
        dict: KA scores including mean_alpha, ci_lower, and ci_upper for each temperature.
    """
    # Dictionary for KA scores
    ka_scores = {}

    # List for storing all annotated data
    all_annotated = []

Our approach

    # For each temperature, generate paraphrases and annotate
    for temp in temperatures:
        paraphrases = generate_paraphrases(original_text, prompt_postfix, nr_variations, temp)
        annotated = []

        # Annotate data for each paraphrase across iterations
        for i in range(1, iterations + 1):
            for paraphrase in paraphrases:
                for d in data:  # Assumes 'data' is defined elsewhere
                    annotation = annotate(d, paraphrase)  # Assumes 'annotate' function exists

                    # Store annotations with metadata
                    annotated.append({
                        "id": d["id"],
                        "text": d["text"],
                        "annotation": annotation,
                        "prompt_id": paraphrase["id"],
                        "prompt": paraphrase["text"],
                        "original": original_text,
                        "temperature": temp,
                        "iteration": i
                    })

        # Convert annotations to a DataFrame
        import pandas as pd
        annotated_data = pd.DataFrame(annotated)
        all_annotated.append(annotated_data)

    # Combine all annotated data
    all_annotated_df = pd.concat(all_annotated, ignore_index=True)

Our approach

    # Calculate KA scores over all temperatures using bootstrapping
    for temp in temperatures:
        temp_data = all_annotated_df[all_annotated_df["temperature"] == temp]

        # Iterate over the number of iterations
        for i in range(1, iterations + 1):
            subset_data = temp_data[temp_data["iteration"] <= i]

            # Bootstrap Krippendorff's Alpha
            mean_alpha, ci_lower, ci_upper = bootstrap_krippendorff(
                subset_data, bootstrap_samples
            )  # Assumes 'bootstrap_krippendorff' function exists

            # Store KA scores
            if temp not in ka_scores:
                ka_scores[temp] = []
            ka_scores[temp].append({
                "iteration": i,
                "mean_alpha": mean_alpha,
                "ci_lower": ci_lower,
                "ci_upper": ci_upper
            })

    return ka_scores

Our approach

To generate variants we:

Use PEGASUS and iterate through temperatures
Allow manual input

Our approach

To annotate data we:

Use gpt-3.5 from OpenAI
Other LMs are available

Our approach

We call this:

“Prompt stability scoring”

And the score:

“Prompt stability scores (PSS)”

Data

Data

Data

Results

Takeaways

Where construct is unclear, stability is low
Where data is unsuitable, stability is low
Where construct clearly delimited, stability is high

Results

Takeaways

Stability decreases as temperature increases
Some rapid PSS decreases even at low temperatures

Recommendations

Provide prompts and prompt variants + outcomes in full
Use our package promptstability to do this
1. Version 0.1.0: https://pypi.org/project/promptstability/
A 0.8 threshold of acceptability is advisable for intra-PSS
Researcher should investigate why below 0.8 at given temperature (and especially low temperatures)

Further steps

Try out with larger n to determine necessary test size
Test with alternative (SOTA) language models
Adapt to alternatives other than PEGASUS
Audience suggestions???

Additional

Additional

Prompt Stability Scoring

LMs in Social Science

LMs in Social Science

LMs in Social Science

LMs in Social Science

Standard practices for annotating data

The zero-shot revolution

We analyze the results, and:

Then we tend to:

And then we tend to:

What’s wrong with this approach?

This is what we’ve already been doing…

Why is this reliability important?

Why is this reliability important?

Why is this reliability important?

Why might things go wrong?

Why might things go wrong?

Why might things go wrong?

Current approaches

Current approaches

Our approach

Our approach

Our approach

Our approach

Our approach

Our approach

Our approach

Our approach

Our approach

Our approach

Data

Data

Data

Results

Takeaways

Results

Takeaways

Recommendations

Further steps

Additional

Additional