Prompt Stability Scoring

For Text Annotation with Large Language Models

LMs in Social Science

LMs now being used for everything

LMs in Social Science

  1. To annotate data
  2. To function as ABMs
  3. To generate synthetic data
  4. To run whole experiments

LMs in Social Science

  1. To annotate data
  2. To function as ABMs
  3. To generate synthetic data
  4. To run whole experiments

LMs in Social Science

  1. To annotate data
  2. To function as ABMs
  3. To generate synthetic data
  4. To run whole experiments

Standard practices for annotating data

Let’s say we want to classify something as:

  • Republican/Democrat
  • Populist/Not populist
  • Positive/Negative

The zero-shot revolution

A prompt is “all you need”

We analyze the results, and:

Then we tend to:

And then we tend to:

What’s wrong with this approach?

  1. The prompt might be super fragile
    • So some test diagnostic helpful
  2. The author might have “prompt-hacked”
    • So some robustness requirement helpful

This is what we’ve already been doing…

Parallels with human coders and IRR:

  • Intra-coder reliability is like asking same LM to do same job again.
  • Inter-coder reliability is (sort of) like asking same LM to do similar job again.

Why is this reliability important?

Why is this reliability important?

Why is this reliability important?

Why might things go wrong?

Why might things go wrong?

Why might things go wrong?

Current approaches

  1. Repeating prompt variants
    • Mainly to fine tune models/Gauge uncertainty
  2. Elaborating new prompt scaffolding
    • Mainly to boost accuracy
  3. Prompt “peturbation” or substitutions
    • To gauge stability

Current approaches

But we lack dedicated routine…

Our approach

  1. Take data for annotation
  2. Annotate the same data with the same prompt for n iterations
  3. Annotate the same data with i different prompts j times each at k temperatures
  4. Calculate Krippendorff’s Alpha of annotations
  5. Bootstrap standard errors for each iteration/temperature

Our approach

# Function: intra_prompt_stability

def intra_prompt_stability(original_text, prompt_postfix, iterations, bootstrap_samples):
    """
    Calculate intra-prompt stability using annotations and bootstrapping.

    Args:
        original_text (str): The base text for the prompt.
        prompt_postfix (str): Additional text to modify the prompt.
        iterations (int): Number of times to run the process.
        bootstrap_samples (int): Number of bootstraps for Krippendorff's Alpha.

    Returns:
        dict: KA scores including mean_alpha, ci_lower, and ci_upper for each iteration.
    """
    # Combine original text and prompt postfix to create the full prompt
    prompt = f"{original_text} {prompt_postfix}"

    # Initialize storage for annotations and KA scores
    all_annotations = []
    ka_scores = {}

  

Our approach

    # Repeat annotation for each iteration
    for i in range(1, iterations + 1):
        annotations = []

        # Annotate data points
        for d in data:  # Assumes 'data' is defined elsewhere
            annotation = annotate(d, prompt)  # Assumes 'annotate' function exists
            annotations.append({
                "id": d["id"],
                "text": d["text"],
                "annotation": annotation,
                "iteration": i
            })

        # Store all annotations
        all_annotations.extend(annotations)

    # Convert annotations to a DataFrame
    import pandas as pd
    all_annotated = pd.DataFrame(all_annotations)

  

Our approach

    # Calculate KA scores over iterations using bootstrapping
    for i in range(1, iterations + 1):
        subset_annotated = all_annotated[all_annotated["iteration"] <= i]

        # Bootstrap Krippendorff's Alpha
        mean_alpha, ci_lower, ci_upper = bootstrap_krippendorff(
            subset_annotated, bootstrap_samples
        ) 

        ka_scores[i] = {
            "mean_alpha": mean_alpha,
            "ci_lower": ci_lower,
            "ci_upper": ci_upper
        }

    return ka_scores
  

Our approach

# Function: interprompt_stability

def interprompt_stability(original_text, prompt_postfix, nr_variations, 
                          temperatures, iterations, bootstrap_samples):
    """
    Calculate inter-prompt stability using paraphrased prompts and bootstrapping.

    Args:
        original_text (str): Base text for the prompt.
        prompt_postfix (str): Modifier text added to the original prompt.
        nr_variations (int): Number of variations to generate.
        temperatures (list): List of temperatures for generating paraphrases.
        iterations (int): Number of iterations to perform.
        bootstrap_samples (int): Number of bootstraps for Krippendorff's Alpha.

    Returns:
        dict: KA scores including mean_alpha, ci_lower, and ci_upper for each temperature.
    """
    # Dictionary for KA scores
    ka_scores = {}

    # List for storing all annotated data
    all_annotated = []

Our approach

    # For each temperature, generate paraphrases and annotate
    for temp in temperatures:
        paraphrases = generate_paraphrases(original_text, prompt_postfix, nr_variations, temp)
        annotated = []

        # Annotate data for each paraphrase across iterations
        for i in range(1, iterations + 1):
            for paraphrase in paraphrases:
                for d in data:  # Assumes 'data' is defined elsewhere
                    annotation = annotate(d, paraphrase)  # Assumes 'annotate' function exists

                    # Store annotations with metadata
                    annotated.append({
                        "id": d["id"],
                        "text": d["text"],
                        "annotation": annotation,
                        "prompt_id": paraphrase["id"],
                        "prompt": paraphrase["text"],
                        "original": original_text,
                        "temperature": temp,
                        "iteration": i
                    })

        # Convert annotations to a DataFrame
        import pandas as pd
        annotated_data = pd.DataFrame(annotated)
        all_annotated.append(annotated_data)

    # Combine all annotated data
    all_annotated_df = pd.concat(all_annotated, ignore_index=True)

Our approach

    # Calculate KA scores over all temperatures using bootstrapping
    for temp in temperatures:
        temp_data = all_annotated_df[all_annotated_df["temperature"] == temp]

        # Iterate over the number of iterations
        for i in range(1, iterations + 1):
            subset_data = temp_data[temp_data["iteration"] <= i]

            # Bootstrap Krippendorff's Alpha
            mean_alpha, ci_lower, ci_upper = bootstrap_krippendorff(
                subset_data, bootstrap_samples
            )  # Assumes 'bootstrap_krippendorff' function exists

            # Store KA scores
            if temp not in ka_scores:
                ka_scores[temp] = []
            ka_scores[temp].append({
                "iteration": i,
                "mean_alpha": mean_alpha,
                "ci_lower": ci_lower,
                "ci_upper": ci_upper
            })

    return ka_scores

Our approach

To generate variants we:

  1. Use PEGASUS and iterate through temperatures
  2. Allow manual input

Our approach

To annotate data we:

  1. Use gpt-3.5 from OpenAI
  2. Other LMs are available

Our approach

We call this:

“Prompt stability scoring”

And the score:

“Prompt stability scores (PSS)”

Data

Data

Data

Results

Takeaways

  1. Where construct is unclear, stability is low
  2. Where data is unsuitable, stability is low
  3. Where construct clearly delimited, stability is high

Results

Takeaways

  1. Stability decreases as temperature increases
  2. Some rapid PSS decreases even at low temperatures

Recommendations

  1. Provide prompts and prompt variants + outcomes in full
  2. Use our package promptstability to do this
    1. Version 0.1.0: https://pypi.org/project/promptstability/
  3. A 0.8 threshold of acceptability is advisable for intra-PSS
  4. Researcher should investigate why below 0.8 at given temperature (and especially low temperatures)

Further steps

  1. Try out with larger n to determine necessary test size
  2. Test with alternative (SOTA) language models
  3. Adapt to alternatives other than PEGASUS
  4. Audience suggestions???

Additional

Additional