Reduce Disparity Between LLMs and Humans: Optimal LLM Sample Calibration

Research Questions

  1. How can LLM outputs be calibrated across demographic groups to better approximate human responses?
  2. Can this approach be made model-agnostic (applicable to any LLM)?
  3. To what extent can calibration be transferred across domains or geographic regions?

Results

  • Human Mimicry Calibration (HMC) significantly improved alignment between LLM responses and human data.
  • HMC outperformed traditional population-density–based weighting methods.
  • Spatial transfer (Texas → New York, California, Florida) was successful, while topic transfer—especially in politics and sensitive domains—was limited.
  • The highest weights were assigned to young (18–34) and low-income demographics, which produced outputs most similar to human behavior.

Findings

  • Calibration Technique:

    • HMC improves LLM-human alignment by reweighting demographic persona outputs toward groups that better mimic human data.
  • Performance:

    • HMC achieved 20–30% higher accuracy compared to uniform and population-weighted baselines.
  • Transferability:

    • Geographic transfer worked effectively.
    • Cross-domain transfer, particularly in sensitive or political topics, showed reduced accuracy.
  • Evaluation Metrics:

    • Differences between human and LLM responses were quantified using Kendall’s Tau and Wasserstein Distance, both showing substantial reductions after calibration.
  • Key Insight:

    • Younger and lower-income groups emerged as the most representative demographic segments in terms of producing human-like LLM outputs.
  • LLM Models: 5

  • Synthetic Data: 4

  • Method: 5

  • Speed: 3

  • Ethics: 3

  • Accuracy: 5

  • Demographics: 5

If you would like to access more detailed information about this article, click here to view the supplementary material.

5 min read

Related Articles