Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies

Research Questions

  1. To what extent can LLMs simulate human behavior across different populations in experimental settings?
  2. Is it possible to reproduce classical behavioral experiments (Ultimatum Game, Garden Path Sentences, Milgram Shock, Wisdom of Crowds) using LLMs?
  3. How realistically can demographic variation (e.g., name, gender) be reflected in LLM outputs?

Results

  • LLMs successfully replicated several known patterns of human behavior.
  • Larger models produced responses that were more human-like.
  • Gender-based behavioral differences emerged in LLM outputs (e.g., the chivalry effect — men accepting unfair offers more often when the proposer was female).
  • Newer and more aligned LLMs exhibited hyper-accuracy distortion, performing unrealistically well on certain knowledge tasks.

Findings

  • Behavioral Imitation:

    • Large language models reflected established human behavioral patterns in experiments such as the Ultimatum Game and Garden Path Sentences.
  • Demographic Variation:

    • LLMs were able to simulate behavioral differences based on demographic cues such as name and gender.
  • Hyper-Accuracy:

    • Some models produced answers that were too accurate compared to typical human performance, failing to mirror real-world human knowledge distributions.
  • Data Contamination Concerns::

    • Because LLMs may have been exposed to these classical experiments during training, questions arise regarding the originality and validity of the reproduced behaviors.
  • Ethical Risks:

    • Simulating harmful experiments—such as the Milgram Shock Experiment—raises significant ethical concerns.
  • LLM Models: 5

  • Synthetic Data: 4

  • Method: 4

  • Speed: 3

  • Ethics: 4

  • Accuracy: 3

  • Demographics: 4

If you would like to access more detailed information about this article, click here to view the supplementary material.

5 min read

Related Articles