What do European Futures look like according to fine-tuned GPT-3, prompt-tuned GPT-3.5, and few-shot prompted GPT-4?
This study investigates the potentials and challenges of utilizing three
specifically tuned and trained generations of Generative Pre-trained
Transformer (GPT) models, GPT-3, GPT-3.5, and GPT-4, in generating
scenarios for Futures Research. The methodology combines a literature
review, a coding-based experiment, and an expert survey. After fine- and prompt-tuning GPT-3 davinci, prompt-tuning GPT-3.5 text-davinci-003, and few-shot prompting GPT-4 to with human-made scenarios concerning Europe of 2030, 2040 and 2050, each model's output was quantitatively and qualitatively analyzed to bridge the gap between objective and subjective evaluations. The expert survey invited 42 practitioners from Futures Research, NLP, and AI to assess whether differences in content and plausibility can be identified, while classifying which scenarios were human-made and which were generated by GPT models. This study's findings suggest that all GPT models’ scenarios heavily center on technology-driven topics and display more neutral sentiments, when comparing them to human-made scenarios. Additionally, GPT-3 and GPT-4 generated scenarios were difficult to distinguish from human-made scenarios, while GPT-3.5 performed more poorly. However, comparing human-made to machine-generated scenarios remains complex, as each model learned in its way from human-made content, while Futures Researchers may have applied machine-assisted methods in their scenario generation process. This makes every approach a hybrid of human-machine collaboration by nature. Therefore, the study concludes by discussing the implications of GPT-3, GPT-3.5, and GPT-4 usage for Futures Research, addressing each model's weaknesses and potential before providing further research directions. The entire process, including code and corpus, is publicly accessible.