Skip to contents

Example rows where human screening decisions differ from GPT decisions. Each row is a (study × prompt) screening outcome.

Usage

disagreements

Format

A tibble/data.frame with one row per screened (studyid, promptid) and 17 columns:

authorcharacterStudy authors
human_codenumericHuman screening decision (1 include, 0 exclude)
studyidintegerUnique study identifier
titlecharacterStudy title
abstractcharacterStudy abstract
promptidintegerPrompt identifier
promptcharacterOriginal short screening prompt text
modelcharacterModel used for the run
questioncharacterFull constructed question sent to model
top_pnumericNucleus sampling parameter
incl_pnumericEstimated probability of inclusion (if repetitions)
final_decision_gptcharacterGPT final label: Include / Exclude / Check
final_decision_gpt_numnumericNumeric GPT decision (1 include/check, 0 exclude)
longest_answercharacterLongest rationale text returned
repsintegerNumber of repetitions attempted
n_mis_answersintegerCount of missing answers across reps
submodelcharacterSpecific model variant (if applicable)