Comparing reasoning models for screening: GPT-4o-mini, GPT-5-mini, and GPT-5.1
Source:vignettes/articles/comparing-reasoning-models.qmd
Introduction
As Vembye et al. (2025) focus exclusively on the screening performance of GPT-4 models, it remains unclear how their proposed setup generalizes to newer GPT-5 models. In this article, we compare the performance of three OpenAI models for screening tasks: gpt-4o-mini, gpt-5-mini, and gpt-5.1. We evaluate their effectiveness in systematic literature review screening tasks using a real-world dataset. The comparison is based on a single repetition (reps = 1) for all models. For the reasoning models (gpt-5-mini and gpt-5.1), we use the default reasoning effort of medium and verbosity low. The aim of this comparison is to assess the trade-offs between the models in terms of accuracy and efficiency for screening tasks. Furthermore, to investigate the impact of reasoning in models.
As can be seen below, we find that prompting GPT-5 models is different from prompting GPT-4 models. They appear to require more detailed information within a single prompt. In other words, they seem to be better at multitasking (i.e., handling multiple pieces of different information simultaneously) compared toGPT-4 models, contrasting some of the findings from Vembye et al. (2025).
Setup
First, we load the necessary packages for our analysis.
Methodology: Running the Comparison
To compare the models, we performed a screening task on the friends_dat dataset. The goal was to identify studies about the FRIENDS preventive programme.
The Prompt
The following prompt was used for all models. It is designed to be very specific about the required output format to ensure consistent results.
prompt <- "We are screening titles and abstracts of studies for a systematic review about FRIENDS-family interventions for children/adolescents.
Your task: decide INCLUDE (1) vs EXCLUDE (0) based ONLY on title + abstract.
INCLUDE (decision = 1) if ALL are true:
A) Intervention is FRIENDS-family OR clearly derived from it:
- Explicitly named FRIENDS / FRIENDS for Life / Fun FRIENDS, OR
- Explicitly described as an adaptation/translation/derivative of FRIENDS-family, OR
- Described as a school-/group-based CBT anxiety prevention/resilience program that is 'based on' or 'informed by' FRIENDS/Fun FRIENDS (treat this as FRIENDS-family unless the abstract clearly indicates it is unrelated).
B) The study measures, evaluates, or reports on anxiety, internalizing symptoms, OR social-emotional/coping outcomes:
- The abstract explicitly mentions anxiety/depression outcomes or anxiety reduction (e.g., 'decreased anxiety', 'anxiety symptoms improved', 'anxiety outcomes'), OR
- The abstract indicates anxiety/depression/internalizing/emotional coping skills are measured, OR
- The FRIENDS intervention is delivered with assessment of emotional/social/coping competencies, OR
- The intervention is explicitly described as targeting anxiety/depression reduction (measurement specifics unclear from abstract).
EXCLUDE if ANY are true:
1) Not FRIENDS-family and not clearly derived from FRIENDS-family (mere generic CBT with no FRIENDS link).
2) Discussion/review/conceptual paper with no empirical study described.
3) Study explicitly focuses ONLY on non-symptom outcomes (e.g., social validity, acceptability, satisfaction, implementation fidelity, teacher/student attendance) WITHOUT mentioning measurement of anxiety or internalizing symptoms.
4) Outcomes are only non-symptom constructs (e.g., social skills/SEL, cooperation) with NO indication that anxiety/internalizing symptoms are being measured.
When uncertain: Lean towards INCLUDE
Remember: Include studies if the abstract suggests that the full text might reveal that the study meets the criteria, even if the abstract is not perfectly clear. Exclude only if the abstract clearly indicates that the study does not meet the criteria.
"Note that the prompt is changed from the original prompt used in the previous comparison of GPT-4.1-mini, GPT-4.1, and GPT-4.1-nano to better suit the reasoning capabilities of the GPT-5 models. The prompt is more detailed and provides clearer instructions to guide the reasoning process of the models.
Running the Screening
For each model (gpt-4o-mini, gpt-5-mini, gpt-5.1), we ran the screening using tabscreen_gpt(). We ran the process with a single repetition (reps = 1) for all models. We used gpt-4o-mini as a baseline as this is the default model used in AIscreenR for screening tasks, and we wanted to compare the performance of the newer GPT-5 models against this baseline.
For the gpt-4o-minimodel we used the following code:
# Example code to run the screening for one model (e.g., gpt-5-mini with 1 repetition)
plan(multisession)
result_obj <-
tabscreen_gpt(
data = friends_dat, # The dataset containing the studies to be screened
prompt = prompt, # The prompt defined above
studyid = studyid, # The column in the dataset that contains the study IDs
title = title, # The column in the dataset that contains the study titles
abstract = abstract, # The column in the dataset that contains the study abstracts
model = "gpt-4o-mini", # The model to use for screening
reps = 1, # Number of repetitions (set to 1 for this comparison)
decision_description = FALSE, # Whether to include the model's reasoning in the output (set to FALSE for this comparison)
)
plan(sequential)For the gpt-5-mini we used the following code:
# Example code to run the screening for one model (e.g., gpt-5-mini with 1 repetition)
plan(multisession)
result_obj <-
tabscreen_gpt(
data = friends_dat, # The dataset containing the studies to be screened
prompt = prompt, # The prompt defined above
studyid = studyid, # The column in the dataset that contains the study IDs
title = title, # The column in the dataset that contains the study titles
abstract = abstract, # The column in the dataset that contains the study abstracts
model = "gpt-5-mini", # The model to use for screening
reps = 1, # Number of repetitions (set to 1 for this comparison)
decision_description = FALSE, # Whether to include the model's reasoning in the output (set to FALSE for this comparison)
reasoning_effort = "medium", # The reasoning effort level for the GPT-5 models (set to "medium" for this comparison)
verbosity = "low" # The verbosity level for the GPT-5 models (set to "low" for this comparison)
)
plan(sequential)For the gpt-5.1 we used the following code:
# Example code to run the screening for one model (e.g., gpt-5-mini with 1 repetition)
plan(multisession)
result_obj <-
tabscreen_gpt(
data = friends_dat, # The dataset containing the studies to be screened
prompt = prompt, # The prompt defined above
studyid = studyid, # The column in the dataset that contains the study IDs
title = title, # The column in the dataset that contains the study titles
abstract = abstract, # The column in the dataset that contains the study abstracts
model = "gpt-5.1", # The model to use for screening
reps = 1, # Number of repetitions (set to 1 for this comparison)
decision_description = FALSE, # Whether to include the model's reasoning in the output (set to FALSE for this comparison)
reasoning_effort = "medium", # The reasoning effort level for the GPT-5 models (set to "medium" for this comparison)
verbosity = "low" # The verbosity level for the GPT-5 models (set to "low" for this comparison)
)
plan(sequential)Results
We now load the pre-computed results from the screening runs. The performance metrics in the table below were calculated using the screen_analyzer() function. The key metrics are:
- Recall: The proportion of truly relevant studies that the model correctly identified. High recall is crucial to avoid missing relevant studies.
- Specificity: The proportion of truly irrelevant studies that the model correctly identified.
- Balanced Accuracy (bAcc): The average of recall and specificity, providing a single measure that balances performance on both relevant and irrelevant studies.
| model | p_agreement | recall | specificity | false_negatives | false_positives | price |
|---|---|---|---|---|---|---|
| GPT-4o-mini | 0.975 | 1.000 | 0.974 | 0 | 66 | 0.3982 |
| GPT-5-mini | 0.981 | 0.984 | 0.981 | 1 | 49 | 2.2387 |
| GPT-5.1 | 0.984 | 0.984 | 0.984 | 1 | 41 | 5.2719 |
Interpretation
The results demonstrate strong performance across all three models, with notable differences in specificity and consistency. GPT-4o-mini achieved a perfect recall of 1.000 but a specificity of 0.974 and agreement of 0.975. GPT-5-mini and GPT-5.1 both achieved slightly lower recall (0.984) but notably higher specificity (0.981 and 0.984, respectively) and agreement (0.981 and 0.984, respectively).
Examining the errors in detail, GPT-4o-mini had 0 false negatives and 66 false positives, while both GPT-5-mini and GPT-5.1 had 1 false negative and 49 and 41 false positives, respectively. The single false negative case (Study 2540) is actually a case where the GPT-5 models correctly identified the paper as a discussion/review with no empirical results, and therefore correctly excluded it. This was also ambiguous for the human coders, with one including and one excluding the study, suggesting that the abstract is quite ambiguous and that the false negative for both GPT-5-mini and GPT-5.1 may be a borderline case rather than a genuine error.
The key trade-off is between recall and specificity. GPT-4o-mini prioritizes sensitivity (catching all relevant studies) at the cost of more false positives, suggesting a more inclusive screening approach. In contrast, GPT-5-mini and GPT-5.1 trade slightly lower recall for substantially better specificity, resulting in fewer false positives. This indicates that the reasoning models follow the prompt more closely and make more conservative inclusion decisions.
From a cost perspective, GPT-4o-mini was the most economical at $0.3982, while GPT-5-mini cost $2.2387 and GPT-5.1 cost $5.2719. Despite having fewer false positives, the reasoning models are roughly 5.6 to 13.2 times more expensive than GPT-4o-mini.
Conclusion
The choice of model depends on the specific priorities of the systematic review and available budget. GPT-4o-mini remains the most cost-effective option, achieving perfect recall and high specificity (0.974), making it ideal for minimizing false negatives when budget is a primary concern. However, if reducing false positives and improving overall agreement with human judgments is a priority, GPT-5-mini and GPT-5.1 provide superior performance with agreement rates of 0.981 and 0.984, respectively, despite the higher cost.
The reasoning models (GPT-5-mini and GPT-5.1) demonstrate that stronger reasoning capabilities lead to more conservative and consistent screening decisions that closely align with the provided prompt. When using reasoning models, a well-crafted, detailed prompt that clearly outlines inclusion and exclusion criteria is essential to leverage their advantages. For reviews where accuracy and precision are most important and budget allows, GPT-5.1 offers the best performance with the lowest false positive rate (41 studies), though at the highest cost.
References
Vembye, M. H., Christensen, J., Mølgaard, A. B., & Schytt, F. L. W. (2025). Generative pretrained transformer models can function as highly reliable second screeners of titles and abstracts in systematic reviews: A proof of concept and common guidelines. Psychological Methods, Online first. https://doi.org/10.1037/met0000769