Comparing reasoning models for screening: GPT-4o-mini, GPT-5-mini, and GPT-5.1 • AIscreenR

Introduction

As Vembye et al. (2025) focus exclusively on the screening performance of GPT-4 models, it remains unclear how their proposed setup generalizes to newer GPT-5 models. In this article, we compare the performance of three OpenAI models for screening tasks: gpt-4o-mini, gpt-5-mini, and gpt-5.1. We evaluate their effectiveness in systematic literature review screening tasks using a real-world dataset. The comparison is based on a single repetition (reps = 1) for all models. For the reasoning models (gpt-5-mini and gpt-5.1), we use the default reasoning effort of medium and verbosity low. The aim of this comparison is to assess the trade-offs between the models in terms of accuracy and efficiency for screening tasks. Furthermore, to investigate the impact of reasoning in models.

As can be seen below, we find that prompting GPT-5 models is different from prompting GPT-4 models. They appear to require more detailed information within a single prompt. In other words, they seem to be better at multitasking (i.e., handling multiple pieces of different information simultaneously) compared toGPT-4 models, contrasting some of the findings from Vembye et al. (2025).

Setup

First, we load the necessary packages for our analysis.

library(AIscreenR)
library(dplyr)
library(knitr)
library(future)

Methodology: Running the Comparison

To compare the models, we performed a screening task on the friends_dat dataset. The goal was to identify studies about the FRIENDS preventive programme.

The Prompt

The following prompt was used for all models. It is designed to be very specific about the required output format to ensure consistent results.

prompt <- "We are screening titles and abstracts of studies for a systematic review about FRIENDS-family interventions for children/adolescents.

Your task: decide INCLUDE (1) vs EXCLUDE (0) based ONLY on title + abstract.

INCLUDE (decision = 1) if ALL are true:
A) Intervention is FRIENDS-family OR clearly derived from it:
   - Explicitly named FRIENDS / FRIENDS for Life / Fun FRIENDS, OR
   - Explicitly described as an adaptation/translation/derivative of FRIENDS-family, OR
   - Described as a school-/group-based CBT anxiety prevention/resilience program that is 'based on' or 'informed by' FRIENDS/Fun FRIENDS (treat this as FRIENDS-family unless the abstract clearly indicates it is unrelated).
B) The study measures, evaluates, or reports on anxiety, internalizing symptoms, OR social-emotional/coping outcomes:
   - The abstract explicitly mentions anxiety/depression outcomes or anxiety reduction (e.g., 'decreased anxiety', 'anxiety symptoms improved', 'anxiety outcomes'), OR
   - The abstract indicates anxiety/depression/internalizing/emotional coping skills are measured, OR
   - The FRIENDS intervention is delivered with assessment of emotional/social/coping competencies, OR
   - The intervention is explicitly described as targeting anxiety/depression reduction (measurement specifics unclear from abstract).

EXCLUDE if ANY are true:
1) Not FRIENDS-family and not clearly derived from FRIENDS-family (mere generic CBT with no FRIENDS link).
2) Discussion/review/conceptual paper with no empirical study described.
3) Study explicitly focuses ONLY on non-symptom outcomes (e.g., social validity, acceptability, satisfaction, implementation fidelity, teacher/student attendance) WITHOUT mentioning measurement of anxiety or internalizing symptoms.
4) Outcomes are only non-symptom constructs (e.g., social skills/SEL, cooperation) with NO indication that anxiety/internalizing symptoms are being measured.

When uncertain: Lean towards INCLUDE

Remember: Include studies if the abstract suggests that the full text might reveal that the study meets the criteria, even if the abstract is not perfectly clear. Exclude only if the abstract clearly indicates that the study does not meet the criteria.
"

Note that the prompt is changed from the original prompt used in the previous comparison of GPT-4.1-mini, GPT-4.1, and GPT-4.1-nano to better suit the reasoning capabilities of the GPT-5 models. The prompt is more detailed and provides clearer instructions to guide the reasoning process of the models.

Running the Screening

For each model (gpt-4o-mini, gpt-5-mini, gpt-5.1), we ran the screening using tabscreen_gpt(). We ran the process with a single repetition (reps = 1) for all models. We used gpt-4o-mini as a baseline as this is the default model used in AIscreenR for screening tasks, and we wanted to compare the performance of the newer GPT-5 models against this baseline.

For the gpt-4o-minimodel we used the following code:

# Example code to run the screening for one model (e.g., gpt-5-mini with 1 repetition)
plan(multisession)
result_obj <- 
  tabscreen_gpt(
    data = friends_dat, # The dataset containing the studies to be screened
    prompt = prompt, # The prompt defined above
    studyid = studyid, # The column in the dataset that contains the study IDs
    title = title, # The column in the dataset that contains the study titles
    abstract = abstract, # The column in the dataset that contains the study abstracts
    model = "gpt-4o-mini", # The model to use for screening
    reps = 1, # Number of repetitions (set to 1 for this comparison)
    decision_description = FALSE, # Whether to include the model's reasoning in the output (set to FALSE for this comparison)
)
plan(sequential)

For the gpt-5-mini we used the following code:

# Example code to run the screening for one model (e.g., gpt-5-mini with 1 repetition)
plan(multisession)
result_obj <- 
  tabscreen_gpt(
    data = friends_dat, # The dataset containing the studies to be screened
    prompt = prompt, # The prompt defined above
    studyid = studyid, # The column in the dataset that contains the study IDs
    title = title, # The column in the dataset that contains the study titles
    abstract = abstract, # The column in the dataset that contains the study abstracts
    model = "gpt-5-mini", # The model to use for screening
    reps = 1, # Number of repetitions (set to 1 for this comparison)
    decision_description = FALSE, # Whether to include the model's reasoning in the output (set to FALSE for this comparison)
    reasoning_effort = "medium", # The reasoning effort level for the GPT-5 models (set to "medium" for this comparison)
    verbosity = "low" # The verbosity level for the GPT-5 models (set to "low" for this comparison)
)
plan(sequential)

For the gpt-5.1 we used the following code:

# Example code to run the screening for one model (e.g., gpt-5-mini with 1 repetition)
plan(multisession)
result_obj <- 
  tabscreen_gpt(
    data = friends_dat, # The dataset containing the studies to be screened
    prompt = prompt, # The prompt defined above
    studyid = studyid, # The column in the dataset that contains the study IDs
    title = title, # The column in the dataset that contains the study titles
    abstract = abstract, # The column in the dataset that contains the study abstracts
    model = "gpt-5.1", # The model to use for screening
    reps = 1, # Number of repetitions (set to 1 for this comparison)
    decision_description = FALSE, # Whether to include the model's reasoning in the output (set to FALSE for this comparison)
    reasoning_effort = "medium", # The reasoning effort level for the GPT-5 models (set to "medium" for this comparison)
    verbosity = "low" # The verbosity level for the GPT-5 models (set to "low" for this comparison)
)
plan(sequential)

Results

We now load the pre-computed results from the screening runs. The performance metrics in the table below were calculated using the screen_analyzer() function. The key metrics are:

Recall: The proportion of truly relevant studies that the model correctly identified. High recall is crucial to avoid missing relevant studies.
Specificity: The proportion of truly irrelevant studies that the model correctly identified.
Balanced Accuracy (bAcc): The average of recall and specificity, providing a single measure that balances performance on both relevant and irrelevant studies.

model	p_agreement	recall	specificity	false_negatives	false_positives	price
GPT-4o-mini	0.975	1.000	0.974	0	66	0.3982
GPT-5-mini	0.981	0.984	0.981	1	49	2.2387
GPT-5.1	0.984	0.984	0.984	1	41	5.2719

Interpretation

The results demonstrate strong performance across all three models, with notable differences in specificity and consistency. GPT-4o-mini achieved a perfect recall of 1.000 but a specificity of 0.974 and agreement of 0.975. GPT-5-mini and GPT-5.1 both achieved slightly lower recall (0.984) but notably higher specificity (0.981 and 0.984, respectively) and agreement (0.981 and 0.984, respectively).

Examining the errors in detail, GPT-4o-mini had 0 false negatives and 66 false positives, while both GPT-5-mini and GPT-5.1 had 1 false negative and 49 and 41 false positives, respectively. The single false negative case (Study 2540) is actually a case where the GPT-5 models correctly identified the paper as a discussion/review with no empirical results, and therefore correctly excluded it. This was also ambiguous for the human coders, with one including and one excluding the study, suggesting that the abstract is quite ambiguous and that the false negative for both GPT-5-mini and GPT-5.1 may be a borderline case rather than a genuine error.

The key trade-off is between recall and specificity. GPT-4o-mini prioritizes sensitivity (catching all relevant studies) at the cost of more false positives, suggesting a more inclusive screening approach. In contrast, GPT-5-mini and GPT-5.1 trade slightly lower recall for substantially better specificity, resulting in fewer false positives. This indicates that the reasoning models follow the prompt more closely and make more conservative inclusion decisions.

From a cost perspective, GPT-4o-mini was the most economical at $0.3982, while GPT-5-mini cost $2.2387 and GPT-5.1 cost $5.2719. Despite having fewer false positives, the reasoning models are roughly 5.6 to 13.2 times more expensive than GPT-4o-mini.

Introducing Ollama models

In addition to the OpenAI models, we also ran the same screening task using the tabscreen_ollama() function with the ministral-3-8b model. This took a lot longer to run (approximately 7 hours) than the OpenAI models, as this was run on a local machine with limited computational resources. However, this also meant that the screening did not incur any costs and that the data was not shared with any third party. For more information on how to run screenings locally using Ollama, see the vignette on Running screenings locally with Ollama. Adding the results from the ministral-3-8b model to the table above, we get the following results:

model	p_agreement	recall	specificity	false_negatives	false_positives	price
GPT-4o-mini	0.9750000	1.00000	0.9740000	0	66	0.3982
GPT-5-mini	0.9810000	0.98400	0.9810000	1	49	2.2387
GPT-5.1	0.9840000	0.98400	0.9840000	1	41	5.2719
ministral-3-8b	0.9790941	0.96875	0.9793569	2	52	0.0000

The ministral-3-8b model achieved a recall of 0.979, a specificity of 0.969, and a balanced accuracy of 0.979. It had 2 false negatives and 52 false positives. While the performance is quite good, it is not as good as the GPT-5 models, which had higher specificity and balanced accuracy. However, the ministral-3-8b model has the advantage of being free to use and not sharing data with any third party, which may be important consideration. It should however be noted that the ministral-3-8b model took much longer to run than the OpenAI models, which may be a significant drawback for larger screening tasks and that the model encountered errors for 7 studies out the 2590 studies in the dataset. These errors were due to the model not being able to follow the tool call specified in the tabscreen_ollama() function, which is a known issue with smaller models. Therefore, this could potentially be improved by using a larger model on a more powerful machine.

Conclusion

The choice of model depends on the specific priorities of the systematic review, including cost, privacy, and available compute. GPT-4o-mini remains the most cost-effective OpenAI option, achieving perfect recall and high specificity (0.974), making it ideal when minimizing false negatives is the main concern. If reducing false positives and improving overall agreement with human judgments is the priority, GPT-5-mini and GPT-5.1 provide superior performance with agreement rates of 0.981 and 0.984, respectively, though at a higher cost.

The Ollama result adds an important local-execution option to this comparison. Ministral-3-8b achieved recall of 0.979, specificity of 0.969, and balanced accuracy of 0.979, which is competitive with GPT-5-mini but still below GPT-5.1. Its main advantages are that it can be run locally, incurs no API cost, and does not share data with a third party. The tradeoff is that it required much longer runtime and produced a few tool-call errors, so it is best suited to workflows where privacy or offline execution outweighs speed and perfect reliability.

Overall, the reasoning models (GPT-5-mini and GPT-5.1) demonstrate that stronger reasoning capabilities lead to more conservative and consistent screening decisions that closely align with the provided prompt. When using reasoning models, a well-crafted, detailed prompt that clearly outlines inclusion and exclusion criteria is essential to leverage their advantages. For reviews where accuracy and precision are most important and budget allows, GPT-5.1 offers the best OpenAI performance with the lowest false positive rate (41 studies), while Ministral-3-8b is a viable local alternative when zero-cost, privacy-preserving screening is the primary requirement.

References

Vembye, M. H., Christensen, J., Mølgaard, A. B., & Schytt, F. L. W. (2025). Generative pretrained transformer models can function as highly reliable second screeners of titles and abstracts in systematic reviews: A proof of concept and common guidelines. Psychological Methods. https://psycnet.apa.org/record/2026-37236-001.