Skip to contents


When both the human and AI title and abstract screening has been done, this function allows you to calculate performance measures of the screening, including the overall accuracy, specificity and sensitivity of the screening, as well as inter-rater reliability kappa statistics.


screen_analyzer(x, human_decision = human_code, key_result = TRUE)



An object of either class'gpt' or 'chatgpt' or a dataset of either class 'gpt_tbl', 'chatgpt_tbl', or 'gpt_agg_tbl'


Indicate the variable in the data that contains the human_decision. This variable must be numeric, containing 1 (for included references) and 0 (for excluded references) only.


Logical indicating if only the raw agreement, recall, and specificity measures should be returned. Default is TRUE.


A tibble with screening performance measures. The tibble includes the following variables:

promptidintegerindicating the prompt ID.
modelcharacterindicating the specific gpt-model used.
repsintegerindicating the number of times the same question was sent to GPT server.
top_pnumericindicating the applied top_p.
n_screenedintegerindicating the number of screened references.
n_missingnumericindicating the number of missing responses.
n_refsintegerindicating the total number of references expected to be screened for the given condition.
human_in_gpt_exnumericindicating the number of references included by humans and excluded by gpt.
human_ex_gpt_innumericindicating the number of references excluded by humans and included by gpt.
human_in_gpt_innumericindicating the number of references included by humans and included by gpt.
human_ex_gpt_exnumericindicating the number of references excluded by humans and excluded by gpt.
accuracynumericindicating the overall percent disagreement between human and gpt (Gartlehner et al., 2019).
p_agreementnumericindicating the overall percent agreement between human and gpt.
precisionnumeric"measures the ability to include only articles that should be included" (Syriani et al., 2023).
recallnumeric"measures the ability to include all articles that should be included" (Syriani et al., 2023).
npvnumericNegative predictive value (NPV) "measures the ability to exclude only articles that should be excluded" (Syriani et al., 2023).
specificitynumeric"measures the ability to exclude all articles that should be excluded" (Syriani et al., 2023).
baccnumeric"capture the accuracy of deciding both inclusion and exclusion classes" (Syriani et al., 2023).
F2numericF-measure that "consider the cost of getting false negatives twice as costly as getting false positives" (Syriani et al., 2023).
mccnumericindicating percent agreement for excluded references (Gartlehner et al., 2019).
irrnumericindicating the inter-rater reliability as described in McHugh (2012).
se_irrnumericindicating standard error for the inter-rater reliability.
cl_irrnumericindicating lower confidence interval for the inter-rater reliability.
cu_irrnumericindicating upper confidence interval for the inter-rater reliability.
level_of_agreementcharacterinterpretation of the inter-rater reliability as suggested by McHugh (2012).


Gartlehner, G., Wagner, G., Lux, L., Affengruber, L., Dobrescu, A., Kaminski-Hartenthaler, A., & Viswanathan, M. (2019). Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study. Systematic Reviews, 8(1), 277. doi:10.1186/s13643-019-1221-3

McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276-282.

Syriani, E., David, I., & Kumar, G. (2023). Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews. ArXiv Preprint ArXiv:2307.06464.


if (FALSE) { # \dontrun{



prompt <- "Is this study about a Functional Family Therapy (FFT) intervention?"


res <- tabscreen_gpt(
  data = filges2015_dat[1:2,],
  prompt = prompt,
  studyid = studyid,
  title = title,
  abstract = abstract


res |> screen_analyzer()

} # }