Analyze performance between the human and AI screening.
Source:R/screen_analyzer.R
screen_analyzer.Rd
When both the human and AI title and abstract screening has been done, this function
allows you to calculate performance measures of the screening, including the overall
accuracy, specificity and sensitivity of the screening, as well as
inter-rater reliability kappa statistics.
Arguments
- x
An object of either class
'gpt'
or'chatgpt'
or a dataset of either class'gpt_tbl'
,'chatgpt_tbl'
, or'gpt_agg_tbl'
- human_decision
Indicate the variable in the data that contains the human_decision. This variable must be numeric, containing 1 (for included references) and 0 (for excluded references) only.
- key_result
Logical indicating if only the raw agreement, recall, and specificity measures should be returned. Default is
TRUE
.
Value
A tibble
with screening performance measures. The tibble
includes the following variables:
promptid | integer | indicating the prompt ID. |
model | character | indicating the specific gpt-model used. |
reps | integer | indicating the number of times the same question was sent to GPT server. |
top_p | numeric | indicating the applied top_p. |
n_screened | integer | indicating the number of screened references. |
n_missing | numeric | indicating the number of missing responses. |
n_refs | integer | indicating the total number of references expected to be screened for the given condition. |
human_in_gpt_ex | numeric | indicating the number of references included by humans and excluded by gpt. |
human_ex_gpt_in | numeric | indicating the number of references excluded by humans and included by gpt. |
human_in_gpt_in | numeric | indicating the number of references included by humans and included by gpt. |
human_ex_gpt_ex | numeric | indicating the number of references excluded by humans and excluded by gpt. |
accuracy | numeric | indicating the overall percent disagreement between human and gpt (Gartlehner et al., 2019). |
p_agreement | numeric | indicating the overall percent agreement between human and gpt. |
precision | numeric | "measures the ability to include only articles that should be included" (Syriani et al., 2023). |
recall | numeric | "measures the ability to include all articles that should be included" (Syriani et al., 2023). |
npv | numeric | Negative predictive value (NPV) "measures the ability to exclude only articles that should be excluded" (Syriani et al., 2023). |
specificity | numeric | "measures the ability to exclude all articles that should be excluded" (Syriani et al., 2023). |
bacc | numeric | "capture the accuracy of deciding both inclusion and exclusion classes" (Syriani et al., 2023). |
F2 | numeric | F-measure that "consider the cost of getting false negatives twice as costly as getting false positives" (Syriani et al., 2023). |
mcc | numeric | indicating percent agreement for excluded references (Gartlehner et al., 2019). |
irr | numeric | indicating the inter-rater reliability as described in McHugh (2012). |
se_irr | numeric | indicating standard error for the inter-rater reliability. |
cl_irr | numeric | indicating lower confidence interval for the inter-rater reliability. |
cu_irr | numeric | indicating upper confidence interval for the inter-rater reliability. |
level_of_agreement | character | interpretation of the inter-rater reliability as suggested by McHugh (2012). |
References
Gartlehner, G., Wagner, G., Lux, L., Affengruber, L., Dobrescu, A., Kaminski-Hartenthaler, A., & Viswanathan, M. (2019). Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study. Systematic Reviews, 8(1), 277. doi:10.1186/s13643-019-1221-3
McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276-282. https://pubmed.ncbi.nlm.nih.gov/23092060/
Syriani, E., David, I., & Kumar, G. (2023). Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews. ArXiv Preprint ArXiv:2307.06464.
Examples
if (FALSE) { # \dontrun{
library(future)
set_api_key()
prompt <- "Is this study about a Functional Family Therapy (FFT) intervention?"
plan(multisession)
res <- tabscreen_gpt(
data = filges2015_dat[1:2,],
prompt = prompt,
studyid = studyid,
title = title,
abstract = abstract
)
plan(sequential)
res |> screen_analyzer()
} # }