Intra- and interobserver agreement of proposed objective transvaginal ultrasound image-quality scoring system for use in artificial intelligence algorithm development.
The development of valuable artificial intelligence (AI) tools to assist with ultrasound diagnosis depends on algorithms developed using high-quality data. This study aimed to test the intra- and interobserver agreement of a proposed image-quality scoring system to quantify the quality of gynecological transvaginal ultrasound (TVS) images, which could be used in clinical practice and AI tool development.
A proposed scoring system to quantify TVS image quality was created following a review of the literature. This system involved a score of 1-4 (2 = poor, 3 = suboptimal and 4 = optimal image quality) assigned by a rater for individual ultrasound images. If the image was deemed inaccurate, it was assigned a score of 1, corresponding to 'reject'. Six professionals, including two radiologists, two sonographers and two sonologists, reviewed 150 images (50 images of the uterus and 100 images of the ovaries) obtained from 50 women, assigning each image a score of 1-4. The review of all images was repeated a second time by each rater after a period of at least 1 week. Mean scores were calculated for each rater. Overall interobserver agreement was assessed using intraclass correlation coefficient (ICC), and interobserver agreement between paired professionals and intraobserver agreement for all professionals were assessed using weighted Cohen's kappa and ICC.
Poor levels of interobserver agreement were obtained between the six raters for all 150 images (ICC, 0.480 (95% CI, 0.363-0.586)), as well as for assessment of the uterine images only (ICC, 0.359 (95% CI, 0.204-0.523)). Moderate agreement was achieved for the ovarian images (ICC, 0.531 (95% CI, 0.417-0.636)). Agreement between the paired sonographers and sonologists was poor for all images (ICC, 0.336 (95% CI, -0.078 to 0.619) and 0.425 (95% CI, 0.014-0.665), respectively), as well as when images were grouped into uterine images (ICC, 0.253 (95% CI, -0.097 to 0.577) and 0.299 (95% CI, -0.094 to 0.606), respectively) and ovarian images (ICC, 0.400 (95% CI, -0.043 to 0.669) and 0.469 (95% CI, 0.088-0.689), respectively). Agreement between the paired radiologists was moderate for all images (ICC, 0.600 (95% CI, 0.487-0.693)) and for their assessment of uterine images (ICC, 0.538 (95% CI, 0.311-0.707)) and ovarian images (ICC, 0.621 (95% CI, 0.483-0.728)). Weak-to-moderate intraobserver agreement was seen for each of the raters with weighted Cohen's kappa ranging from 0.533 to 0.718 for all images and from 0.467 to 0.751 for ovarian images. Similarly, for all raters, the ICC indicated moderate-to-good intraobserver agreement for all images overall (ICC ranged from 0.636 to 0.825) and for ovarian images (ICC ranged from 0.596 to 0.862). Slightly better intraobserver agreement was seen for uterine images, with weighted Cohen's kappa ranging from 0.568 to 0.808 indicating weak-to-strong agreement, and ICC ranging from 0.546 to 0.893 indicating moderate-to-good agreement. All measures were statistically significant (P < 0.001).
The proposed image quality scoring system was shown to have poor-to-moderate interobserver agreement and mostly weak-to-moderate levels of intraobserver agreement. More refinement of the scoring system may be needed to improve agreement, although it remains unclear whether quantification of image quality can be achieved, given the highly subjective nature of ultrasound interpretation. Although some AI systems can tolerate labeling noise, most will favor clean (high-quality) data. As such, innovative data-labeling strategies are needed. © 2025 The Author(s). Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of International Society of Ultrasound in Obstetrics and Gynecology.
Deslandes A
,Avery JC
,Chen HT
,Leonardi M
,Knox S
,Lo G
,O'Hara R
,Condous G
,Hull ML
,Collaborators
... -
《-》
Prediction of vesicouterine adhesions by transvaginal sonographic sliding sign technique: validation study.
Adhesions between the uterus, bladder and anterior abdominal wall are associated with clinical sequelae, including chronic pelvic pain and dyspareunia, and can also yield complications during surgery. The transvaginal sonographic (TVS) sliding bladder sign is a minimally invasive diagnostic tool to evaluate the presence of vesicouterine adhesions. This study aimed to determine the predictive value and intra- and interobserver variation of the TVS sliding bladder sign in the assessment of vesicouterine adhesions.
This was a prospective observational double-blind diagnostic accuracy study conducted at the Amsterdam University Medical Center. Patients scheduled for gynecological laparoscopic surgery for a benign disorder between January 2020 and December 2022 were included consecutively. All patients underwent preoperative TVS, including a dynamic sliding bladder sign examination in our outpatient clinic. Videoclips of the TVS scans were stored for offline assessment and used as an index test. The recordings of both TVS and laparoscopy were evaluated for diagnostic characteristics of vesicouterine adhesions by independent assessors, who were blinded to the clinical situation in addition to the laparoscopic findings when assessing recordings of TVS and vice versa. The presence of adhesions on laparoscopy was used as the reference standard. The positive predictive value (PPV), negative predictive value (NPV), specificity and sensitivity of the sliding bladder sign were calculated. In addition, inter- and intraobserver variability of the sliding bladder sign on TVS were assessed.
Of 116 included women, 57 had a negative sliding bladder sign on TVS, while on laparoscopy, 51 women had mild and 28 had severe vesicouterine adhesions. A negative sliding bladder sign had a PPV of 94.7% (95% CI, 88.9-100%) for the presence of any vesicouterine adhesions, and a positive sliding bladder sign had a specificity of 91.9% (95% CI, 83.1-100%). For severe adhesions, the negative sliding bladder sign had a sensitivity of 89.3% (95% CI, 77.8-100%) and a positive sliding bladder sign had a NPV of 94.9% (95% CI, 89.3-100%). When using Cohen's kappa coefficient, inter- and intraobserver agreement between assessors was good.
Sliding bladder sign evaluation using TVS is a reliable diagnostic tool for the prediction of vesicouterine adhesions on laparoscopy. A negative sliding bladder sign indicates the presence of vesicouterine adhesions, while a positive sliding bladder sign makes the presence of severe adhesions unlikely. Establishing vesicouterine adhesions by TVS may optimize preoperative planning, and can be used for future studies to evaluate the relationship between symptomatology and vesicouterine adhesions and, subsequently, the effect of adhesion-prevention interventions. © 2024 The Authors. Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of International Society of Ultrasound in Obstetrics and Gynecology.
Min N
,van Keizerswaard J
,Visser RH
,Burger NB
,Rake JWT
,Aarts JWM
,Van den Bosch T
,Leonardi M
,Huirne JAF
,de Leeuw RA
... -
《-》
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.
Survival estimation for patients with symptomatic skeletal metastases ideally should be made before a type of local treatment has already been determined. Currently available survival prediction tools, however, were generated using data from patients treated either operatively or with local radiation alone, raising concerns about whether they would generalize well to all patients presenting for assessment. The Skeletal Oncology Research Group machine-learning algorithm (SORG-MLA), trained with institution-based data of surgically treated patients, and the Metastases location, Elderly, Tumor primary, Sex, Sickness/comorbidity, and Site of radiotherapy model (METSSS), trained with registry-based data of patients treated with radiotherapy alone, are two of the most recently developed survival prediction models, but they have not been tested on patients whose local treatment strategy is not yet decided.
(1) Which of these two survival prediction models performed better in a mixed cohort made up both of patients who received local treatment with surgery followed by radiotherapy and who had radiation alone for symptomatic bone metastases? (2) Which model performed better among patients whose local treatment consisted of only palliative radiotherapy? (3) Are laboratory values used by SORG-MLA, which are not included in METSSS, independently associated with survival after controlling for predictions made by METSSS?
Between 2010 and 2018, we provided local treatment for 2113 adult patients with skeletal metastases in the extremities at an urban tertiary referral academic medical center using one of two strategies: (1) surgery followed by postoperative radiotherapy or (2) palliative radiotherapy alone. Every patient's survivorship status was ascertained either by their medical records or the national death registry from the Taiwanese National Health Insurance Administration. After applying a priori designated exclusion criteria, 91% (1920) were analyzed here. Among them, 48% (920) of the patients were female, and the median (IQR) age was 62 years (53 to 70 years). Lung was the most common primary tumor site (41% [782]), and 59% (1128) of patients had other skeletal metastases in addition to the treated lesion(s). In general, the indications for surgery were the presence of a complete pathologic fracture or an impending pathologic fracture, defined as having a Mirels score of ≥ 9, in patients with an American Society of Anesthesiologists (ASA) classification of less than or equal to IV and who were considered fit for surgery. The indications for radiotherapy were relief of pain, local tumor control, prevention of skeletal-related events, and any combination of the above. In all, 84% (1610) of the patients received palliative radiotherapy alone as local treatment for the target lesion(s), and 16% (310) underwent surgery followed by postoperative radiotherapy. Neither METSSS nor SORG-MLA was used at the point of care to aid clinical decision-making during the treatment period. Survival was retrospectively estimated by these two models to test their potential for providing survival probabilities. We first compared SORG to METSSS in the entire population. Then, we repeated the comparison in patients who received local treatment with palliative radiation alone. We assessed model performance by area under the receiver operating characteristic curve (AUROC), calibration analysis, Brier score, and decision curve analysis (DCA). The AUROC measures discrimination, which is the ability to distinguish patients with the event of interest (such as death at a particular time point) from those without. AUROC typically ranges from 0.5 to 1.0, with 0.5 indicating random guessing and 1.0 a perfect prediction, and in general, an AUROC of ≥ 0.7 indicates adequate discrimination for clinical use. Calibration refers to the agreement between the predicted outcomes (in this case, survival probabilities) and the actual outcomes, with a perfect calibration curve having an intercept of 0 and a slope of 1. A positive intercept indicates that the actual survival is generally underestimated by the prediction model, and a negative intercept suggests the opposite (overestimation). When comparing models, an intercept closer to 0 typically indicates better calibration. Calibration can also be summarized as log(O:E), the logarithm scale of the ratio of observed (O) to expected (E) survivors. A log(O:E) > 0 signals an underestimation (the observed survival is greater than the predicted survival); and a log(O:E) < 0 indicates the opposite (the observed survival is lower than the predicted survival). A model with a log(O:E) closer to 0 is generally considered better calibrated. The Brier score is the mean squared difference between the model predictions and the observed outcomes, and it ranges from 0 (best prediction) to 1 (worst prediction). The Brier score captures both discrimination and calibration, and it is considered a measure of overall model performance. In Brier score analysis, the "null model" assigns a predicted probability equal to the prevalence of the outcome and represents a model that adds no new information. A prediction model should achieve a Brier score at least lower than the null-model Brier score to be considered as useful. The DCA was developed as a method to determine whether using a model to inform treatment decisions would do more good than harm. It plots the net benefit of making decisions based on the model's predictions across all possible risk thresholds (or cost-to-benefit ratios) in relation to the two default strategies of treating all or no patients. The care provider can decide on an acceptable risk threshold for the proposed treatment in an individual and assess the corresponding net benefit to determine whether consulting with the model is superior to adopting the default strategies. Finally, we examined whether laboratory data, which were not included in the METSSS model, would have been independently associated with survival after controlling for the METSSS model's predictions by using the multivariable logistic and Cox proportional hazards regression analyses.
Between the two models, only SORG-MLA achieved adequate discrimination (an AUROC of > 0.7) in the entire cohort (of patients treated operatively or with radiation alone) and in the subgroup of patients treated with palliative radiotherapy alone. SORG-MLA outperformed METSSS by a wide margin on discrimination, calibration, and Brier score analyses in not only the entire cohort but also the subgroup of patients whose local treatment consisted of radiotherapy alone. In both the entire cohort and the subgroup, DCA demonstrated that SORG-MLA provided more net benefit compared with the two default strategies (of treating all or no patients) and compared with METSSS when risk thresholds ranged from 0.2 to 0.9 at both 90 days and 1 year, indicating that using SORG-MLA as a decision-making aid was beneficial when a patient's individualized risk threshold for opting for treatment was 0.2 to 0.9. Higher albumin, lower alkaline phosphatase, lower calcium, higher hemoglobin, lower international normalized ratio, higher lymphocytes, lower neutrophils, lower neutrophil-to-lymphocyte ratio, lower platelet-to-lymphocyte ratio, higher sodium, and lower white blood cells were independently associated with better 1-year and overall survival after adjusting for the predictions made by METSSS.
Based on these discoveries, clinicians might choose to consult SORG-MLA instead of METSSS for survival estimation in patients with long-bone metastases presenting for evaluation of local treatment. Basing a treatment decision on the predictions of SORG-MLA could be beneficial when a patient's individualized risk threshold for opting to undergo a particular treatment strategy ranged from 0.2 to 0.9. Future studies might investigate relevant laboratory items when constructing or refining a survival estimation model because these data demonstrated prognostic value independent of the predictions of the METSSS model, and future studies might also seek to keep these models up to date using data from diverse, contemporary patients undergoing both modern operative and nonoperative treatments.
Level III, diagnostic study.
Lee CC
,Chen CW
,Yen HK
,Lin YP
,Lai CY
,Wang JL
,Groot OQ
,Janssen SJ
,Schwab JH
,Hsu FM
,Lin WH
... -
《-》