Sample size in multistakeholder Delphi surveys: at what minimum sample size do replicability of results stabilize?
The minimum sample size for multistakeholder Delphi surveys remains understudied. Drawing from three large international multistakeholder Delphi surveys, this study aimed to: 1) investigate the effect of increasing sample size on replicability of results; 2) assess whether the level of replicability of results differed with participant characteristics: for example, gender, age, and profession.
We used data from Delphi surveys to develop guidance for improved reporting of health-care intervention trials: SPIRIT (Standard Protocol Items: Recommendations for Interventional Trials) and CONSORT (Consolidated Standards of Reporting Trials) extension for surrogate end points (n = 175, 22 items rated); CONSORT-SPI [CONSORT extension for Social and Psychological Interventions] (n = 333, 77 items rated); and core outcome set for burn care (n = 553, 88 items rated). Resampling with replacement was used to draw random subsamples from the participant data set in each of the three surveys. For each subsample, the median value of all rated survey items was calculated and compared to the medians from the full participant data set. The median number (and interquartile range) of medians replicated was used to calculate the percentage replicability (and variability). High replicability was defined as ≥80% and moderate as 60% and <80% RESULTS: The average median replicability (variability) as a percentage of total number of items rated from the three datasets was 81% (10%) at a sample size of 60. In one of the datasets (CONSORT-SPI), a ≥80% replicability was reached at a sample size of 80. On average, increasing the sample size from 80 to 160 increased the replicability of results by a further 3% and reduced variability by 1%. For subgroup analysis based on participant characteristics (eg, gender, age, professional role), using resampled samples of 20 to 100 showed that a sample size of 20 to 30 resulted to moderate replicability levels of 64% to 77%.
We found that a minimum sample size of 60-80 participants in multistakeholder Delphi surveys provides a high level of replicability (≥80%) in the results. For Delphi studies limited to individual stakeholder groups (such as researchers, clinicians, patients), a sample size of 20 to 30 per group may be sufficient.
Manyara AM
,Purvis A
,Ciani O
,Collins GS
,Taylor RS
... -
《-》
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.
About 20-30% of older adults (≥ 65 years old) experience one or more falls each year, and falls are associated with substantial burden to the health care system, individuals, and families from resulting injuries, fractures, and reduced functioning and quality of life. Many interventions for preventing falls have been studied, and their effectiveness, factors relevant to their implementation, and patient preferences may determine which interventions to use in primary care. The aim of this set of reviews was to inform recommendations by the Canadian Task Force on Preventive Health Care (task force) on fall prevention interventions. We undertook three systematic reviews to address questions about the following: (i) the benefits and harms of interventions, (ii) how patients weigh the potential outcomes (outcome valuation), and (iii) patient preferences for different types of interventions, and their attributes, shown to offer benefit (intervention preferences).
We searched four databases for benefits and harms (MEDLINE, Embase, AgeLine, CENTRAL, to August 25, 2023) and three for outcome valuation and intervention preferences (MEDLINE, PsycINFO, CINAHL, to June 9, 2023). For benefits and harms, we relied heavily on a previous review for studies published until 2016. We also searched trial registries, references of included studies, and recent reviews. Two reviewers independently screened studies. The population of interest was community-dwelling adults ≥ 65 years old. We did not limit eligibility by participant fall history. The task force rated several outcomes, decided on their eligibility, and provided input on the effect thresholds to apply for each outcome (fallers, falls, injurious fallers, fractures, hip fractures, functional status, health-related quality of life, long-term care admissions, adverse effects, serious adverse effects). For benefits and harms, we included a broad range of non-pharmacological interventions relevant to primary care. Although usual care was the main comparator of interest, we included studies comparing interventions head-to-head and conducted a network meta-analysis (NMAs) for each outcome, enabling analysis of interventions lacking direct comparisons to usual care. For benefits and harms, we included randomized controlled trials with a minimum 3-month follow-up and reporting on one of our fall outcomes (fallers, falls, injurious fallers); for the other questions, we preferred quantitative data but considered qualitative findings to fill gaps in evidence. No date limits were applied for benefits and harms, whereas for outcome valuation and intervention preferences we included studies published in 2000 or later. All data were extracted by one trained reviewer and verified for accuracy and completeness. For benefits and harms, we relied on the previous review team's risk-of-bias assessments for benefit outcomes, but otherwise, two reviewers independently assessed the risk of bias (within and across study). For the other questions, one reviewer verified another's assessments. Consensus was used, with adjudication by a lead author when necessary. A coding framework, modified from the ProFANE taxonomy, classified interventions and their attributes (e.g., supervision, delivery format, duration/intensity). For benefit outcomes, we employed random-effects NMA using a frequentist approach and a consistency model. Transitivity and coherence were assessed using meta-regressions and global and local coherence tests, as well as through graphical display and descriptive data on the composition of the nodes with respect to major pre-planned effect modifiers. We assessed heterogeneity using prediction intervals. For intervention-related adverse effects, we pooled proportions except for vitamin D for which we considered data in the control groups and undertook random-effects pairwise meta-analysis using a relative risk (any adverse effects) or risk difference (serious adverse effects). For outcome valuation, we pooled disutilities (representing the impact of a negative event, e.g. fall, on one's usual quality of life, with 0 = no impact and 1 = death and ~ 0.05 indicating important disutility) from the EQ-5D utility measurement using the inverse variance method and a random-effects model and explored heterogeneity. When studies only reported other data, we compared the findings with our main analysis. For intervention preferences, we used a coding schema identifying whether there were strong, clear, no, or variable preferences within, and then across, studies. We assessed the certainty of evidence for each outcome using CINeMA for benefit outcomes and GRADE for all other outcomes.
A total of 290 studies were included across the reviews, with two studies included in multiple questions. For benefits and harms, we included 219 trials reporting on 167,864 participants and created 59 interventions (nodes). Transitivity and coherence were assessed as adequate. Across eight NMAs, the number of contributing trials ranged between 19 and 173, and the number of interventions ranged from 19 to 57. Approximately, half of the interventions in each network had at least low certainty for benefit. The fallers outcome had the highest number of interventions with moderate certainty for benefit (18/57). For the non-fall outcomes (fractures, hip fracture, long-term care [LTC] admission, functional status, health-related quality of life), many interventions had very low certainty evidence, often from lack of data. We prioritized findings from 21 interventions where there was moderate certainty for at least some benefit. Fourteen of these had a focus on exercise, the majority being supervised (for > 2 sessions) and of long duration (> 3 months), and with balance/resistance and group Tai Chi interventions generally having the most outcomes with at least low certainty for benefit. None of the interventions having moderate certainty evidence focused on walking. Whole-body vibration or home-hazard assessment (HHA) plus exercise provided to everyone showed moderate certainty for some benefit. No multifactorial intervention alone showed moderate certainty for any benefit. Six interventions only had very-low certainty evidence for the benefit outcomes. Two interventions had moderate certainty of harmful effects for at least one benefit outcome, though the populations across studies were at high risk for falls. Vitamin D and most single-component exercise interventions are probably associated with minimal adverse effects. Some uncertainty exists about possible adverse effects from other interventions. For outcome valuation, we included 44 studies of which 34 reported EQ-5D disutilities. Admission to long-term care had the highest disutility (1.0), but the evidence was rated as low certainty. Both fall-related hip (moderate certainty) and non-hip (low certainty) fracture may result in substantial disutility (0.53 and 0.57) in the first 3 months after injury. Disutility for both hip and non-hip fractures is probably lower 12 months after injury (0.16 and 0.19, with high and moderate certainty, respectively) compared to within the first 3 months. No study measured the disutility of an injurious fall. Fractures are probably more important than either falls (0.09 over 12 months) or functional status (0.12). Functional status may be somewhat more important than falls. For intervention preferences, 29 studies (9 qualitative) reported on 17 comparisons among single-component interventions showing benefit. Exercise interventions focusing on balance and/or resistance training appear to be clearly preferred over Tai Chi and other forms of exercise (e.g., yoga, aerobic). For exercise programs in general, there is probably variability among people in whether they prefer group or individual delivery, though there was high certainty that individual was preferred over group delivery of balance/resistance programs. Balance/resistance exercise may be preferred over education, though the evidence was low certainty. There was low certainty for a slight preference for education over cognitive-behavioral therapy, and group education may be preferred over individual education.
To prevent falls among community-dwelling older adults, evidence is most certain for benefit, at least over 1-2 years, from supervised, long-duration balance/resistance and group Tai Chi interventions, whole-body vibration, high-intensity/dose education or cognitive-behavioral therapy, and interventions of comprehensive multifactorial assessment with targeted treatment plus HHA, HHA plus exercise, or education provided to everyone. Adding other interventions to exercise does not appear to substantially increase benefits. Overall, effects appear most applicable to those with elevated fall risk. Choice among effective interventions that are available may best depend on individual patient preferences, though when implementing new balance/resistance programs delivering individual over group sessions when feasible may be most acceptable. Data on more patient-important outcomes including fall-related fractures and adverse effects would be beneficial, as would studies focusing on equity-deserving populations and on programs delivered virtually.
Not registered.
Pillay J
,Gaudet LA
,Saba S
,Vandermeer B
,Ashiq AR
,Wingert A
,Hartling L
... -
《Systematic Reviews》
The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.
Sample collection is a key driver of accuracy in the diagnosis of SARS-CoV-2 infection. Viral load may vary at different anatomical sampling sites and accuracy may be compromised by difficulties obtaining specimens and the expertise of the person taking the sample. It is important to optimise sampling accuracy within cost, safety and accessibility constraints.
To compare the sensitivity of different sampling collection sites and methods for the detection of current SARS-CoV-2 infection with any molecular or antigen-based test.
Electronic searches of the Cochrane COVID-19 Study Register and the COVID-19 Living Evidence Database from the University of Bern (which includes daily updates from PubMed and Embase and preprints from medRxiv and bioRxiv) were undertaken on 22 February 2022. We included independent evaluations from national reference laboratories, FIND and the Diagnostics Global Health website. We did not apply language restrictions.
We included studies of symptomatic or asymptomatic people with suspected SARS-CoV-2 infection undergoing testing. We included studies of any design that compared results from different sample types (anatomical location, operator, collection device) collected from the same participant within a 24-hour period.
Within a sample pair, we defined a reference sample and an index sample collected from the same participant within the same clinical encounter (within 24 hours). Where the sample comparison was different anatomical sites, the reference standard was defined as a nasopharyngeal or combined naso/oropharyngeal sample collected into the same sample container and the index sample as the alternative anatomical site. Where the sample comparison was concerned with differences in the sample collection method from the same site, we defined the reference sample as that closest to standard practice for that sample type. Where the sample pair comparison was concerned with differences in personnel collecting the sample, the more skilled or experienced operator was considered the reference sample. Two review authors independently assessed the risk of bias and applicability concerns using the QUADAS-2 and QUADAS-C checklists, tailored to this review. We present estimates of the difference in the sensitivity (reference sample (%) minus index sample sensitivity (%)) in a pair and as an average across studies for each index sampling method using forest plots and tables. We examined heterogeneity between studies according to population (age, symptom status) and index sample (time post-symptom onset, operator expertise, use of transport medium) characteristics.
This review includes 106 studies reporting 154 evaluations and 60,523 sample pair comparisons, of which 11,045 had SARS-CoV-2 infection. Ninety evaluations were of saliva samples, 37 nasal, seven oropharyngeal, six gargle, six oral and four combined nasal/oropharyngeal samples. Four evaluations were of the effect of operator expertise on the accuracy of three different sample types. The majority of included evaluations (146) used molecular tests, of which 140 used RT-PCR (reverse transcription polymerase chain reaction). Eight evaluations were of nasal samples used with Ag-RDTs (rapid antigen tests). The majority of studies were conducted in Europe (35/106, 33%) or the USA (27%) and conducted in dedicated COVID-19 testing clinics or in ambulatory hospital settings (53%). Targeted screening or contact tracing accounted for only 4% of evaluations. Where reported, the majority of evaluations were of adults (91/154, 59%), 28 (18%) were in mixed populations with only seven (4%) in children. The median prevalence of confirmed SARS-CoV-2 was 23% (interquartile (IQR) 13%-40%). Risk of bias and applicability assessment were hampered by poor reporting in 77% and 65% of included studies, respectively. Risk of bias was low across all domains in only 3% of evaluations due to inappropriate inclusion or exclusion criteria, unclear recruitment, lack of blinding, nonrandomised sampling order or differences in testing kit within a sample pair. Sixty-eight percent of evaluation cohorts were judged as being at high or unclear applicability concern either due to inflation of the prevalence of SARS-CoV-2 infection in study populations by selectively including individuals with confirmed PCR-positive samples or because there was insufficient detail to allow replication of sample collection. When used with RT-PCR • There was no evidence of a difference in sensitivity between gargle and nasopharyngeal samples (on average -1 percentage points, 95% CI -5 to +2, based on 6 evaluations, 2138 sample pairs, of which 389 had SARS-CoV-2). • There was no evidence of a difference in sensitivity between saliva collection from the deep throat and nasopharyngeal samples (on average +10 percentage points, 95% CI -1 to +21, based on 2192 sample pairs, of which 730 had SARS-CoV-2). • There was evidence that saliva collection using spitting, drooling or salivating was on average -12 percentage points less sensitive (95% CI -16 to -8, based on 27,253 sample pairs, of which 4636 had SARS-CoV-2) compared to nasopharyngeal samples. We did not find any evidence of a difference in the sensitivity of saliva collected using spitting, drooling or salivating (sensitivity difference: range from -13 percentage points (spit) to -21 percentage points (salivate)). • Nasal samples (anterior and mid-turbinate collection combined) were, on average, 12 percentage points less sensitive compared to nasopharyngeal samples (95% CI -17 to -7), based on 9291 sample pairs, of which 1485 had SARS-CoV-2. We did not find any evidence of a difference in sensitivity between nasal samples collected from the mid-turbinates (3942 sample pairs) or from the anterior nares (8272 sample pairs). • There was evidence that oropharyngeal samples were, on average, 17 percentage points less sensitive than nasopharyngeal samples (95% CI -29 to -5), based on seven evaluations, 2522 sample pairs, of which 511 had SARS-CoV-2. A much smaller volume of evidence was available for combined nasal/oropharyngeal samples and oral samples. Age, symptom status and use of transport media do not appear to affect the sensitivity of saliva samples and nasal samples. When used with Ag-RDTs • There was no evidence of a difference in sensitivity between nasal samples compared to nasopharyngeal samples (sensitivity, on average, 0 percentage points -0.2 to +0.2, based on 3688 sample pairs, of which 535 had SARS-CoV-2).
When used with RT-PCR, there is no evidence for a difference in sensitivity of self-collected gargle or deep-throat saliva samples compared to nasopharyngeal samples collected by healthcare workers when used with RT-PCR. Use of these alternative, self-collected sample types has the potential to reduce cost and discomfort and improve the safety of sampling by reducing risk of transmission from aerosol spread which occurs as a result of coughing and gagging during the nasopharyngeal or oropharyngeal sample collection procedure. This may, in turn, improve access to and uptake of testing. Other types of saliva, nasal, oral and oropharyngeal samples are, on average, less sensitive compared to healthcare worker-collected nasopharyngeal samples, and it is unlikely that sensitivities of this magnitude would be acceptable for confirmation of SARS-CoV-2 infection with RT-PCR. When used with Ag-RDTs, there is no evidence of a difference in sensitivity between nasal samples and healthcare worker-collected nasopharyngeal samples for detecting SARS-CoV-2. The implications of this for self-testing are unclear as evaluations did not report whether nasal samples were self-collected or collected by healthcare workers. Further research is needed in asymptomatic individuals, children and in Ag-RDTs, and to investigate the effect of operator expertise on accuracy. Quality assessment of the evidence base underpinning these conclusions was restricted by poor reporting. There is a need for further high-quality studies, adhering to reporting standards for test accuracy studies.
Davenport C
,Arevalo-Rodriguez I
,Mateos-Haro M
,Berhane S
,Dinnes J
,Spijker R
,Buitrago-Garcia D
,Ciapponi A
,Takwoingi Y
,Deeks JJ
,Emperador D
,Leeflang MMG
,Van den Bruel A
,Cochrane COVID-19 Diagnostic Test Accuracy Group
... -
《Cochrane Database of Systematic Reviews》