Medical literature: Classes of evidence, FDA guidelines, basic statistics, and study design

Author(s): Oscar O. Ortiz Vargas, MD

Originally published:09/20/2014

Last updated:09/07/2018

Why all clinicians need to use scientific medical literature

Medical knowledge changes rapidly1, and what can be learned attending CME activities or didactic sessions is not enough to keep pace with the changes2,3.

When and how to use medical literature

Reading all medical literature available to stay abreast of medical advances is not only impractical, but counterproductive. The volume is simply too vast.

The most effective way to gather pertinent and high quality medical information is through the “pull and push” approach4. The “push” approach, or “just in case” learning, refers to learning from periodical sources such as medical journals, webpages, etc. This method is useful for important new, valid research. The “pull” approach, also called “just in time” learning, refers to gathering information when needed, when clinical questions arise.

Pull approach steps:

  1. Formulation of an answerable question
  2. Search for the best evidence
  3. Critical appraisal of the evidence

 

  1. Formulation of an answerable question

To find relevant answers, it is essential to formulate detailed, well-focused questions. Although non-formatted search queries might work5, the population-intervention-comparison-outcome (PICO) format retrieves more relevant information6 when used in PubMed Clinical Queries.

To write a PICO question, simply frame the question describing population or problem (P) of interest, indicator or intervention (I), comparison (C), and the outcome (O) of interest.

Example of a PICO question:

Clinical Question 60 year-old male with medial compartment knee OA and pain. Would a lateral wedge insole be better at improving knee pain than a valgus brace?
P knee OA
I lateral wedge insoles
C valgus knee brace
O Pain

Background questions (general knowledge about disorders, tests, or treatments) are best answered using reference textbooks or review articles such as PM&R Knowledge NOW. 

  1. Search for the best evidence

Several electronic bibliographic databases are available for searches. Pre-appraised databases might save time, but are influenced by expert opinion and are not as comprehensive as primary databases.

Pre-appraised databases
·         Trip database

·         PEDro

·         REHAB+

·         UpToDate

·         Cochrane Library

Among primary sources, Clinical Queries from PubMed is recommended as starting point. It includes filters to improve the efficiency and specificity of the search7. 

Recommended search strategy

Start with an unfiltered search on PubMed “Clinical Queries” and then progressively add filters and Boolean operators until the number of papers retrieved is 20-50. If the search does not retrieve relevant articles, add synonyms or decrease terms used in the search. If yield is still insufficient, do a direct search in PubMed or Google Scholar. Once enough papers are found, prioritize them from highest to lowest level of evidence, and review them in descending order until the question is answered.

Study design and levels of evidence

The study design determines the level of evidence of the study. Clinicians should identify the type of study independently to avoid common mislabeling8.

An easy way to determine the study design is to use the Oxford Centre for Evidence-Based Medicine algorithm.

What was the aim of the study?

§  To describe occurrence of outcome –> descriptive study (case reports, case series)

§  To quantify association between treatment/exposure and outcome –> analytic (relational) study

If analytic, was the intervention randomly allocated?

§  Yes? –> randomized trial or randomized controlled trial (RCT)

§  No? –> observational study

If observational study, when were the outcomes determined?

1.      Some time after the exposure or intervention? –> cohort study (prospective study)

2.      At the same time as the exposure or intervention? –> cross-sectional study or survey

3.      Before the exposure was determined? –> case-control study (retrospective study based on recall of the exposure)

There are several proposed evidence ranking systems to facilitate the rating of medical literature. The Oxford levels of evidence were designed for clinicians searching for answers to clinical questions.

Table 1.       Oxford levels of evidence (adapted from the Oxford Centre of Evidence-based Medicine)

Level Therapy / Prevention Etiology / Harm Prognosis Diagnosis Differential diagnosis / symptom prevalence study
1a SR* (with homogeneity) of RCTs SR (with homogeneity) of inception cohort studies SR (with homogeneity) of Level 1 diagnostic studies SR (with homogeneity) of prospective cohort studies
1b Individual RCT (with narrow Confidence Interval) Individual inception cohort study with > 80% follow-up Validating cohort study with good  reference standards Prospective cohort study with good follow-up
1c All or none All or none case-series Absolute SpPins and SnNouts# All or none case-series
2a SR (with homogeneity) of cohort studies SR (with homogeneity) of either retrospective cohort studies or untreated control groups in RCTs SR (with homogeneity) of Level >2 diagnostic studies SR (with homogeneity) of 2b and better studies
2b Individual cohort study (including low quality RCT; e.g., <80% follow-up) Retrospective cohort study or follow-up of untreated control patients in an RCT Exploratory cohort study with good reference standards Retrospective cohort study, or poor follow-up
2c Outcomes Research; Ecological studies Outcomes Research Ecological studies
3a SR (with homogeneity) of case-control studies SR (with homogeneity) of 3b and better studies SR (with homogeneity) of 3b and better studies
3b Individual Case-Control Study Non-consecutive study; or without consistently applied reference standards Non-consecutive cohort study, or very limited population
4 Case-series (and poor quality cohort and case-control studies) Case-series (and poor quality prognostic cohort studies) Case-control study, poor or non-independent reference standard Case-series or superseded reference standards
5 Expert opinion without explicit critical appraisal, or based on physiology, bench research or “first principles” Expert opinion without explicit critical appraisal, or based on physiology, bench research or “first principles” Expert opinion without explicit critical appraisal, or based on physiology, bench research or “first principles” Expert opinion without explicit critical appraisal, or based on physiology, bench research or “first principles”

* SR = systematic review

# Absolute SpPin = diagnostic finding whose Specificity is so high that a Positive result rules-in the diagnosis. Absolute SnNout = diagnostic finding whose Sensitivity is so high that a Negative result rules-out the diagnosis.

  1. Critical appraisal of the evidence

Clinicians should critically appraise literature and draw their own conclusions: only 5% of primary care studies9 and less than 1% in rehabilitation are considered high quality, mostly due to bias (~50%) and poor scientific design (~20%)10.

Basic statistics concepts to appraise medical literature
The purpose of the scientific method is to translate a research question into a mathematical formulation that supports only two possible hypotheses. The hypothesis to prove false is the null hypothesis (H0) and the one to bolster is the alternative hypothesis (HA). H0 usually reflects the status quo or no observed effect.

Example of a research hypothesis:

  • H0: No association between intervention and disease
  • HA: Association between intervention and disease

The researcher then designs an experiment to test the H0.  A test statistic is calculated from the sample value obtained with the outcome data. If this test statistic falls within a critical region corresponding to the p-value or alpha level, then it is considered statistically significant. In this case, we reject H0 and accept HA. In practice, the proper selection of the statistical methods depends on the nature of the data and study design.

Example of hypothesis testing of a single population mean (adapted from Paul Christos, Dr.P.H.):

  • The population mean cholesterol and standard deviation for men in Lexington are 200 and 50 mg/dL, respectively, based on past literature.
  • A sample population mean cholesterol of 300 mg/dL is obtained from a new study of 30 men in Lexingon.
  • Does this new sample value validate or contradict the population mean cholesterol previously reported in literature?
  • H0: Population mean cholesterol (µ) = 200 mg/dL -> based on past literature
  • HA: Population mean cholesterol (µ) ≠ 200 mg/dL -> population mean cholesterol is not equal to 200 as reported in past literature
  • Set -value (level of statistical significance) to 0.05. Calculate a z-score that corresponds to a p-value of 0.025 and 0.975 for a two-tailed test = ±1.96. This z-score represents the critical values. If our test statistic falls above 1.96 or below -1.96, then it falls within the critical region.
  • Based on the new sample mean and standard deviation, a test statistic (z-score) is calculated: (300-200)/(50/√30) = 10.96. Since 10.96 > 1.96, the test statistic falls in the critical region.
  • Therefore, we reject H0 and accept HA (µ ≠ 200).

The level of statistical significance is usually set at < 5% (p < 0.05), as in the above example. Alpha (type I) error (usually set at 5%) is the probability of rejecting a true null hypothesis, or concluding there is an association between exposure and disease when in fact there is not. Beta (type II) error (usually set at 20%) represents the probability of not rejecting a false null hypothesis, or concluding there is no association between exposure and disease when in fact there really is one.

When interpreting p-values, remember:

  1. P-values close to the level of significance do not mean that there is a “weak” correlation or “weak” statistical significance.
  2. The magnitude of the p-values does not correlate with the strength of the HA.
  3. Statistical significance may not be clinically significant, but clinical significance is usually statistically significant.
  4. Lack of statistical significance is not proof of “no effect.” “Accepting” the H0 is not actually proving it.

 

3.1  Evaluate study quality

There are several templates available to guide the evaluation of specific study designs. The general steps are:

  1. Identify the research question (H0 and HA). Is the purpose of the study appropriately translated to an answerable scientific question?
  2. Identify the study design. Is the design capable of answering the question?
  3. Identify possible selection bias. Were the inclusion and exclusion criteria reasonable choices? Was it appropriately randomized? Did all groups have the same prognosis?
  4. Identify possible researcher bias. Were researchers blinded to outcomes and interventions? Check authors’ disclosures.
  5. Identify if placebo effect was reasonably controlled. Was the intervention blind to patients? Where all groups treated equally?
  6. Evaluate the flow of the subjects throughout the study. Were all subjects kept in the same groups they were initially assigned to, including the dropouts? Did >80% of subjects finish the study? Was intention-to-treat analysis performed?

3.2  Evaluate study results

Clinicians need to be familiar with the interpretation of common summary statistics and effect size measurements to put results into perspective. Appreciation of the effect size can be significantly altered by how the results are presented.

Example (adapted from “Stating the Meaning of Effect Size Measures in Plain English”):

RCT of 20-year sunscreen use vs. placebo for prevention of melanoma:

·         EER (Experimental Event Rate) = 1/1000 = 0.001 –> “The risk of developing melanoma over 20 years in the sunscreen experimental group was 0.1% or 1 in 1000.”

·         CER (Control Event Rate) = 5/1000 – 0.005 –> “The risk of developing melanoma over 20 years with placebo is 0.5%.”

·         ARR (Absolute Risk Reduction) = EER – CER = 0.004 –> “0.4% of patients, or 4 of 1000, are prevented from developing melanoma by using sunscreen.”

·         RR (Relative Risk) = EER/CER = 0.20 –> “People who do not use sunscreens have a 5 times greater risk of developing melanoma over 20 years as compared to those who use sunscreens.”

·         RRR (Relative Risk Reduction) = (EER – CER)/CER = 1 – RR = 0.80 –> “Sunscreen use decreases the risk of developing melanoma by 80% compared with no sunscreen.”

·         NNT (Number-needed-to-treat) = 1/ARR = 250 –> “250 patients would need to be treated with sunscreen rather than placebo for 20 years to prevent one additional case of melanoma.”

 

If the study shows statistically significant results, evaluate:

  1. Misuse or abuse of statistical methods in the process and presentation of the results. For instance, using means to compare ordinal or nominal variables.
  2. The impact or clinical importance of the results. How large is the magnitude of the effect found in the study? Are those effects or differences clinically important? What is the minimally clinically important difference (MCID)?
  3. The possibility that positive results are false positives. The risk of finding false positive results increases with the presence of small samples, small effect sizes, non-standard study designs, multiple null hypotheses tested, or multiple comparisons11.

If the study shows statistically non-significant results, analyze confidence intervals (CI) 12. A 95% confidence interval (based on alpha = 0.05) yields a correct interval 95 out of 100 times. The other five times would be incorrect (error rate). If the CI includes a value that can be interpreted as clinically important, then it is reasonable to contemplate the possibility that the study is underpowered and the results are falsely negative.

Power (1 – beta) is usually set at 80% or 90%. A study with only 60% power implies that it only had a 60% chance of correctly accepting HA if it was true. If the results from this underpowered study are not found to be statistically significant with p > 0.05, then the conclusions are questionable. In contrast, a well-powered study (99%) has a 99% chance of correctly accepting HA if it was true. If the results are without statistical significance (p > 0.05), then H0 is acceptable.

FDA guidelines

The FDA assures the safety, efficacy and security of drugs, biological products, and medical devices. FDA regulations apply to clinical investigations conducted on medical products that will be marketed in the U.S., including but not limited to drugs, devices, and biologics.

Applications are required for:

An IND has been previously screened for pharmacological activity and acute toxicity potential in animals, and is pending investigation for its diagnostic or therapeutic potential in humans. This investigation is typically divided into three phases:

  1. Phase 1: introduction of IND to patients or control volunteer subjects (n = 20-80) to determine drug pharmacokinetics and pharmacologic effects in humans, side effects, and if possible, clinical efficacy.
  2. Phase 2: controlled studies, typically no more than several hundred subjects, to evaluate drug efficacy for specific indications, common short-term side effects and associated risk.
  3. Phase 3: expanded controlled and uncontrolled trials ranging from several hundred to several thousand subjects, in order to gather further information on efficacy and safety, determine the overall drug benefit-risk relationship, and provide a foundation for physician labeling.

Optional phase 4 studies occur post-approval to determine long-term efficacy and safety of certain drugs. Not all drugs will require phase 4 trials. After an IND is shown to have clinical efficacy and an acceptable safety profile, it can then become the subject of an NDA or BLA.

An IDE application allows an investigational device to be used in a clinical study to collect safety and effectiveness data. This includes clinical evaluation of certain modifications or new intended uses of legally marketed devices. All clinical evaluations of investigational devices must have an approved IDE before the study is initiated.

In the U.S., interventional studies with high level of evidence (often randomized trials, equivalent to OECBM level 2 or above) will be considered by the FDA to approve an innovative drug or intervention. The design of the study (patient population, outcome measures, doses) is reflected in terms of the approved indication for the intervention. The FDA publishes guidelines regarding the good practices and acceptable outcomes to target, as well as the way to measure them in a trial.

REFERENCES

  1. Densen P. Challenges and opportunities facing medical education.Trans Am Clin Climatol Assoc.2010;122:48-58.
  2. Bordage G, Carlin B, Mazmanian PE–American College of Chest Physicians Health and Science Policy Committee. Continuing medical education effect on physician knowledge: effectiveness of continuing medical education: Amerian College of Chest Physicians Evidence-Based Educational Guidelines.Chest. 2009;135(Supp 3):29s-36s.
  3. Davis D, O’Brien M, Freemantle N, Wolf F, Mazmanian P, Taylor-Vaisey A. Impact of formal continuing medical education: do conferences, workshops, rounds, and other traditional continuing education activities change physician behavior or health care outcomes?JAMA. 1999;282(9):867-874.
  4. Nutley S, Davies H, Walter I. Evidence-based policy and practice: cross-sector lessons from the United Kingdom.Social Policy J New Zealand.2003;20:29-48.
  5. Hoogendam A, de Vries Robbe P, Overbeke A. Comparing patient characteristics, type of intervention, control, and outcome (PICO) queries with unguided searching: a randomized controlled crossover trial.J Med Libr Assoc. 2012;100(2):121-126.
  6. Agoritsas T, Merglen A, Courvoiser D, et al. Sensitivity and predictive value of 15 PubMed search strategies to answer clinical questions rated against full systematic reviews.J Med Internet Res. 2012;14(3):e81.
  7. Haynes R, McKibbon K, Wilczynski N, Walter S, Werre S, Hedges T. Optimal search strategies for retrieving scientifically strong studies of treatment from Medline: analytical survey.BMJ. 2005;330(7501):1179.
  8. Mayo NE, Goldberg MS. When is a case-control study not a case-control study?J Rehabil Med. 2009;41:209−216.
  9. McKibbon K, Wilczynski N, Haynes R. What do evidence-based secondary journals tell us about the publication of clinically important articles in primary healthcare journals?BMJ Med. 2004;2:33.
  10. Chalmers I, Glasziou P. Avoidable waste in the production and reporting of research evidence.Lancet. 2009;374(9683):86-89.
  11. Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. Epub 2005 Aug 30.
  12. Colegrave N, Graeme D. Confidence intervals are a more useful complement to nonsignificant tests than are power calculations.Behavioral Ecology. 2003;14(3):446-447.

Original Version of the Topic

Oscar O. Ortiz Vargas, MD. Medical literature: Classes of evidence, FDA guidelines, basic statistics, and study design for understanding medical literature. 09/20/2014.

Author Disclosure

Xiaoning Yuan, MD
Nothing to Disclose

Related Articles