Medical Literature: Classes of Evidence, FDA Guidelines, Basic Statistics, and Study Design

Author(s): Sunil K Jain, MD, Julia H DeLuca, BS

Originally published: September 20, 2014 Last updated: September 13, 2023

Overview and Description

Why all clinicians need to use scientific medical literature

Medical knowledge changes rapidly,¹ and what can be learned attending CME activities or didactic sessions is not enough to keep pace with the changes.^2,3

When and how to use medical literature

Reading all medical literature available to stay abreast of medical advances is not only impractical, but counterproductive. The volume is simply too vast.

The most effective way to gather pertinent and high quality medical information is through the “pull and push” approach.⁴ The “push” approach, or “just in case” learning, refers to learning from periodical sources such as medical journals, webpages, etc. This method is useful for important new, valid research. The “pull” approach, also called “just in time” learning, refers to gathering information when needed, when clinical questions arise.

Pull approach steps

Formulation of an answerable question
Search for the best evidence
Critical appraisal of the evidence

Relevance To Clinical Practice

Formulation of an Answerable Question

To find relevant answers, it is essential to formulate detailed, well-focused questions. Although non-formatted search queries might work,⁵ the population-intervention-comparison-outcome (PICO) format retrieves more relevant information⁶ when used in PubMed Clinical Queries.

To write a PICO question, simply frame the question describing population or problem (P) of interest, indicator or intervention (I), comparison (C), and the outcome (O) of interest.

Example of a PICO question

Clinical Question	60 year-old male with medial compartment knee OA and pain. Would a lateral wedge insole be better at improving knee pain than a valgus brace?
P	knee OA
I	lateral wedge insoles
C	valgus knee brace
O	Pain

Background questions (general knowledge about disorders, tests, or treatments) are best answered using reference textbooks or review articles such as PM&R Knowledge NOW.

Search for the Best Evidence

Several electronic bibliographic databases are available for searches. Pre-appraised databases might save time but are influenced by expert opinion and are not as comprehensive as primary databases.

Pre-appraised databases

• Trip database
• PEDro
• Essential Evidence Plus
• REHAB+
• UpToDate
• Cochrane Library

Among primary sources, Clinical Queries from PubMed is recommended as a starting point. It includes filters to improve the efficiency and specificity of the search⁷.

Recommended search strategy

Start with an unfiltered search on PubMed “Clinical Queries” and then progressively add filters and Boolean operators until the number of papers retrieved is 20-50. If the search does not retrieve relevant articles, add synonyms or decrease terms used in the search. If yield is still insufficient, do a direct search in PubMed or Google Scholar. Once enough papers are found, prioritize them from highest to lowest level of evidence, and review them in descending order until the question is answered.

Study design and levels of evidence

The study design determines the level of evidence of the study. Clinicians should identify the type of study independently to avoid common mislabeling.⁸

An easy way to determine the study design is to use the Oxford Centre for Evidence-Based Medicine algorithm.

What was the aim of the study?
• To describe occurrence of outcome –> descriptive study (case reports, case series)
• To quantify association between treatment/exposure and outcome –> analytic (relational) study

If analytic, was the intervention randomly allocated?
• Yes? –> randomized trial or randomized controlled trial (RCT)
• No? –> observational study

If observational study, when were the outcomes determined?
• Sometime after the exposure or intervention? –> cohort study (prospective study)
• At the same time as the exposure or intervention? –> cross-sectional study or survey
• Before the exposure was determined? –> case-control study (retrospective study based on recall of the exposure)

There are several proposed evidence ranking systems to facilitate the rating of medical literature. The Oxford levels of evidence were designed for clinicians searching for answers to clinical questions.

Table 1. Oxford levels of evidence (adapted from the Oxford Centre of Evidence-based Medicine)

Level	Therapy / Prevention Etiology / Harm	Prognosis	Diagnosis	Differential diagnosis / symptom prevalence study
1a	SR^* (with homogeneity) of RCTs	SR (with homogeneity) of inception cohort studies; CDR” validated in different populations	SR (with homogeneity) of Level 1 diagnostic studies; CDR” with 1b studies from different clinical centers	SR (with homogeneity) of prospective cohort studies
1b	Individual RCT (with narrow Confidence Interval)	Individual inception cohort study with > 80% follow-up; CDR” validated in a single population	Validating cohort study with good reference standards; or CDR” tested within one clinical center	Prospective cohort study with good follow-up
1c	All or none	All or none case-series	Absolute SpPins and SnNouts^#	All or none case-series
2a	SR (with homogeneity) of cohort studies	SR (with homogeneity) of either retrospective cohort studies or untreated control groups in RCTs	SR (with homogeneity) of Level >2 diagnostic studies	SR (with homogeneity) of 2b and better studies
2b	Individual cohort study (including low quality RCT; e.g., <80% follow-up)	Retrospective cohort study or follow-up of untreated control patients in an RCT; Derivation of CDR” or validated on split-sample only	Exploratory cohort study with good reference standards; CDR” after derivation, or validated only on split-sample or databases	Retrospective cohort study, or poor follow-up
2c	Outcomes Research; Ecological studies	Outcomes Research		Ecological studies
3a	SR (with homogeneity) of case-control studies		SR (with homogeneity) of 3b and better studies	SR (with homogeneity) of 3b and better studies
3b	Individual Case-Control Study		Non-consecutive study; or without consistently applied reference standards	Non-consecutive cohort study, or very limited population
4	Case-series (and poor quality cohort and case-control studies)	Case-series (and poor quality prognostic cohort studies)	Case-control study, poor or non-independent reference standard	Case-series or superseded reference standards
5	Expert opinion without explicit critical appraisal, or based on physiology, bench research or “first principles”	Expert opinion without explicit critical appraisal, or based on physiology, bench research or “first principles”	Expert opinion without explicit critical appraisal, or based on physiology, bench research or “first principles”	Expert opinion without explicit critical appraisal, or based on physiology, bench research or “first principles”

^* SR = systematic review
“CDR = Clinical Decision Rule. (These are algorithms or scoring systems that lead to a prognostic estimation or a diagnostic category.)
^# Absolute SpPin = diagnostic finding whose Specificity is so high that a Positive result rules-in the diagnosis. Absolute SnNout = diagnostic finding whose Sensitivity is so high that a Negative result rules-out the diagnosis.

Evidence Base

Critical appraisal of the evidence

Clinicians should critically appraise literature and draw their own conclusions: only 5% of primary care studies⁹ and less than 1% in rehabilitation are considered high quality, mostly due to bias (~50%) and poor scientific design (~20%).¹⁰ With studies and trials becoming more complex, it is important for readers to be familiar with analyzing study designs, data, and conclusions to be able to appropriately and efficiently interpret outcomes.¹¹

Basic statistics concepts to appraise medical literature
The purpose of the scientific method is to translate a research question into a mathematical formulation that supports only two possible hypotheses. The hypothesis to prove false is the null hypothesis (H₀) and the one to bolster is the alternative hypothesis (H_A). H₀ usually reflects the status quo or no observed effect.

Example of a research hypothesis:

H₀: No association between intervention and disease
H_A: Association between intervention and disease

The researcher then designs an experiment to test the H₀. A test statistic is calculated from the sample value obtained with the outcome data. If this test statistic falls within a critical region corresponding to the p-value or alpha level, then it is considered statistically significant. In this case, we reject H₀ and accept H_A. In practice, the proper selection of the statistical methods depends on the nature of the data and study design.

Example of hypothesis testing of a single population mean (adapted from Paul Christos, D.Ph.):

The population mean cholesterol and standard deviation for men in Lexington are 200 and 50 mg/dL, respectively, based on past literature.
A sample population mean cholesterol of 300 mg/dL is obtained from a new study of 30 men in Lexington chosen at random.
Does this new sample value validate or contradict the population mean cholesterol previously reported in literature?
H₀: Population mean cholesterol (µ) = 200 mg/dL -> based on past literature
H_A: Population mean cholesterol (µ) ≠ 200 mg/dL -> population mean cholesterol is not equal to 200 as reported in past literature
Set p-value (level of statistical significance) to 0.05. Calculate a z-score that corresponds to a p-value of 0.025 and 0.975 for a two-tailed test = ±1.96. This z-score represents the critical values. If our test statistic falls above 1.96 or below -1.96, then it falls within the critical region.
Based on the new sample mean and standard deviation, a test statistic (z-score) is calculated: (300-200)/(50/√30) = 10.96. Since 10.96 > 1.96, the test statistic falls in the critical region.
Therefore, we reject H₀ and accept H_A (µ ≠ 200).

The level of statistical significance is usually set at < 5% (p < 0.05), as in the above example. Alpha (type I) error (usually set at 5%) is the probability of rejecting a true null hypothesis, or concluding there is an association between exposure and disease when in fact there is not. Beta (type II) error (usually set at 20%) represents the probability of not rejecting a false null hypothesis, or concluding there is no association between exposure and disease when in fact there really is one.

When interpreting p-values, remember:

P-values close to the level of significance do not mean that there is a “weak” correlation or “weak” statistical significance.
The magnitude of the p-values does not correlate with the strength of the H_A.
Statistical significance may not be clinically significant, but clinical significance is usually statistically significant.
Lack of statistical significance is not proof of “no effect.” “Accepting” the H₀ is not actually proving it.

Evaluate study quality

There are several templates available to guide the evaluation of specific study designs. The general steps are:

Identify the research question (H₀ and H_A). Is the purpose of the study appropriately translated to an answerable scientific question?
Identify the study design. Is the design capable of answering the question?
Identify possible selection bias. Were the inclusion and exclusion criteria reasonable choices? Was it appropriately randomized? Did all groups have the same prognosis?
Identify possible researcher bias. Were researchers blinded to outcomes and interventions? Check authors’ disclosures.
Identify if placebo effect was reasonably controlled. Was the intervention blind to patients? Where all groups treated equally?
Evaluate the flow of the subjects throughout the study. Were all subjects kept in the same groups they were initially assigned to, including the dropouts? Did >80% of subjects finish the study? Was intention-to-treat analysis performed?
Determine if the study findings are applicable to other patient populations. Is the sample group generalizable? Is the sample size large enough?

Evaluate study results

Clinicians need to be familiar with the interpretation of common summary statistics and effect size measurements to put results into perspective. An effect size can either quantify how different two groups are from one another or measure the strength of a relationship between two variables¹². Appreciation of the effect size can be significantly altered by how the results are presented. Example:

RCT of 20-year sunscreen use vs. placebo for prevention of melanoma:
• EER (Experimental Event Rate) = 1/1000 = 0.001 –> “The risk of developing melanoma over 20 years in the sunscreen experimental group was 0.1% or 1 in 1000.”
• CER (Control Event Rate) = 5/1000 – 0.005 –> “The risk of developing melanoma over 20 years with placebo is 0.5%.”
• ARR (Absolute Risk Reduction) = EER – CER = 0.004 –> “0.4% of patients, or 4 of 1000, are prevented from developing melanoma by using sunscreen.”
• RR (Relative Risk) = EER/CER = 0.20 –> “People who do not use sunscreens have a 5 times greater risk of developing melanoma over 20 years as compared to those who use sunscreens.”
• RRR (Relative Risk Reduction) = (EER – CER)/CER = 1 – RR = 0.80 –> “Sunscreen use decreases the risk of developing melanoma by 80% compared with no sunscreen.”
• NNT (Number-needed-to-treat) = 1/ARR = 250 –> “250 patients would need to be treated with sunscreen rather than placebo for 20 years to prevent one additional case of melanoma.”

If the study shows statistically significant results, evaluate:

Misuse or abuse of statistical methods in the process and presentation of the results. For instance, using means to compare ordinal or nominal variables.
The impact or clinical importance of the results. How large is the magnitude of the effect found in the study? Are those effects or differences clinically important? What is the minimally clinically important difference (MCID)?
The possibility that positive results are false positives. The risk of finding false positive results increases with the presence of small samples, small effect sizes, non-standard study designs, multiple null hypotheses tested, or multiple comparisons.¹³

If the study shows statistically non-significant results, analyze confidence intervals (CI).¹⁴ A 95% confidence interval (based on alpha = 0.05) yields a correct interval 95 out of 100 times. The other five times would be incorrect (error rate). If the CI includes a value that can be interpreted as clinically important, then it is reasonable to contemplate the possibility that the study is underpowered and the results are falsely negative.

Power (1 – beta) is usually set at 80% or 90%. A study with only 60% power implies that it only had a 60% chance of correctly accepting H_A if it was true. If the results from this underpowered study are not found to be statistically significant with p > 0.05, then the conclusions are questionable. In contrast, a well-powered study (99%) has a 99% chance of correctly accepting H_A if it was true. If the results are without statistical significance (p > 0.05), then H₀ is acceptable.

FDA guidelines

The FDA assures the safety, efficacy and security of drugs, biological products, and medical devices. FDA regulations apply to clinical investigations conducted on medical products that will be marketed in the U.S., including but not limited to drugs, devices, and biologics.

Applications are required for:

An IND has been previously screened for pharmacological activity and acute toxicity potential in animals and is pending investigation for its diagnostic or therapeutic potential in humans. This investigation is typically divided into three phases

Phase 1: introduction of IND to patients or control volunteer subjects (n = 20-80) to determine drug pharmacokinetics and pharmacologic effects in humans, side effects, and if possible, clinical efficacy.
Phase 2: controlled studies, typically no more than several hundred subjects, to evaluate drug efficacy for specific indications, common short-term side effects and associated risk.
Phase 3: expanded controlled and uncontrolled trials ranging from several hundred to several thousand subjects, in order to gather further information on efficacy and safety, determine the overall drug benefit-risk relationship, and provide a foundation for physician labeling.

Optional phase 4 studies occur post-approval to determine long-term efficacy and safety of certain drugs. Not all drugs will require phase 4 trials. After an IND is shown to have clinical efficacy and an acceptable safety profile, it can then become the subject of an NDA or BLA.

There is overlap between the NDA and BLA applications, but there are a few differences that dictate which to apply to. In general, NDA is used for drugs intended to be compliant with the United States Federal Food, Drug, and Cosmetic (FD&C) Act, while a BLA is required for biological products (biologics) subject to licensure under the Public Health Service (PHS) Act.¹⁵ In order to be approved for either application, the sponsor must provide specific data for review which includes chemistry, pharmacology, medical, biopharmaceutics, and statistics. Once approved, the drug may begin to be marketed in the United States.

An IDE application allows an investigational device to be used in a clinical study to collect safety and effectiveness data. This includes clinical evaluation of certain modifications or new intended uses of legally marketed devices. All clinical evaluations of investigational devices must have an approved IDE before the study is initiated. There are three types of device studies: significant risk (SR), nonsignificant risk (NSR), and exempt studies.¹⁶

Significant Risk (SR): a device that presents a potential for serious risk to the health, safety, or welfare of a subject. These studies require more stringent approval processes, record keeping, and reporting.
Nonsignificant Risk (NSR): a device that does not present a potential for serious risk to the health, safety, or welfare of a subject and does not meet the definition of a SR. These studies have an abbreviated requirements and less reporting responsibilities.

In the US interventional studies with high level of evidence (often randomized trials, equivalent to OECBM level 2 or above) will be considered by the FDA to approve an innovative drug or intervention. The design of the study (patient population, outcome measures, doses) is reflected in terms of the approved indication for the intervention. The FDA publishes guidelines regarding the good practices and acceptable outcomes to target, as well as the way to measure them in a trial.

Types of Reviews

A review article is a kind of scientific literature that synthesizes and summarizes what is currently known about a particular topic. These articles synthesize and discuss data from primary literature previously published. There are multiple kinds of review articles such as scoping, narrative, systematic and metanalysis reviews.

Scoping Review: Preliminary assessment of available research on a topic. These allow for identifications of gaps in knowledge and recommendations of future research to be conducted in order to obtain a meaningful full systematic review.¹⁷
Narrative Review: Broad literature review and discussion about what is known about a topic. Also referred to as a literature review.¹⁸
Systematic Review: A comprehensive literature review that answers a scientific question.¹⁹
Metanalysis Review: Combining multiple independently executed studies for statistical analysis on a new, single data set. This approach essentially leads to a larger sample size and more precise calculations compared to individual studies. This can also help with determining a concise answer about a topic when studies from different research groups have conflicting conclusions.

Cutting Edge/Unique Concepts/Emerging Issues

Artificial intelligence and machine learning

Explainable Artificial Intelligence (AI) and Interpretable Machine Learning (ML) are new tools that are gaining exponential popularity in many fields of research, including medicine.²⁰ AI technologies utilize machines that are able to learn on their own. Within AI lies the branch of ML, which involves automatic learning from experience. Explainable AI and Interpretable ML go a step further to provide support or rationale for models created. These types of approaches aim to efficiently process information similarly to how a human would digest complex ideas and associations.²¹

There is an abundance of applications for these technologies such as diagnosis and clinical reasoning scenarios, but one research-specific use for physicians is literature mapping. This is a way of exploring connections between publications.²² There are multiple free and easy to use online ML tools that group together papers that have cited a paper of interest.

Literature Mapping ML Tools

• Connected Papers
• Inciteful
• Litmaps

References

Densen P. Challenges and opportunities facing medical education. Trans Am Clin Climatol Assoc.2010;122:48-58.
Bordage G, Carlin B, Mazmanian PE–American College of Chest Physicians Health and Science Policy Committee. Continuing medical education effect on physician knowledge: effectiveness of continuing medical education: Amerian College of Chest Physicians Evidence-Based Educational Guidelines.Chest. 2009;135(Supp 3):29s-36s.
Davis D, O’Brien M, Freemantle N, Wolf F, Mazmanian P, Taylor-Vaisey A. Impact of formal continuing medical education: do conferences, workshops, rounds, and other traditional continuing education activities change physician behavior or health care outcomes?JAMA. 1999;282(9):867-874.
Nutley S, Davies H, Walter I. Evidence-based policy and practice: cross-sector lessons from the United Kingdom. Social Policy J New Zealand.2003;20:29-48.
Hoogendam A, de Vries Robbe P, Overbeke A. Comparing patient characteristics, type of intervention, control, and outcome (PICO) queries with unguided searching: a randomized controlled crossover trial.J Med Libr Assoc. 2012;100(2):121-126.
Agoritsas T, Merglen A, Courvoiser D, et al. Sensitivity and predictive value of 15 PubMed search strategies to answer clinical questions rated against full systematic reviews.J Med Internet Res. 2012;14(3):e81.
Haynes R, McKibbon K, Wilczynski N, Walter S, Werre S, Hedges T. Optimal search strategies for retrieving scientifically strong studies of treatment from Medline: analytical survey.BMJ. 2005;330(7501):1179.
Mayo NE, Goldberg MS. When is a case-control study not a case-control study?J Rehabil Med. 2009;41:209−216.
McKibbon K, Wilczynski N, Haynes R. What do evidence-based secondary journals tell us about the publication of clinically important articles in primary healthcare journals?BMJ Med. 2004;2:33.
Chalmers I, Glasziou P. Avoidable waste in the production and reporting of research evidence.Lancet. 2009;374(9683):86-89.
Incze MA, Parks AL, Stern RJ. Teaching a Deeper Understanding of the Medical Literature. J Gen Intern Med. 2023;38(4):1059-1060. doi:10.1007/s11606-022-07851-4
Durlak JA. How to select, calculate, and interpret effect sizes. J Pediatr Psychol. 2009;34(9):917-928. doi:10.1093/jpepsy/jsp004
Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. Epub 2005 Aug 30.
Colegrave N, Graeme D. Confidence intervals are a more useful complement to nonsignificant tests than are power calculations.Behavioral Ecology. 2003;14(3):446-447.
Rare Diseases Registry Program. New Drug Application. https://registries.ncats.nih.gov/glossary/new-drug-application/. Accessed 17 July 2023.
U.S. Food & Drug Administration. Information Sheet Guidance For IRBs, Clinical Investigators, and Sponsors Significant Risk and Nonsignificant Risk Medical Device Studies. Published January 2006. Accessed 17 July 2023.
University of Houstin Libraries. Type of Reviews. https://guides.lib.uh.edu/c.php?g=1035985&p=7737345. Published 2023. Accessed 17 July 2023.
University of Alabama at Birmingham Libraries. Reviews: From Systematic to Narrative: Narrative Review. https://guides.library.uab.edu/sysrev. Published 5 December 2022. Accessed 17 July 2023.
Duke University Medical Center Library & Archives. Systematic Reviews. https://guides.mclibrary.duke.edu/sysreview. Published 5 July 2023. Accessed 17 July 2023.
Parashar G, Chaudhary A, Rana A. Systematic Mapping Study of AI/Machine Learning in Healthcare and Future Directions. SN Comput Sci. 2021;2(6):461.
Buch VH, Ahmed I, Maruthappu M. Artificial intelligence in medicine: current trends and future possibilities. Br J Gen Pract. 2018;68(668):143-144.
Princeton University Library. Literature Mapping. https://libguides.princeton.edu/litmapping. Published 13 April 2023. Accessed 16 July 2023.

Original Version of the Topic

Oscar O. Ortiz Vargas, MD. Medical literature: Classes of evidence, FDA guidelines, basic statistics, and study design for understanding medical literature. 9/20/2014.

Previous Revision(s) of the Topic

Xiaoning Yuan, MD, PhD, Michael W. O’Dell, MD. Medical literature: Classes of evidence, FDA guidelines, basic statistics, and study design for understanding medical literature. 9/7/2018.

Author Disclosure

Sunil K Jain, MD
Nothing to Disclose

Julia H DeLuca, BS
Nothing to Disclose

Essentials of Rehabilitation Practice and Science

Medical Literature: Classes of Evidence, FDA Guidelines, Basic Statistics, and Study Design

Overview and Description

Why all clinicians need to use scientific medical literature

When and how to use medical literature

Relevance To Clinical Practice

Formulation of an Answerable Question

Search for the Best Evidence

Evidence Base

Critical appraisal of the evidence

Evaluate study quality

Evaluate study results

FDA guidelines

Types of Reviews

Cutting Edge/Unique Concepts/Emerging Issues

Artificial intelligence and machine learning

References

Original Version of the Topic

Previous Revision(s) of the Topic

Author Disclosure

Patient and Family Resources

Get published and recognized among your peers

Essentials of Rehabilitation Practice and Science

Medical Literature: Classes of Evidence, FDA Guidelines, Basic Statistics, and Study Design

Overview and Description

Why all clinicians need to use scientific medical literature

When and how to use medical literature

Relevance To Clinical Practice

Formulation of an Answerable Question

Search for the Best Evidence

Evidence Base

Critical appraisal of the evidence

Evaluate study quality

Evaluate study results

FDA guidelines

Types of Reviews

Cutting Edge/Unique Concepts/Emerging Issues

Artificial intelligence and machine learning

References

Original Version of the Topic

Previous Revision(s) of the Topic

Author Disclosure

Patient and Family Resources

Get published and recognized among your peers

Related Posts