Knowledge integration includes knowledge management, synthesis, and translation processes. It aims to maximize the use of collected scientific information and accelerate translation of discoveries into individual and population health benefits. Accumulated evidence in cancer epidemiology constitutes a large share of the 2.7 million articles on cancer in PubMed. We examine the landscape of knowledge integration in cancer epidemiology. Past approaches have mostly used retrospective efforts of knowledge management and traditional systematic reviews and meta-analyses. Systematic searches identify 2,332 meta-analyses, about half of which are on genetics and epigenetics. Meta-analyses represent 1:89-1:1162 of published articles in various cancer subfields. Recently, there are more collaborative meta-analyses with individual-level data, including those with prospective collection of measurements [e.g., genotypes in genome-wide association studies (GWAS)]; this may help increase the reliability of inferences in the field. However, most meta-analyses are still done retrospectively with published information. There is also a flurry of candidate gene meta-analyses with spuriously prevalent “positive” results. Prospective design of large research agendas, registration of datasets, and public availability of data and analyses may improve our ability to identify knowledge gaps, maximize and accelerate translational progress or—at a minimum—recognize dead ends in a more timely fashion. Cancer Epidemiol Biomarkers Prev; 22(1); 3–10. ©2012 AACR.
Given the rapid expansion of scientific information, there is a critical need to ensure that maximal use is made of the collected data in a most efficient and unbiased way. “Knowledge integration” describes the processes that aim at effective use of information from many sources for accelerating translation of scientific discoveries into clinical applications, evidence-based recommendations, use in practice, and eventually health benefits for individuals and populations (Fig. 1). The term has been used with different definitions and perspectives (1–3), but here we adopt the definition that includes knowledge management (KM), knowledge synthesis (KS), and knowledge translation (KT; ref. 1). These 3 processes can inform the continuum of translational research from discovery to population health impact (1). The knowledge integration engine can drive research progress, especially in data-intensive fields.
Previously, knowledge integration has depended mostly on retrospective efforts of horizon scanning, traditional systematic reviews and meta-analyses, and nonsystematic knowledge brokering between stakeholders. These efforts may be inadequate in the current era of rapid accumulation of multilevel information—from molecular to the macro-level of environmental exposures and health system attributes. Revamping the current knowledge integration processes may help drive the future of cancer epidemiology across the translational research continuum. Here, we overview the accumulated experience on knowledge integration with emphasis on cancer epidemiology. We discuss what methods have been used over time, their strengths and limitations, and what alternative approaches might lead to more efficient integration of emerging information.
Published information on cancer
PubMed (search August 27, 2012) lists 2,673,926 articles with the search on “cancer,” of which more than three quarters (n = 2,146,156) are tagged with “human.” Table 1 shows selective subsets that present an overall picture of published evidence for different types of designs relating to cancer epidemiology. More than 50,000 articles are identified with the search word “cohort,” and slightly more with “case control,” whereas the term “risk” retrieves more than 200,000 articles, and “biomarker” just as many. At the same time, there are also more than 100,000 clinical trial publications, many of which are randomized trials. Efforts to synthesize this information are represented by more than 6,000 meta-analyses, almost 30,000 systematic reviews, and more than 300,000 nonsystematic reviews. Thus, most efforts at evidence integration still use subjective opinion and nonsystematic methods for data overview and interpretation.
It is difficult to accurately separate the literature on associations from the literature on treatments and interventions—sometimes they intermingled in the same article. Subsets searches in Table 1 should be reviewed with caution given the non-perfect sensitivity and specificity of PubMed searches. However, clearly articles with relevance to treatment far outnumber articles with relevance to prevention across all subsets. Table 1 also provides search results with the string “NOT (trial* OR treatment)” to further exclude articles on clinical trials and treatment-related research (which can be either randomized or observational, e.g., predictive tools for treatment response). As shown, studies with “cohort” or “case control” are split between the literature with and without trials/treatment implications.
Knowledge management: studies, data, analyses
KM efforts can take many different forms, depending on whether one is tapping into published information; retrieving unpublished information; developing databases with raw data; or allowing a live stream of all collected data and analyses (Table 2). Published information is only a fraction of the total raw data that have been collected for or repurposed for research purposes and of the analyses that have been formally conducted, probed, or contemplated (4).
Efforts of KM for published data concentrate usually at search optimization, curation, cleaning, and harmonization. Published data may often be a selected, even distorted, subset of the whole information chosen based on statistical significance and/or other selection filters (5). Although some journals have begun efforts to publish “null” results (6, 7), these remain underrepresented in the published literature. If so, KM targeting past published data may yield largely misleading results. Empirical meta-research evaluations on the credibility of research findings in different research fields and with different methods and technologies may help anchor some credibility estimates. They may help decide whether a field is severely biased that it is a wasteful effort to collect, clean, and use published information. Conversely, the KM process may indicate that a systematic synthesis can indeed yield reliable results.
Unpublished data and analyses results are notoriously difficult to unearth. Some investigators may claim that unpublished data can be ignored, as they have not passed peer review. However, the seal of peer review is not a perfect discriminant. Registration of protocols and analyses would have helped to understand the depth of the problem. Nevertheless, publication of these documents is limited. For example, it is well documented that many clinical trials protocols and/or analyses are never published. Of those that are published, half or more of the originally considered outcomes are not reported, whereas many others are reported with analyses and results that deviate from the originally intended analyses (8–10). For observational epidemiology, these problems are probably more common (11). However, a priori registration of protocols is difficult for such studies, given their exploratory and iterative nature. Instead, it has been argued that registration should focus on study datasets (4, 12), that is, information on what variables have been collected and measured. This allows an understanding and assessment how many registered datasets could have undergone specific analyses.
Public deposition of raw datasets has attracted increasing attention over time, with several successful efforts for laboratory research, for example, genomic sequencing databases and functional databases (13). For microarray, macromolecular, and protein data, most high-impact journals have policies that require delineating some plan of making raw data, protocols, and analysis codes publicly available, routinely or after request to the authors; public deposition of such information is often a prerequisite for publication, but these policies are not necessarily enforced (14, 15).
There are several challenges related to deposition of raw datasets (Table 2). Some datasets may be deposited with poor documentation that hinders their usage by an outsider or may lead to erroneous data readings and misleading inferences. In addition, some public databases have minimum requirements when depositing data. Even if investigators are required to adhere to data-sharing policies (either from funding agencies or journal requirements), they often enter the minimal amount of required information. There are different modes of data access: open-to-all, cursory approval–based, and access to select investigators passing stringent standards of recognized expertise. Striking a balance between credit and independence is also challenging. Original investigators could (or should) be credited for analyses conducted on their data. However, it may also be advisable to keep further analyses separate from them: subsequent investigators who then use these published data should feel free to repeat and challenge the original analyses.
Finally, the live stream information model suggests that data, protocols, and analyses are readily available and visible to a wider circle, even the full public, as they accumulate, change, and evolve. This practice has been piloted in experiments trying to replicate the finding of bacteria with arsenic-containing DNA (16). Other fields, including cancer epidemiology, may also learn from it.
Knowledge synthesis of same-level information
There is a wide variety of KS methods (Table 2), but systematic reviews and meta-analyses are the most common. The majority of such reviews still depend on published information. Meta-analyses of published data are popular in many disciplines, especially those where unadjusted estimates and plain 2 × 2 tables are convenient. Some efforts may also be made to retrieve and include unpublished information, but success in this endeavor varies. For fields where there are many meta-analyses, field synopses have emerged (17) with simultaneous compilation of tens to millions of meta-analyses on the same field, as in several examples of applications in human genome epidemiology, for example, AlzGene, SzGene, and PDGene (18–20).
Other KS efforts involve investigators who control existing primary data. Such collaborative meta-analyses use a central secretariat to collect, query, clean, and synthesize individual-level data or statistics derived from individual-level data analyses procured by the original investigators of each included study. The advantages (potential standardization or harmonization of data and analyses, consistency of adjustments, multivariable models, interactions, and other complex models) and disadvantages (cost, effort, inability to fully standardize post hoc, selective availability of information, political difficulties) of this approach versus meta-analyses of the published literature have been extensively discussed previously (21, 22).
There is increasing interest in collaborative meta-analyses that use prospectively collected measurements from existing studies. This is the dominant paradigm in meta-analyses of GWAS (23, 24) that have led to a massive increase in the number of discovered genetic variants with strong statistical support (25). These meta-analyses may avoid the potential for selective reporting bias that threatens collaborative meta-analyses of previously collected data. Consortia working in this framework are common in human genome epidemiology but less common in other fields. Finally, one can envision prospective meta-analyses, where not only specific measurements but also the primary studies are designed prospectively, with the plan to eventually combine them. Such examples currently exist mostly from randomized trials (26). Nevertheless, the concept can conceivably be applied to future designed case–control studies, cohort studies, and biobanks (27), with prospective standardization of their designs and data collection and analyses procedures (28).
Landscape of KS methods used in cancer epidemiology
Table 3 shows the landscape of practiced KS methods for the PubMed subset of “Cancer NOT (trial* OR treatment)”. The large majority of systematic reviews (85.5%) and all but 13 meta-analyses addressed human data. More effort is needed to systematically appraise evidence from animal studies (29, 30), which can be informative and influential for judging biologic plausibility and for other preclinical inferences.
The fields of genetics or epigenetics dominate almost a third of the literature. Correspondingly, the same distribution applies to systematic reviews, whereas these 2 disciplines account for about half of the existing meta-analyses. The literatures on biomarkers, hormones, and infectious agents are extensive but have relatively fewer meta-analyses (n = 51–109 in each). Conversely, other concentrations with smaller shares of the total literature instead have as many or more published meta-analyses, in particular smoking, occupational, and nutritional fields.
The number of systematic reviews is 2- to 3-fold larger than the number of meta-analyses in most areas, except for biomarkers, immune/allergy/asthma, and social/socioeconomic factors, where the ratio is even larger. This may reflect the difficulty of conducting quantitative syntheses (e.g., for social and socioeconomic factors with extreme heterogeneity of definitions and measurements) or less established traditions for conducting meta-analyses. The ratio of all published articles per published meta-analysis is 556 overall; despite their emerging popularity, meta-analyses are still a small portion of the literature. Moreover, there is large variability in this ratio across different fields. It is smaller [n (all)/n (MA) = 89–133] for smoking, occupational, nutritional, and lifestyle areas; modestly high [n (all)/n (MA) = 215–350] for alcohol, social, genetics, carcinogens, and radiation; and very high [n (all)/n (MA) = 728–1,162] in epigenetics, biomarkers, immune factors, hormones, and infectious agents.
A more detailed examination of a sample of meta-analyses published in 1992, 2002, and 2012 shows the evolution of the application of these methods over time. “Cancer NOT (trial* OR treatment) AND meta-analysis [type]” yields 25 items in 1992, 49 in 2002, and 232 in the first 8 months of 2012 alone. On closer examination, 20 of the 25 meta-analysis–tagged articles published in 1992 are indeed meta-analyses related to cancer epidemiology, and the same applies to 44 of 49 tagged articles in 2002 and 50 of the 53 latest indexed articles in 2012.
Besides the geometric increase in the number of published meta-analysis articles each year, the areas represented have changed over time. The advent of meta-analyses on genetics and epigenetics is impressive. In 1992, there was only one quantitative review on leukemia cytogenetics. In 2002, of the 44 meta-analyses, 8 (18%) assessed genetic variants, 1 (2%) genomic hybridization, and 1 (2%) microarrays. In 2012, of the 50 most recently indexed published meta-analyses, 25 (50%) were on genetic variants, 2 (4%) on epigenetics, and another one on gene–menopause interaction. No other field in 2012 had such staggering increase in the number of meta-analyses (smoking, n = 3; alcohol, n = 3; biomarkers, n = 2; infectious agents, n = 2; dietary, n = 2; social, n = 1; occupational, n = 1; diagnostic tests, n = 3, other, n = 3 among 50 meta-analyses examined).
Moreover, there have been an increasing number of genes and genetic variants examined in meta-analyses over time. All genetic meta-analyses in 2002 focused on specific genes. Conversely, meta-analyses in 2012 included also genome-wide association meta-analyses, consortium analyses examining a number of variants, and field synopses. There is also a discernible change in the types of meta-analyses conducted over time, in particular about the use of consortium approaches and use of individual-level data (31). In the examined samples, there were 2 meta-analyses using individual-level data in 1992 (among 20), 5 in 2002 (among 44), and 9 among the most recent 50 meta-analyses in 2012. All the meta-analyses with individual-level data in 1992 and 2002 combined information that had been already collected in existing studies. One meta-analysis combined data from publicly available data on microarray experiments, whereas all the other meta-analyses created collaborative structures where investigators contributed their data and participated in the final manuscripts. These analyses pertained to nutritional factors (n = 4), hormones (n = 1), or smoking and alcohol (n = 1). Conversely, in 2012, the 9 meta-analyses using individual-level data targeted a very different set of risk factors: genetic factors (n = 6), gene expression data (n = 1), biomarkers (n = 1) and endometriosis (n = 1). The 6 meta-analyses of genetic factors were done by consortia conducting with genotype data generated prospectively for the project. The gene expression meta-analysis used data from publicly available databases, and the other 2 meta-analyses were done by investigators contributing previously collected data.
Caveats in current meta-analyses in cancer epidemiology
Despite the increase in the number and proportion of meta-analyses with individual-level data over time, these still represent the minority. Most currently published meta-analyses in cancer epidemiology continue to depend on published summary data. Many of these meta-analyses focus on genetic variants, often targeting a single or a few candidate genes and variants thereof. Interestingly, among the 50 published meta-analyses from 2012, 12 were done in China focusing on specific candidate genes from the era preceding GWAS. With one exception (32), all of these Chinese meta-analyses concluded that the examined candidate genes are significantly associated with the phenotypes of interest, although the P values were always very modest. On the basis of previous experience on candidate gene associations (33, 34), the credibility of these associations is very low. Another 3 meta-analyses from China addressed genetic variants previously highlighted from GWAS and also included primary data that the authors generated in their own sample and claimed replication of the genetic effects. Including also other fields beyond genetics, overall, 19 of the 50 (38%) meta-analyses in 2012 were from China and 17 of these 19 concluded with significant, favorable results. Previous empirical evaluations suggest that studies from China in different fields have frequent or even ubiquitous “positive” findings (35, 36).
The very high prevalence of “positive” meta-analyses, at the face of what should be mostly null associations, is worrisome. Apparently automation has allowed the massive production of potentially unreliable meta-analyses. The problem seems to be most acute for genetic epidemiology, which carries a lion's share in currently published meta-analyses, but may extend also in other disciplines.
Knowledge synthesis: multiple-level information
Besides KS involving the same type of information combined across different studies, KS may also try to synthesize multiple levels of information and/or simulated rather than real data. Cross-design synthesis approaches can combine data from different types of designs and umbrella reviews try to compile information on different aspects of questions of interest, for example, incidence, prevalence, associations, predictive performance, and clinical treatment effects, if pertinent (37). The IARC monographs combine basic and epidemiologic data to arrive at a systematic approach to classification of carcinogens (38). The HuGENet Venice criteria attempt to do this for genetic associations (39) and Boffetta and colleagues recently proposed a merging of IARC monograph and Venice methods appraising evidence on gene–environment interactions (40).
As knowledge progresses from discovery to health-related applications, mixed methods become increasingly used to examine the evidence of validity and use of the information. Examples of KS using a mixture of methods for more advanced translation steps include the US Preventive Services Task Force documents for clinical preventive services (41) such as prostate and breast cancer screening (T2 translation stage, “does it work”), the CDC Community Guide for Preventive Services (ref. 42; T3 translation stage, “how does it work in community settings”), and CISNet, an NCI-funded consortia that evaluate using modeling and empirical data the impact of different interventions on real-world population outcomes (T4 translation stage), for example, as in a recent modeling article on contribution of screening and survival differences to racial disparities in colorectal cancer rates (43).
Knowledge synthesis: meta-research
Meta-research (research on research) may allow obtaining wide views on evidence concerning multiple research questions across one or more fields. It may help understand general patterns of study design, reporting, and biases. For example, meta-research evaluations have documented the problems of selective reporting and excess significance biases in cancer epidemiology studies and their meta-analyses (44–50). One may list here also efforts to reproduce published results. Such efforts, ranging from “forensic bioinformatics” to “reproducibility checks,” have shown major reproducibility problems in several research fields such as -omics signatures (51) or preclinical data on drug targets (52).
Knowledge translation: using science to influence research, policy, and practice
KM and KS are not sufficient to move promising applications and interventions into practice. KT is a proactive process that involves communicating and disseminating synthesized information to influence policy, guideline development, practice, and research across the translation continuum. This is the most “messy” component of knowledge integration; it requires the “buy-in” from stakeholders with different perspectives, for example, see the recently discussed dissemination and implementation agenda for NIH (53).
Many forces affect the diffusion, adoption, and implementation of evidence-based recommendations into policy and practice and often operate independently from KS. They include public and private investments in research and development, policy and legal frameworks, oversight and regulation, product marketing, coverage and reimbursements, consumer advocacy, provider awareness, access, and health care services development and implementation. Deverka and colleagues (54) showed that for cancer genomic applications, different stakeholders hold disparate views of the synthesized knowledge presented to them. For example, payers generally require a higher level of evidence of clinical use than genomic researchers or test developers. Issues around differential access and implementation may contribute to the “lost in translation” phenomenon (55).
One aspect of KT may involve convening stakeholders around KS to address differences in evidentiary thresholds that drive decision making. This convening function of “knowledge brokering” links researchers and policy makers to facilitate interactions and forge partnerships to use evidence from existing knowledge and define areas for future research. In fields with rapidly changing landscape such as cancer genomics, KT knowledge brokering may need more robust proactive stakeholder engagement earlier in the decision-making process rather than later (1).
Conclusions and future prospects
The landscape of knowledge integration in cancer epidemiology has changed substantially over time and it continues to change. New methods are used more widely for managing, synthesizing, and translating information. Table 4 summarizes some possibilities that may enhance knowledge integration efforts in the future.
KM may benefit from more proactive steps rather than waiting to handle selectively reported, fragmented published data. Registration of observational datasets (4, 56), more systematic availability of raw data and analysis codes (57, 58), facilitation of repeatability and reproducibility checks (59), building a replication culture (60), and even consideration of live streaming of information may accelerate science, allow prompt recognition of false positives and dead ends, and facilitate translation of interesting observations that can be repeated and validated. The optimal way and implementation mode to achieve these changes needs carefully study.
In KS, the paradigm of large-scale international collaboration with prospective data collection has become dominant in human genome epidemiology (29), but it can also permeate more broadly many other fields. New epidemiologic studies and biobanks (25) may also be designed with the outlook that they will form part of a larger prospective network, rather than isolated proprietary experiments. For many diseases and subfields, it is possible that their several consortia with overlapping purposes may continue to co-exist. This may not necessarily be a drawback, as this may promote competition and independent replication across consortia. Regardless, it is important to have wide views of the information that is or could become available. This would help avoiding having to fund yet another study when there are hundreds available that can easily address the same question (4) or to prioritize studies and data collection in fields where the wider map of global evidence seems to have a dearth of data.
Finally, KT could benefit by wider spread and brokering of sound evidence from more reliable KM and KS efforts. It may be easier to set upfront goals, expectations, and rules of engagement and make all stakeholders aware of them, rather than wait for debates to be settled post hoc. A fascinating aspect of science is that not everything can be anticipated, but this does not mean that we cannot try to have some upfront planning and more transparency in protocols, analysis plans, and results.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
- Received October 9, 2012.
- Revision received October 13, 2012.
- Accepted October 15, 2012.
- ©2012 American Association for Cancer Research.