Evidence for Multi-Dimensional Research Evaluation: Philosophy, History, and Global Practice

Research value cannot be captured by any single metric or philosophical framework. The scholarly literature reveals at least seven distinct philosophical traditions justifying research value—from epistemic truth-seeking to critical emancipation—each with legitimate claims and corresponding evaluation practices. Historical evidence shows that what counts as "valuable research" has transformed dramatically across six major paradigm shifts, from medieval patronage through the contemporary "responsible assessment" movement. Geographic comparison reveals fundamentally incompatible cultural orientations: the Anglophone accountability model, Humboldtian scholarly autonomy, East Asian state-directed modernization, and Global South calls for epistemic justice represent genuinely different conceptions of knowledge's purpose. This plurality provides strong evidence that a customizable, multi-dimensional framework—allowing users to select philosophical lenses and weight dimensions according to context—is not merely practical but philosophically necessary.

Philosophical foundations reveal irreducible plurality

The philosophical justifications for research value cannot be collapsed into a single framework. Seven major traditions have shaped evaluation practices, each tracing to distinct intellectual lineages and generating different criteria for what makes research valuable.

Epistemic/truth-seeking perspectives hold that knowledge has intrinsic value independent of practical applications. Thomas Kuhn identified five epistemic values for assessing theories—accuracy, consistency, broad scope, simplicity, and fruitfulness—that continue to influence contemporary research assessment. Karl Popper's falsifiability criterion elevated bold, testable predictions as the hallmark of valuable science, while Imre Lakatos distinguished progressive from degenerating research programs based on whether they predict novel facts. These criteria appear directly in the UK Research Excellence Framework's evaluation of "originality, significance, and rigour."

Utilitarian/consequentialist perspectives trace from Francis Bacon's vision of science for practical benefit through John Dewey's instrumentalism, which holds that ideas are tools for solving practical problems. This framework dominates contemporary impact assessment: the UK REF requires impact case studies documenting "reach and significance of impacts on economy, society and/or culture," while funders increasingly demand evidence of social benefit, economic return, and policy influence.

Social constructivist perspectives, developed by Helen Longino and Michael Polanyi, argue that scientific objectivity is achieved through community practices rather than individual insight. Longino proposed alternative epistemic values including ontological heterogeneity, complexity of interaction, and decentralization of power—values that influence calls for diverse representation in peer review and recognition that evaluation criteria themselves are socially constructed.

Values-in-science perspectives, articulated by Heather Douglas and Philip Kitcher, demonstrate that non-epistemic values legitimately influence scientific reasoning. Douglas's concept of inductive risk—the possibility of error requiring consideration of consequences—connects directly to responsible research assessment. Kitcher's "well-ordered science" proposes that research agendas should reflect democratic deliberation about significant questions.

Critical/emancipatory perspectives from the Frankfurt School argue that research should serve liberation from domination. Jürgen Habermas identified three knowledge-constitutive interests: technical (prediction and control), practical (mutual understanding), and emancipatory (liberation). These perspectives demand that evaluation attend to who benefits from research and whether power relations are transformed.

Postcolonial/decolonial perspectives, developed by scholars including Linda Tuhiwai Smith, Walter Mignolo, and Boaventura de Sousa Santos, challenge the hegemony of Eurocentric knowledge production. Smith's Decolonizing Methodologies (1999) established a radical research agenda engaging with indigenous knowledge systems, arguing that decolonization requires appropriating all sources for epistemic justice—recognizing diverse ways of knowing that conventional evaluation frameworks systematically exclude.

Aesthetic perspectives value research for elegance, beauty, and intellectual craftsmanship. While rarely explicit in formal criteria, aesthetic judgments pervade peer review, and contemporary frameworks like RESQUE (Research Quality Evaluation) include "ambition, creativity, and innovation" as legitimate dimensions.

These philosophical traditions are not merely academic abstractions—they generate genuinely different evaluation criteria. A utilitarian framework privileges demonstrable social impact; an epistemic framework privileges theoretical advancement regardless of application; a critical framework demands transformation of power relations. A multi-dimensional evaluation system must accommodate this plurality rather than arbitrarily privileging one perspective.

Historical turning points show evaluation as socially constructed

The criteria for evaluating scholarly research have transformed dramatically across six major paradigm shifts, each reflecting underlying philosophical and institutional changes. This historical evidence demonstrates that evaluation systems are social constructions that can be deliberately redesigned.

Pre-modern knowledge production (pre-1660s) operated through patronage systems and religious authority. Knowledge was valued by alignment with doctrine and ancient texts rather than empirical demonstration. Medieval universities at Bologna, Paris, and Oxford prioritized knowledge transmission over knowledge creation, with evaluation based on institutional membership and faithfulness to received wisdom.

The Enlightenment institutionalization (1660s-1800) marked a transition to experimental demonstration and peer witnessing. The Royal Society of London (1662) championed "matters of fact" established through witnessed experiments, publishing Philosophical Transactions (1665) as the world's first scientific periodical. Steven Shapin identifies five key transformations: mechanization of nature, depersonalization of knowledge, mechanization of knowledge-making, aspirations for reformed knowledge to serve social ends, and an evolving notion that science was "pure, powerful, benign, and disinterested."

The Humboldtian revolution (1810-1900) established the modern research university model. Wilhelm von Humboldt founded the University of Berlin on principles that transformed global higher education: the unity of teaching and research (Einheit der Lehre und Forschung), academic freedom (Lehr- und Lernfreiheit), and Bildung (education as personal formation). Evaluation shifted to competitive appointment based on scholarly merit, with recognition as an "original contributor to Wissenschaft" becoming the primary criterion for academic advancement.

The post-WWII social contract (1945-1970s) established government-funded basic research through Vannevar Bush's Science: The Endless Frontier (1945). Bush argued that "basic research is the pacemaker of technological progress" and that "scientific progress on a broad front results from the free play of free intellects." This created the linear model—basic research yields applied research yields development yields societal benefit—and institutionalized peer review and merit-based funding through agencies like the National Science Foundation.

The metrics turn (1980s-2010s) introduced quantified accountability through bibliometrics. Eugene Garfield's Science Citation Index and Journal Impact Factor, originally designed to help librarians select journals, became proxies for individual research quality. Jorge Hirsch's h-index (2005), university rankings (from 2003), and output-based funding transformed evaluation into numerical targets. As Jerry Muller documents in The Tyranny of Metrics (2018), this "metric fixation" involved replacing judgment with numerical values, publishing numbers for transparency, and managing people through targets tied to rewards—with significant unintended consequences including gaming, short-termism, and discouraging innovation.

The responsible assessment movement (2010s-present) represents a reaction against metric fixation. The San Francisco Declaration on Research Assessment (DORA, 2012) declared that journal-based metrics should not serve as surrogates for individual article quality. The Leiden Manifesto (2015) established ten principles for responsible metrics, emphasizing that "quantitative evaluation should support qualitative, expert assessment." The Coalition for Advancing Research Assessment (CoARA, 2022) has gathered 700+ organizational signatories committing to recognize diverse contributions, base assessment on qualitative judgment, and abandon inappropriate uses of publication metrics. This trajectory reveals recurring tensions between expert judgment and standardized measurement, autonomy and accountability, intrinsic and instrumental valuation of knowledge.

Geographic traditions embody different knowledge philosophies

Five distinct academic traditions demonstrate fundamentally different philosophical orientations toward research value, shaped by historical and cultural factors. These are not merely variations on a universal theme but represent genuinely incompatible conceptions of knowledge's purpose.

The Western/Anglophone tradition (UK, US, Australia, Canada) emphasizes accountability to taxpayers and demonstrable public benefit. The UK's Research Excellence Framework evaluates outputs (60%), impact (25%), and environment (15%), with impact defined as "effect on, change or benefit to economy, society, culture, public policy or services, health, environment or quality of life, beyond academia." The US tenure system, rooted in the 1940 AAUP Statement on Academic Freedom, evaluates research, teaching, and service with publication count and journal prestige heavily weighted at elite institutions. The implicit philosophical values include meritocratic competition, utilitarian impact, and market-driven metrics focused on individual researcher achievements.

The Continental European (Humboldtian) tradition prioritizes scholarly autonomy and Bildung. The core principles—unity of teaching and research, academic freedom, and Wissenschaft as unified systematic inquiry—privilege the scholar-teacher pursuing truth independently. Traditional evaluation relied on peer reputation within guild-like academic communities with minimal external assessment. While Germany's Excellence Initiative (2005) and France's HCERES have introduced competitive elements, these systems remain in tension with traditional values. The Netherlands' "Room for Everyone's Talent" program (2019) and Finland's Good Practice in Research Assessment represent European efforts to resist bibliometric dominance.

East Asian traditions (China, Japan, South Korea) reflect Confucian educational philosophy emphasizing knowledge as moral cultivation and social harmony, effort over innate ability, and education as pathway to national service. China's Project 211 (1995), Project 985 (1998), and Double First-Class Initiative (2015/2017) represent state-driven modernization viewing science as a tool for national competitiveness. Until recent reforms, publication in Science Citation Index journals was mandatory for degrees, hiring, and promotion. The Ministry of Science and Technology's 2020 reform now emphasizes "representative publications" and local Chinese journals, though the Ministry of Education continues Web of Science criteria—revealing tensions between metrics as modernization and recognition of context-appropriate excellence.

Global South/postcolonial perspectives challenge what Boaventura de Sousa Santos calls "cognitive injustice." Raewyn Connell's "Southern Theory" and Sabelo Ndlovu-Gatsheni's decolonial critique expose how evaluation systems perpetuate colonial hierarchies: North-South knowledge asymmetry (theory produced in North, data extracted from South), journal imperialism (Anglophone journals and citation networks privileging Global North), and methodological colonialism (Western paradigms imposed as universal). The CLACSO-FOLEC Declaration (2022) with 220+ adherents commits to responsible assessment valuing regional journals, indigenous knowledge, and multilingualism. The Research Quality Plus (RQ+) framework developed by Canada's International Development Research Centre explicitly assesses what matters to research recipients.

Soviet/post-Soviet traditions treated science as state function serving socialist construction. The Academy of Sciences structure separated research institutes from universities, with priorities set by Five-Year Plans. The Akademgorodok model (1957) created scientific communities with greater intellectual freedom than Moscow, while the Naukograd system established approximately 70 science cities across Russia. Current Russian reforms introduce bibliometric pressures from international ranking aspirations while maintaining state-directed research priorities.

The International Science Council's synthesis (2023) identifies reform convergence: narrative CVs replacing exhaustive publication lists, "representative publications" (5-10 quality works versus counts), recognition of diverse research activities (mentoring, data sharing, public engagement), and context-sensitive implementation. Yet these convergences should not obscure genuine philosophical differences about knowledge's ultimate purpose.

Evaluation methods each capture different philosophical values

Current evaluation mechanisms form an ecosystem, with each method measuring particular dimensions of research value while missing others. Understanding what each method actually captures is essential for building a multi-dimensional framework.

Peer review captures epistemic values (truth, rigor, validity) and professional community standards. Its advantages include expert assessment of scientific quality and identification of methodological flaws. However, studies show significant inter-rater disagreement (average correlation between reviewers' ratings is just 0.34), inability to detect fraud (the Surgisphere scandal passed peer review at Lancet and NEJM before bloggers detected fraud within days), and a reviewer crisis with editors now sending up to 35 invitations to secure 2 reviewers per manuscript. Reform proposals include paid professional reviewers, open peer review with published reports (as implemented by eLife and Frontiers), registered reports reviewing methods before results are known, and post-publication peer review as the primary mechanism.

Journal Impact Factor measures aggregate attention within academia—a prestige proxy originally invented to help librarians select journals. Its limitations are well-documented: highly skewed distributions (top 20% of articles receive 80% of citations even in Nature), field-specific differences making cross-disciplinary comparison invalid, a 2-year window disadvantaging fields with longer research cycles, and manipulation through coercive citation practices. Eugene Garfield himself warned against "misuse in evaluating individuals" due to "wide variation from article to article within a single journal." DORA explicitly states that JIF should not be used "as a surrogate measure of the quality of individual research articles."

The h-index attempts to balance productivity with impact—a researcher has index h if h papers have been cited at least h times. It shows some correlation with academic honors but suffers from career stage bias (can only increase over time, disadvantaging early-career researchers), no accounting for author contribution position, field-specific differences, and vulnerability to self-citation gaming. As physicist Albert Einstein reportedly noted, "Not everything that counts is countable, and not everything that is countable counts."

Field-normalized indicators (FWCI, CNCI, SNIP) address some limitations by benchmarking against disciplinary expectations. A Field-Weighted Citation Impact of 1.44 means 44% more citations than expected globally. These enable fairer cross-disciplinary comparison but remain limited to measuring academic attention and are database-specific.

Altmetrics capture democratic engagement and societal reach through social media mentions, policy citations, news coverage, and Wikipedia references. They provide speed (generated immediately versus years for citations) and diversity (capturing public discussion, policy relevance, media uptake). However, meta-analysis shows pooled correlation with citation counts of only 0.19 in health sciences, and the meaning of a tweet remains ambiguous—is it commercial interest, curiosity, or disagreement? Gender bias has been documented: journal articles by female scholars score 27% lower on average.

REF Impact Case Studies capture utilitarian outcomes and public accountability through 4-page narratives demonstrating effects beyond academia. The UK's 6,781 case studies in REF2021 provide rich evidence of "complex, diverse and unique" impact pathways. Criticisms include high cost (£246 million total for REF2014, £55 million for impact alone), reliability issues (panel members struggled to distinguish between rating levels), and linear pathway assumptions that don't capture serendipitous impacts or long-term foundational research.

Economic measures (ROI, patents, licensing) capture utilitarian/commercial value. From 1996-2013, university technology transfer enabled $518 billion in US GDP contribution and 3.8 million jobs. However, Campbell's Law applies: "The more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures." Economic impacts take decades to materialize, and most universities cannot turn a profit on innovations.

Societal impact frameworks (UN SDGs alignment, Responsible Research and Innovation) capture global justice, sustainability, and procedural values. The SDGs provide a universal framework recognized globally, but broad categories may not distinguish quality of contribution, and "tagging" research can be superficial.

This inventory reveals that no single method captures research value comprehensively. Each mechanism embeds particular values and inevitably shapes researcher behavior. The choice of metric is not merely technical but normative—determining what kinds of research, researchers, and impacts are valued.

Existing multi-dimensional frameworks provide design evidence

Several multi-dimensional frameworks already exist, demonstrating feasibility and providing design principles for customizable evaluation. The strongest evidence comes from the Research Quality Plus (RQ+) framework developed by Canada's International Development Research Centre.

RQ+ rests on three core tenets: context matters (quality must be assessed relative to enabling/constraining factors), quality is multi-dimensional (cannot be reduced to single metrics), and systematic empirical appraisal enables transparency. The framework assesses four dimensions: research integrity/scientific rigour (proper design, logical consistency), research legitimacy (stakeholder involvement, ethics), research importance (contribution to knowledge, societal relevance), and positioning for use (potential for uptake, knowledge translation). Crucially, it explicitly evaluates contextual factors including data availability, research capacity, enabling environment, risk profile, and field maturity. The framework is designed as a "dynamic, evolving tool" users can adapt—weights can be adjusted based on organization's mandate, values, and purpose. It has been successfully adapted for multiple contexts including co-production (RQ+ 4 Co-Pro).

The Responsible Research and Innovation (RRI) framework developed by the European Commission embeds procedural values through the AREA framework: Anticipate (foresight on impacts), Reflect (critical examination of assumptions), Engage (stakeholder involvement), and Act (adapting based on learning). The six key implementation areas—public engagement, open access, gender equality, ethics, science education, and governance—prioritize how research is conducted alongside what it produces.

The Productive Interactions framework (Spaapen & van Drooge, 2011) reconceptualizes impact as emerging through interactions rather than linear pathways. It identifies three interaction types: direct (personal contacts, meetings, collaboration), indirect (mediated through texts, reports, products, policy instruments), and financial (funding, cooperative agreements). This process-oriented framework makes research-society relationships visible and addresses attribution problems by narrowing the gap between research activity and impact.

The Coalition for Advancing Research Assessment (CoARA) Agreement, signed by 700+ organizations from 55+ countries, establishes ten core commitments including: recognize diversity of contributions, base assessment primarily on qualitative judgment with peer review, abandon inappropriate uses of journal-based metrics, avoid rankings for assessment purposes, and review criteria accounting for diversity. Signatories commit to action plans within one year and completed reform cycles by 2027.

Stakeholder weighting research provides evidence for customization. A Springer Nature survey of 6,600+ researchers (2024) found that researchers currently feel outputs constitute 60% of evaluation but prefer 46%, with greater weighting on contributions to society and research culture. Taylor et al.'s (2023) discrete choice experiment identified four weighted criteria: appropriateness (28.9%), significance (25.8%), relevance (25.8%), and feasibility (19.5%)—with consumers weighting relevance higher and feasibility lower than researchers.

Cross-framework synthesis reveals common dimensions appearing across multiple systems: scientific rigour/integrity, societal relevance/impact, stakeholder engagement, knowledge translation/use, ethics, open science, diversity/inclusion, and context-sensitivity. These recurring dimensions provide empirical support for a multi-dimensional architecture.

Disciplines embody different epistemological commitments

The imposition of standardized metrics—particularly citation-based indicators developed for STEM fields—systematically disadvantages disciplines with different knowledge production modes. Understanding disciplinary variation is essential for any customizable framework.

Natural sciences (physics, chemistry, biology) privilege epistemic-positivist values: novelty, replication, prediction, and universality. Papers must "represent an advance in understanding likely to influence thinking in the field." Publication culture features journal articles as dominant output (epidemiology >90% journal articles), rapid publication cycles (3-year window captures 97% of physicists' work), multi-author papers, and strong preprint culture (ArXiv: 2+ million documents). Citation metrics work relatively well, though the replication crisis highlights tension between novelty-seeking and reproducibility.

Social sciences balance epistemic and utilitarian values, with the rigor-relevance debate central. The WT Grant Foundation (2017) argues "the dichotomy of rigor versus relevance is false—there is no inevitable trade-off between producing rigorous research and producing research with relevance." Publication culture features journal articles predominant but longer than STEM, smaller collaborations, and significant methodological pluralism. Psychology's behavioral science reforms (registered reports, pre-registration) represent responses to replication failures.

Humanities privilege aesthetic-interpretive values: interpretive quality, depth of understanding, scholarly contribution to ongoing disciplinary conversations, and cultural/historical situatedness. Publication culture is radically different: 60%+ of English literature publications are books/chapters, publication rhythm is slower (only 42% of historians publish a book in 3-year window versus 81% in 10-year window), single authorship is normative, and citation databases poorly capture the field. The HuMetricsHSS Initiative found "it may not be possible to develop a one-size-fits-all list of core values that can inform metrics, even within the boundaries of the humanities."

Arts and creative disciplines validate aesthetic-experiential values through practice-based research—the principle that "creative work in itself is a form of research and generates detectable research outputs." Non-traditional outputs (compositions, performances, exhibitions) require different evaluation approaches. The literature identifies fundamental tension: "The introduction of the arts into research evaluation has brought into co-existence two different modes of evaluation: the artistic mode and the research mode. Those two modes are linked with potentially competing or incompatible values."

Applied fields (engineering, medicine, law) prioritize utilitarian-practical values: impact orientation, problem-solving, translation, and stakeholder utility. Medical research evaluation increasingly uses the translational science continuum (T0-T4), tracking progress from basic biological research through implementation and population health outcomes. Engineering privileges conference proceedings (computer science publishes more proceedings than journal articles) and patents.

Frontiers in Research Metrics and Analytics (2022) documents that "bibliometric evaluation using only journal articles fails to capture more than 50% of the published works in 26 of 170 disciplines over a 10-year timeframe, almost all of which are in the humanities." The Metric Tide report (HEFCE, 2015) established five principles for "responsible metrics": robustness, humility (supporting not supplanting qualitative assessment), transparency, diversity (accounting for disciplinary differences and plurality of career paths), and reflexivity.

Design principles for a customizable framework

The evidence supports building a modular, multi-dimensional framework with the following design principles derived from existing frameworks and empirical research.

Principle 1: Enable qualitative expert judgment as foundation, supplemented by metrics. The Leiden Manifesto's first principle states that "quantitative evaluation should support qualitative, expert assessment." Every major reform framework (DORA, CoARA, Metric Tide) converges on this position. Metrics can inform judgment but cannot replace it.

Principle 2: Require explicit context specification before assessment. RQ+ demonstrates the value of explicitly assessing contextual factors—data availability, research capacity, enabling environment, risk profile, field maturity—before evaluating research quality. Context shapes what constitutes "excellence."

Principle 3: Allow philosophical framework selection with transparent weighting. Users should be able to select which philosophical lenses apply (epistemic, utilitarian, critical, aesthetic, etc.) and assign weights according to their values and purposes. This makes normative commitments explicit rather than hidden.

Principle 4: Include process indicators alongside outcome indicators. The Productive Interactions framework demonstrates value in assessing research-stakeholder interactions, not just final outputs. Process indicators capture dimensions of quality (engagement, reflexivity, adaptation) that outcome metrics miss.

Principle 5: Build in stakeholder participation mechanisms. RRI's emphasis on anticipation, reflection, engagement, and responsiveness, combined with evidence that different stakeholders legitimately weight criteria differently, supports incorporating stakeholder input into framework design and application.

Principle 6: Provide explicit guidance on disciplinary variation. The Leiden Manifesto's sixth principle—"account for variation by field in publication and citation practices"—is supported by extensive evidence of disciplinary differences. A customizable framework must enable field-appropriate calibration.

Principle 7: Enable evaluation of diverse outputs beyond publications. DORA and CoARA explicitly call for valuing datasets, software, mentoring, peer review, public engagement, and other contributions. The framework should accommodate non-traditional research outputs appropriate to different fields and career stages.

Principle 8: Support narrative/portfolio approaches. The movement toward narrative CVs (Netherlands, Norway, Finland, UK experiments) and impact case studies demonstrates value in contextualized stories over reduced metrics. The framework should enable researchers to position their work within appropriate interpretive frames.

Conclusion

The evidence strongly supports building a customizable, multi-dimensional framework for positioning research by value. Seven philosophical traditions offer legitimate but distinct justifications for research value, from epistemic truth-seeking through critical emancipation. Six historical paradigm shifts demonstrate that evaluation criteria are social constructions amenable to deliberate redesign. Five geographic traditions reveal fundamentally different cultural orientations that cannot be collapsed into universal standards. Current evaluation methods each capture different dimensions—peer review measures epistemic validity, impact factors measure academic attention, altmetrics measure public engagement, impact cases measure societal benefit—with none comprehensive.

Existing frameworks like RQ+, RRI, Productive Interactions, and CoARA demonstrate feasibility of multi-dimensional assessment. Disciplinary variation evidence confirms that one-size-fits-all metrics systematically disadvantage humanities, arts, and fields with different knowledge production modes. Stakeholder research shows legitimate variation in how researchers, funders, and publics weight different dimensions.

The key insight is that there is no neutral, universal standard for research quality—evaluation criteria inevitably embody particular philosophical commitments about knowledge's nature and purpose. A customizable framework that makes these commitments explicit, allows dimension selection and weighting, requires context specification, and accommodates disciplinary plurality represents not merely a practical improvement but a philosophically necessary response to irreducible pluralism in how research can legitimately be valued.