Technical Overview

How Colo Works

A plain-language explanation of the analysis pipeline, the quality controls built into it, and what the outputs actually mean, so you can use them critically.

Contents
  1. The big picture
  2. The two agents and why they disagree
  3. Evidence tags and what they mean
  4. The INFERENCE rule: separating reasoning from citation
  5. Consensus checkpoints
  6. How the literature is retrieved
  7. iCite citation quality weighting
  8. What a VERDICT means
  9. Known limitations
  10. Further reading

1. The big picture

Colo is a literature synthesis engine. You give it a research question or hypothesis. It retrieves relevant peer-reviewed papers from PubMed, runs two opposing expert perspectives through a structured debate that grades the evidence behind your idea, then carries the result forward into formal methodology and grant scaffolding. Every step is anchored to specific cited papers.

The synthesis runs across a five-stage pipeline: Setup → Adversarial debate → Methods design → Scaffold → Export. The transcript, citations, and verdicts from each stage carry into the next, so by the end you have a study you could actually run, traceable line-by-line back to the literature you started from.

The output is not a summary. It's a graded, contested evaluation of the hypothesis followed by structured next steps, produced by a system explicitly designed to challenge itself before reaching a conclusion. Think of it as a rapid, literature-grounded peer review that doesn't stop at "this is plausible." It keeps going until you have a study design.

Why this matters

A general chatbot can produce a confident-sounding paragraph about a research question in five seconds. It can't anchor that paragraph in 1.7M+ peer-reviewed abstracts, surface unresolved disagreements between cited studies, or carry the answer forward into a methodology section that meets RCT or NIH standards. Colo is built around the parts of research a generalist tool treats as out of scope.

2. The two agents and why they disagree

Each analysis is run by two agents with different roles:

Agent A: Clinical Researcher

Agent A evaluates the hypothesis from the perspective of applied, patient-facing evidence. It prioritizes clinical trial data, patient outcomes, treatment responses, and real-world feasibility. Its job is to ask: what does the evidence say happens in actual patients?

Agent B: Translational Researcher

Agent B evaluates from the perspective of mechanism and biology. It digs into signaling pathways, resistance mechanisms, cell biology, and whether the clinical conclusions are actually supported by what we know about how the underlying biology works. Its job is to ask: does the mechanism support what the clinical data claims?

This split is intentional. Clinical trial results sometimes outrun the mechanistic explanation. Mechanistic findings sometimes don't translate to patients. Forcing both perspectives into the same conversation, and requiring them to resolve disagreements before reaching a verdict, produces conclusions that are more complete than either perspective alone.

Both roles are fixed by the system. Agent A is always the clinical researcher, Agent B is always the translational researcher. The domain (e.g., Oncology) and the hypothesis you set in setup get injected into both prompts so the agents reason within scope.

3. Evidence tags and what they mean

Every time an agent cites a paper, it is required to tag it with an evidence type. These tags reflect how much confidence the study design supports a given conclusion.

Tag Study type What it means for the claim
[RCT] Randomized controlled trial Highest level of clinical evidence. Supports causal claims about treatment effects.
[META] Meta-analysis or systematic review Synthesizes multiple studies. Strong for establishing consensus across a body of evidence.
[COHORT] Observational cohort study Observational. Establishes associations, not causation. Useful for real-world patterns.
[PRECLINICAL] Animal model or in vitro study Mechanistic evidence. Cannot directly support clinical claims without translational data.
[EXPERT] Expert opinion, review, or editorial Interpretive, not primary evidence. Useful for framing but carries the lowest evidentiary weight.
How to read this

A verdict supported only by [PRECLINICAL] evidence is much weaker than one supported by [RCT] or [META]. Look at the tags attached to the EVIDENCE lines in each agent's response. They tell you how strong the foundation actually is.

4. The INFERENCE rule

Agents are required to distinguish between two kinds of statements:

Cited claims: statements directly supported by a paper in the retrieved literature. These are attached to an EVIDENCE line with a PMID or paper title and an evidence tag.

Inferred claims: conclusions that follow logically from the evidence but are not themselves directly stated in any cited paper. These must be prefixed with INFERENCE:.

This separation matters because it prevents an agent from presenting a logical leap as if it were an established finding. A verdict cannot be built on INFERENCE-labeled claims alone. The agents are instructed to enforce this themselves, and the consensus checkpoint requires a verdict to be traceable to [RCT] or [META] evidence.

What to watch for

If you see a VERDICT that relies heavily on INFERENCE-prefixed claims rather than cited papers, treat it with more skepticism. The reasoning may be plausible, but it hasn't been anchored to the literature.

5. Consensus checkpoints

Every four agent turns, Colo injects a consensus checkpoint into the conversation. At a checkpoint, both agents are required to stop debating and respond with only three things:

VERDICT The single most actionable research direction that has emerged, stated as a testable hypothesis. Must be supported by [RCT] or [META] evidence.
AGREE / DISAGREE Whether their verdict aligns with the other agent's. If DISAGREE, they must state what evidence would resolve the conflict.
NEXT LANE Which aspect of the topic should be the focus of the next exchange, and why.

If one agent returns DISAGREE, the system flags the unresolved conflict and continues the debate. A final verdict requires both agents to agree, or for the disagreement to be resolved with additional evidence. This prevents a premature consensus when the underlying evidence is genuinely contradictory.

6. How the literature is retrieved

When RAG (Retrieval-Augmented Generation) is enabled, Colo searches a pre-built index of PubMed abstracts to find relevant papers before each agent turn. The retrieval works as follows:

1
Three parallel queries

Rather than searching once with the literal hypothesis text, Colo generates three query angles from the recent conversation: one framing the topic clinically, one mechanistically, and one from a biomarker perspective. This prevents the retrieval from narrowing too early.

2
Semantic similarity search

Each query is converted into a numerical vector using a sentence embedding model and compared against the vectors of every abstract in the index. Papers are ranked by how closely their meaning matches the query, not just whether they share keywords.

3
PMID deduplication

Results from all three queries are merged and deduplicated by PubMed ID so the same paper doesn't appear multiple times in the context. The first occurrence (most semantically relevant) is kept.

4
Quality re-ranking

The top results are re-ranked using a combined score: 70% semantic similarity + 30% iCite citation quality (see below). The final 12 abstracts are passed to both agents as their literature context for that turn.

7. iCite citation quality weighting

Not all published papers carry equal evidential weight. A landmark randomized trial with 3,000 citations is not the same as a single-center case series with 4. Colo uses the NIH iCite Relative Citation Ratio (RCR) to account for this difference.

What is RCR?

The Relative Citation Ratio is a field-normalized citation metric developed by the NIH. It measures how often a paper is cited relative to other papers in the same research field and publication year. An RCR of 1.0 means a paper is cited at the average rate for its field. An RCR of 5.0 means it is cited five times more than average, a sign of outsized influence.

Because RCR is field-normalized, it is more meaningful than a raw citation count. A paper in molecular biology accumulates citations faster than one in a narrow clinical subspecialty. RCR adjusts for this so comparisons across subfields are fair.

How Colo uses it

When retrieving papers, Colo blends semantic relevance (how closely a paper's content matches the query) with citation quality (how influential the paper is in its field). The blend is currently 70% semantic / 30% RCR. This means a highly relevant but low-impact paper can still surface, but a highly cited paper on the exact topic will rank near the top.

Each retrieved abstract is displayed with its RCR score and raw citation count so you can see the quality signal directly.

Known limitation

RCR has a recency bias: papers published in the last 1–2 years haven't had time to accumulate citations, so they will rank lower than older papers even if they are more current. Recent high-quality work may be underweighted. This is a known limitation we are actively working to address.

8. What a VERDICT means

A VERDICT is the agents' consensus on the single most actionable research direction that has emerged from the debate, stated as a testable hypothesis. It is not a clinical recommendation. It is not a definitive answer. It is a synthesis of what the retrieved literature supports, filtered through adversarial debate.

A VERDICT is only considered valid under three conditions:

When you see a verdict card appear in the dialogue, it means the agents have reached this bar. It does not mean the verdict is correct. It means it is internally consistent with the evidence they were given and the logic they were required to follow. Always verify against the cited papers directly.

9. Known limitations

We believe in being transparent about what Colo cannot do. The following are active limitations you should factor into how you use and interpret its outputs. Many of these are scoped for future development and will be addressed as the product matures.

Limitation What it means in practice
Abstract-only retrieval The pre-built corpus indexes PubMed abstracts, not full-text articles. Methods sections, supplemental data, and full results tables are not accessible to the agents. Workaround: use the Build custom corpus option in setup to index a domain or paper set you have full-text access to, or paste full-text excerpts into the Literature panel on the adversarial screen — both routes give the agents direct access to the deeper material.
Publication bias PubMed skews toward positive results. Studies with null or negative findings are underrepresented, so agents may overweight evidence for efficacy.
Date cutoffs The corpus covers a defined range (typically 2015–present). Landmark studies outside that window are not indexed and will not be cited unless added manually.
RCR recency bias Recently published papers have accumulated fewer citations and will be underweighted by the RCR ranking signal, even when they represent important new evidence.
No retraction detection Papers retracted after indexing are not flagged. Always verify the retraction status of any cited paper before relying on it.
AI reasoning errors Agents can misread, misattribute, or hallucinate details even when given the abstract. The INFERENCE rule and evidence tagging reduce this risk but do not eliminate it.
Domain coverage The corpus currently covers oncology, cardiology, neurology, immunology, pharmacology, and infectious disease (~1.3M papers, 2015–2025). Domains outside this scope (e.g., psychiatry, dermatology, ophthalmology) are not yet indexed; queries about them will not surface relevant literature.

10. Further reading

A short list of papers and analyses worth knowing if you want to follow where language-model literature synthesis is heading. The field is moving quickly, and we expect to refactor as the evidence shifts. Colo's commitment is to rigor, not to any specific architecture.

Some entries are arXiv preprints rather than peer-reviewed publications. Treat them as directional signal, not settled findings.

Suggestions welcome at privacy@colo-sci.com. This list is revised as papers are retracted, replicated, or superseded.