Does SageTube cite every claim?

Yes. Every paragraph in a SageTube answer carries one or more references back to specific transcript chunks. References that the LLM tries to invent get stripped by AnswerValidator before the answer reaches you — if the LLM cites a chunk that wasn't in the retrieved evidence, that reference is removed.

How precise are SageTube's timestamps?

Timestamps are chunk-boundary precise — typically within a few seconds of the cited moment. Each chunk is stored with the start/end timestamps from the original transcript (YouTube captions or AssemblyAI/OpenAI transcription output), and the citation hyperlink jumps the user directly to that point in the embedded video.

How do SageTube citations differ from web-search citations in ChatGPT?

Web-search citations point to a whole page. SageTube citations point to a specific transcript clip in a specific video at a specific timestamp. The user can verify the claim in seconds by playing the clip instead of skimming a multi-thousand-word article looking for the supporting sentence.

Does the answer change if the model is wrong about which chunk supports a claim?

AnswerValidator runs a server-side validation pass that classifies every reference in the model's output as either valid (present in the retrieved evidence) or invalid (a hallucination), strips the invalid ones, and counts the distinct sources remaining. If too many references were stripped, the system surfaces the gap rather than silently shipping a thin answer.

Citations, Timestamps, and Trust: How SageTube Answers Are Grounded

May 18, 2026 · by marcin · grounding, citations, product, rag

Every SageTube answer cites the source video and timestamp. When you ask an Expert “what does this creator say about state management?” you get a paragraph with numbered references like [1] [2], and clicking each one jumps to the exact second in the source video where the claim was made. This post explains how that pipeline works under the hood, what happens when no transcript supports your question, and why we treat citation grounding as the core product invariant rather than a bolt-on feature.

We built SageTube around a simple idea: an answer without a verifiable source is just a guess. The product is interesting only if the citations are trustworthy. Everything below is the work we do to keep them trustworthy.

What does “grounded” actually mean when an AI answers a question about video?

A grounded answer is one where every factual claim ties back to specific source material that you can verify in seconds. For SageTube, that source material is a transcript chunk — a few sentences of text extracted from a real YouTube video, with the video ID and the start/end timestamps preserved. When the answer says “Theo recommends Zustand for medium apps because it avoids the boilerplate of Redux,” the [1] next to that sentence opens a YouTube embed at the precise moment Theo said it.

This is different from how ChatGPT or Perplexity cite the web. Their citations point to whole pages. You get a URL, you skim a 3,000-word article, you decide whether the sentence you needed was actually in there. SageTube citations point to a clip — usually 30–60 seconds — and the user verifies the claim by playing it. The verification cost is in single-digit seconds.

The four-step pipeline that produces a grounded answer:

Ingest. When you add a YouTube channel or video to an Expert, SageTube fetches the transcript. About 85% of YouTube videos have captions; we pull those directly. The remaining 15% go through audio extraction (yt-dlp) and API transcription (OpenAI primary, AssemblyAI fallback). Either way, the output is structured text with timestamps preserved per sentence.
Chunk + embed. The transcript gets split into a parent/child hierarchy — child chunks of around 500 tokens (roughly 350–400 words) embedded for retrieval, parent chunks of around 1,000 tokens returned to the LLM as broader context for each retrieved child. Every child chunk gets embedded into a 1,536-dimensional vector via OpenAI’s embedding model. Vectors land in Qdrant, our self-hosted vector database. Each chunk row in MySQL keeps the source video ID, the start and end timestamps, and a reference back to the raw transcript.
Retrieve. When you ask a question, your query gets embedded the same way. SageTube runs an approximate-nearest-neighbour search against the Expert’s Qdrant collection and pulls the top-N most semantically similar chunks. The cutoff varies by retrieval strategy, but typically 5–20 chunks make it to the next step.
Compose + validate. The retrieved chunks become the LLM’s evidence. The model writes a structured answer referencing the chunks by ID. Before the answer reaches you, a server-side validator inspects every reference and strips any that the model tried to invent.

Each of those four steps is a place where the citation could go wrong, and each step has its own defence.

How does SageTube actually emit a citation in the rendered answer?

The visible [1] in a SageTube answer is a hydrated reference. The LLM doesn’t write the URL directly — it writes a short evidence ID (something like E-1, E-2) that points to one of the retrieved chunks. The rendering layer turns those IDs into clickable links via the CitationHydrator service, which knows how to map an evidence ID to the source URL.

For a YouTube-backed Expert, the source URL is built straight from the chunk’s video ID. The format is https://youtube.com/watch?v=<video_id> with the timestamp appended as &t=<seconds> when available — so clicking the citation in your browser opens YouTube at the right moment. For Experts backed by uploaded PDFs or audio files, the source URL points to the relevant page or audio segment within the user’s own storage.

The hydration step is deliberately a separate service. The LLM’s output is treated as untrusted input: it might mention an evidence ID that doesn’t exist in the retrieved set (a hallucination), it might mention the same ID twice (a duplicate that should be coalesced), or it might leave out IDs that should clearly be cited (an omission). The validator handles cases 1 and 2; case 3 is harder and is part of why we have multiple retrieval strategies. The hydrator runs after the validator, so by the time URLs are emitted, every citation either matches a real chunk or has been stripped.

The Chrome extension renders the same citations via the popup chat UI. Same evidence IDs, same hydration pipeline, same validator pass. Users running version 1.35.4 of the extension see numbered references that open YouTube directly in the next tab.

What happens when no transcript supports the question?

This is where SageTube most clearly diverges from a general-purpose LLM. If you ask ChatGPT about something it doesn’t know, it tends to write a confident-sounding answer based on adjacent knowledge — sometimes correct, sometimes not. SageTube refuses.

Concretely: when retrieval returns no relevant chunks for your question, the answer surface says so directly. There’s no fallback to “general knowledge about YouTube” or “what I think this creator probably believes.” The product invariant is that an answer must be supported by something in the Expert’s indexed content, or it must say “I don’t have enough source material to answer this.”

There’s a secondary failure mode worth describing. Retrieval might return chunks that the LLM then tries to over-interpret — citing the chunks for a claim they don’t actually support. AnswerValidator catches this by stripping references that don’t appear in the retrieved evidence set. If the model wrote “Theo says X” and cited evidence ID E-5, but E-5 doesn’t talk about X, the validator removes the [5] reference and the editorial post-processing flags the paragraph as thinly-supported. The user sees a more honest answer with fewer claims rather than a confident-sounding answer built on phantom citations.

The validator also enforces a “distinct sources” floor for any answer that asserts a generalization. A paragraph saying “creators on YouTube tend to recommend X” needs evidence from more than one creator’s chunks; if only one chunk supports it, the generalization gets weakened to “this creator recommends X” before being shipped.

How precise are the timestamps, and why does precision matter?

Timestamps are chunk-boundary precise. A chunk in SageTube is typically 200–400 words of transcript, which usually translates to 30–90 seconds of video depending on speaking rate. The citation hyperlink jumps to the chunk’s start timestamp, so the user lands a few seconds before the cited claim and hears the surrounding context.

This level of precision matters for verification. A citation that drops you within a 60-second window means a user can confirm a claim in under a minute. A citation that drops you on a 30-minute video with no timestamp means a user can’t realistically verify anything — they go back to taking the AI’s word for it. Anchoring claims to specific moments is the difference between “you can audit me” and “trust me.”

The timestamps come from the original transcript source. YouTube captions ship with per-line timestamps directly; the chunker preserves the start of the first line and the end of the last line in each chunk. AssemblyAI and OpenAI’s transcription APIs return word-level timestamps; the chunker aggregates those into chunk-level start/end pairs. Either way, the timestamps in the chunk table reflect the original audio, not a post-hoc estimate.

How is SageTube different from NotebookLM, Eightify, or general web-search citations?

NotebookLM treats each source as a research notebook page. Its citations link to a page within the notebook and reference a paragraph there. The flow is “upload sources, then ask questions.” Comparing against SageTube: NotebookLM doesn’t have native YouTube channel sync (you paste video links one at a time), doesn’t have a Chrome extension that lives inside YouTube, and citations point to extracted text rather than to a timestamped video clip. NotebookLM is a research-notebook tool; SageTube is a YouTube-native tool.

Eightify focuses on per-video summaries with key-point extraction. Its strength is the single-video summary view. SageTube’s strength is cross-video question answering — asking one question across an entire channel’s catalogue and getting an answer that draws from multiple videos with citations to each. Eightify and SageTube can coexist for different use cases; the citation models are different (Eightify cites by key-point, SageTube cites by chunk + timestamp).

General web-search citations in ChatGPT point to a whole URL. The user has to scan the page to find the supporting sentence. SageTube citations point to a clip, which is verifiable in seconds. The difference matters most for long-form video where the supporting moment might be hidden 47 minutes into a 90-minute episode.

The unifying invariant across all three comparisons: SageTube cites the moment, not the document. When you build the product around making the moment verifiable, the whole user trust model shifts. Users stop reading the answer wondering “where did this come from” and start clicking through to verify the claims they care about.

Why we measure our own citation rate

The other side of being grounded is being cited. SageTube wants to be the source AI engines like ChatGPT and Perplexity quote when users ask “how do I search a YouTube channel with AI?” or “what’s the best YouTube knowledge base tool?” Citation by other AI engines is the leading indicator that the content we publish (this blog included) is reaching the model context windows that compose AI answers.

We measure this with a self-built QA test, GeoVisibilityTest, that runs every 6 hours. It hits OpenAI and Google Vertex’s grounded-generation APIs with a fixed set of ~20 conversational queries, parses the responses, and counts how often sagetube.ai shows up as a cited source. The test produces a Citation Score that we trend over time. Across 160 runs in the last 24 hours as of this writing, the test cost $0.20 to run and recorded 0 citations — the expected baseline for a product that hasn’t yet been discovered by AI crawlers.

This blog post is part of the work to change that. Every published post adds a Markdown page with full Article + FAQPage JSON-LD schema, gets included in /sitemap.xml, and ships in /blog/rss.xml. AI search indexers crawl those surfaces; when they answer a question that overlaps with our content, our pages become candidate citations.

The citation tracker will tell us whether the strategy worked in 60–90 days. The active-closure self-monitor Phase2bCitationDeltaDeadlineTest measures the delta against today’s zero-citation baseline on 2026-09-25 — about 16 weeks after this post lands.

What we won’t do

We won’t ship answers without citations. We won’t fall back to “general knowledge” when retrieval comes up empty. We won’t paraphrase a source so loosely that the citation becomes plausible-deniability rather than verification. The cost of any of those would be the only thing that makes SageTube interesting: the user’s ability to trust the answer because the moment is one click away.

If you’re a SageTube user and you ever see an answer with a thin citation or no source on a claim that needed one, that’s a bug, not a feature. Report it — we treat citation regressions as P0.

Frequently asked questions

Does SageTube cite every claim?: Yes. Every paragraph in a SageTube answer carries one or more references back to specific transcript chunks. References that the LLM tries to invent get stripped by AnswerValidator before the answer reaches you — if the LLM cites a chunk that wasn't in the retrieved evidence, that reference is removed.
What happens when no transcript supports the question?: SageTube refuses to fabricate. If retrieval returns no relevant chunks, the answer says so directly rather than invent a plausible-sounding response. When chunks exist but the model still hallucinates citations, AnswerValidator deletes the unsupported references and the visible answer flags the gap.
How precise are SageTube's timestamps?: Timestamps are chunk-boundary precise — typically within a few seconds of the cited moment. Each chunk is stored with the start/end timestamps from the original transcript (YouTube captions or AssemblyAI/OpenAI transcription output), and the citation hyperlink jumps the user directly to that point in the embedded video.
How do SageTube citations differ from web-search citations in ChatGPT?: Web-search citations point to a whole page. SageTube citations point to a specific transcript clip in a specific video at a specific timestamp. The user can verify the claim in seconds by playing the clip instead of skimming a multi-thousand-word article looking for the supporting sentence.
Does the answer change if the model is wrong about which chunk supports a claim?: AnswerValidator runs a server-side validation pass that classifies every reference in the model's output as either valid (present in the retrieved evidence) or invalid (a hallucination), strips the invalid ones, and counts the distinct sources remaining. If too many references were stripped, the system surfaces the gap rather than silently shipping a thin answer.

What does “grounded” actually mean when an AI answers a question about video?#

How does SageTube actually emit a citation in the rendered answer?#

What happens when no transcript supports the question?#

How precise are the timestamps, and why does precision matter?#

How is SageTube different from NotebookLM, Eightify, or general web-search citations?#

Why we measure our own citation rate#

What we won’t do#