What Does SPDF Stand For?

SPDF stands for Semantic PDF, a universal document format designed to make any type of content deeply searchable using AI. While the name references PDF, the SPDF format goes far beyond traditional documents. It serves as a unified container for PDFs, videos, audio recordings, and images, all enriched with semantic embeddings that enable intelligent search and retrieval.

The SPDF format was created to solve a fundamental problem in research and knowledge management: traditional file formats store content but not meaning. A regular PDF contains text and layout information, but it has no understanding of what the text actually says. SPDF bridges that gap by pairing every piece of content with a dense vector embedding that captures its semantic meaning.

Why Regular PDFs Are Not Enough

Standard PDFs were designed for printing and visual display, not for machine understanding. When you search inside a regular PDF, you are limited to exact keyword matching. If a paper discusses "neural network architectures" but you search for "deep learning models," a standard PDF search will return nothing, even though the concepts are closely related.

Beyond text search limitations, regular PDFs have several other shortcomings for research workflows:

No semantic understanding: Search is purely lexical, missing synonyms, paraphrases, and related concepts.
No cross-modal support: A PDF cannot link its text to related video lectures, audio recordings, or figure descriptions in a unified way.
Poor metadata extraction: Title, authors, and publication details are embedded in visual layout rather than structured data, making automated extraction unreliable.
No chunking or segmentation: There is no built-in concept of meaningful text segments optimized for retrieval.

SPDF addresses every one of these limitations by adding structured layers of semantic information on top of the original content.

The SPDF Structure

An SPDF file is a structured container with several key components that work together to enable intelligent search and retrieval.

Metadata

Every SPDF includes rich, structured metadata extracted from the source document:

Title, authors, publication date, and DOI
Source file information and format details
Processing timestamps and model versions used
Custom tags and library associations

This metadata is extracted automatically during conversion using specialized OCR and extraction models, though users can also edit and correct it manually.

Chunks

The document content is split into semantically meaningful chunks, each representing a coherent unit of information such as a paragraph, a section, or a figure caption. Each chunk stores:

The raw text content
Its position within the source document (page number, bounding box)
A unique identifier for precise citation

Chunking is performed intelligently, respecting paragraph and section boundaries rather than splitting at arbitrary character counts. This ensures that each chunk carries a complete thought, which improves both search relevance and citation accuracy.

Embeddings

This is the heart of what makes SPDF powerful. Every chunk is paired with a dense vector embedding generated by a multimodal embedding model. These embeddings capture the semantic meaning of the content in a high-dimensional vector space where similar concepts cluster together.

Scholaris uses the Qwen3-VL-Embedding model, which generates unified embeddings for text, images, and video frames. This means a text query like "graph showing training loss over epochs" can match against a figure in a document even if the figure caption does not contain those exact words.

Embeddings support Matryoshka Representation Learning (MRL), allowing dimensions to be adjusted from 64 to 2048 depending on the precision and storage requirements of your setup.

Pages and Visual Content

For PDF sources, the SPDF stores page-level information including preview images and the spatial layout of text blocks. For video and audio sources, the format stores:

Video frames: Keyframes extracted at scene boundaries and fixed intervals, each with its own embedding
Video segments: Time-coded sections with transcribed text and speaker labels
Audio segments: Transcribed speech segments with timestamps and speaker diarization

This unified structure is what enables cross-modal search, where a single query can return results from text documents, video recordings, and audio files simultaneously.

One of the most powerful features enabled by the SPDF format is cross-modal search. Because text, images, video frames, and audio transcriptions all exist in the same embedding space, you can search across all of them with a single query.

For example, imagine you have a library containing a research paper (PDF), a recorded conference talk (video), and a podcast interview (audio) all related to the same topic. With SPDF, a search query like "limitations of attention mechanisms" would return:

The relevant paragraph from the PDF
The video segment where the speaker discusses the topic (with a timestamp you can jump to)
The audio clip from the podcast where the interviewee mentions the limitations

Each result includes a precise citation: page numbers for PDFs, timestamps for video and audio. This makes it trivial to trace any finding back to its original source.

How Scholaris Creates SPDFs

When you upload a document to Scholaris, a multi-stage pipeline processes it into SPDF format:

Text extraction: OCR models (GLM-OCR) extract text from every page, handling scanned documents, handwriting, and complex layouts.
Metadata detection: Title, authors, dates, and other metadata are identified from the document structure.
Intelligent chunking: The extracted text is segmented into semantically coherent chunks.
Embedding generation: Each chunk is passed through the multimodal embedding model to produce its vector representation.
Indexing: The chunks and embeddings are stored and indexed for fast retrieval.

For video files, additional steps include audio extraction, speech transcription with speaker diarization, keyframe extraction, and frame-level embedding. For audio files, the process includes transcription, speaker identification, and segment-level embedding.

The entire pipeline runs locally on your machine, ensuring that your documents are never sent to external servers. Processing speed depends on your hardware: a 15-page PDF takes about three minutes on an NVIDIA RTX 3070, while a one-hour video takes approximately 15 to 20 minutes.

Getting Started with SPDF

If you are ready to start building your own semantic document library, check out our Getting Started with Scholaris guide. It walks you through installation, uploading your first document, and running your first semantic search query.

The SPDF format is at the foundation of everything Scholaris does. By transforming static documents into semantically rich, searchable objects, it turns your personal library into an intelligent knowledge base that understands what your documents mean, not just what words they contain.