Where file import plugs in: Each format parser produces HMBlockNode[], then the existing flattenToOperations() converts those into DocumentOperation[] (ReplaceBlock + MoveBlocks ops). From there, the standard signing and publishing flow takes over — no changes needed downstream.
For updates (document update --replace-body), the CLI already has matchBlockIds() and computeReplaceOps() in block-diff.ts that diff old vs new block trees and emit minimal operations. File import can use the same path.
DOCX — mammoth
npm:
mammothBundle size: ~160 KB gzipped
Browser support: Yes
Structural quality: Excellent
Converts
.docxto clean semantic HTML. Maps Word styles to HTML elements:| Word Style | HTML Output | |-----------|-------------| | Heading 1 |
<h1>| | Heading 2 |<h2>| | Bold |<strong>| | Italic |<em>| | Lists |<ol>/<ul>| | Tables |<table>| | Images |<img>(base64) |Custom style mappings are supported.
Repo: https://github.com/mwilliamson/mammoth.js
SDK Integration
Two possible conversion paths:
Path A — mammoth → HTML → BlockNote's HTML-to-blocks
BlockNote (used in
packages/editor) already hastryParseHTMLToBlocks()which converts HTML into its internal block model. From there, the existing editor-to-HMBlockNode conversion can be reused. This path gets the richest output with minimal new code but couples the import to the editor package.Path B — mammoth → HTML → custom HMBlockNode[] mapper
Write a standalone HTML-to-HMBlockNode converter. Walk the DOM tree and map:
<h1>–<h6>→{type: 'Heading', text, annotations}with heading-level nesting<p>→{type: 'Paragraph', text, annotations}<strong>,<em>,<code>,<a>→ annotation spans withstarts/endscharacter positions<ul>/<ol>→ Paragraph blocks withchildrenType: 'Unordered' | 'Ordered'<table>→ Code block (current CLI behavior) or future Table block type<img>→ extract base64 data, upload via blob storage, create{type: 'Image', link: ipfsUrl}
Then call
flattenToOperations(blocks)and feed into the standarddocument createordocument updateflow.Image handling note: mammoth extracts images as base64
<img>tags. These need to be uploaded to blob storage (UnixFS chunking, ~256KB chunks, viastoreBlobswith 4MB message limit) and replaced with IPFS CID links before creating the Image blocks.CLI usage would look like:
seed-cli document create z6Mk... --title "Imported Doc" --docx report.docx --key main seed-cli document update hm://z6Mk.../doc --replace-docx updated.docx --key mainAlternative:
officeparsernpm:
officeparserProduces a hierarchical AST (paragraphs, headings, tables, lists, bold/italic)
Also handles
.pptx,.xlsx,.odt,.pdf,.rtfNewer and less battle-tested than mammoth
XLSX — read-excel-file or xlsx (SheetJS)
Option A: read-excel-file (lightweight, recommended)
npm:
read-excel-fileBundle size: ~37 KB gzipped
Browser support: Yes (browser-first design)
Returns rows as arrays of cells (string/number/Date/boolean)
Supports schema-based parsing for typed JSON objects
Does not handle formulas
Option B: xlsx / SheetJS (full-featured)
npm:
xlsxBundle size: ~300–500 KB gzipped
Browser support: Yes
Handles merged cells, formulas, number formatting
Caveat: npm version stuck at 0.18.5; newer versions distributed via
cdn.sheetjs.comDocs: https://docs.sheetjs.com/
SDK Integration
Spreadsheets don't map to the existing markdown parser at all — they need a dedicated converter.
Conversion flow:
.xlsx → read-excel-file/SheetJS → rows[][] per sheet
→ for each sheet:
Heading block (sheet name)
→ Table-like structure as child blocks
→ flattenToOperations(blocks)
→ standard signing/publish flow
Block mapping options:
As a Code block (simplest, matches current CLI behavior for tables): Render the spreadsheet as a markdown/CSV table string inside a
{type: 'code-block', text: csvString}block. The CLI already renders markdown tables as Code blocks.As nested Paragraph blocks (richer but verbose): Each row becomes a Paragraph block, cells separated by formatting. Poor UX for large sheets.
Multi-sheet handling: Each worksheet becomes a Heading block with the sheet's content as children. The flattenToOperations() function already handles nested parent-child block trees via MoveBlocks operations.
CLI usage would look like:
seed-cli document create z6Mk... --title "Q4 Report" --xlsx data.xlsx --key main
# Each sheet becomes a section under the document
Practical recommendation: Start with Option A (read-excel-file) rendering sheets as Code blocks containing markdown tables. This requires zero schema changes and works with the existing pipeline immediately.
PDF — pdfjs-dist
npm:
pdfjs-distBundle size: ~400–800 KB gzipped (heavy)
Browser support: Yes (this is Firefox's PDF engine)
Structural quality: Low — PDFs have no semantic structure
Extracts text items with position coordinates (x, y), font info, and text content per page via
page.getTextContent(). Does not natively give you headings, paragraphs, or tables — you must reconstruct structure from:Headings: larger font size
Paragraphs: spatial proximity grouping
Tables: grid-aligned text detection
Images: separate rendering pass
Repo: https://github.com/mozilla/pdf.js
SDK Integration
PDF is the hardest format because
pdfjs-distreturns positioned text items, not semantic blocks. A heuristic layer is needed between the parser andHMBlockNode[].Conversion flow:
.pdf → pdfjs getTextContent() → TextItem[] with {str, transform, fontName, ...} → heuristic grouping: 1. Group text items into lines (same Y coordinate within tolerance) 2. Group lines into paragraphs (vertical gap < threshold) 3. Detect headings (font size > body font size) 4. Detect lists (lines starting with "•", "-", "1.") → HMBlockNode[] (mostly Paragraph + Heading blocks) → flattenToOperations(blocks) → standard signing/publish flowWhat maps cleanly to HMBlockNode:
Large-font text →
{type: 'Heading', text}with annotationsBody text groups →
{type: 'Paragraph', text}with bold/italic from font metadataBullet/numbered patterns → Paragraph with
childrenType: 'Unordered' | 'Ordered'
What doesn't map well:
Multi-column layouts (text items interleave columns)
Tables (must detect grid alignment — complex heuristic)
Images (requires separate canvas rendering per page, then blob upload)
Headers/footers (appear as regular text items)
Mathematical formulas (not extractable as LaTeX)
The existing
parseInlineFormatting()inmarkdown.tsbuilds annotation spans from character positions. The same pattern applies here — as you concatenate text items into a paragraph string, track font changes to create Bold/Italic annotations with correctstarts/endsoffsets.CLI usage would look like:
seed-cli document create z6Mk... --title "Research Paper" --pdf paper.pdf --key main # Best-effort text extraction with heuristic heading detectionPractical recommendation: Offer this as a "basic text import" rather than promising structural fidelity. The
--verboseflag could show warnings about low-confidence heading detection or skipped elements.Alternative:
unpdfnpm:
unpdfBundle size: ~390 KB gzipped
Cleaner async/await API wrapper around pdf.js
Methods:
extractText,extractLinks,getMetaRepo: https://github.com/unjs/unpdf
SDK note:
extractTextreturns plain string per page — simpler but loses all font/position metadata needed for heading detection. Only viable for plain-text-only import.
Summary
All three formats converge at the same point: once you have HMBlockNode[], the existing flattenToOperations() → createChange() → createVersionRef() → client.publish() pipeline handles the rest unchanged.
DOCX is the clear quick win — mammoth's HTML output maps almost 1:1 to the block model. XLSX is straightforward but limited to table-like rendering. PDF requires the most custom heuristic code for the least structural fidelity.
Do you like what you are reading?. Subscribe to receive updates.
Unsubscribe anytime