HTML in HSV

Parallel-parseable structured documents

HSV enables fully parallel processing of HTML-like structured documents. Represent HTML as HSV, split at any delimiter, parse chunks on separate cores. No speculation, no state synchronization, linear scaling.

Proof-of-concept: go-html-parallel — parallel HTML parsing in ~50 lines of Go.

The Problem with HTML Parsing

HTML and XML parsing has been sequential for 30 years. Research into parallel HTML parsing achieved limited results:

Project	Year	Approach	Result
HPar	2013	Speculative data-parallel	2.4x on 4 cores
ZOOMM	2013	Qualcomm parallel browser	2x (whole engine)
Servo	2017	Off-main-thread parsing	Tokenization only

Why so little progress? The inherent complexity of HTML's stateful parsing model:

State machine: Track whether you're inside a tag, attribute, comment, CDATA...
Tag matching: Opening tags must match closing tags
Escape sequences: &, <, >, "
Context-dependent: < means different things in different places

You cannot split an HTML file in the middle and parse the chunks independently. The parser must process byte-by-byte from the start.

HTML in HSV

HSV represents HTML-like structures using its own delimiters:

Traditional HTML

<div class="container">
  <p>Hello world</p>
  <a href="https://example.com">
    Click here
  </a>
</div>

HSV

[STX] html:div [US] [SSA]
  html:class [US] container [RS]
  html:p [US] Hello world [RS]
  html:a [US] [SSA]
    html:href [US] https://example.com [RS]
    html:text [US] Click here
  [ESA]
[ESA] [ETX]

The structure uses HSV control characters:

SSA / ESA (0x86/0x87) — nesting (like opening/closing tags)
[RS] (RS) — sibling elements
[US] (US) — attribute/value pairs
html: prefix — namespace convention

Parallel Parsing

HTML-in-HSV inherits HSV's parallel-parseable structure.

Aspect	Traditional HTML	HTML in HSV
Parsing	Sequential state machine	Parallel split
Escaping	Required (`<`, `&`)	Never
Split point	Cannot split safely	Any delimiter
Multi-core	Single-threaded	Trivially parallel

Find a delimiter (SSA, ESA, RS, US), split, parse chunks on separate cores. Same as any HSV data.

No Escaping

Quotes, angles, and ampersands are just content:

HTML (escaping required)

<p>Use &lt;div&gt; for containers</p>
<p>A &amp; B &amp; C</p>
<a href="?a=1&amp;b=2">Link</a>

HSV (no escaping)

[STX] html:p [US] Use  for containers [FS]
html:p [US] A & B & C [FS]
html:a [US] [SSA] html:href [US] ?a=1&b=2 [RS] html:text [US] Link [ESA] [ETX]

The control characters (0x86, 0x87, 0x1E, 0x1F) never appear in normal text, so no escaping is ever needed.

Rich Text Content

Multi-line content with formatting:

[STX] html:article [US] [SSA]
  html:h1 [US] The Title [RS]
  html:p [US] First paragraph with "quotes" and . [RS]
  html:p [US] Second paragraph.
This continues on a new line.
And another. [RS]
  html:blockquote [US] [SSA]
    html:p [US] A nested quote with special chars: <>&"'
  [ESA]
[ESA] [ETX]

Newlines are literal. Quotes are literal. Everything except the reserved control characters is data.

AST Representation

HTML-in-HSV is essentially an AST (Abstract Syntax Tree) format:

[STX] tag [US] div [RS]
  attr:class [US] container [RS]
  children [US] [SSA]
    tag [US] p [RS] text [US] Hello [RS]
    tag [US] a [RS] attr:href [US] /link [RS] text [US] Click
  [ESA] [ETX]

This makes it ideal for:

Template engines: Compile templates to HSV, render in parallel
DOM manipulation: Parse once, transform, serialize
Static site generators: Process pages in parallel
Document storage: Store structured content without escaping nightmares

Converting HTML to HSV

The mapping is straightforward:

HTML	HSV
`<tag>`	`html:tag [US] [SSA]`
`</tag>`	`[ESA]`
`attr="value"`	`html:attr [US] value`
Text content	Value after `[US]`
Sibling elements	Separated by `[RS]` or `[FS]`
`< > &`	`< > &` (literal)

Summary

HTML-in-HSV gives you:

Parallel parsing of structured documents
No escaping for quotes, angles, ampersands
AST-level representation
Same HSV tooling for documents and data

30 years of sequential HTML parsing. HSV changes that.