「‍」 Lingenic

HTML in HSV

Parallel-parseable structured documents

HSV enables fully parallel processing of HTML-like structured documents. Represent HTML as HSV, split at any delimiter, parse chunks on separate cores. No speculation, no state synchronization, linear scaling.

Proof-of-concept: go-html-parallel — parallel HTML parsing in ~50 lines of Go.

The Problem with HTML Parsing

HTML and XML parsing has been sequential for 30 years. Research into parallel HTML parsing achieved limited results:

ProjectYearApproachResult
HPar2013Speculative data-parallel2.4x on 4 cores
ZOOMM2013Qualcomm parallel browser2x (whole engine)
Servo2017Off-main-thread parsingTokenization only

Why so little progress? The inherent complexity of HTML's stateful parsing model:

You cannot split an HTML file in the middle and parse the chunks independently. The parser must process byte-by-byte from the start.

HTML in HSV

HSV represents HTML-like structures using its own delimiters:

Traditional HTML

<div class="container">
  <p>Hello world</p>
  <a href="https://example.com">
    Click here
  </a>
</div>

HSV

[STX] html:div [US] [SSA]
  html:class [US] container [RS]
  html:p [US] Hello world [RS]
  html:a [US] [SSA]
    html:href [US] https://example.com [RS]
    html:text [US] Click here
  [ESA]
[ESA] [ETX]

The structure uses HSV control characters:

Parallel Parsing

HTML-in-HSV inherits HSV's parallel-parseable structure.

AspectTraditional HTMLHTML in HSV
ParsingSequential state machineParallel split
EscapingRequired (&lt;, &amp;)Never
Split pointCannot split safelyAny delimiter
Multi-coreSingle-threadedTrivially parallel

Find a delimiter (SSA, ESA, RS, US), split, parse chunks on separate cores. Same as any HSV data.

No Escaping

Quotes, angles, and ampersands are just content:

HTML (escaping required)

<p>Use &lt;div&gt; for containers</p>
<p>A &amp; B &amp; C</p>
<a href="?a=1&amp;b=2">Link</a>

HSV (no escaping)

[STX] html:p [US] Use 
for containers [FS] html:p [US] A & B & C [FS] html:a [US] [SSA] html:href [US] ?a=1&b=2 [RS] html:text [US] Link [ESA] [ETX]

The control characters (0x86, 0x87, 0x1E, 0x1F) never appear in normal text, so no escaping is ever needed.

Rich Text Content

Multi-line content with formatting:

[STX] html:article [US] [SSA]
  html:h1 [US] The Title [RS]
  html:p [US] First paragraph with "quotes" and . [RS]
  html:p [US] Second paragraph.
This continues on a new line.
And another. [RS]
  html:blockquote [US] [SSA]
    html:p [US] A nested quote with special chars: <>&"'
  [ESA]
[ESA] [ETX]

Newlines are literal. Quotes are literal. Everything except the reserved control characters is data.

AST Representation

HTML-in-HSV is essentially an AST (Abstract Syntax Tree) format:

[STX] tag [US] div [RS]
  attr:class [US] container [RS]
  children [US] [SSA]
    tag [US] p [RS] text [US] Hello [RS]
    tag [US] a [RS] attr:href [US] /link [RS] text [US] Click
  [ESA] [ETX]

This makes it ideal for:

Converting HTML to HSV

The mapping is straightforward:

HTMLHSV
<tag>html:tag [US] [SSA]
</tag>[ESA]
attr="value"html:attr [US] value
Text contentValue after [US]
Sibling elementsSeparated by [RS] or [FS]
&lt; &gt; &amp;< > & (literal)

Summary

HTML-in-HSV gives you:

30 years of sequential HTML parsing. HSV changes that.