So a scraped job description looked correct in the employer dashboard, but on our public job page the salary line vanished. The HTML still contained $101,000.00 - $126,000.00 / year. It sat between </p> and <p>, not wrapped in its own tag.
The browser was not wrong. Our React pipeline was.
This post walks through loose text nodes in third-party HTML, why dangerouslySetInnerHTML can look fine until a useEffect rewrites the DOM, and how we keep sanitization without stripping the fragments scrapers and ATS exports leave behind.
What we will learn
- Why text outside tags is valid HTML and common in imported job copy.
- The difference between
childrenandchildNodeswhen we post-process markup in the browser. - Where sanitization belongs (server vs client) and what it does not fix.
- A preserve-order rebuild pattern we use after bullet-list normalization.
Prerequisites
- React (Pages or App Router) rendering CMS or API HTML with
dangerouslySetInnerHTML. - Job descriptions (or similar rich text) from external sources: ATS exports, scrapers, Word paste, legacy WYSIWYG.
- Optional: a sanitizer library such as DOMPurify if we allow arbitrary HTML.
The symptom
We inject description HTML:
<p>About the role</p>$101,000.00 - $126,000.00 / year<p>Responsibilities</p><p>• Build features</p>
After first paint, salary might flash briefly, then disappear. Or it never shows if our normalization runs before paint in strict mode double-mount scenarios.
Support ticket version: “Scraper broke the description.” Often the scraper did something ugly but valid. Our DOM rewrite dropped the ugly part.
Facts: how the browser parses this
When React sets innerHTML (via dangerouslySetInnerHTML), the browser builds a DOM tree:
| Node type | Example in markup above |
|---|---|
Element (<p>) | paragraph blocks |
| Text node | $101,000.00 - $126,000.00 / year sitting between elements |
Text nodes are first-class. They have no tag name. They still render.
Rule of thumb: if we only ever iterate elements, we silently delete anything that is not an element.
Where React makes this worse (usually not dangerouslySetInnerHTML itself)
Initial injection tends to work:
<div id="jobContentDesc" className="prose" dangerouslySetInnerHTML={{ __html: job.description ?? '' }}/>
The bug often arrives in a useEffect that “cleans up” employer HTML:
- turn
•paragraphs into<ul><li>, - strip empty
<p><br></p>spacers, - merge split bullet lines.
That effect frequently does this:
// Bug pattern: children is elements onlyconst kids = Array.from(container.children);container.replaceChildren(...rebuiltElements);
HTMLElement.children returns an HTMLCollection of element nodes only. Text nodes between <p> blocks never appear in kids. When we replaceChildren with rebuilt elements, loose text is gone.
Fix we rely on: childNodes, not children
useEffect(() => { const container = document.getElementById('jobContentDesc'); if (!container || !job?.description) return; const kids = Array.from(container.childNodes); const out: Node[] = []; let i = 0; while (i < kids.length) { const cur = kids[i]; // Preserve text nodes, comments, etc. in original order if (cur.nodeType !== Node.ELEMENT_NODE) { out.push(cur.cloneNode(true)); i += 1; continue; } // ... element-specific bullet / <p> logic ... i += 1; } container.replaceChildren(...out);}, [job?.description]);
Opinion: any DOM-normalization pass that walks the tree should treat non-element nodes as sacred unless we explicitly intend to strip them.
Quick reference:
| API | Includes text nodes? |
|---|---|
element.children | No |
element.childNodes | Yes |
element.textContent | Flattened string (loses structure) |
Sanitization: separate concern, same pipeline
Sanitizers like DOMPurify remove unsafe markup (scripts, event handlers, javascript: URLs). They do not automatically fix layout normalization bugs, and they will not invent wrappers for loose text.
Where we sanitize
| Layer | Pros | Cons |
|---|---|---|
| Server (API route, SSR, ingest job) | One canonical clean string in DB; safer default | Must re-run if allowlist changes |
Client (before dangerouslySetInnerHTML) | Easy to add late | XSS window if we ever SSR uns sanitized HTML |
| Both | Defense in depth | Duplicated config unless shared |
Example (client or isomorphic with isomorphic-dompurify):
import DOMPurify from 'isomorphic-dompurify';const safeHtml = DOMPurify.sanitize(job.description ?? '', { USE_PROFILES: { html: true }, // tighten ALLOWED_TAGS / ALLOWED_ATTR to our prose subset});return ( <div dangerouslySetInnerHTML={{ __html: safeHtml }} />);
Facts:
- DOMPurify generally keeps text nodes that survive its allowlist.
- If loose text vanishes after sanitize but before our effect, suspect normalization, not DOMPurify.
- If loose text vanishes immediately on first paint with no effect, suspect the source HTML never contained it (API truncation, wrong field).
Decision tree: fix at source vs fix in the browser

Option A – Browser normalization (what we did)
Keep employer HTML as-is in the database. Fix the React effect to preserve text nodes. Fastest when many legacy rows already exist.
Option B – Server-side normalization (longer-term)
On ingest or API read, parse HTML and wrap orphan text:
<!-- before --></p>Salary here<p><!-- after --></p><p class="job-description-orphan">Salary here</p><p>
Libraries: node-html-parser, cheerio, rehype/remark if we already run MDX pipelines.
Opinion: server wrap is cleaner for new data; client childNodes fix is the honest patch when we cannot re-import ten thousand jobs this week.
Verification we actually run
- API check: fetch
job.descriptionraw string. Confirm salary substring exists outside tags. - First paint: temporarily disable the normalization
useEffect. If text returns, the effect was the culprit. - DevTools Elements: select
#jobContentDesc, expand child list. Look for#textnodes between<p>elements. - Console snippet:
const el = document.getElementById('jobContentDesc');[...el.childNodes].map((n) => n.nodeType === Node.TEXT_NODE ? `#text: ${JSON.stringify(n.textContent)}` : n.tagName);
- Regression fixture: save one real broken HTML blob in the repo (redacted) and unit-test the normalizer output string or DOM child count.
Pitfalls beyond children
innerHTMLround-trips: readinginnerHTMLand writing it back can collapse whitespace differently than cloning nodes.- React Strict Mode double effects: normalization may run twice; idempotent rebuilds help.
- Blank text nodes: whitespace-only
#textnodes matter for spacing; clone them unless we mean to collapse. - Assuming
<span>wrappers: scrapers rarely wrap salary in<span>. Do not require tags that were never there. - Sanitizer over-tightening: stripping
styleor unknown tags is fine for security; stripping text usually means misconfiguration, not “bad HTML.”
Minimal checklist before we ship
- Post-process loop uses
childNodes(or does not rebuild the container at all). - Sanitize once at a documented layer with a shared allowlist.
- At least one fixture with inter-tag salary / location text in CI or Storybook.
- Product knows we display third-party HTML as imported; fixing upstream ATS export is a separate ticket.
Closing
Rich job descriptions teach a DOM lesson we keep relearning: HTML is a tree, not a bag of tags. Text nodes count. children lies by omission. Sanitization keeps users safe; it does not replace walking the tree honestly.
When the next ticket says “missing salary in description,” we ask one question first: did the text node make it into the container? Everything after that is either preservation or ingest, and we stop blaming the scraper until we check.

Leave a Reply