Data Scraping: Rendering rich text when content has text nodes outside tags (React / sanitization)

So a scraped job description looked correct in the employer dashboard, but on our public job page the salary line vanished. The HTML still contained $101,000.00 - $126,000.00 / year. It sat between  and , not wrapped in its own tag.

The browser was not wrong. Our React pipeline was.

This post walks through loose text nodes in third-party HTML, why dangerouslySetInnerHTML can look fine until a useEffect rewrites the DOM, and how we keep sanitization without stripping the fragments scrapers and ATS exports leave behind.

What we will learn

Why text outside tags is valid HTML and common in imported job copy.
The difference between children and childNodes when we post-process markup in the browser.
Where sanitization belongs (server vs client) and what it does not fix.
A preserve-order rebuild pattern we use after bullet-list normalization.

Prerequisites

React (Pages or App Router) rendering CMS or API HTML with dangerouslySetInnerHTML.
Job descriptions (or similar rich text) from external sources: ATS exports, scrapers, Word paste, legacy WYSIWYG.
Optional: a sanitizer library such as DOMPurify if we allow arbitrary HTML.

The symptom

We inject description HTML:

			
<p>About the role</p>
$101,000.00 - $126,000.00 / year
<p>Responsibilities</p>
<p>• Build features</p>

After first paint, salary might flash briefly, then disappear. Or it never shows if our normalization runs before paint in strict mode double-mount scenarios.

Support ticket version: “Scraper broke the description.” Often the scraper did something ugly but valid. Our DOM rewrite dropped the ugly part.

Facts: how the browser parses this

When React sets innerHTML (via dangerouslySetInnerHTML), the browser builds a DOM tree:

Node type	Example in markup above
Element (`<p>`)	paragraph blocks
Text node	`$101,000.00 - $126,000.00 / year` sitting between elements

Text nodes are first-class. They have no tag name. They still render.

Rule of thumb: if we only ever iterate elements, we silently delete anything that is not an element.

Where React makes this worse (usually not `dangerouslySetInnerHTML` itself)

Initial injection tends to work:

			
<div
  id="jobContentDesc"
  className="prose"
  dangerouslySetInnerHTML={{ __html: job.description ?? '' }}
/>

		

The bug often arrives in a useEffect that “cleans up” employer HTML:

turn • paragraphs into <ul><li>,
strip empty   spacers,
merge split bullet lines.

That effect frequently does this:

			
// Bug pattern: children is elements only
const kids = Array.from(container.children);
container.replaceChildren(...rebuiltElements);

HTMLElement.children returns an HTMLCollection of element nodes only. Text nodes between  blocks never appear in kids. When we replaceChildren with rebuilt elements, loose text is gone.

Fix we rely on: `childNodes`, not `children`

			
useEffect(() => {
  const container = document.getElementById('jobContentDesc');
  if (!container || !job?.description) return;
  const kids = Array.from(container.childNodes);
  const out: Node[] = [];
  let i = 0;
  while (i < kids.length) {
    const cur = kids[i];
    // Preserve text nodes, comments, etc. in original order
    if (cur.nodeType !== Node.ELEMENT_NODE) {
      out.push(cur.cloneNode(true));
      i += 1;
      continue;
    }
    // ... element-specific bullet / <p> logic ...
    i += 1;
  }
  container.replaceChildren(...out);
}, [job?.description]);

		

Opinion: any DOM-normalization pass that walks the tree should treat non-element nodes as sacred unless we explicitly intend to strip them.

Quick reference:

API	Includes text nodes?
`element.children`	No
`element.childNodes`	Yes
`element.textContent`	Flattened string (loses structure)

Sanitization: separate concern, same pipeline

Sanitizers like DOMPurify remove unsafe markup (scripts, event handlers, javascript: URLs). They do not automatically fix layout normalization bugs, and they will not invent wrappers for loose text.

Where we sanitize

Layer	Pros	Cons
Server (API route, SSR, ingest job)	One canonical clean string in DB; safer default	Must re-run if allowlist changes
Client (before `dangerouslySetInnerHTML`)	Easy to add late	XSS window if we ever SSR uns sanitized HTML
Both	Defense in depth	Duplicated config unless shared

Example (client or isomorphic with isomorphic-dompurify):

			
import DOMPurify from 'isomorphic-dompurify';
const safeHtml = DOMPurify.sanitize(job.description ?? '', {
  USE_PROFILES: { html: true },
  // tighten ALLOWED_TAGS / ALLOWED_ATTR to our prose subset
});
return (
  <div dangerouslySetInnerHTML={{ __html: safeHtml }} />
);

		

Facts:

DOMPurify generally keeps text nodes that survive its allowlist.
If loose text vanishes after sanitize but before our effect, suspect normalization, not DOMPurify.
If loose text vanishes immediately on first paint with no effect, suspect the source HTML never contained it (API truncation, wrong field).

Decision tree: fix at source vs fix in the browser

Option A – Browser normalization (what we did)

Keep employer HTML as-is in the database. Fix the React effect to preserve text nodes. Fastest when many legacy rows already exist.

Option B – Server-side normalization (longer-term)

On ingest or API read, parse HTML and wrap orphan text:

			
<!-- before -->
</p>Salary here<p>
<!-- after -->
</p><p class="job-description-orphan">Salary here</p><p>

Libraries: node-html-parser, cheerio, rehype/remark if we already run MDX pipelines.

Opinion: server wrap is cleaner for new data; client childNodes fix is the honest patch when we cannot re-import ten thousand jobs this week.

Verification we actually run

API check: fetch job.description raw string. Confirm salary substring exists outside tags.
First paint: temporarily disable the normalization useEffect. If text returns, the effect was the culprit.
DevTools Elements: select #jobContentDesc, expand child list. Look for #text nodes between  elements.
Console snippet:

			
const el = document.getElementById('jobContentDesc');
[...el.childNodes].map((n) =>
  n.nodeType === Node.TEXT_NODE ? `#text: ${JSON.stringify(n.textContent)}` : n.tagName
);

Regression fixture: save one real broken HTML blob in the repo (redacted) and unit-test the normalizer output string or DOM child count.

Pitfalls beyond `children`

innerHTML round-trips: reading innerHTML and writing it back can collapse whitespace differently than cloning nodes.
React Strict Mode double effects: normalization may run twice; idempotent rebuilds help.
Blank text nodes: whitespace-only #text nodes matter for spacing; clone them unless we mean to collapse.
Assuming  wrappers: scrapers rarely wrap salary in . Do not require tags that were never there.
Sanitizer over-tightening: stripping style or unknown tags is fine for security; stripping text usually means misconfiguration, not “bad HTML.”

Minimal checklist before we ship

Post-process loop uses childNodes (or does not rebuild the container at all).
Sanitize once at a documented layer with a shared allowlist.
At least one fixture with inter-tag salary / location text in CI or Storybook.
Product knows we display third-party HTML as imported; fixing upstream ATS export is a separate ticket.

Closing

Rich job descriptions teach a DOM lesson we keep relearning: HTML is a tree, not a bag of tags. Text nodes count. children lies by omission. Sanitization keeps users safe; it does not replace walking the tree honestly.

When the next ticket says “missing salary in description,” we ask one question first: did the text node make it into the container? Everything after that is either preservation or ingest, and we stop blaming the scraper until we check.

Like this:

Leave a ReplyCancel reply