Data-Driven Link Verification with Playwright (Python) + pytest

When you refactor menus, migrate content, or run a large SEO cleanup, you often need to prove that specific anchor texts on specific pages now point to the new target URLs. Doing that by hand is slow and error-prone. This article shows a clean, repeatable way to automate it using:

Playwright (Python) — renders the real page (including JavaScript-built links)
pytest — data-driven tests from a CSV
pytest-html / Allure — human-friendly reports you can hand to stakeholders

What we’ll verify

For each row in a spreadsheet:

Open the Page URL.
Find one or more <a> elements whose visible text matches the Anchor Text
(case/whitespace-insensitive; “exact or contains” by default).
Confirm at least one of those links resolves (after redirects) to the New Anchor URL.
(Optional) Record the Old Anchor URL for reporting context.

Typical uses: sitewide SEO retargeting, product renames, marketing page rewrites, CMS migrations.

Project layout

qa-link-check-py/
  ├─ data/
  │  └─ anchors.csv                 # your spreadsheet exported to CSV
  ├─ tests/
  │  └─ test_links.py               # the test
  ├─ utils/
  │  └─ normalize.py                # text/URL normalization helpers
  ├─ conftest.py                    # adds custom columns to pytest-html
  ├─ requirements.txt
  ├─ pytest.ini
  └─ README.md

anchors.csv (sample)

Page URL,Anchor Text,Old Anchor URL,New Anchor URL
https://example.com/marketing,Small Business SEO,https://example.com/old-small-biz,https://example.com/local-seo/affordable-seo-services-for-small-business/

Install once

python -m venv .venv
source .venv/bin/activate              # Windows: .venv\Scripts\Activate.ps1
pip install -r requirements.txt
python -m playwright install --with-deps

requirements.txt

pytest==8.2.2
playwright==1.46.0
pytest-playwright==0.5.0
pytest-html==4.1.1
allure-pytest==2.13.5        # optional

pytest.ini

[pytest]
addopts = -ra

Core logic

Normalization helpers (keep comparisons sane)

utils/normalize.py

import re
from urllib.parse import urlsplit, urlunsplit, parse_qsl, urlencode

_IGNORE = {"utm_source","utm_medium","utm_campaign","utm_term","utm_content",
           "gclid","fbclid","msclkid","utm_id"}

def normalize_text(s: str) -> str:
    if not s: return ""
    s = s.replace("\u00A0", " ")
    return re.sub(r"\s+", " ", s).strip().lower()

def normalize_url(u: str) -> str:
    if not u: return ""
    try:
        p = urlsplit(u.strip())
        scheme = p.scheme or "https"
        netloc = re.sub(r":(80|443)$", "", p.netloc.lower())
        path = re.sub(r"/{2,}", "/", p.path or "")
        if path.endswith("/") and path != "/":
            path = path[:-1]
        q = [(k,v) for (k,v) in parse_qsl(p.query, keep_blank_values=True) if k not in _IGNORE]
        return urlunsplit((scheme, netloc, path, urlencode(q, doseq=True), ""))
    except Exception:
        return u.strip()

The test (data-driven from CSV)

tests/test_links.py

import csv, os
from typing import Dict, List
import pytest, allure
from playwright.sync_api import Page, APIRequestContext
from utils.normalize import normalize_text, normalize_url

CSV_PATH = os.path.join(os.path.dirname(__file__), "..", "data", "anchors.csv")

def _load_rows(path: str) -> List[Dict[str, str]]:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing {path}")
    rows = []
    with open(path, newline="", encoding="utf-8-sig") as f:
        r = csv.DictReader(f)
        for i, row in enumerate(r, start=2):  # header is row 1
            rows.append({
                "sheet_row": i,
                "page_url": (row.get("Page URL") or "").strip(),
                "anchor_text": (row.get("Anchor Text") or "").strip(),
                "old_anchor_url": (row.get("Old Anchor URL") or "").strip(),
                "new_anchor_url": (row.get("New Anchor URL") or "").strip(),
            })
    return rows

TEST_ROWS = _load_rows(CSV_PATH)

@pytest.mark.parametrize("row", TEST_ROWS, ids=lambda r: f"Row_{r['sheet_row']}")
def test_anchor_targets(row: Dict[str, str], page: Page, record_property):
    page_url     = row["page_url"]
    anchor_text  = row["anchor_text"]
    old_anchor   = row["old_anchor_url"]
    expected_url = row["new_anchor_url"]

    # show these in the report details
    for k, v in [("Page URL", page_url), ("Anchor Text", anchor_text),
                 ("Old Anchor URL", old_anchor), ("New Anchor URL", expected_url)]:
        record_property(k, v)
        allure.dynamic.parameter(k, v)

    assert page_url and anchor_text and expected_url, "CSV has empty required field(s)."

    page.goto(page_url, wait_until="networkidle")

    # find candidate anchors by visible text
    target = normalize_text(anchor_text)
    anchors = page.locator("a[href]")
    candidates: List[Dict[str, str]] = []
    for i in range(anchors.count()):
        a = anchors.nth(i)
        text = normalize_text(a.inner_text() or "")
        if not text: continue
        if text == target or target in text or text in target:
            href = a.get_attribute("href")
            if href:
                abs_url = page.evaluate("(u)=>new URL(u, window.location.href).toString()", href)
                candidates.append({"text": text, "href": abs_url})

    allure.attach("\n".join(f"{c['text']} -> {c['href']}" for c in candidates[:20]) or "(none)",
                  name="candidates.txt", attachment_type=allure.attachment_type.TEXT)
    assert candidates, f"No <a> with matching anchor text found on {page_url}"

    # resolve redirects, compare normalized
    expected_norm = normalize_url(expected_url)
    matched, wrong = False, []
    api: APIRequestContext = page.context.request
    for c in candidates:
        try:
            resp = api.get(c["href"], max_redirects=10)
            final_url = resp.url
        except Exception:
            final_url = c["href"]
        final_norm = normalize_url(final_url)
        if final_norm == expected_norm:
            matched = True
            break
        wrong.append(final_norm)

    allure.attach(f"expected: {expected_norm}\nwrong examples:\n" + "\n".join(dict.fromkeys(wrong))[:10000],
                  name="summary.txt", attachment_type=allure.attachment_type.TEXT)

    if not matched:
        try:
            allure.attach(page.screenshot(full_page=True), "screenshot.png", allure.attachment_type.PNG)
        except Exception:
            pass
        allure.attach(page.content(), "page.html", allure.attachment_type.HTML)

    assert matched, "Anchor text found but URL mismatch.\nExamples:\n" + "\n".join(dict.fromkeys(wrong[:5]))

Make the HTML report show your four fields as columns

conftest.py

from py.xml import html

FIELDS = [("Page URL","page_url"), ("Anchor Text","anchor_text"),
          ("Old Anchor URL","old_anchor_url"), ("New Anchor URL","new_anchor_url")]

def pytest_itemcollected(item):
    callspec = getattr(item, "callspec", None)
    if not callspec: return
    row = callspec.params.get("row")
    if not isinstance(row, dict): return

    props = list(getattr(item, "user_properties", []))
    # replace any existing key; add only non-empty values
    def set_prop(label, value):
        nonlocal props
        props = [p for p in props if p[0] != label]
        if value: props.append((label, value))
    for label, key in FIELDS:
        set_prop(label, row.get(key, ""))

    item.user_properties = props

def _get_prop(report, name: str, default=""):
    val = default
    for k, v in getattr(report, "user_properties", []):
        if k == name and str(v).strip():
            val = v
    return val

def pytest_html_results_table_header(cells):
    cells.insert(2, html.th("Page URL"))
    cells.insert(3, html.th("Anchor Text"))
    cells.insert(4, html.th("Old Anchor URL"))
    cells.insert(5, html.th("New Anchor URL"))

def pytest_html_results_table_row(report, cells):
    page = _get_prop(report, "Page URL")
    anchor = _get_prop(report, "Anchor Text")
    oldu = _get_prop(report, "Old Anchor URL")
    newu = _get_prop(report, "New Anchor URL")

    cells.insert(2, html.td(html.a(page, href=page)) if page else html.td(""))
    cells.insert(3, html.td(anchor))
    cells.insert(4, html.td(html.a(oldu, href=oldu)) if oldu else html.td(""))
    cells.insert(5, html.td(html.a(newu, href=newu)) if newu else html.td(""))

def pytest_html_report_title(report):
    report.title = "Anchor Link Verification Report"

Run it

All tests with HTML report:

python -m pytest --html=reports/report.html --self-contained-html -ra

Run one spreadsheet row (helpful for smoke tests):

By node id (after --collect-only to see ids):

python -m pytest 'tests/test_links.py::test_anchor_targets[chromium-Row_2]' --html=reports/report.html --self-contained-html -ra

Or add a tiny env switch (optional):

# after TEST_ROWS = _load_rows(...)
import os
ONLY_ROW = os.getenv("ONLY_ROW", "")
if ONLY_ROW:
    TEST_ROWS = [r for r in TEST_ROWS if str(r["sheet_row"]) == ONLY_ROW]

Then run: ONLY_ROW=2 python -m pytest --html=reports/report.html --self-contained-html -ra

CI example (GitHub Actions)

.github/workflows/anchor-check.yml

name: Anchor Link Check (Py)
on:
  workflow_dispatch:
  schedule: [{ cron: "0 3 * * 1" }]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: python -m venv .venv && . .venv/bin/activate && pip install -r requirements.txt
      - run: python -m playwright install --with-deps
      - run: . .venv/bin/activate && python -m pytest --html=reports/report.html --self-contained-html -ra
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: pytest-html
          path: reports/

Matching policy & useful tweaks

Text matching: current rule is exact OR contains both ways after lowercasing and space-collapsing.
Make it stricter (exact only) by changing:

if text == target:
    ...

Hidden links: skip non-visible anchors:

if not a.is_visible(): continue

Redirect policy: we follow up to 10 redirects and compare the final URL.
URL normalization: ignores trailing slash differences and tracking params like utm_*, gclid, etc. Add or remove keys in _IGNORE as needed.
Flake control: retry once at the pytest level with -n auto (via pytest-xdist) if you parallelize later.

Troubleshooting

“pytest not found” → run inside your venv, or use python -m pytest ….
--headed=false error → --headed is a flag (no value). Use --headed or omit it.
No tests run with -k "Row 2" → -k parser dislikes spaces; use node id or the ONLY_ROW env var.
Blank columns in HTML summary → ensure conftest.py is present; it injects user properties at collection time so pytest-html can render them.
JS-built links missing → Playwright already renders JS; ensure you use wait_until="networkidle" and scan a[href] after page settles.

Why this approach works well

Auditable: each spreadsheet row becomes a test with evidence (trace, screenshot, DOM).
Scalable: add more rows—no code change.
Portable: CSV in, HTML report out; easy for SEO/content teams to review.
CI-friendly: one workflow step gives you a weekly compliance check on links.

Optional: .gitignore

__pycache__/
*.py[cod]
.venv/
.pytest_cache/
reports/
playwright-report/
allure-results/
allure-report/
test-results/
.DS_Store
Thumbs.db
.vscode/
.idea/
data/*.csv
!data/anchors.sample.csv

That’s the whole flow. Drop your CSV, run the tests, and ship a clear report showing exactly which anchors point where the links don’t work as expected—no manual crawling, no guesswork.