When you refactor menus, migrate content, or run a large SEO cleanup, you often need to prove that specific anchor texts on specific pages now point to the new target URLs. Doing that by hand is slow and error-prone. This article shows a clean, repeatable way to automate it using:

  • Playwright (Python) — renders the real page (including JavaScript-built links)
  • pytest — data-driven tests from a CSV
  • pytest-html / Allure — human-friendly reports you can hand to stakeholders

What we’ll verify

For each row in a spreadsheet:

  • Open the Page URL.
  • Find one or more <a> elements whose visible text matches the Anchor Text
    (case/whitespace-insensitive; “exact or contains” by default).
  • Confirm at least one of those links resolves (after redirects) to the New Anchor URL.
  • (Optional) Record the Old Anchor URL for reporting context.

Typical uses: sitewide SEO retargeting, product renames, marketing page rewrites, CMS migrations.

Project layout

qa-link-check-py/
  ├─ data/
  │  └─ anchors.csv                 # your spreadsheet exported to CSV
  ├─ tests/
  │  └─ test_links.py               # the test
  ├─ utils/
  │  └─ normalize.py                # text/URL normalization helpers
  ├─ conftest.py                    # adds custom columns to pytest-html
  ├─ requirements.txt
  ├─ pytest.ini
  └─ README.md

anchors.csv (sample)

Page URL,Anchor Text,Old Anchor URL,New Anchor URL
https://example.com/marketing,Small Business SEO,https://example.com/old-small-biz,https://example.com/local-seo/affordable-seo-services-for-small-business/

Install once

python -m venv .venv
source .venv/bin/activate              # Windows: .venv\Scripts\Activate.ps1
pip install -r requirements.txt
python -m playwright install --with-deps

requirements.txt

pytest==8.2.2
playwright==1.46.0
pytest-playwright==0.5.0
pytest-html==4.1.1
allure-pytest==2.13.5        # optional

pytest.ini

[pytest]
addopts = -ra

Core logic

Normalization helpers (keep comparisons sane)

utils/normalize.py

import re
from urllib.parse import urlsplit, urlunsplit, parse_qsl, urlencode

_IGNORE = {"utm_source","utm_medium","utm_campaign","utm_term","utm_content",
           "gclid","fbclid","msclkid","utm_id"}

def normalize_text(s: str) -> str:
    if not s: return ""
    s = s.replace("\u00A0", " ")
    return re.sub(r"\s+", " ", s).strip().lower()

def normalize_url(u: str) -> str:
    if not u: return ""
    try:
        p = urlsplit(u.strip())
        scheme = p.scheme or "https"
        netloc = re.sub(r":(80|443)$", "", p.netloc.lower())
        path = re.sub(r"/{2,}", "/", p.path or "")
        if path.endswith("/") and path != "/":
            path = path[:-1]
        q = [(k,v) for (k,v) in parse_qsl(p.query, keep_blank_values=True) if k not in _IGNORE]
        return urlunsplit((scheme, netloc, path, urlencode(q, doseq=True), ""))
    except Exception:
        return u.strip()

The test (data-driven from CSV)

tests/test_links.py

import csv, os
from typing import Dict, List
import pytest, allure
from playwright.sync_api import Page, APIRequestContext
from utils.normalize import normalize_text, normalize_url

CSV_PATH = os.path.join(os.path.dirname(__file__), "..", "data", "anchors.csv")

def _load_rows(path: str) -> List[Dict[str, str]]:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing {path}")
    rows = []
    with open(path, newline="", encoding="utf-8-sig") as f:
        r = csv.DictReader(f)
        for i, row in enumerate(r, start=2):  # header is row 1
            rows.append({
                "sheet_row": i,
                "page_url": (row.get("Page URL") or "").strip(),
                "anchor_text": (row.get("Anchor Text") or "").strip(),
                "old_anchor_url": (row.get("Old Anchor URL") or "").strip(),
                "new_anchor_url": (row.get("New Anchor URL") or "").strip(),
            })
    return rows

TEST_ROWS = _load_rows(CSV_PATH)

@pytest.mark.parametrize("row", TEST_ROWS, ids=lambda r: f"Row_{r['sheet_row']}")
def test_anchor_targets(row: Dict[str, str], page: Page, record_property):
    page_url     = row["page_url"]
    anchor_text  = row["anchor_text"]
    old_anchor   = row["old_anchor_url"]
    expected_url = row["new_anchor_url"]

    # show these in the report details
    for k, v in [("Page URL", page_url), ("Anchor Text", anchor_text),
                 ("Old Anchor URL", old_anchor), ("New Anchor URL", expected_url)]:
        record_property(k, v)
        allure.dynamic.parameter(k, v)

    assert page_url and anchor_text and expected_url, "CSV has empty required field(s)."

    page.goto(page_url, wait_until="networkidle")

    # find candidate anchors by visible text
    target = normalize_text(anchor_text)
    anchors = page.locator("a[href]")
    candidates: List[Dict[str, str]] = []
    for i in range(anchors.count()):
        a = anchors.nth(i)
        text = normalize_text(a.inner_text() or "")
        if not text: continue
        if text == target or target in text or text in target:
            href = a.get_attribute("href")
            if href:
                abs_url = page.evaluate("(u)=>new URL(u, window.location.href).toString()", href)
                candidates.append({"text": text, "href": abs_url})

    allure.attach("\n".join(f"{c['text']} -> {c['href']}" for c in candidates[:20]) or "(none)",
                  name="candidates.txt", attachment_type=allure.attachment_type.TEXT)
    assert candidates, f"No <a> with matching anchor text found on {page_url}"

    # resolve redirects, compare normalized
    expected_norm = normalize_url(expected_url)
    matched, wrong = False, []
    api: APIRequestContext = page.context.request
    for c in candidates:
        try:
            resp = api.get(c["href"], max_redirects=10)
            final_url = resp.url
        except Exception:
            final_url = c["href"]
        final_norm = normalize_url(final_url)
        if final_norm == expected_norm:
            matched = True
            break
        wrong.append(final_norm)

    allure.attach(f"expected: {expected_norm}\nwrong examples:\n" + "\n".join(dict.fromkeys(wrong))[:10000],
                  name="summary.txt", attachment_type=allure.attachment_type.TEXT)

    if not matched:
        try:
            allure.attach(page.screenshot(full_page=True), "screenshot.png", allure.attachment_type.PNG)
        except Exception:
            pass
        allure.attach(page.content(), "page.html", allure.attachment_type.HTML)

    assert matched, "Anchor text found but URL mismatch.\nExamples:\n" + "\n".join(dict.fromkeys(wrong[:5]))

Make the HTML report show your four fields as columns

conftest.py

from py.xml import html

FIELDS = [("Page URL","page_url"), ("Anchor Text","anchor_text"),
          ("Old Anchor URL","old_anchor_url"), ("New Anchor URL","new_anchor_url")]

def pytest_itemcollected(item):
    callspec = getattr(item, "callspec", None)
    if not callspec: return
    row = callspec.params.get("row")
    if not isinstance(row, dict): return

    props = list(getattr(item, "user_properties", []))
    # replace any existing key; add only non-empty values
    def set_prop(label, value):
        nonlocal props
        props = [p for p in props if p[0] != label]
        if value: props.append((label, value))
    for label, key in FIELDS:
        set_prop(label, row.get(key, ""))

    item.user_properties = props

def _get_prop(report, name: str, default=""):
    val = default
    for k, v in getattr(report, "user_properties", []):
        if k == name and str(v).strip():
            val = v
    return val

def pytest_html_results_table_header(cells):
    cells.insert(2, html.th("Page URL"))
    cells.insert(3, html.th("Anchor Text"))
    cells.insert(4, html.th("Old Anchor URL"))
    cells.insert(5, html.th("New Anchor URL"))

def pytest_html_results_table_row(report, cells):
    page = _get_prop(report, "Page URL")
    anchor = _get_prop(report, "Anchor Text")
    oldu = _get_prop(report, "Old Anchor URL")
    newu = _get_prop(report, "New Anchor URL")

    cells.insert(2, html.td(html.a(page, href=page)) if page else html.td(""))
    cells.insert(3, html.td(anchor))
    cells.insert(4, html.td(html.a(oldu, href=oldu)) if oldu else html.td(""))
    cells.insert(5, html.td(html.a(newu, href=newu)) if newu else html.td(""))

def pytest_html_report_title(report):
    report.title = "Anchor Link Verification Report"

Run it

All tests with HTML report:

python -m pytest --html=reports/report.html --self-contained-html -ra

Run one spreadsheet row (helpful for smoke tests):

  • By node id (after --collect-only to see ids):
python -m pytest 'tests/test_links.py::test_anchor_targets[chromium-Row_2]' --html=reports/report.html --self-contained-html -ra
  • Or add a tiny env switch (optional):
# after TEST_ROWS = _load_rows(...)
import os
ONLY_ROW = os.getenv("ONLY_ROW", "")
if ONLY_ROW:
    TEST_ROWS = [r for r in TEST_ROWS if str(r["sheet_row"]) == ONLY_ROW]

Then run: ONLY_ROW=2 python -m pytest --html=reports/report.html --self-contained-html -ra

CI example (GitHub Actions)

.github/workflows/anchor-check.yml

name: Anchor Link Check (Py)
on:
  workflow_dispatch:
  schedule: [{ cron: "0 3 * * 1" }]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: python -m venv .venv && . .venv/bin/activate && pip install -r requirements.txt
      - run: python -m playwright install --with-deps
      - run: . .venv/bin/activate && python -m pytest --html=reports/report.html --self-contained-html -ra
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: pytest-html
          path: reports/

Matching policy & useful tweaks

  • Text matching: current rule is exact OR contains both ways after lowercasing and space-collapsing.
    Make it stricter (exact only) by changing:
if text == target:
    ...

  • Hidden links: skip non-visible anchors:
if not a.is_visible(): continue
  • Redirect policy: we follow up to 10 redirects and compare the final URL.
  • URL normalization: ignores trailing slash differences and tracking params like utm_*, gclid, etc. Add or remove keys in _IGNORE as needed.
  • Flake control: retry once at the pytest level with -n auto (via pytest-xdist) if you parallelize later.

Troubleshooting

  • “pytest not found” → run inside your venv, or use python -m pytest ….
  • --headed=false error--headed is a flag (no value). Use --headed or omit it.
  • No tests run with -k "Row 2"-k parser dislikes spaces; use node id or the ONLY_ROW env var.
  • Blank columns in HTML summary → ensure conftest.py is present; it injects user properties at collection time so pytest-html can render them.
  • JS-built links missing → Playwright already renders JS; ensure you use wait_until="networkidle" and scan a[href] after page settles.

Why this approach works well

  • Auditable: each spreadsheet row becomes a test with evidence (trace, screenshot, DOM).
  • Scalable: add more rows—no code change.
  • Portable: CSV in, HTML report out; easy for SEO/content teams to review.
  • CI-friendly: one workflow step gives you a weekly compliance check on links.

Optional: .gitignore

__pycache__/
*.py[cod]
.venv/
.pytest_cache/
reports/
playwright-report/
allure-results/
allure-report/
test-results/
.DS_Store
Thumbs.db
.vscode/
.idea/
data/*.csv
!data/anchors.sample.csv

That’s the whole flow. Drop your CSV, run the tests, and ship a clear report showing exactly which anchors point where the links don’t work as expected—no manual crawling, no guesswork.

Leave a comment