Developing a custom website crawler can be an exciting and educational journey for developers, especially when working in low-level environments like AWS EC2. Recently, I worked on building a website crawler using Java, Selenium, and JSoup to programmatically discover and analyze all internal URLs of a given domain.

This article outlines my development and deployment process, including the key challenges I faced and how I overcame them. If you’re trying to build something similar, I hope this step-by-step breakdown helps you avoid common pitfalls.


Problem Statement

I wanted to build a crawler that:

  • Automatically discovers all internal links on a website
  • Handles JavaScript-rendered content (not just static HTML)
  • Saves unique internal URLs to a database
  • Operates reliably in a headless Linux server environment (Ubuntu on AWS EC2)

Technology Stack

  • Java 17
  • Selenium WebDriver with ChromeDriver
  • JSoup for fast and clean HTML parsing
  • Chromium-browser for headless operation
  • Spring Boot for integrating with backend services
  • AWS EC2 Ubuntu as the deployment environment

Initial Crawler Design with JSoup

We started with a basic version of the crawler using only JSoup:

  • JSoup connects to each URL
  • Extracts all <a href=""> links
  • Normalizes the URLs
  • Saves new internal URLs to a database

This was sufficient for many static pages, but not for SPAs or content rendered via JavaScript.

Document doc = Jsoup.connect(currentUrl)
    .userAgent("Mozilla/5.0")
    .timeout(10_000)
    .get();

Elements links = doc.select("a[href]");

Moving to Selenium for Dynamic Content

To deal with JavaScript-heavy websites, we integrated Selenium WebDriver with headless Chrome.

Challenge: Headless Chrome on AWS EC2

When deploying to an AWS Ubuntu instance, we hit this error:

Could not start a new session. Response code 500. Message: session not created: probably user data directory is already in use

This was resolved by:

  • Specifying a unique --user-data-dir using UUID.randomUUID()
  • Ensuring /usr/bin/chromedriver and /usr/bin/chromium-browser were explicitly configured
options.setBinary("/usr/bin/chromium-browser");
options.addArguments("--user-data-dir=/tmp/chrome-profile-" + UUID.randomUUID());
System.setProperty("webdriver.chrome.driver", "/usr/bin/chromedriver");

This prevented Chrome from crashing due to shared profile conflicts in headless mode.


Full Crawl Logic

  1. Start with a given root URL
  2. Create a queue and set of visited URLs
  3. Load each page using Selenium
  4. Extract all anchor and raw URLs using Selenium + Regex
  5. Add new internal links to the queue
  6. Store all discovered URLs to database via JPA repository
List<WebElement> linkElements = driver.findElements(By.cssSelector("a[href], link[href]"));
String pageSource = driver.getPageSource();
Matcher matcher = Pattern.compile("https?://[^\\s\"'<>]+").matcher(pageSource);


Key Learnings

  • User Data Directory Conflicts: In headless environments, Chrome often fails if the --user-data-dir is not unique per session.
  • Selenium + JSoup Combo: For dynamic + static content crawling, combining both tools provides full coverage.
  • Explicit Binaries Matter: Don’t rely on system PATH. Explicitly set chromedriver and chromium-browser binaries.
  • Thread.sleep vs WebDriverWait: For JS-heavy sites, waiting a few seconds or using WebDriverWait helps ensure content is fully loaded before extracting URLs.

Final Thoughts

This crawler is now part of a broader data pipeline that analyzes page indexing status, page speed insights, and UX metrics using Google APIs.

By solving one issue at a time and debugging with logs and headless Chrome insights, I was able to get it working smoothly on my production EC2 instance.

If you’re building your own crawler, I hope this gives you both technical insights and confidence to tackle headless browser issues and dynamic content extraction.

Leave a comment