Back to all posts

How to Extract All Links from a Website Using Puppeteer

CaptureKit Team

Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.

Method 1: Using Puppeteer

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:

const puppeteer = require('puppeteer');

async function extractLinks(url) {
	// Launch the browser
	const browser = await puppeteer.launch();
	const page = await browser.newPage();

	try {
		// Navigate to the URL
		await page.goto(url, { waitUntil: 'networkidle0' });

		// Extract all links
		const links = await page.evaluate(() => {
			const anchors = document.querySelectorAll('a');
			return Array.from(anchors).map((anchor) => anchor.href);

		// Remove duplicates
		const uniqueLinks = [ Set(links)];

		return uniqueLinks;
	} catch (error) {
		console.error('Error:', error);
		throw error;
	} finally {
		await browser.close();

// Usage example
async function main() {
	const url = '';
	const links = await extractLinks(url);
	console.log('Found links:', links);


This code will:

  1. Launch a headless browser using Puppeteer
  2. Navigate to the specified URL
  3. Extract all <a> tags from the page
  4. Get their href attributes
  5. Remove any duplicate links
  6. Return the unique list of URLs

Handling Dynamic Content

If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:

// Wait for specific elements to load
await page.waitForSelector('a');

// Or wait for network to be idle
await page.waitForNetworkIdle();

You can also filter links based on specific criteria:

const links = await page.evaluate(() => {
	const anchors = document.querySelectorAll('a');
	return Array.from(anchors)
		.map((anchor) => anchor.href)
		.filter((href) => {
			// Filter out external links
			return href.startsWith('');
			// Or filter by specific patterns
			// return href.includes('/blog/');

While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.

Here's how to use CaptureKit API:

curl ""

The API response includes categorized links and additional metadata:

	"success": true,
	"data": {
		"links": {
			"internal": ["", ""],
			"external": ["", ""],
			"social": [
		"metadata": {
			"title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
			"description": "Tailwind CSS is a utility-first CSS framework.",
			"favicon": "",
			"ogImage": ""

Benefits of Using CaptureKit API

  1. Categorized Links: Links are automatically categorized into internal, external, and social links
  2. Additional Metadata: Get website title, description, favicon, and OpenGraph image
  3. Reliability: No need to handle browser automation, network issues, or rate limiting
  4. Speed: Results are returned in seconds, not minutes
  5. Maintenance-Free: No need to update code when websites change their structure


While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.

Choose the method that best fits your needs:

  • Use Puppeteer if you need full control over the scraping process or have specific requirements
  • Use CaptureKit API if you want a quick, reliable solution with additional features