How to Extract All Links from a Website Using Puppeteer

March 24, 2025•Jonathan Geiger

puppeteerweb-scrapingautomationtutorial

Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.

Method 1: Using Puppeteer

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:

const puppeteer = require('puppeteer');

async function extractLinks(url) {
	// Launch the browser
	const browser = await puppeteer.launch();
	const page = await browser.newPage();

	try {
		// Navigate to the URL
		await page.goto(url, { waitUntil: 'networkidle0' });

		// Extract all links
		const links = await page.evaluate(() => {
			const anchors = document.querySelectorAll('a');
			return Array.from(anchors).map((anchor) => anchor.href);
		});

		// Remove duplicates
		const uniqueLinks = [...new Set(links)];

		return uniqueLinks;
	} catch (error) {
		console.error('Error:', error);
		throw error;
	} finally {
		await browser.close();
	}
}

// Usage example
async function main() {
	const url = 'https://example.com';
	const links = await extractLinks(url);
	console.log('Found links:', links);
}

main();

This code will:

Launch a headless browser using Puppeteer
Navigate to the specified URL
Extract all <a> tags from the page
Get their href attributes
Remove any duplicate links
Return the unique list of URLs

Handling Dynamic Content

If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:

// Wait for specific elements to load
await page.waitForSelector('a');

// Or wait for network to be idle
await page.waitForNetworkIdle();

Filtering Links

You can also filter links based on specific criteria:

const links = await page.evaluate(() => {
	const anchors = document.querySelectorAll('a');
	return Array.from(anchors)
		.map((anchor) => anchor.href)
		.filter((href) => {
			// Filter out external links
			return href.startsWith('https://example.com');
			// Or filter by specific patterns
			// return href.includes('/blog/');
		});
});

Method 2: Using CaptureKit API (Recommended)

While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.

Here's how to use CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://tailwindcss.com&access_key=YOUR_ACCESS_KEY"

The API response includes categorized links and additional metadata:

{
	"success": true,
	"data": {
		"links": {
			"internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
			"external": ["https://tailwindui.com", "https://shopify.com"],
			"social": [
				"https://github.com/tailwindlabs/tailwindcss",
				"https://x.com/tailwindcss"
			]
		},
		"metadata": {
			"title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
			"description": "Tailwind CSS is a utility-first CSS framework.",
			"favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
			"ogImage": "https://tailwindcss.com/opengraph-image.jpg"
		}
	}
}

Benefits of Using CaptureKit API

Categorized Links: Links are automatically categorized into internal, external, and social links
Additional Metadata: Get website title, description, favicon, and OpenGraph image
Reliability: No need to handle browser automation, network issues, or rate limiting
Speed: Results are returned in seconds, not minutes
Maintenance-Free: No need to update code when websites change their structure

Conclusion

While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.

Choose the method that best fits your needs:

Use Puppeteer if you need full control over the scraping process or have specific requirements
Use CaptureKit API if you want a quick, reliable solution with additional features

If you got to here, maybe you will find these scraping tutorials with Puppeteer useful as well:

How to Extract All Links from a Website Using Puppeteer

Method 1: Using Puppeteer

Handling Dynamic Content

Filtering Links

Method 2: Using CaptureKit API (Recommended)

Benefits of Using CaptureKit API

Conclusion

Ready to get started with CaptureKit?

You Might Also Like

How to Wait for Page to Load in Puppeteer

How to Convert HTML to Markdown

How to Block Requests with Puppeteer

Table of Contents