Back to all posts

How to Extract All Links from a Website Using Puppeteer

CaptureKit Team
puppeteerweb-scrapingautomationtutorial

Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.

Method 1: Using Puppeteer

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:

const puppeteer = require('puppeteer');

async function extractLinks(url) {
	// Launch the browser
	const browser = await puppeteer.launch();
	const page = await browser.newPage();

	try {
		// Navigate to the URL
		await page.goto(url, { waitUntil: 'networkidle0' });

		// Extract all links
		const links = await page.evaluate(() => {
			const anchors = document.querySelectorAll('a');
			return Array.from(anchors).map((anchor) => anchor.href);
		});

		// Remove duplicates
		const uniqueLinks = [...new Set(links)];

		return uniqueLinks;
	} catch (error) {
		console.error('Error:', error);
		throw error;
	} finally {
		await browser.close();
	}
}

// Usage example
async function main() {
	const url = 'https://example.com';
	const links = await extractLinks(url);
	console.log('Found links:', links);
}

main();

This code will:

  1. Launch a headless browser using Puppeteer
  2. Navigate to the specified URL
  3. Extract all <a> tags from the page
  4. Get their href attributes
  5. Remove any duplicate links
  6. Return the unique list of URLs

Handling Dynamic Content

If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:

// Wait for specific elements to load
await page.waitForSelector('a');

// Or wait for network to be idle
await page.waitForNetworkIdle();

You can also filter links based on specific criteria:

const links = await page.evaluate(() => {
	const anchors = document.querySelectorAll('a');
	return Array.from(anchors)
		.map((anchor) => anchor.href)
		.filter((href) => {
			// Filter out external links
			return href.startsWith('https://example.com');
			// Or filter by specific patterns
			// return href.includes('/blog/');
		});
});

While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.

Here's how to use CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://tailwindcss.com&access_key=YOUR_ACCESS_KEY"

The API response includes categorized links and additional metadata:

{
	"success": true,
	"data": {
		"links": {
			"internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
			"external": ["https://tailwindui.com", "https://shopify.com"],
			"social": [
				"https://github.com/tailwindlabs/tailwindcss",
				"https://x.com/tailwindcss"
			]
		},
		"metadata": {
			"title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
			"description": "Tailwind CSS is a utility-first CSS framework.",
			"favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
			"ogImage": "https://tailwindcss.com/opengraph-image.jpg"
		}
	}
}

Benefits of Using CaptureKit API

  1. Categorized Links: Links are automatically categorized into internal, external, and social links
  2. Additional Metadata: Get website title, description, favicon, and OpenGraph image
  3. Reliability: No need to handle browser automation, network issues, or rate limiting
  4. Speed: Results are returned in seconds, not minutes
  5. Maintenance-Free: No need to update code when websites change their structure

Conclusion

While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.

Choose the method that best fits your needs:

  • Use Puppeteer if you need full control over the scraping process or have specific requirements
  • Use CaptureKit API if you want a quick, reliable solution with additional features