How to Extract All Links from a Website Using Puppeteer

Learn how to extract all links from a website using Puppeteer, and discover an easier alternative using CaptureKit API
Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.
Method 1: Using Puppeteer
Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:
const puppeteer = require('puppeteer');
async function extractLinks(url) {
// Launch the browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
// Navigate to the URL
await page.goto(url, { waitUntil: 'networkidle0' });
// Extract all links
const links = await page.evaluate(() => {
const anchors = document.querySelectorAll('a');
return Array.from(anchors).map((anchor) => anchor.href);
});
// Remove duplicates
const uniqueLinks = [...new Set(links)];
return uniqueLinks;
} catch (error) {
console.error('Error:', error);
throw error;
} finally {
await browser.close();
}
}
// Usage example
async function main() {
const url = 'https://example.com';
const links = await extractLinks(url);
console.log('Found links:', links);
}
main();
This code will:
- Launch a headless browser using Puppeteer
- Navigate to the specified URL
- Extract all
<a>tags from the page - Get their
hrefattributes - Remove any duplicate links
- Return the unique list of URLs
Handling Dynamic Content
If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:
// Wait for specific elements to load
await page.waitForSelector('a');
// Or wait for network to be idle
await page.waitForNetworkIdle();
Filtering Links
You can also filter links based on specific criteria:
const links = await page.evaluate(() => {
const anchors = document.querySelectorAll('a');
return Array.from(anchors)
.map((anchor) => anchor.href)
.filter((href) => {
// Filter out external links
return href.startsWith('https://example.com');
// Or filter by specific patterns
// return href.includes('/blog/');
});
});
Method 2: Using CaptureKit API (Recommended)
While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.
Here's how to use CaptureKit API:
curl "https://api.capturekit.dev/v1/content?url=https://tailwindcss.com&x_api_key=YOUR_ACCESS_KEY"
The API response includes categorized links and additional metadata:
{
"success": true,
"data": {
"links": {
"internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
"external": ["https://tailwindui.com", "https://shopify.com"],
"social": [
"https://github.com/tailwindlabs/tailwindcss",
"https://x.com/tailwindcss"
]
},
"metadata": {
"title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
"description": "Tailwind CSS is a utility-first CSS framework.",
"favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
"ogImage": "https://tailwindcss.com/opengraph-image.jpg"
}
}
}
Benefits of Using CaptureKit API
- Categorized Links: Links are automatically categorized into internal, external, and social links
- Additional Metadata: Get website title, description, favicon, and OpenGraph image
- Reliability: No need to handle browser automation, network issues, or rate limiting
- Speed: Results are returned in seconds, not minutes
- Maintenance-Free: No need to update code when websites change their structure
Conclusion
While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.
Choose the method that best fits your needs:
- Use Puppeteer if you need full control over the scraping process or have specific requirements
- Use CaptureKit API if you want a quick, reliable solution with additional features
If you got to here, maybe you will find these scraping tutorials with Puppeteer useful as well:
Ready to get started with CaptureKit?
Start capturing screenshots and extracting content today. Get started for free.
Get StartedYou Might Also Like
How To Pick a Screenshot API for Your Use Case
Learn how to choose the right screenshot API for your needs with tips on features, performance, pricing, and reliability.
5 Best Screenshot Machine Alternatives
Discover the top 5 Screenshot Machine alternatives offering reliable website capture, API automation, and affordable pricing for developers and marketers.
Automate Website Screenshots with CaptureKit and Zapier
Learn how to automate website screenshots and content extraction using CaptureKit and Zapier integration