Back to all posts

How to Extract HTML from Web Pages with Puppeteer

CaptureKit Team
puppeteerweb-scrapinghtmlautomationtutorial

Extracting HTML content from websites is a fundamental task for web scrapers, data scientists, and developers building automation tools. Puppeteer, a Node.js library developed by Google, provides a robust way to interact with web pages programmatically. In this guide, we'll explore how to extract HTML content effectively with Puppeteer and address common challenges.

What is Puppeteer?

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers. It enables developers to:

  • Scrape web content and extract data
  • Automate form submissions and user interactions
  • Generate screenshots and PDFs
  • Run automated testing
  • Monitor website performance
  • Crawl single-page applications (SPAs)

Let's dive into using Puppeteer for HTML extraction.

Setting Up Puppeteer

First, install Puppeteer via npm:

npm install puppeteer

This command installs both Puppeteer and a compatible version of Chromium. If you'd prefer to use your existing Chrome installation, use puppeteer-core instead:

npm install puppeteer-core

Basic HTML Extraction

Here's a simple script to extract the entire HTML from a webpage:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  // Get the page's HTML content
  const html = await page.content();
  console.log(html);
  
  await browser.close();
})();

This script:

  1. Launches a headless browser
  2. Opens a new page
  3. Navigates to https://example.com
  4. Extracts the full HTML content
  5. Closes the browser

Extracting HTML from Specific Elements

To extract HTML from a specific element on the page:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  // Extract HTML from a specific element
  const elementHtml = await page.evaluate(() => {
    const element = document.querySelector('.main-content');
    return element ? element.outerHTML : null;
  });
  
  console.log(elementHtml);
  await browser.close();
})();

Waiting for Dynamic Content

Modern websites often load content dynamically. To ensure all content is loaded before extraction:

await page.goto('https://example.com', { 
  waitUntil: 'networkidle2' 
});

For pages with specific elements that load asynchronously:

await page.waitForSelector('.dynamic-content', { visible: true });
const html = await page.content();

Extracting Text Content

If you only need the text content without HTML tags:

const textContent = await page.evaluate(() => {
  return document.body.innerText;
});

For a specific element:

const elementText = await page.$eval('.article', el => el.textContent);

Extracting Metadata

To extract a webpage's metadata like title, description, and Open Graph data:

const metadata = await page.evaluate(() => {
  return {
    title: document.title,
    description: document.querySelector('meta[name="description"]')?.content || null,
    ogTitle: document.querySelector('meta[property="og:title"]')?.content || null,
    ogDescription: document.querySelector('meta[property="og:description"]')?.content || null,
    ogImage: document.querySelector('meta[property="og:image"]')?.content || null
  };
});

console.log(metadata);

To extract all links from a webpage:

const links = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('a')).map(a => {
    return {
      text: a.textContent.trim(),
      href: a.href
    };
  });
});

console.log(links);

Handling Authentication

For websites that require authentication:

await page.goto('https://example.com/login');
await page.type('#username', 'your_username');
await page.type('#password', 'your_password');
await page.click('#login-button');
await page.waitForNavigation();

// Now that we're logged in, extract the protected content
const html = await page.content();

Avoiding Detection

Many websites implement anti-bot measures. Use stealth mode to avoid detection:

npm install puppeteer-extra puppeteer-extra-plugin-stealth
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

// Now use puppeteer as usual
const browser = await puppeteer.launch();

Saving Extracted HTML to a File

To save the extracted HTML to a file:

const fs = require('fs');

// Extract HTML
const html = await page.content();

// Write to file
fs.writeFileSync('extracted-page.html', html);

Working with iframes

To extract HTML from an iframe:

const frameContent = await page.frames()[1].content(); // gets content from the second frame

// Or find a frame by its name
const namedFrame = page.frames().find(frame => frame.name() === 'frameName');
const namedFrameContent = await namedFrame.content();

Alternative to Puppeteer: CaptureKit API

Setting up and maintaining Puppeteer for HTML extraction can be challenging. If you need a reliable, scalable solution without infrastructure headaches, consider using CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://example.com&access_key=YOUR_ACCESS_KEY&include_html=true"

Benefits of CaptureKit API

  • Complete Solution: Extract not just HTML, but also metadata, links, and structured content
  • No Browser Management: No need to maintain browser instances
  • Scale Effortlessly: Handle high-volume extraction without infrastructure concerns

Example Response from CaptureKit API:

{
  "success": true,
  "data": {
    "metadata": {
      "title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
      "description": "Tailwind CSS is a utility-first CSS framework.",
      "favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
      "ogImage": "https://tailwindcss.com/opengraph-image.jpg"
    },
    "links": {
      "internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
      "external": ["https://tailwindui.com", "https://shopify.com"],
      "social": [
        "https://github.com/tailwindlabs/tailwindcss",
        "https://x.com/tailwindcss"
      ]
    },
    "html": "<html><body><h1>Hello, world!</h1></body></html>"
  }
}

Conclusion

Puppeteer offers powerful capabilities for extracting HTML from websites, but it can be complex to set up and maintain. For developers who need a reliable, maintenance-free solution that provides more than just raw HTML, CaptureKit API offers a compelling alternative with comprehensive data extraction capabilities. By choosing the right approach for your needs, you can streamline your web scraping workflows and focus on using the extracted data rather than managing the extraction process. 🚀