Back to all posts

How to Extract Links from a Sitemap

Jonathan Geiger
sitemapweb-scrapingautomationtutorial

Sitemaps provide an organized map of a website's content, making them invaluable for SEO analysis, content auditing, and web scraping. In this guide, we'll show you how to extract links from a sitemap using Node.js and XML parsing libraries, then demonstrate how CaptureKit API offers a simpler alternative.

Sitemaps are XML files that list all the important URLs on a website. To extract links from a sitemap, we'll need to:

  1. Find the sitemap URL (usually at /sitemap.xml)
  2. Fetch and parse the XML content
  3. Extract the links from the parsed XML

Here's a complete solution using Node.js with axios and xml2js:

import axios from 'axios';
import { parseStringPromise } from 'xml2js';

// Maximum number of links to fetch
const MAX_LINKS = 100;

// Main function to find and extract sitemap links
async function extractSitemapLinks(url) {
  try {
    // Step 1: Find the sitemap URL
    const sitemapUrl = await getSitemapUrl(url);
    if (!sitemapUrl) {
      console.log('No sitemap found for this website');
      return [];
    }
    
    // Step 2: Fetch links from the sitemap
    const links = await fetchSitemapLinks(sitemapUrl);
    return links;
  } catch (error) {
    console.error('Error extracting sitemap links:', error);
    return [];
  }
}

// Function to determine the sitemap URL
export async function getSitemapUrl(url) {
  try {
    const { origin, fullPath } = formatUrl(url);

    // If the URL already points to an XML file, verify if it's a valid sitemap
    if (fullPath.endsWith('.xml')) {
      const isValidSitemap = await verifySitemap(fullPath);
      return isValidSitemap ? fullPath : null;
    }

    // Common sitemap paths to check in order of popularity
    const commonSitemapPaths = [
      '/sitemap.xml',
      '/sitemap_index.xml',
      '/sitemap-index.xml',
      '/sitemaps.xml',
      '/sitemap/sitemap.xml',
      '/sitemaps/sitemap.xml',
      '/sitemap/index.xml',
      '/wp-sitemap.xml', // WordPress
      '/sitemap_news.xml', // News specific
      '/sitemap_products.xml', // E-commerce
      '/post-sitemap.xml', // Blog specific
      '/page-sitemap.xml', // Page specific
      '/robots.txt', // Sometimes sitemap URL is in robots.txt
    ];

    // Try each path in order
    for (const path of commonSitemapPaths) {
      // If we're checking robots.txt, we need to extract the sitemap URL from it
      if (path === '/robots.txt') {
        try {
          const robotsUrl = `${origin}${path}`;
          const robotsResponse = await axios.get(robotsUrl);
          const robotsContent = robotsResponse.data;

          // Extract sitemap URL from robots.txt
          const sitemapMatch = robotsContent.match(/Sitemap:\s*(.+)/i);
          if (sitemapMatch && sitemapMatch[1]) {
            const robotsSitemapUrl = sitemapMatch[1].trim();
            const isValid = await verifySitemap(robotsSitemapUrl);
            if (isValid) return robotsSitemapUrl;
          }
        } catch (e) {
          // If robots.txt check fails, continue to the next option
          continue;
        }
      } else {
        const sitemapUrl = `${origin}${path}`;
        const isValidSitemap = await verifySitemap(sitemapUrl);
        if (isValidSitemap) return sitemapUrl;
      }
    }

    return null;
  } catch (error) {
    console.error('Error determining sitemap URL:', error);
    return null;
  }
}

// Verify if a URL is a valid sitemap
export async function verifySitemap(sitemapUrl) {
  try {
    const response = await axios.get(sitemapUrl);
    const parsedSitemap = await parseStringPromise(response.data);

    // Check for <urlset> or <sitemapindex> tags
    return Boolean(parsedSitemap.urlset || parsedSitemap.sitemapindex);
  } catch (e) {
    console.error(`Invalid sitemap at ${sitemapUrl}`);
    return false;
  }
}

// Format URL to ensure it has proper scheme and structure
export function formatUrl(url) {
  if (!url.startsWith('http')) {
    url = `https://${url}`;
  }

  const { origin, pathname } = new URL(url.replace('http:', 'https:'));

  return {
    origin,
    fullPath: `${origin}${pathname}`.replace(/\/$/, ''),
  };
}

// Fetch and process links from the sitemap
export async function fetchSitemapLinks(sitemapUrl, maxLinks = MAX_LINKS) {
  const links = [];

  try {
    const response = await axios.get(sitemapUrl);
    const sitemapContent = response.data;

    const parsedSitemap = await parseStringPromise(sitemapContent);

    // Handle sitemaps with <urlset>
    if (parsedSitemap.urlset?.url) {
      for (const urlObj of parsedSitemap.urlset.url) {
        try {
          if (links.length >= maxLinks) break;
          links.push(urlObj.loc[0]);
        } catch (e) {
          console.error('Error parsing sitemap:', e);
        }
      }
    }
    
    // Handle nested sitemaps with <sitemapindex>
    if (parsedSitemap.sitemapindex?.sitemap) {
      for (const sitemapObj of parsedSitemap.sitemapindex.sitemap) {
        try {
          if (links.length >= maxLinks) break; // Stop processing further sitemaps
          const nestedSitemapUrl = sitemapObj.loc[0];
          const nestedLinks = await fetchSitemapLinks(
            nestedSitemapUrl,
            maxLinks - links.length
          );
          links.push(...nestedLinks.slice(0, maxLinks - links.length));
        } catch (e) {
          console.error('Error parsing sitemap:', e);
        }
      }
    }

    return links.slice(0, maxLinks);
  } catch (error) {
    console.error(`Error fetching sitemap from ${sitemapUrl}:`, error);
    return links || [];
  }
}

// Usage example
async function main() {
  const url = 'https://example.com';
  const links = await extractSitemapLinks(url);
  console.log('Found links:', links);
}

main();

How It Works

This code handles several important aspects of sitemap processing:

  1. Sitemap Discovery: It checks multiple common sitemap locations, including robots.txt.
  2. Sitemap Validation: It verifies that XML files are valid sitemaps by checking for standard sitemap elements.
  3. Nested Sitemaps: It processes sitemap indexes that point to other sitemaps.
  4. Link Extraction: It extracts the <loc> elements from both regular sitemaps and sitemap indexes.

Handling Different Sitemap Types

Sitemaps come in different forms:

  1. Standard Sitemaps: These contain a list of URLs in <url> elements.
  2. Sitemap Indexes: These contain links to other sitemaps in <sitemap> elements.
  3. Specialized Sitemaps: Some sites have separate sitemaps for news, products, images, or videos.

Our code handles all these cases by:

  • Checking for both <urlset> (standard sitemap) and <sitemapindex> (sitemap index) elements
  • Recursively processing nested sitemaps
  • Limiting the total number of links to avoid memory issues

While the Node.js approach is flexible, it requires handling HTTP requests, XML parsing, error handling, and recursive sitemap traversal. CaptureKit API offers a simpler solution that handles all these complexities for you.

Here's how to use CaptureKit API to extract sitemap data:

curl "https://api.capturekit.dev/content?url=https://example.com&access_key=YOUR_ACCESS_KEY&include_sitemap=true"

The API response includes organized sitemap data along with other useful website information:

{
  "success": true,
  "data": {
    "metadata": {
      "title": "Example Website",
      "description": "Example website description",
      "favicon": "https://example.com/favicon.ico",
      "ogImage": "https://example.com/og-image.png"
    },
    "sitemap": {
      "source": "https://example.com/sitemap.xml",
      "totalLinks": 150,
      "links": [
        "https://example.com/",
        "https://example.com/page-content",
        "https://example.com/ai"
        // More links here...
      ]
    }
  }
}

Benefits of Using CaptureKit API

  1. Simplicity: One API call instead of dozens of lines of code
  2. Reliability: Handles all edge cases, redirects, and error conditions
  3. Performance: Optimized for speed and efficiency
  4. Additional Data: Get website metadata alongside sitemap information
  5. No Maintenance: No need to update your code when sitemap formats change

Conclusion

Extracting links from sitemaps is essential for many web scraping and SEO tasks. While our Node.js solution provides a comprehensive approach with full control, CaptureKit API offers a more convenient alternative that handles the complexities for you.

Choose the method that best fits your needs:

  • Use the Node.js solution if you need full control over the extraction process
  • Use CaptureKit API if you want a quick, reliable solution with minimal code

By leveraging sitemaps, you can efficiently access a website's structure and content without having to crawl every page manually, saving time and resources while ensuring you don't miss important content.