Back to all posts

How to Run Puppeteer on AWS Lambda

CaptureKit Team
puppeteerweb-scrapingAWSserverlessautomation

Running Puppeteer on AWS Lambda can be challenging due to the serverless environment's limitations and Chrome's resource requirements. However, with the right setup and optimizations, it's possible to create a reliable web scraping solution that scales automatically. In this guide, we'll explore how to set up Puppeteer on AWS Lambda and provide a working boilerplate solution.

Why Run Puppeteer on AWS Lambda?

Running Puppeteer on AWS Lambda offers several advantages:

  • Serverless Architecture: No need to manage servers or worry about uptime
  • Cost-Effective: Pay only for the compute time you use
  • Auto-Scaling: Automatically handle varying workloads
  • Easy Integration: Works well with other AWS services

However, there are some challenges to consider:

  • Lambda's execution time limits (up to 15 minutes)
  • Memory constraints (up to 10GB)
  • Cold starts affecting performance
  • Chrome binary compatibility issues

Setting Up Puppeteer on AWS Lambda

I've created a boilerplate repository that handles these challenges and provides a working solution. Let's go through the setup process:

Prerequisites

  1. Node.js 18.x (recommended)
  2. AWS Account with Lambda and S3 access
  3. AWS CLI configured for local deployment

Local Development Setup

First, clone the repository and set up your local environment:

# Install Node.js 18
nvm install 18
nvm use 18

# Install dependencies
npm install

# Create environment file
echo "SECRET=your-secret-key-here" > .env

# Run locally
node index.js

AWS Configuration

  1. Create an S3 bucket for your Lambda deployment package
  2. Create a Lambda function with these recommended settings:
    • Runtime: Node.js 18.x
    • Memory: 1024 MB
    • Timeout: 30 seconds
    • Architecture: x86_64

Deployment Options

Manual Deployment

# Create deployment package
zip -r lambda.zip index.js node_modules

# Upload to S3
aws s3 cp lambda.zip s3://your-bucket-name/lambda.zip

Then update your Lambda function through the AWS Console:

  1. Go to AWS Lambda Console
  2. Select your function
  3. Go to Code tab
  4. Click "Upload from" -> "Amazon S3 location"
  5. Paste the S3 URL of your uploaded zip file

Automated Deployment with GitHub Actions

The boilerplate includes a GitHub Actions workflow for automated deployment. To set it up:

  1. Add these secrets to your GitHub repository:

    • AWS_ACCESS_KEY_ID
    • AWS_SECRET_ACCESS_KEY
  2. Update the workflow file (.github/workflows/main.yml) with your values:

    • Replace {{your-bucket-name}} with your S3 bucket name
    • Replace {{your-function-name}} with your Lambda function name
  3. Push to main to trigger deployment

Using the Lambda Function

The function accepts POST requests with this structure:

{
  "url": "https://example.com"
}

Required headers:

secret: your-secret-key

Key Features of the Boilerplate

  1. Stealth Mode: Uses puppeteer-extra-plugin-stealth to avoid detection
  2. AWS Compatibility: Uses @sparticuz/chromium for Lambda compatibility
  3. Security: Secret key authentication
  4. Automated Deployment: GitHub Actions workflow included

Dependencies

The boilerplate uses these key dependencies:

  • @sparticuz/chromium: ^123.0.1
  • puppeteer-extra: ^3.3.4
  • puppeteer-core: 19.6
  • puppeteer-extra-plugin-stealth: ^2.11.1
  • puppeteer: ^21.5.0
  • dotenv: ^16.4.5

Alternative Solution: CaptureKit

While running Puppeteer on AWS Lambda is powerful, it requires significant maintenance and handling of edge cases. If you're looking for a managed solution that handles all the infrastructure and maintenance, consider using CaptureKit. It provides three powerful APIs in one platform:

Screenshot API

  • Reliable screenshot capture with no infrastructure management
  • Full-page screenshots with lazy loading support
  • Built-in ad and cookie banner blocking
  • Multiple output formats (PNG, WebP, JPEG, PDF)
  • Direct S3 upload integration

Content Extraction API

  • Clean, structured HTML extraction
  • Metadata parsing (title, description, OpenGraph & Schema data)
  • Link scraping (internal and external)
  • Consistent data without maintenance headaches
  • Perfect for data pipelines and web scraping

AI Analysis API

  • Instant webpage summarization
  • Key insights extraction
  • AI-powered content analysis
  • Scale your web research process
  • Focus on creating, not extracting content

All CaptureKit APIs are:

  • Developer-first with instant access
  • No credit card required for free tier
  • Lightning-fast support
  • Built for production use cases

Best Practices and Tips

  1. Memory Management

    • Monitor Lambda memory usage
    • Adjust memory allocation based on your needs
    • Clean up resources properly
  2. Performance Optimization

    • Use Lambda layers for dependencies
    • Implement connection pooling
    • Cache frequently accessed data
  3. Error Handling

    • Implement proper error logging
    • Set up CloudWatch alarms
    • Handle timeouts gracefully
  4. Security

    • Never commit AWS credentials
    • Use environment variables for secrets
    • Implement proper IAM roles

Conclusion

Running Puppeteer on AWS Lambda is a powerful solution for serverless web scraping, but it requires careful setup and maintenance. The provided boilerplate handles many common challenges and provides a solid foundation for your projects.

For those who want to focus on their core business logic without managing infrastructure, CaptureKit offers a comprehensive solution that handles all the complexities of web scraping and content extraction.

Choose the approach that best fits your needs:

  • Use the Puppeteer Lambda boilerplate if you need full control and customization
  • Use CaptureKit if you want a managed solution with additional features