Sitemap Page Extractor: Discover, Clean, and Save Website URLs to Google Sheets

Go to Workflow
0 views
Built by Incrementors Incrementors
Created on June 05, 2026

Description

Description:
Automatically extracts all page URLs from website sitemaps, filters out unwanted sitemap links, and saves clean URLs to Google Sheets for SEO analysis and reporting.

How It Works:
This workflow automates the process of discovering and extracting all page URLs from a website's sitemap structure. Here's how it works step-by-step:

Step 1: URL Input
The workflow starts when you submit a website URL through a simple form interface.

Step 2: Sitemap Discovery
The system automatically generates and tests multiple possible sitemap URLs including /sitemap.xml, /sitemap_index.xml, /robots.txt, and other common variations.

Step 3: Valid Sitemap Identification
It sends HTTP requests to each potential sitemap URL and filters out empty or invalid responses, keeping only accessible sitemaps.

Step 4: Nested Sitemap Processing
For sitemap index files, the workflow extracts all nested sitemap URLs and processes each one individually to ensure complete coverage.

Step 5: Page URL Extraction
From each valid sitemap, it parses the XML content and extracts all individual page URLs using both XML <loc> tags and HTML links.

Step 6: URL Filtering
The system removes any URLs containing "sitemap" to ensure only actual content pages (like product, service, or blog pages) are retained.

Step 7: Google Sheets Integration
Finally, all clean page URLs are automatically saved to a Google Sheets document with duplicate prevention for easy analysis and reporting.

Setup Steps:

Estimated Setup Time: 10-15 minutes

1. Import the Workflow:
Import the provided JSON file into your n8n instance.

2. Configure Google Sheets Integration:
Set up Google Sheets OAuth2 credentials in n8n
Create a new Google Sheet or use an existing one
Update the "Save Page URLs to Sheet" node with your Google Sheet URL
Ensure your sheet has a tab named "Your sheet tab name" with a column header "Column name"


3. Test the Workflow:

Activate the workflow in n8n
Use the form trigger URL to submit a test website URL
Verify that URLs are being extracted and saved to your Google Sheet

4. Customize (Optional):

Modify the sitemap URL patterns in the "Build sitemap URLs" node if needed
Adjust the filtering criteria in the "Exclude the Sitemap URLs" node
Update the Google Sheets column mapping as required

Important Notes:

Ensure your Google Sheets credentials have proper read/write permissions
The workflow handles both XML sitemaps and robots.txt sitemap references
Duplicate URLs are automatically prevented when saving to Google Sheets
The workflow continues processing even if some sitemap URLs are inaccessible

Need Help?

For technical support or questions about this workflow:
✉️ [email protected]
or
fill out this form: Contact Us

Nodes Used (3)

Code
n8n-nodes-base.code
Google Sheets
n8n-nodes-base.googleSheets
HTTP Request
n8n-nodes-base.httpRequest