Transform Websites into a Conversational Knowledge Base with OpenAI RAG & Supabase
Go to WorkflowDescription
Overview
This advanced automation workflow enables deep web scraping combined with Retrieval-Augmented Generation (RAG) to transform websites into intelligent, queryable knowledge bases. The system recursively crawls target websites, extracts content, and indexes all data in a vector database for AI conversational access.
How the system works
Intelligent Web Scraping and RAG Pipeline
Recursive Web Scraper - Automatically crawls every accessible page of a target website
Data Extraction - Collects text, metadata, emails, links, and PDF documents
Supabase Integration - Stores content in PostgreSQL tables for scalability
RAG Vectorization - Generates embeddings and stores them for semantic search
AI Query Layer - Connects embeddings to an AI chat engine with citations
Error Handling - Automatically retriggers failed queries
Setup Instructions
Estimated setup time: 30-45 minutes
Prerequisites
Self-hosted n8n instance (v0.200.0 or higher)
Supabase account and project (PostgreSQL enabled)
OpenAI/Gemini/Claude API key for embeddings and chat
Optional: External vector database (Pinecone, Qdrant)
Detailed configuration steps
Step 1: Supabase configuration
Project creation**: New Supabase project with PostgreSQL enabled
Generating credentials**: API keys (anon key and service_role key) and connection string
Security configuration**: RLS policies according to your access requirements
Step 2: Connect Supabase to n8n
Configure Supabase node**: Add credentials to n8n Credentials
Test connection**: Verify with a simple query
Configure PostgreSQL**: Direct connection for advanced operations
Step 3: Preparing the database
Main tables**:
pages: URLs, content, metadata, scraping statuses
documents: Extracted and processed PDF files
embeddings: Vectors for semantic search
links: Link graph for navigation
Management functions**: Scripts to reactivate failed URLs and manage retries
Step 4: Configuring automation
Recursive scraper**: Starting URL, crawling depth, CSS selectors
HTTP extraction**: User-Agent, headers, timeouts, and retry policies
Supabase backup**: Batch insertion, data validation, duplicate management
Step 5: Error handling and re-executions
Failure monitoring**: Automatic detection of failed URLs
Manual triggers**: Selective re-execution by domain or date
Recovery sub-streams**: Retry logic with exponential backoff
Step 6: RAG processing
Embedding generation**: Text-embedding models with intelligent chunking
Vector storage**: Supabase pgvector or external database
Conversational engine**: Connection to chat models with source citations
Data structure
Main Supabase tables
| Table | Content | Usage |
|-------|---------|-------|
| pages | URLs, HTML content, metadata | Main storage for scraped content |
| documents | PDF files, extracted text | Downloaded and processed documents |
| embeddings | Vectors, text chunks | Semantic search and RAG |
| links | Link graph, navigation | Relationships between pages |
Use cases
Business and enterprise
Competitive intelligence with conversational querying
Market research from complex web domains
Compliance monitoring and regulatory watch
Research and academia
Literature extraction with semantic search
Building datasets from fragmented sources
Legal and technical
Scraping legal repositories with intelligent queries
Technical documentation transformed into a conversational assistant
Key features
Advanced scraping
Recursive crawling with automatic link discovery
Multi-format extraction (HTML, PDF, emails)
Intelligent error handling and retry
Intelligent RAG
Contextual embeddings for semantic search
Multi-document queries with citations
Intuitive conversational interface
Performance and scalability
Processing of thousands of pages per execution
Embedding cache for fast responses
Scalable architecture with Supabase
Technical Architecture
Main flow: Target URL → Recursive scraping → Content extraction → Supabase storage → Vectorization → Conversational interface
Supported types: HTML pages, PDF documents, metadata, links, emails
Performance specifications
Capacity**: 10,000+ pages per run
Response time**: < 5 seconds for RAG queries
Accuracy**: >90% relevance for specific domains
Scalability**: Distributed architecture via Supabase
Advanced configuration
Customization
Crawling depth and scope controls
Domain and content type filters
Chunking settings to optimize RAG
Monitoring
Real-time monitoring in Supabase
Cost and performance metrics
Detailed conversation logs