Process Documents & Build Semantic Search with OpenAI, Gemini & Qdrant
Go to WorkflowDescription
🎯 Overview
This n8n workflow automates the process of ingesting documents from multiple sources (Google Drive and web forms) into a Qdrant vector database for semantic search capabilities. It handles batch processing, document analysis, embedding generation, and vector storage - all while maintaining proper error handling and execution tracking.
🚀 Key Features
Dual Input Sources**: Accepts files from both Google Drive folders and web form uploads
Batch Processing**: Processes files one at a time to prevent memory issues and ensure reliability
AI-Powered Analysis**: Uses Google Gemini to extract metadata and understand document context
Vector Embeddings**: Generates OpenAI embeddings for semantic search capabilities
Automated Cleanup**: Optionally deletes processed files from Google Drive (configurable)
Loop Processing**: Handles multiple files efficiently with Split In Batches nodes
Interactive Chat Interface**: Built-in chatbot for testing semantic search queries against indexed documents
📋 Use Cases
Knowledge Base Creation**: Build searchable document repositories for organizations
Document Compliance**: Process and index legal/regulatory documents (like Fair Work documents)
Content Management**: Automatically categorize and store uploaded documents
Research Libraries**: Create semantic search capabilities for research papers or reports
Customer Support**: Enable instant answers to policy and documentation questions via chat interface
🔧 Workflow Components
Input Methods
Google Drive Integration
Monitors a specific folder for new files
Processes existing files in batch mode
Supports automatic file conversion to PDF
Web Form Upload
Public-facing form for document submission
Accepts PDF, DOCX, DOC, and CSV files
Processes multiple file uploads in a single submission
Processing Pipeline
File Splitting: Separates multiple uploads into individual items
Document Analysis: Google Gemini extracts document understanding
Text Extraction: Converts documents to plain text
Embedding Generation: Creates vector embeddings via OpenAI
Vector Storage: Inserts documents with embeddings into Qdrant
Loop Control: Manages batch processing with proper state handling
Key Nodes
Split In Batches**: Processes files one at a time with reset: false to maintain state
Google Gemini**: Analyzes documents for context and metadata
Langchain Vector Store**: Handles Qdrant insertion with embeddings
HTTP Request**: Direct API calls for custom operations
Chat Interface**: Interactive chatbot for testing vector search queries
🛠️ Technical Implementation
Batch Processing Logic
The workflow uses a clever looping mechanism:
Split In Batches with batchSize: 1 ensures single-file processing
reset: false maintains loop state across iterations
Loop continues until all files are processed
Error Handling
All nodes include continueOnFail options where appropriate
Execution logs are preserved for debugging
File deletion only occurs after successful insertion
Data Flow
Form Upload → Split Files → Batch Loop → Analyze → Insert → Loop Back
Google Drive → List Files → Batch Loop → Download → Analyze → Insert → Delete → Loop Back
📊 Performance Considerations
Processing Time**: ~20-30 seconds per file
Batch Size**: Set to 1 for reliability (configurable)
Memory Usage**: Optimized for files under 10MB
API Costs**: Uses OpenAI embeddings (text-embedding-3-large model)
🔐 Required Credentials
Google Drive OAuth2: For file access and management
OpenAI API: For embedding generation
Qdrant API: For vector database operations
Google Gemini API: For document analysis
💡 Implementation Tips
Start Small: Test with a few files before processing large batches
Monitor Costs: Track OpenAI API usage for embedding generation
Backup First: Consider archiving instead of deleting processed files
Check Collections: Ensure Qdrant collection exists before running
🎨 Customization Options
Change Embedding Model**: Switch to text-embedding-3-small for cost savings
Adjust Chunk Size**: Modify text splitting parameters for different document types
Add Metadata**: Extend the Gemini prompt to extract specific fields
Archive vs Delete**: Replace delete operation with move to "processed" folder
📈 Real-World Application
This workflow was developed to process business documents and legal agreements, making them searchable through semantic queries. It's particularly useful for organizations dealing with large volumes of regulatory documentation that need to be quickly accessible and searchable.
Chat Interface Testing
The integrated chatbot interface allows users to:
Query processed documents using natural language
Test semantic search capabilities in real-time
Verify document indexing and retrieval accuracy
Ask questions about specific topics (e.g., "What are the pay rates for junior employees?")
Get instant AI-powered responses based on the indexed content
🌟 Benefits
Automation**: Eliminates manual document processing
Scalability**: Handles individual files or bulk uploads
Intelligence**: AI-powered understanding of document content
Flexibility**: Multiple input sources and processing options
Reliability**: Robust error handling and state management
👨💻 About the Creator
Jeremy Dawes is the CEO of Jezweb, specializing in AI and automation deployment solutions. This workflow represents practical, production-ready automation that solves real business challenges while maintaining simplicity and reliability.
📝 Notes
The workflow intelligently handles the n8n form upload pattern where multiple files create a single item with multiple binary properties (Files_0, Files_1, etc.)
The Split In Batches pattern with reset: false is crucial for proper loop execution
Direct API integration provides more control than pure Langchain implementations
🔗 Resources
Qdrant Documentation
OpenAI Embeddings
n8n Documentation
Jezweb - AI & Automation Solutions
This workflow demonstrates practical automation that bridges document management with modern AI capabilities, creating intelligent document processing systems that scale with your needs.