π Process YouTube Transcripts with Apify, OpenAI & Pinecone Database
Go to WorkflowDescription
π YouTube Transcript Indexing Backend for Pinecone π₯πΎ
This tutorial explains how to build the backend workflow in n8n that indexes YouTube video transcripts into a Pinecone vector database. Note: This workflow handles the processing and indexing of transcripts onlyβthe retrieval agent (which searches these embeddings) is implemented separately.
π Workflow Overview
This backend workflow performs the following tasks:
Fetch Video Records from Airtable π₯
Retrieves video URLs and related metadata.
Scrape YouTube Transcripts Using Apify π¬
Triggers an Apify actor to scrape transcripts with timestamps from each video.
Update Airtable with Transcript Data π
Stores the fetched transcript JSON back in Airtable linked via video ID.
Process & Chunk Transcripts βοΈ
Parses the transcript JSON, converts "mm:ss" timestamps to seconds, and groups entries into meaningful chunks. Each chunk is enriched with metadataβsuch as video title, description, start/end timestamps, and a direct URL linking to that video moment.
Generate Embeddings & Index in Pinecone πΎ
Uses OpenAI to create vector embeddings for each transcript chunk and indexes them in Pinecone. This enables efficient semantic searches later by a separate retrieval agent.
π§ Step-by-Step Guide
Step 1: Retrieve Video Records from Airtable π₯
Airtable Search Node:**
Setup: Configure the node to fetch video records (with essential fields like url and metadata) from your Airtable base.
Loop Over Items:**
Use a SplitInBatches node to process each video record individually.
Step 2: Scrape YouTube Transcripts Using Apify π¬
Trigger Apify Actor:**
HTTP Request Node ("Apify NinjaPost"):
Method: POST
Endpoint: https://api.apify.com/v2/acts/topaz_sharingan~youtube-transcript-scraper-1/runs?token=<YOUR_TOKEN>
Payload Example:
{
"includeTimestamps": "Yes",
"startUrls": ["{{ $json.url }}"]
}
Purpose: Initiates transcript scraping for each video URL.
Wait for Processing:**
Wait Node:
Duration: Approximately 1 minute to allow Apify to generate the transcript.
Retrieve Transcript Data:**
HTTP Request Node ("Get JSON TS"):
Method: GET
Endpoint: https://api.apify.com/v2/acts/topaz_sharingan~youtube-transcript-scraper-1/runs/last/dataset/items?token=<YOUR_TOKEN>
Step 3: Update Airtable with Transcript Data π
Format Transcript Data:**
Code Node ("Code"):
Task: Convert the fetched transcript JSON into a formatted string.
const jsonObject = items[0].json;
const jsonString = JSON.stringify(jsonObject, null, 2);
return { json: { stringifiedJson: jsonString } };
Extract the Video ID:**
Set Node ("Edit Fields"):
Expression:
{{$json.url.split('v=')[1].split('&')[0]}}
Update Airtable Record:**
Airtable Update Node ("Airtable1"):
Updates:
ts: Stores the transcript string.
videoid: Uses the extracted video ID to match the record.
Step 4: Process Transcripts into Semantic Chunks βοΈ
Retrieve Updated Records:**
Airtable Search Node ("Airtable2"):
Purpose: Fetch records that now contain transcript data.
Parse and Chunk Transcripts:**
Code Node ("Code4"):
Functionality:
Parses transcript JSON.
Converts "mm:ss" timestamps to seconds.
Groups transcript entries into chunks based on a 3-second gap.
Creates an object for each chunk that includes:
Text: The transcript segment.
Video Metadata: Video ID, title, description, published date, thumbnail.
Chunk Details: Start and end timestamps.
Direct URL: A link to the exact moment in the video (e.g., https://youtube.com/watch?v=VIDEOID&t=XXs).
Enrich & Split Text:**
Default Data Loader Node:
Attaches additional metadata (e.g., video title, description) to each chunk.
Recursive Character Text Splitter Node:
Settings: Typically set to 500-character chunks with a 50-character overlap.
Purpose: Ensures long transcript texts are broken into manageable segments for embedding.
Step 5: Generate Embeddings & Index in Pinecone πΎ
Generate Embeddings:**
Embeddings OpenAI Node:
Task: Convert each transcript chunk into a vector embedding.
Tip: Adjust the batch size (e.g., 512) based on your data volume.
Index in Pinecone:**
Pinecone Vector Store Node:
Configuration:
Index: Specify your Pinecone index (e.g., "videos").
Namespace: Use a dedicated namespace (e.g., "transcripts").
Outcome: Each enriched transcript chunk is stored in Pinecone, ready for semantic retrieval by a separate retrieval agent.
π Final Thoughts
This backend workflow is dedicated to processing and indexing YouTube video transcripts so that a separate retrieval agent can perform efficient semantic searches. With this setup:
Transcripts Are Indexed:**
Chunks of transcripts are enriched with metadata and stored as vector embeddings.
Instant Topic Retrieval:**
A retrieval agent (implemented separately) can later query Pinecone to find the exact moment in a video where a topic is discussed, thanks to the direct URL and metadata stored with each chunk.
Scalable & Modular:**
The separation between indexing and retrieval allows for easy updates and scalability.
Happy automating and enjoy building powerful search capabilities with your YouTube content! π