Uploading Documents¶

This guide covers how to add content to your RAG Chatbot knowledge base through file uploads and media imports.

Overview¶

Documents are the foundation of your chatbot's knowledge. When you upload a document:

The content is extracted and parsed
Text is split into searchable chunks
AI embeddings are generated for semantic search
The document becomes queryable via chat

Supported Formats¶

Documents¶

Format	Extension	Features
PDF	`.pdf`	Text, tables, images (OCR)
Word	`.docx`	Text, tables, formatting
PowerPoint	`.pptx`	Slides, speaker notes
Excel	`.xlsx`	Spreadsheet data
HTML	`.html`, `.htm`	Web page content

Text Files¶

Format	Extension	Description
Plain Text	`.txt`	Simple text files
Markdown	`.md`	Formatted documentation
CSV	`.csv`	Tabular data
JSON	`.json`	Structured data
XML	`.xml`	Structured markup

Media Files¶

Format	Extension	Processing
Images	`.png`, `.jpg`, `.gif`, `.webp`, `.tiff`, `.bmp`	OCR text extraction
Audio	`.wav`, `.mp3`, `.m4a`, `.ogg`, `.flac`	Speech transcription
Subtitles	`.srt`, `.vtt`	Direct text import

Online Media¶

Supported platforms for URL import:

YouTube (single videos and playlists)
Vimeo
SoundCloud
TED Talks
Twitch VODs
Twitter/X videos
TikTok
Dailymotion
And 1000+ more platforms

Uploading Files¶

Open the chat widget
Click the upload button (↑) in the header
Select the Upload File tab
Drag and drop files onto the drop zone, or click to browse
(Optional) Add more files to the queue
Click Upload & Import

The progress bar shows upload and processing status.

Importing Media URLs¶

Open the chat widget
Click the upload button (↑) in the header
Select the Media URL tab
Paste the video or audio URL
Click Upload & Import

For YouTube videos, the system will:

First attempt to download existing subtitles
Fall back to AI transcription if no subtitles exist

Importing Playlists¶

When you paste a YouTube playlist URL:

The system fetches playlist information
Each video is processed sequentially
Progress updates show current video
Failed videos display retry buttons

Large Playlists

For playlists with many videos, consider importing in batches. Very large playlists may timeout.

Using the API¶

File Upload via API¶

curl -X POST https://your-domain.com/chatbot/public/upload.php \
  -F "file=@document.pdf" \
  -F 'metadata={"source": "training", "department": "HR"}'

URL Import via API¶

curl -X POST https://your-domain.com/chatbot/public/upload.php \
  -F "url=https://www.youtube.com/watch?v=VIDEO_ID" \
  -F "subtitle_langs=en,es"

With API Key Authentication¶

If UPLOAD_API_KEY is configured:

curl -X POST https://your-domain.com/chatbot/public/upload.php \
  -H "X-API-Key: your-api-key" \
  -F "file=@document.pdf"

Document Metadata¶

Automatic Metadata¶

The system automatically captures:

Field	Description
`filename`	Original file name
`file_size`	Size in bytes
`word_count`	Number of words
`page_count`	Pages (for documents)
`upload_date`	When imported
`source_url`	For media imports
`duration`	For audio/video

Custom Metadata¶

Add your own metadata during upload:

{
  "department": "Engineering",
  "version": "2.0",
  "author": "John Smith",
  "confidential": false
}

Custom metadata can be used for filtering searches in future queries.

Processing Pipeline¶

1. Document Parsing¶

The Docling service extracts content:

Text extraction from PDFs and documents
OCR for scanned documents and images
Table recognition preserves tabular data
Layout analysis maintains document structure

2. Chunking¶

Documents are split into searchable chunks:

Target size: ~500 tokens per chunk
15% overlap between chunks (preserves context)
Respects sentence boundaries
Maintains paragraph coherence

3. Embedding Generation¶

Each chunk gets a vector embedding:

Uses OpenAI's text-embedding-3-small model
1536-dimensional vectors
Enables semantic similarity search

4. Indexing¶

Chunks are indexed for fast retrieval:

Vector index (HNSW) for semantic search
Full-text index (GIN) for keyword search
Combined in hybrid search for best results

Best Practices¶

File Organization¶

Use descriptive filenames - "Q3-2025-Financial-Report.pdf" not "report.pdf"
Group related content - Upload all chapters of a manual together
Keep files focused - One topic per document improves search accuracy

Content Quality¶

Text-based PDFs work better than scanned images
Clear audio produces better transcriptions
Structured documents (headings, sections) chunk more effectively

Size Considerations¶

Consideration	Recommendation
Individual file	Keep under 50MB for faster processing
Total knowledge base	No hard limit, but search may slow with millions of chunks
Chunk count	Aim for <100,000 chunks for optimal performance

Duplicate Handling¶

The system checks for duplicates by filename:

Uploading "report.pdf" twice won't create duplicates
Change the filename to create a new version
Delete old versions before re-uploading updates

Monitoring Uploads¶

The upload modal shows:

Upload progress percentage
Processing status messages
Success/failure notifications

Via Debug Endpoint¶

Check imported documents:

curl "https://your-domain.com/chatbot/public/debug-rag.php"

View specific document:

curl "https://your-domain.com/chatbot/public/debug-rag.php?doc=42"

Troubleshooting¶

Upload Fails Immediately¶

Issue	Solution
File too large	Reduce file size or increase `MAX_UPLOAD_SIZE_MB`
Unsupported format	Check supported formats list
API key missing	Add `X-API-Key` header if configured

Processing Fails¶

Issue	Solution
Docling service down	Check if service is running at configured URL
Timeout	Increase `DOCLING_TIMEOUT` for large files
OCR language	Set correct `DOCLING_OCR_LANGUAGE`

Media Import Fails¶

Issue	Solution
Video unavailable	Check if video is public/accessible
No subtitles	Enable transcription (slower but works)
Rate limited	Wait and retry, or use proxy rotation
Age-restricted	Video cannot be accessed without login

Content Not Searchable¶

After upload, if content isn't found in searches:

Verify document appears in debug endpoint
Check that chunks were created
Test with exact phrases from the document
Lower the search threshold temporarily

Deleting Documents¶

Currently, document deletion is available via direct database access:

-- Delete a specific document (cascades to chunks and embeddings)
DELETE FROM documents WHERE id = 42;

-- Delete by filename
DELETE FROM documents WHERE filename = 'old-report.pdf';

Future Feature

A web-based document management interface is planned for future releases.

Next Steps¶

Learn about the API for programmatic uploads
Configure the Docling service for advanced processing
Understand security settings for upload protection