Uploading Documents¶
This guide covers how to add content to your RAG Chatbot knowledge base through file uploads and media imports.
Overview¶
Documents are the foundation of your chatbot's knowledge. When you upload a document:
- The content is extracted and parsed
- Text is split into searchable chunks
- AI embeddings are generated for semantic search
- The document becomes queryable via chat
Supported Formats¶
Documents¶
| Format | Extension | Features |
|---|---|---|
.pdf | Text, tables, images (OCR) | |
| Word | .docx | Text, tables, formatting |
| PowerPoint | .pptx | Slides, speaker notes |
| Excel | .xlsx | Spreadsheet data |
| HTML | .html, .htm | Web page content |
Text Files¶
| Format | Extension | Description |
|---|---|---|
| Plain Text | .txt | Simple text files |
| Markdown | .md | Formatted documentation |
| CSV | .csv | Tabular data |
| JSON | .json | Structured data |
| XML | .xml | Structured markup |
Media Files¶
| Format | Extension | Processing |
|---|---|---|
| Images | .png, .jpg, .gif, .webp, .tiff, .bmp | OCR text extraction |
| Audio | .wav, .mp3, .m4a, .ogg, .flac | Speech transcription |
| Subtitles | .srt, .vtt | Direct text import |
Online Media¶
Supported platforms for URL import:
- YouTube (single videos and playlists)
- Vimeo
- SoundCloud
- TED Talks
- Twitch VODs
- Twitter/X videos
- TikTok
- Dailymotion
- And 1000+ more platforms
Using the Chat Widget¶
Uploading Files¶
- Open the chat widget
- Click the upload button (↑) in the header
- Select the Upload File tab
- Drag and drop files onto the drop zone, or click to browse
- (Optional) Add more files to the queue
- Click Upload & Import
The progress bar shows upload and processing status.
Importing Media URLs¶
- Open the chat widget
- Click the upload button (↑) in the header
- Select the Media URL tab
- Paste the video or audio URL
- Click Upload & Import
For YouTube videos, the system will:
- First attempt to download existing subtitles
- Fall back to AI transcription if no subtitles exist
Importing Playlists¶
When you paste a YouTube playlist URL:
- The system fetches playlist information
- Each video is processed sequentially
- Progress updates show current video
- Failed videos display retry buttons
Large Playlists
For playlists with many videos, consider importing in batches. Very large playlists may timeout.
Using the API¶
File Upload via API¶
curl -X POST https://your-domain.com/chatbot/public/upload.php \
-F "file=@document.pdf" \
-F 'metadata={"source": "training", "department": "HR"}'
URL Import via API¶
curl -X POST https://your-domain.com/chatbot/public/upload.php \
-F "url=https://www.youtube.com/watch?v=VIDEO_ID" \
-F "subtitle_langs=en,es"
With API Key Authentication¶
If UPLOAD_API_KEY is configured:
curl -X POST https://your-domain.com/chatbot/public/upload.php \
-H "X-API-Key: your-api-key" \
-F "file=@document.pdf"
Document Metadata¶
Automatic Metadata¶
The system automatically captures:
| Field | Description |
|---|---|
filename | Original file name |
file_size | Size in bytes |
word_count | Number of words |
page_count | Pages (for documents) |
upload_date | When imported |
source_url | For media imports |
duration | For audio/video |
Custom Metadata¶
Add your own metadata during upload:
Custom metadata can be used for filtering searches in future queries.
Processing Pipeline¶
1. Document Parsing¶
The Docling service extracts content:
- Text extraction from PDFs and documents
- OCR for scanned documents and images
- Table recognition preserves tabular data
- Layout analysis maintains document structure
2. Chunking¶
Documents are split into searchable chunks:
- Target size: ~500 tokens per chunk
- 15% overlap between chunks (preserves context)
- Respects sentence boundaries
- Maintains paragraph coherence
3. Embedding Generation¶
Each chunk gets a vector embedding:
- Uses OpenAI's
text-embedding-3-smallmodel - 1536-dimensional vectors
- Enables semantic similarity search
4. Indexing¶
Chunks are indexed for fast retrieval:
- Vector index (HNSW) for semantic search
- Full-text index (GIN) for keyword search
- Combined in hybrid search for best results
Best Practices¶
File Organization¶
- Use descriptive filenames - "Q3-2025-Financial-Report.pdf" not "report.pdf"
- Group related content - Upload all chapters of a manual together
- Keep files focused - One topic per document improves search accuracy
Content Quality¶
- Text-based PDFs work better than scanned images
- Clear audio produces better transcriptions
- Structured documents (headings, sections) chunk more effectively
Size Considerations¶
| Consideration | Recommendation |
|---|---|
| Individual file | Keep under 50MB for faster processing |
| Total knowledge base | No hard limit, but search may slow with millions of chunks |
| Chunk count | Aim for <100,000 chunks for optimal performance |
Duplicate Handling¶
The system checks for duplicates by filename:
- Uploading "report.pdf" twice won't create duplicates
- Change the filename to create a new version
- Delete old versions before re-uploading updates
Monitoring Uploads¶
Via Chat Widget¶
The upload modal shows:
- Upload progress percentage
- Processing status messages
- Success/failure notifications
Via Debug Endpoint¶
Check imported documents:
View specific document:
Troubleshooting¶
Upload Fails Immediately¶
| Issue | Solution |
|---|---|
| File too large | Reduce file size or increase MAX_UPLOAD_SIZE_MB |
| Unsupported format | Check supported formats list |
| API key missing | Add X-API-Key header if configured |
Processing Fails¶
| Issue | Solution |
|---|---|
| Docling service down | Check if service is running at configured URL |
| Timeout | Increase DOCLING_TIMEOUT for large files |
| OCR language | Set correct DOCLING_OCR_LANGUAGE |
Media Import Fails¶
| Issue | Solution |
|---|---|
| Video unavailable | Check if video is public/accessible |
| No subtitles | Enable transcription (slower but works) |
| Rate limited | Wait and retry, or use proxy rotation |
| Age-restricted | Video cannot be accessed without login |
Content Not Searchable¶
After upload, if content isn't found in searches:
- Verify document appears in debug endpoint
- Check that chunks were created
- Test with exact phrases from the document
- Lower the search threshold temporarily
Deleting Documents¶
Currently, document deletion is available via direct database access:
-- Delete a specific document (cascades to chunks and embeddings)
DELETE FROM documents WHERE id = 42;
-- Delete by filename
DELETE FROM documents WHERE filename = 'old-report.pdf';
Future Feature
A web-based document management interface is planned for future releases.
Next Steps¶
- Learn about the API for programmatic uploads
- Configure the Docling service for advanced processing
- Understand security settings for upload protection