Skip to content

Uploading Documents

This guide covers how to add content to your RAG Chatbot knowledge base through file uploads and media imports.

Overview

Documents are the foundation of your chatbot's knowledge. When you upload a document:

  1. The content is extracted and parsed
  2. Text is split into searchable chunks
  3. AI embeddings are generated for semantic search
  4. The document becomes queryable via chat

Supported Formats

Documents

Format Extension Features
PDF .pdf Text, tables, images (OCR)
Word .docx Text, tables, formatting
PowerPoint .pptx Slides, speaker notes
Excel .xlsx Spreadsheet data
HTML .html, .htm Web page content

Text Files

Format Extension Description
Plain Text .txt Simple text files
Markdown .md Formatted documentation
CSV .csv Tabular data
JSON .json Structured data
XML .xml Structured markup

Media Files

Format Extension Processing
Images .png, .jpg, .gif, .webp, .tiff, .bmp OCR text extraction
Audio .wav, .mp3, .m4a, .ogg, .flac Speech transcription
Subtitles .srt, .vtt Direct text import

Online Media

Supported platforms for URL import:

  • YouTube (single videos and playlists)
  • Vimeo
  • SoundCloud
  • TED Talks
  • Twitch VODs
  • Twitter/X videos
  • TikTok
  • Dailymotion
  • And 1000+ more platforms

Using the Chat Widget

Uploading Files

  1. Open the chat widget
  2. Click the upload button (↑) in the header
  3. Select the Upload File tab
  4. Drag and drop files onto the drop zone, or click to browse
  5. (Optional) Add more files to the queue
  6. Click Upload & Import

The progress bar shows upload and processing status.

Importing Media URLs

  1. Open the chat widget
  2. Click the upload button (↑) in the header
  3. Select the Media URL tab
  4. Paste the video or audio URL
  5. Click Upload & Import

For YouTube videos, the system will:

  • First attempt to download existing subtitles
  • Fall back to AI transcription if no subtitles exist

Importing Playlists

When you paste a YouTube playlist URL:

  1. The system fetches playlist information
  2. Each video is processed sequentially
  3. Progress updates show current video
  4. Failed videos display retry buttons

Large Playlists

For playlists with many videos, consider importing in batches. Very large playlists may timeout.

Using the API

File Upload via API

curl -X POST https://your-domain.com/chatbot/public/upload.php \
  -F "file=@document.pdf" \
  -F 'metadata={"source": "training", "department": "HR"}'

URL Import via API

curl -X POST https://your-domain.com/chatbot/public/upload.php \
  -F "url=https://www.youtube.com/watch?v=VIDEO_ID" \
  -F "subtitle_langs=en,es"

With API Key Authentication

If UPLOAD_API_KEY is configured:

curl -X POST https://your-domain.com/chatbot/public/upload.php \
  -H "X-API-Key: your-api-key" \
  -F "file=@document.pdf"

Document Metadata

Automatic Metadata

The system automatically captures:

Field Description
filename Original file name
file_size Size in bytes
word_count Number of words
page_count Pages (for documents)
upload_date When imported
source_url For media imports
duration For audio/video

Custom Metadata

Add your own metadata during upload:

{
  "department": "Engineering",
  "version": "2.0",
  "author": "John Smith",
  "confidential": false
}

Custom metadata can be used for filtering searches in future queries.

Processing Pipeline

1. Document Parsing

The Docling service extracts content:

  • Text extraction from PDFs and documents
  • OCR for scanned documents and images
  • Table recognition preserves tabular data
  • Layout analysis maintains document structure

2. Chunking

Documents are split into searchable chunks:

  • Target size: ~500 tokens per chunk
  • 15% overlap between chunks (preserves context)
  • Respects sentence boundaries
  • Maintains paragraph coherence

3. Embedding Generation

Each chunk gets a vector embedding:

  • Uses OpenAI's text-embedding-3-small model
  • 1536-dimensional vectors
  • Enables semantic similarity search

4. Indexing

Chunks are indexed for fast retrieval:

  • Vector index (HNSW) for semantic search
  • Full-text index (GIN) for keyword search
  • Combined in hybrid search for best results

Best Practices

File Organization

  • Use descriptive filenames - "Q3-2025-Financial-Report.pdf" not "report.pdf"
  • Group related content - Upload all chapters of a manual together
  • Keep files focused - One topic per document improves search accuracy

Content Quality

  • Text-based PDFs work better than scanned images
  • Clear audio produces better transcriptions
  • Structured documents (headings, sections) chunk more effectively

Size Considerations

Consideration Recommendation
Individual file Keep under 50MB for faster processing
Total knowledge base No hard limit, but search may slow with millions of chunks
Chunk count Aim for <100,000 chunks for optimal performance

Duplicate Handling

The system checks for duplicates by filename:

  • Uploading "report.pdf" twice won't create duplicates
  • Change the filename to create a new version
  • Delete old versions before re-uploading updates

Monitoring Uploads

Via Chat Widget

The upload modal shows:

  • Upload progress percentage
  • Processing status messages
  • Success/failure notifications

Via Debug Endpoint

Check imported documents:

curl "https://your-domain.com/chatbot/public/debug-rag.php"

View specific document:

curl "https://your-domain.com/chatbot/public/debug-rag.php?doc=42"

Troubleshooting

Upload Fails Immediately

Issue Solution
File too large Reduce file size or increase MAX_UPLOAD_SIZE_MB
Unsupported format Check supported formats list
API key missing Add X-API-Key header if configured

Processing Fails

Issue Solution
Docling service down Check if service is running at configured URL
Timeout Increase DOCLING_TIMEOUT for large files
OCR language Set correct DOCLING_OCR_LANGUAGE

Media Import Fails

Issue Solution
Video unavailable Check if video is public/accessible
No subtitles Enable transcription (slower but works)
Rate limited Wait and retry, or use proxy rotation
Age-restricted Video cannot be accessed without login

Content Not Searchable

After upload, if content isn't found in searches:

  1. Verify document appears in debug endpoint
  2. Check that chunks were created
  3. Test with exact phrases from the document
  4. Lower the search threshold temporarily

Deleting Documents

Currently, document deletion is available via direct database access:

-- Delete a specific document (cascades to chunks and embeddings)
DELETE FROM documents WHERE id = 42;

-- Delete by filename
DELETE FROM documents WHERE filename = 'old-report.pdf';

Future Feature

A web-based document management interface is planned for future releases.

Next Steps

  1. Learn about the API for programmatic uploads
  2. Configure the Docling service for advanced processing
  3. Understand security settings for upload protection