Skip to main content
The Datasource Catalog is a global registry of all datasource definitions available in Findable. It lives in the datasources Cosmos DB container and serves as the single source of truth for what datasources exist and who can use them.

Three-Tier Architecture

Datasources operate across three distinct tiers — from global definitions down to per-chat runtime configuration:
Tier 1 — DATASOURCE CATALOG (Cosmos: 'datasources')
         Global registry of definitions.
         ┌─────────────────────────┬─────────────────────────┐
         │  origin: 'builtIn'      │  origin: 'admin'        │
         │  System templates       │  Admin connections       │
         │  • Locked fields        │  • Full CRUD             │
         │  • No concrete index    │  • Has indexName          │
         │  • Not deletable        │  • Has ACL               │
         └────────────┬────────────┴────────────┬────────────┘
                      │                         │
                      │ datasourceId FK          │ datasourceId FK
                      │ (provenance link)        │ (provenance + entitlement)
                      ▼                         ▼
Tier 2 — WORKSPACE SETTINGS (IAppSettings)
         Controls which catalog entries are activated by default.
         ┌────────────────────────────────────────────────────┐
         │  workspaceDataSources[]:                           │
         │    Default sources for workspace (personal) chats  │
         │                                                    │
         │  defaultChatDataSources[]:                         │
         │    Default sources for new shared chats             │
         │                                                    │
         │  Each entry: { datasourceId, defaultEnabled,       │
         │                allowUserToggle, sort, label }       │
         └───────────────────────┬────────────────────────────┘

                                 │ Populates on chat creation

Tier 3 — CHAT dataSources[] (per-chat runtime config)
         Concrete configuration for each chat.
         ┌────────────────────────────────────────────────────┐
         │  Each entry: IDataSourceConfig                     │
         │  • datasourceId FK to catalog (provenance)         │
         │  • type, enabled, allowUserToggle                  │
         │  • indexName (concrete, once created)               │
         │  • selectedFolder, searchEndpoint, flowId, etc.     │
         └────────────────────────────────────────────────────┘

Datasource Origin (DataSourceOrigin)

Every catalog entry has an origin field that determines who created it and what rules apply:
OriginDescriptionCRUDDeletableLocked Fields
builtInSystem-bootstrapped templates and capabilities. Seeded at startup via seedDatasources.Name, description, tags, ACL editable onlyNotype, origin, indexName, searchEndpoint, connection fields
adminAdmin-authored external connections (SharePoint, Flow Retriever, etc.). Created via the Datasources admin UI.Full CRUDYesNone
This replaces the previous builtIn: boolean flag with a more expressive enum that supports future origin types.

Three Kinds of Datasources

The catalog contains three conceptually different kinds of datasources, all represented as IDataSource entities:
Kind 1: VIRTUAL CAPABILITIES                     Kind 2: FILE STORAGE TEMPLATES
(No infrastructure — purely virtual)              (Template → concrete on first upload)

┌────────────────────────────┐                   ┌────────────────────────────┐
│  LLM Knowledge             │                   │  Shared Files              │
│  origin: builtIn           │                   │  origin: builtIn           │
│  type: llmknowledge        │                   │  type: shared              │
│  indexName: ∅              │                   │  indexName: ∅ (template)   │
│                            │                   │                            │
│  Web Search                │                   │  Personal Files            │
│  origin: builtIn           │                   │  origin: builtIn           │
│  type: websearch           │                   │  type: personal            │
│  indexName: ∅              │                   │  indexName: ∅ (template)   │
│                            │                   │                            │
│  Flow Retriever            │                   │  Workspace                 │
│  origin: builtIn           │                   │  origin: builtIn           │
│  type: flowretriever       │                   │  type: workspace           │
│  indexName: ∅              │                   │  indexName: ∅ (runtime)    │
└────────────────────────────┘                   └────────────────────────────┘

Kind 3: ADMIN EXTERNAL CONNECTIONS
(Concrete from the start — pre-existing infrastructure)

┌────────────────────────────┐
│  "HR SharePoint"           │
│  origin: admin             │
│  type: sharepoint          │
│  sharePointSite: { … }     │
└────────────────────────────┘

Datasource Lifecycle by Kind

Each kind follows a different lifecycle from catalog definition through to runtime: Kind 1 — Virtual (no infrastructure):
Catalog (template) ──FK──→ Chat dataSources[] (enabled/disabled)
                            No index. No infrastructure.
                            LLM Knowledge = model's training data
                            Web Search = live web results
Kind 2 — File Storage (infrastructure created on demand):
Catalog (template)  ──FK──→  Chat dataSources[]     ──upload──→  Azure Infrastructure
origin: builtIn              indexName: ""                        Index: shared-my-project
type: shared                 (virtual until upload)               Blob: Shared/My Project/
                                     │                            Indexer, Skillset, etc.

                                     └── indexName populated ──→  indexName: "shared-my-project"
                                         after first upload        (stored on the chat)
For Personal/Workspace types, the index name is resolved at runtime from the user’s UPN — it is never stored on the chat. Kind 3 — Admin External (pre-existing infrastructure):
Catalog (concrete)  ──FK──→  Chat dataSources[]
origin: admin                Inherits indexName, searchEndpoint
indexName: sales-2024        from catalog. ACL checked via
searchEndpoint: https://…    entitlement service.

Retrieval Pipeline

The origin-based classification above describes who creates a datasource. The retrieval pipeline describes how data flows from a source into the RAG context at query time. Every datasource follows one of three paths:
                         ┌──────────────────────────┐
                         │    Chat dataSources[]     │
                         │   (IDataSourceConfig[])   │
                         └─────┬─────┬─────┬────────┘
                               │     │     │
            ┌──────────────────┘     │     └──────────────────┐
            │                        │                        │
            ▼                        ▼                        ▼
  ┌───────────────────┐  ┌────────────────────┐  ┌─────────────────────────┐
  │      VIRTUAL       │  │  PHYSICAL (INDEX)   │  │  PHYSICAL (CONNECTION)   │
  │  No infrastructure │  │  Azure AI Search    │  │  External system via     │
  │                    │  │                     │  │  Data Connection          │
  │  • llmknowledge    │  │  • shared           │  │                          │
  │  • websearch       │  │  • personal         │  │  ┌─────────────────────┐ │
  │                    │  │  • workspace        │  │  │  Data Connection     │ │
  │                    │  │  • sharepoint       │  │  │  (connectionId FK)   │ │
  │                    │  │                     │  │  └──────────┬──────────┘ │
  │                    │  │   Storage            │  │        ┌────┴────┐       │
  │                    │  │     │                │  │        │         │       │
  │                    │  │     ▼                │  │        ▼         ▼       │
  │                    │  │   Azure AI Search    │  │   Vector DB   Retriever  │
  │                    │  │   Index              │  │   (direct)    Flow       │
  │                    │  │     │                │  │              (headless)  │
  │                    │  │     ▼                │  │                │         │
  │                    │  │   Documents          │  │                ▼         │
  │                    │  │                     │  │            Documents     │
  └─────────┬─────────┘  └─────────┬────────────┘  └────────────┬────────────┘
            │                      │                             │
            └──────────────────────┼─────────────────────────────┘

                       ┌──────────────────────────┐
                       │   Merged RAG Context      │
                       │   → LLM Response          │
                       └──────────────────────────┘
Virtual — No retrieval infrastructure. llmknowledge uses the model’s training data; websearch injects live web results (Tavily, Brave, etc.). Physical (Index) — Documents live in storage (Azure Blob, OneDrive, or SharePoint) and are searchable via an Azure AI Search index. The index is auto-created per chat/user/library. Physical (Connection) — Documents are retrieved from an external system through a registered Data Connection (IDataConnection):
PathTypeConnection TargetHow It Works
Vector → ConnectionvectorretrieverPinecone, Qdrant, Weaviate, Chroma, Redis, pgvector, etc.vectorConnectionId resolves to a Data Connection → direct vector similarity search → documents
Connection → Flow RetrieverflowretrieverSQL databases, REST APIs, custom logicflowId resolves to a headless retriever flow → flow executes with connection params → documents
Both connection-based paths are ACL-checked — the user must have access to the Data Connection in the registry before the query executes.

Security Model

The chat security guard (chatSecurityGuard.ts) validates datasource access differently per kind:
KindValidationDetails
Kind 1 (Virtual)TrustedNo index to protect. Enabled/disabled per chat config.
Kind 2 (File Storage)Prefix / ownershipPersonal/Workspace: index name prefix must match user’s UPN. Shared: user must be entitled to the chat that owns the index.
Kind 3 (Admin External)Catalog ACLdatasourceId is looked up in the catalog → entitlement service checks the user’s access against the catalog entity’s ACL.

Built-In Datasource IDs

The following UUIDs are reserved for built-in datasource templates (prefix 0a):
IDNameType
0a..001LLM Knowledgellmknowledge
0a..002Workspaceworkspace
0a..003Shared Filesshared
0a..004Personal Filespersonal
0a..005Web Searchwebsearch
0a..006Flow Retrieverflowretriever
0a..007Vector Storevectorretriever
0a..008SharePointsharepoint

Admin UI

The Datasources admin page (Admin → Datasource Catalog [#/admin/datasources]) presents all catalog entries in a data grid:
  • Built-in entries show a lock icon and cannot be deleted. Only name, description, tags, and permissions are editable.
  • Admin entries support full CRUD including connection details, type, and permissions.
  • Workspace Datasource Settings configure which catalog entries are activated by default for workspace and shared chats (IAppSettings.workspaceDataSources[] and defaultChatDataSources[]).

Chat Types & Data Sources

Chats in Findable use a multi-source RAG architecture. Each chat’s dataSources array (IDataSourceConfig[]) is the sole source of truth for where documents come from. A chat with an empty array operates in LLM-only mode (no retrieval).

Data Source Types (DATASOURCE_TYPE)

Each entry in the dataSources array has a type that determines storage, indexing, and retrieval behaviour:
TypeEnum ValueStorageSearch IndexUse Case
SharedsharedAzure Blob → Shared/{folderName}/Auto-created per folderTeam knowledge bases with uploaded documents
PersonalpersonalAzure Blob → Personal/{sanitized-upn}/Auto-created per userPrivate user files (persisted chats)
WorkspaceworkspaceAzure Blob Private/ or OneDriveResolved at runtime from user UPNEphemeral session workspace files
SharePointsharepointSharePoint OnlineAuto-created per libraryIndex and search SharePoint document libraries
Flow RetrieverflowretrieverN/A (virtual)N/A — flow returns documents directlyHeadless retriever flows that query databases, APIs, or custom logic
Vector RetrievervectorretrieverExternal vector storeN/A — queries vector DB directlyQuery Pinecone, Qdrant, Weaviate, Chroma, Redis, pgvector, etc.
Web SearchwebsearchN/A (live results)N/AGround responses in real-time web search results
LLM KnowledgellmknowledgeN/AN/ALLM-only mode — no retrieval, uses model’s training data

IDataSourceConfig Fields

Every data source entry supports the following fields:
FieldTypeDescription
idstringUnique identifier for this source
datasourceIdstring?FK to IDataSource.id in the datasource catalog (provenance link)
indexNamestringAzure Search index name. PERSONAL: persisted from UPN + chatTitle. WORKSPACE: empty — resolved at runtime.
searchEndpointstring?Search endpoint ID (uses default if not set)
weightnumber?Result weighting 0.0–1.0 (default: 1.0)
maxResultsnumber?Max documents from this source
filterstring?Static OData filter expression
labelstring?Display name (e.g. “Company Docs”, “My Files”)
typeDATASOURCE_TYPE?Source type — determines storage and retrieval
enabledboolean?Whether this source is active (default: true)
enableAclFilteringboolean?Enable native ACL filtering on the search index
allowUserToggleboolean?End users can disable this source at runtime from the sidebar
allowUserWeightEditboolean?End users can adjust the source weight at runtime
allowUserMaxResultsEditboolean?End users can adjust max results at runtime
selectedFolderstring?Blob folder name for file-backed sources

Multi-Source Result Merging

When a chat has multiple data sources, results are merged using the resultMergeStrategy field (RESULT_MERGE_STRATEGY enum) on IChatTabDBItem:
StrategyEnum ValueDescription
InterleaveinterleaveRound-robin results from each source (default)
WeightedweightedScore-based ranking with per-source weight values
SequentialsequentialResults from source 1, then source 2, etc.
The deduplicateResults boolean flag removes duplicate documents that appear across multiple sources.

Chat Classification (CHAT_KIND)

CHAT_KIND is a virtual, non-persisted runtime label derived at query time by getChatKind(chat). It is never stored in Cosmos DB and must not be used for ACL or storage decisions — those are governed by EntityScope.
KindDerivationDescription
CHAT_KIND.SHAREDscope === EntityScope.SharedACL-governed shared chat
CHAT_KIND.WORKSPACEisWorkspace === true (server-set)Ephemeral workspace session
CHAT_KIND.PERSONALAll other casesPersonal persisted chat
Use isSharedChat(), isWorkspaceChat(), isPersonalChat() from @eaai/shared rather than reading the enum value directly.

Key Files

FilePurpose
shared/src/domain/datasources/datasource.tsIDataSource interface, DataSourceOrigin enum, BUILT_IN_LOCKED_FIELDS
src/bootstrap/bootstrapdatasources.tsBuilt-in datasource definitions and default workspace/chat settings
src/datasources/seedDatasources.tsBootstrap seed logic with migration support
src/datasources/datasourcesrouter.tsCRUD API with origin-based guards
src/azure/ai/chatSecurityGuard.tsRuntime datasource validation per kind
client/src/components/Datasources/Admin UI — catalog grid, editor dialog, workspace settings