Enrichment Agent

Prev Next

The service enriches vector store entries, such as Pinecone, with classification metadata by invoking an external classification service and persisting the results back to the vector store. It supports both scheduled enrichment and on-demand enrichment through an API.

This service operates as part of the PlainID Edge enrichment flow and prepares vector metadata for downstream authorization and governance use cases.


Configuration

Configuration is loaded from a YAML file and merged with default values. All configuration values support Environment Variables substitution.

Configuration File and Loading

  • Default path: config/config.yaml, or a path provided to the application at startup.

  • Environment Variables substitution:

    • Use ${VAR} for required variables.
    • Use ${VAR:default} for optional variables with a default value, for example ${LOG_LEVEL:info}.

The following top-level keys are used by the application: server, log, http, management, jwt, and databases.

The framework consumes server and log. All other sections are consumed by the enrichment agent application.


Parameters

Section Description
server.name Optional. Application name.
log.level Optional. Log level, for example info or debug.
http HTTP server and API configuration.
management Health and metrics server configuration.
jwt JWT validation configuration for the enrichment API.
databases List of vector databases and enrichment targets.

http Parameters

  • port (integer, default 8080). API server port.

Other fields such as useMux, openApiSpecPath, enableXSSValidator, xssWhitelistType, enableExternalMonitor, and externalMonitorPath follow the micro-infra HttpConfig. The application may apply default values in code.


Management Parameters

  • port (integer, default 8081). Port for health and metrics endpoints.
  • prefix (string, default /health). Path prefix for readiness, liveness, and metrics endpoints, for example /health/readiness or /health/metrics.

JWT Parameters

  • jwksUrl (string). URL of the JWKS endpoint. Required when enabled is set to true.
  • enabled (boolean). Enables or disables JWT validation for the enrichment API.

Databases Parameters

Each entry in the databases array defines a single enrichment target.

Parameter Type Required Description
id string Yes Unique database identifier used by the scheduler and API.
type string Yes Vendor type, for example PINECONE.
periodicStart string No Cron expression for scheduled enrichment. An empty value disables scheduling.
classificationService object Yes Classification service used to assign categories to vectors.
metadataKey string No Metadata key where the category is written in the vector store. Default is category.
vendor object Yes Vendor-specific connection and filtering configuration.

Classification Service Parameters

  • serviceUrl (string, required). Base URL of the HTTP classification service.

Pinecone Vendor Configuration

  • pinecone (object, required):

    • apiKey (string, required). Pinecone API key.
  • collections (object, optional). Controls which namespaces are processed. Matching is applied to the string indexName_namespaceName, for example my-index_users.

    • mode (string). One of include or exclude.
    • patterns (array of strings). Regular expression patterns.

With include, only namespaces matching at least one pattern are processed. With exclude, all namespaces except those matching are processed.

Filter Behavior

If no patterns are defined:

  • include processes all namespaces.
  • exclude processes no namespaces.

Patterns are compiled and evaluated as regular expressions.


Configuration Examples

Minimal Pinecone Configuration

Below is an example configuration with a single database, no schedule, and no namespace filtering:

server:
  name: enrichment-agent
log:
  level: info
jwt:
  jwksUrl: ${JWKS_URL:}
  enabled: false
databases:
  - id: pineconeDb
    type: PINECONE
    classificationService:
      serviceUrl: http://localhost:8000/classify
    metadataKey: category
    vendor:
      pinecone:
        apiKey: ${PINECONE_API_KEY}

Full Pinecone Configuration

Below is an example with scheduled enrichment, namespace filtering, JWT enabled, and Environment Variables substitution.

Key elements include:

  • periodicStart: "0 * 20 * * ?" runs enrichment every day at 20:00 using a six-field cron format with seconds.
  • collections.mode and collections.patterns control namespace inclusion or exclusion.
  • Environment Variables such as CLASSIFICATION_SERVICE_URL, PINECONE_API_KEY, JWKS_URL, JWT_VALIDATION_ENABLED, and LOG_LEVEL.
server:
  name: enrichment-agent
log:
  level: ${LOG_LEVEL:info}
jwt:
  jwksUrl: ${JWKS_URL:}
  enabled: ${JWT_VALIDATION_ENABLED:true}
databases:
  - id: pineconeDb
    type: PINECONE
    periodicStart: "0 * 20 * * ?"
    classificationService:
      serviceUrl: ${CLASSIFICATION_SERVICE_URL}
    metadataKey: category
    vendor:
      pinecone:
        apiKey: ${PINECONE_API_KEY}
      collections:
        mode: exclude
        patterns:
          - users
          - books_.*

Multiple Pinecone Databases

Multiple enrichment targets can be defined with different identifiers, API keys, classification services, filters, or metadata keys.

Example:

databases:
  - id: pineconeProd
    type: PINECONE
    periodicStart: "0 0 2 * * ?"
    classificationService:
      serviceUrl: https://classifier.example.com/classify
    metadataKey: category
    vendor:
      pinecone:
        apiKey: ${PINECONE_PROD_API_KEY}
      collections:
        mode: include
        patterns:
          - "index1_.*"
  - id: pineconeStaging
    type: PINECONE
    classificationService:
      serviceUrl: http://localhost:8000/classify
    metadataKey: category
    vendor:
      pinecone:
        apiKey: ${PINECONE_STAGING_API_KEY}

Cron Scheduling (periodicStart)

The periodicStart field uses a six-field cron format, including seconds:

second minute hour day-of-month month day-of-week

Example:

"0 * 20 * * ?" runs every day at 20:00:00.

Leave this value empty to disable scheduled enrichment for a specific database.


Common Environment Variables

The following Environment Variables are commonly used.

Variable Description
LOG_LEVEL Log level, for example info or debug.
JWKS_URL JWKS URL for JWT validation. Required when JWT is enabled.
JWT_VALIDATION_ENABLED Enables or disables JWT validation.
CLASSIFICATION_SERVICE_URL Base URL of the classification HTTP service.
PINECONE_API_KEY Pinecone API key.

Environment Variables can be defined in the runtime environment, for example via Helm env or secret, or referenced directly in the configuration file using the ${VAR:default} syntax.