Solr教程：高级练习 - 向量搜索

Posted on 十二月 10, 2024 本文总阅读量次

本练习介绍Solr的向量搜索功能，展示如何使用向量嵌入实现语义搜索和推荐系统。

什么是向量搜索？

向量搜索（Vector Search）是一种基于数学向量相似度的搜索技术：

将文本、图像等转换为高维向量
通过计算向量之间的距离找到相似内容
实现语义理解而非简单的关键词匹配
支持推荐系统和相似性搜索

核心概念

向量嵌入（Embeddings）

将内容转换为固定维度的数值向量
相似内容的向量在空间中距离较近
可以捕获语义关系

相似度度量

余弦相似度：测量向量方向的相似性
欧几里得距离：测量向量之间的直线距离
点积：结合大小和方向的相似性度量

HNSW算法

分层可导航小世界（Hierarchical Navigable Small World）：

高效的近似最近邻搜索算法
构建多层图结构加速搜索
在准确性和性能之间取得良好平衡

设置向量搜索

1. 创建带向量字段的集合

定义10维向量字段的模式：

<fieldType name="knn_vector" class="solr.DenseVectorField" 
           vectorDimension="10" 
           similarityFunction="cosine"/>

<field name="movie_vector" type="knn_vector" indexed="true" stored="true"/>

或使用Schema API：

curl -X POST -H 'Content-type:application/json' \
  --data-binary '{
    "add-field-type": {
      "name": "knn_vector",
      "class": "solr.DenseVectorField",
      "vectorDimension": 10,
      "similarityFunction": "cosine",
      "knnAlgorithm": "hnsw"
    }
  }' http://localhost:8983/solr/films/schema

curl -X POST -H 'Content-type:application/json' \
  --data-binary '{
    "add-field": {
      "name": "movie_vector",
      "type": "knn_vector",
      "indexed": true,
      "stored": true
    }
  }' http://localhost:8983/solr/films/schema

2. 索引向量数据

示例文档带向量：

{
  "id": "movie_1",
  "name": "The Matrix",
  "genre": ["Sci-Fi", "Action"],
  "movie_vector": [0.1, 0.5, 0.3, -0.2, 0.8, 0.4, -0.1, 0.6, 0.2, 0.9]
}

批量索引：

curl -X POST -H 'Content-type:application/json' \
  --data-binary '[
    {
      "id": "1",
      "name": "Inception",
      "movie_vector": [0.2, 0.6, 0.4, -0.1, 0.7, 0.5, 0.0, 0.8, 0.3, 0.9]
    },
    {
      "id": "2", 
      "name": "Interstellar",
      "movie_vector": [0.3, 0.5, 0.4, -0.2, 0.8, 0.6, -0.1, 0.7, 0.2, 0.85]
    }
  ]' http://localhost:8983/solr/films/update?commit=true

高级向量搜索技术

1. 推荐搜索

找到与给定电影最相似的电影：

# 使用KNN查询找到最相似的5部电影
curl "http://localhost:8983/solr/films/select" -d '
{
  "query": "*:*",
  "filter": "{!knn f=movie_vector topK=5}[0.1, 0.5, 0.3, -0.2, 0.8, 0.4, -0.1, 0.6, 0.2, 0.9]"
}'

2. 过滤已观看的电影

推荐时排除用户已看过的电影：

curl "http://localhost:8983/solr/films/select" -d '
{
  "query": "*:*",
  "filter": [
    "{!knn f=movie_vector topK=10}[0.1, 0.5, 0.3, -0.2, 0.8, 0.4, -0.1, 0.6, 0.2, 0.9]",
    "-id:(movie_1 OR movie_5 OR movie_8)"
  ]
}'

3. 带流派约束的搜索

只在特定流派中搜索相似电影：

curl "http://localhost:8983/solr/films/select" -d '
{
  "query": "genre:Sci-Fi",
  "filter": "{!knn f=movie_vector topK=5}[0.1, 0.5, 0.3, -0.2, 0.8, 0.4, -0.1, 0.6, 0.2, 0.9]"
}'

4. 重新排序结果

使用向量相似度重新排序搜索结果：

curl "http://localhost:8983/solr/films/select" -d '
{
  "query": "name:star",
  "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=2}",
  "rqq": "{!knn f=movie_vector topK=100}[0.1, 0.5, 0.3, -0.2, 0.8, 0.4, -0.1, 0.6, 0.2, 0.9]"
}'

5. 混合评分

结合词法搜索和向量相似度：

curl "http://localhost:8983/solr/films/select" -d '
{
  "query": {
    "bool": {
      "should": [
        {
          "query": "name:inception",
          "boost": 1.0
        },
        {
          "knn": {
            "movie_vector": {
              "vector": [0.1, 0.5, 0.3, -0.2, 0.8, 0.4, -0.1, 0.6, 0.2, 0.9],
              "k": 10
            },
            "boost": 2.0
          }
        }
      ]
    }
  }
}'

实际应用案例

1. 电商推荐系统

{
  "product_vector": "[产品特征向量]",
  "applications": [
    "相似产品推荐",
    "个性化搜索结果",
    "交叉销售建议",
    "库存替代品推荐"
  ]
}

2. 内容推荐平台

{
  "content_vector": "[内容嵌入向量]",
  "use_cases": [
    "相关文章推荐",
    "个性化新闻推送",
    "视频推荐",
    "播放列表生成"
  ]
}

3. 语义搜索引擎

{
  "text_vector": "[文本嵌入向量]",
  "features": [
    "理解查询意图",
    "同义词自动识别",
    "多语言搜索",
    "问答系统"
  ]
}

向量生成最佳实践

1. 选择合适的嵌入模型

文本嵌入：

Sentence-BERT
OpenAI Embeddings
Google Universal Sentence Encoder

图像嵌入：

ResNet
CLIP
Vision Transformer

2. 向量维度选择

低维（50-100）：
- 更快的搜索速度
- 较少的存储空间
- 适合简单任务

中维（200-500）：
- 平衡性能和准确性
- 大多数应用的选择

高维（768-1536）：
- 更高的准确性
- 需要更多资源
- 复杂语义任务

3. 向量归一化

# Python示例：向量归一化
import numpy as np

def normalize_vector(vector):
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

性能优化

1. HNSW参数调优

<fieldType name="knn_vector" class="solr.DenseVectorField">
  <similarityFunction>cosine</similarityFunction>
  <knnAlgorithm>hnsw</knnAlgorithm>
  <hnswMaxConnections>16</hnswMaxConnections>
  <hnswBeamWidth>100</hnswBeamWidth>
</fieldType>

参数说明：

hnswMaxConnections：每个节点的最大连接数（默认16）
hnswBeamWidth：搜索时的候选集大小（默认100）

2. 批量索引优化

# 使用更大的批次大小
curl -X POST -H 'Content-type:application/json' \
  --data-binary @vectors_batch_1000.json \
  http://localhost:8983/solr/films/update?commit=false

# 定期提交
curl "http://localhost:8983/solr/films/update?commit=true"

3. 缓存策略

<query>
  <filterCache size="512" initialSize="512" autowarmCount="0"/>
  <queryResultCache size="512" initialSize="512" autowarmCount="0"/>
  <documentCache size="512" initialSize="512" autowarmCount="0"/>
</query>

监控和调试

1. 查看向量字段信息

1	curl "http://localhost:8983/solr/films/schema/fields/movie_vector"

2. 调试向量搜索

添加调试参数：

curl "http://localhost:8983/solr/films/select?debug=true" -d '
{
  "query": "{!knn f=movie_vector topK=5}[0.1, 0.5, ...]"
}'

3. 性能指标

监控关键指标：

查询延迟
索引速度
内存使用
缓存命中率

常见问题解决

1. 向量维度不匹配

错误：Vector dimension mismatch

解决方案：

确保所有向量维度一致
检查嵌入模型输出
验证字段定义

2. 搜索结果不准确

可能原因：

向量质量问题
相似度函数选择不当
topK值太小