Skip to main content

How to write

1. How to write data to a vector database

We use news data as an example to demonstrate how to write news data into the vector database and perform ANN Search. You can download the pre-prepared dataset from ipfs.

About Dataset

This dataset contains 82 news records, with content related to the release of the iPhone 16. The data includes the news ID, URL, title, and content. We hope to perform vector retrieval on the content to find news articles that are of interest to us.

FieldData Type
idBIGINT1853351417715849015
urlSTRINGhttps://www.macrumors.com/2024/09/11/iphone-17-pro-rumor-recap/
titleSTRINGSkipping the iPhone 16 Pro? Here's What's Rumored for iPhone 17 Pro
contentSTRINGWill you be skipping the iPhone 16 Pro and waiting another year to upgrade? If so, we already have
some iPhone 17 Pro rumors for you...
vector_titleFLOAT_VECTORembedding the title
vector_contentFLOAT_VECTORembedding the content

Prepare Data

Initialize the Milvus client and call create_schema() to define the table schema. Call add_field() to add fields, where title_vector and content_vector are of DataType.FLOAT_VECTOR type, which are respectively the results of embedding title and content. The embedding model needs to be selected based on different application scenarios. Meanwhile, the setting of the dim parameter is related to the specific model.

Python

from pymilvus import MilvusClient, DataType
milvus_client = MilvusClient(token='')
schema = MilvusClient.create_schema(auto_id=False, enable_dynamic_field=False)
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="url", datatype=DataType.VARCHAR, max_length=1000)
schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=2000)
schema.add_field(field_name="content", datatype=DataType.VARCHAR, max_length=65535)
schema.add_field(field_name="title_vector", datatype=DataType.FLOAT_VECTOR, dim=1536) # dim取决于embeding模型的选择
schema.add_field(field_name="content_vector", datatype=DataType.FLOAT_VECTOR, dim=1536)
schema.verify()

Call add_index() to set index for fields that need to be searched. In this example, we create index on title_vector and content_vector because the dataset is small, we choose index type AUTOINDEX to let engine automatically select the index type. In production environment, the index type needs to be selected according to the characteristics of the dataset. Please refer to the documentation for more information.

Python

index_params = milvus_client.prepare_index_params()
index_params.add_index(field_name="id")
index_params.add_index(field_name="title_vector", index_type="AUTOINDEX", metric_type="L2")
index_params.add_index(field_name="content_vector", index_type="AUTOINDEX", metric_type="L2")

Call the create_collection() method to create a new data collection named news_demo.

Python

milvus_client.create_collection(collection_name="news_demo", schema=schema, index_params=index_params)

How to embeding

Choose a suitable model for vectorization. Here we use OpenAI's text-embedding-3-small model. You can choose other models like text-embedding-3-large, BAAI/bge-m3, etc. Different models have different dimensions. You can use dimensions to control the dimension of the vector returned by the model. Note that the dimension returned by the model needs to be consistent with the dimension defined in the schema. The default dimension of text-embedding-3-large is 1536.

Python

from openai import OpenAI
openai_client = OpenAI()

def get_embedding(text):
text = text.replace("\n", " ")
model = "text-embedding-3-small"
return openai_client.embeddings.create(input=[text], model=model, dimensions=1536).data[0].embedding

The expected return result of get_embedding() is a 1536-dimensional vector:

Json

[
0.07234696,
0.24083021,
-0.2613889,
-0.25943092,
-0.66101044,
-0.42331296,
0.38689473,
0.051249847,
0.15712704,
0.08947919
...
...
]

Insert Data

Start data writing, first vectorize title and content, then call insert method to write to vector database.

Python

import requests
import io

url = 'https://ipfs.glitterprotocol.dev/ipfs/Qmdnkm5a7PDL8AjS4zV2nbjaaw2ADpr3fPuH1taRZQkH27'
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
# Vectorize the title and content.
df['title_vector'] = df.title.apply(lambda x: get_embedding(x))
df['content_vector'] = df.content.apply(lambda x: get_embedding(x))
# write to milvus
for index, row in df.iterrows():
milvus_client.insert("news_demo", {
"id": row["id"],
"url": row["url"],
"title": row["title"],
"content": row["content"],
"title_vector": row["title_vector"],
"content_vector": row["content_vector"],
})

2.How to use engine-server

Run an engine-server node and modify its configuration file to support vector databases. The configuration file needs to specify the Milvus node address and embedding_provider related information. The configuration file is as follows, and the following text will explain the meaning of the configuration items:

Config

[ [ database ] ]
name = "vec" # the name of dataset
engine = "milvus" # the type of engine
[database.config]
endpoint = "localhost:19530" #milvus endpoint
dail_timeout = "3s"
[[database.config.embedding_provider]]
name="text-embedding-3-small"
api_key="sk-xxxxx"
endpoint="https://api.openai.com/v1/embeddings"
timeout = "3s"
default=true

How to use embedding_provider

The engine-server integrates with embedding_provider through TEXT_TO_VEC, where the embedding_provider service returns vectors that are compatible with Milvus indexes. You can use third-party embedding services, such as those provided by OpenAI, or develop your own embedding_provider.

Instructions for using TEXT_TO_VEC:

  • TEXT_TO_VEC('AI','BAAI/bge-m3') will match the provider named BAAI/bge-m3.
  • TEXT_TO_VEC('AI','text-embedding-3-small') will match the provider named text-embedding-3-small.
  • TEXT_TO_VEC('AI') will match the provider with default=true, which is text-embedding-3-small.

The first parameter of TEXT_TO_VEC is the text to be vectorized, and the second parameter is the embedding_provider to be called (the engine-server configuration file will specify relevant configurations).

Examples of using TEXT_TO_VEC:

SELECT id,
url,
title,
VECTOR_L2_DISTANCE(
content_vector,
TEXT_TO_VEC('Your text string goes here')
) AS distance
FROM vec.news_demo
ORDER BY distance LIMIT
2;

How to develop an embedding_provider

An embedding_provider service needs to adhere to the following protocol, which is compatible with OPENAI-Embedding:

Request body:

  • model_name(string) The name of the model used for encoding. Required.
  • api_key(string) The key for accessing the embedding service. Required.
  • dimensions(int) The dimensions of the embedding result. Optional.

Returns:

A list of embedding objects.

An example request to the embedding_provider is as follows, which is initiated by the engine-server:

curl

curl --location --request POST 'https://your.embedding.provider/v1/embeddings' \
--header 'Authorization: Bearer $APIKEY' \
--header 'Content-Type: application/json' \
--data-raw '{
"input": "Your text string goes here",
"model": "text-embedding-3-small",
"dimensions":1000
}'

The expected return is as follows:

Json

{
"model": "text-embedding-3-small",
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
0.07234696,
0.24083021,
-0.2613889,
-0.25943092,
-0.66101044,
-0.42331296,
0.38689473,
0.051249847,
0.15712704,
0.08947919
]
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 0,
"total_tokens": 5
}
}

Implement an embedding_provider that supports the BAAI/bge-m3 model, and configure it in the engine-server.

Yaml

[[database]]
name = "vec" #
engine = "milvus" #
[database.config]
endpoint = "localhost:19530" #
dail_timeout = "3s"
[[database.config.embedding_provider]]
name="text-embedding-3-small"
api_key="sk-xxxxx"
endpoint="https://api.openai.com/v1/embeddings"
timeout = "3s"
default=true
[[database.config.embedding_provider]]
name="BAAI/bge-m3"
api_ke=""
endpoint="http://127.0.0.1:5000/embeddings"
timeout = "3s"

Using the embedding-server in queries

sql

SELECT
id,
url,
title,
VECTOR_L2_DISTANCE (
content_vector,
TEXT_TO_VEC ('Your text string goes here','BAAI/bge-m3')
) AS distance
FROM
vec.news_demo
ORDER BY
distance
LIMIT
2;

3.Register dataset

Before querying a dataset, the dataset information needs to be written to the chain so that data consumers can discover and use the dataset. The following shows how to register a dataset through go-sdk.

Register dataset, specifying the dataset name "vec" and the address of the host (engine-server).

golang

import (
"context"
"errors"

"github.com/glitternetwork/glitter-sdk-go/client"
"github.com/glitternetwork/glitter-sdk-go/key"
chaindepindextype "github.com/glitternetwork/chain-dep/glitter_proto/glitterchain/index/types"
)

func CreateDatasetBySDK() {
mnemonicKey := ""
chainID := "glitter_12001-4"
endPoint :="https://orlando-api.glitterprotocol.tech"

pk, _ := key.DerivePrivKeyBz(mnemonicKey, key.CreateHDPath(0, 0))
privateKey, _ := key.PrivKeyGen(pk)
chainClient := client.New(chainID, privateKey, client.WithChainEndpoint(endPoint))

rsp, err := chainClient.CreateDataset(
context.Background(),
"vec",
chaindepindextype.ServiceStatus(chaindepindextype.ServiceStatus_value["Service_Status_Start"]),
"http://sg5.testnet.glitter.link:8085",
"",
"",
"{\"description\":\"a vector dataset\",\"cover_img\":\"https://ipfs.glitterprotocol.dev/ipfs/QmdUs8w7KTgGojHmuHkzfj67yZsvuWdzcfWgsJ1TnX2LXR\"}",
30*86400,
)
}

The dataset can also be registered through our developer platform DataMap:

tokenomics tokenomics tokenomics