E-commerce

Magento 2 search on pgvector: a 12k-SKU swap playbook

The renewal from the search vendor grew enough that the CFO asked: do we still need this? A playbook for swapping Magento 2 search to pgvector on a 12,000-SKU Dutch B2B catalogue.

Jacob Molkenboer· Founder · A Brand New Company· 6 Jun 2026· 8 min

Brass two-pan scale with small craft-paper parcel on left pan and stack of paper tags on right, on ivory paper surface.

The renewal quote from the search vendor landed on a Wednesday morning. It had grown enough that the CFO forwarded it to the operations lead with a single line: "Do we still need this?" The catalogue is around 12,000 SKUs of fasteners, fittings and industrial spare parts, sold to Dutch installers and contractors. The store runs on Magento 2.4. Search is the only part of the stack that lives on a third-party service, and everything else already sits on Postgres.

This is the playbook we follow when a client decides to bring catalogue search back in-house onto Postgres with pgvector, while leaving the checkout untouched. It works because Magento 2's search layer is one of the most cleanly abstracted parts of the platform, and because B2B search is mostly a problem of disambiguating SKU codes and Dutch synonyms, not a problem of crawl scale.

Why pgvector and not Elasticsearch

Magento 2.4 dropped MySQL search and went all-in on Elasticsearch, with OpenSearch as the supported fork from 2.4.6 onward. Both work. Neither is what you reach for first when the rest of your stack is Postgres and your search problem is 12,000 documents.

An OpenSearch cluster sized for production has a JVM heap, a replica, a coordinator and a maintenance schedule. pgvector is a Postgres extension that adds a vector column type and two index methods (IVFFlat and HNSW). On 12k rows it answers in under 5 ms on a small Hetzner box, and it indexes overnight on the same database that already holds your orders. You back it up with pg_dump. You scale it by upgrading Postgres.

The trade-off: pgvector is not a search engine. There is no document mapping language, no analyzer chain, no relevance tuning UI. You get a vector column and a similarity operator, and you write the rest in SQL. For 12k SKUs that is a feature, not a limitation.

What stays, what changes, and where the checkout line sits

Magento 2 routes every search query through Magento\Search\Model\SearchEngine, which dispatches to whatever adapter is registered. The product result page, the layered navigation, the search-as-you-type widget and the "did you mean" block all consume that adapter. The cart, the quote, the tax calculation and the payment capture do not. They go through different services entirely.

That gives you a clean cut. You replace the search adapter and the indexer that feeds it. You leave the rest of Magento alone. A customer can still add a product to cart, place an order, and pay through iDEAL or your card processor with no code path crossing pgvector.

Takeaway

If your swap touches anything past the search results page, you have scoped it wrong. The boundary is the adapter interface, and Magento defends it for you.

Step 1: mirror the catalogue into Postgres

Even if your operational database is MySQL (Magento's default), the search index lives in its own Postgres database. Treat it as a read-side projection. Magento publishes catalog events you can subscribe to: catalog_product_save_after, catalog_product_delete_commit_after, attribute set changes and stock updates.

For the first build, do a full dump. A direct query over the EAV tables works for small catalogues, but it's brittle. Use the REST endpoint /rest/V1/products with a store-scoped token. It already flattens EAV, resolves attribute labels, and respects multi-store visibility. Page it 200 at a time and write to a staging table.

CREATE TABLE catalog_search_doc (
  sku            text PRIMARY KEY,
  entity_id      bigint NOT NULL,
  name           text NOT NULL,
  description    text,
  manufacturer   text,
  ean            text,
  attrs          jsonb NOT NULL DEFAULT '{}'::jsonb,
  price_cents    integer,
  in_stock       boolean,
  updated_at     timestamptz NOT NULL DEFAULT now()
);

The attrs JSONB is where the B2B-specific fields live: thread pitch, head type, material grade, certificate references. You do not need to normalise these into columns. You will query them with attrs ? 'rvs_a2' style predicates, and GIN makes that cheap.

Step 2: pick the embedding model

For a Dutch B2B catalogue the embedding model matters more than the vector index. English-only models miss synonyms a fastener buyer takes for granted: "RVS A2" against "stainless steel 304", "zeskant" against "hex", "verzinkt" against "galvanised". You want a multilingual model trained on European languages.

Two reasonable picks: BAAI/bge-m3 (1024 dimensions, runs on CPU for a 12k catalogue, MIT-style licence) and OpenAI's text-embedding-3-small (1536 dimensions, hosted, costs roughly a euro for the full corpus). We default to bge-m3 because the embeddings stay on your hardware and re-embedding the catalogue on a model upgrade is free.

Embed the concatenation of name, manufacturer, key attributes and the first 300 words of the description. Do not embed the whole HTML body. Most product descriptions in B2B catalogues are 80% boilerplate ("Levertijd 2-3 werkdagen, prijzen excl. BTW") that drowns the signal.

def doc_text(row):
    parts = [
        row["name"],
        row["manufacturer"] or "",
        " ".join(f"{k}: {v}" for k, v in row["attrs"].items()),
        (row["description"] or "")[:1200],
    ]
    return "\n".join(p for p in parts if p).strip()

embeddings = model.encode(
    [doc_text(r) for r in batch],
    normalize_embeddings=True,
)

Step 3: index design

The final index has three layers: a generated tsvector for lexical search, a vector column for semantic search, and a couple of GIN indexes for structured filters.

ALTER TABLE catalog_search_doc
  ADD COLUMN embedding vector(1024),
  ADD COLUMN fts tsvector GENERATED ALWAYS AS (
    setweight(to_tsvector('dutch',  coalesce(name,'')),         'A') ||
    setweight(to_tsvector('simple', coalesce(sku,'')),          'A') ||
    setweight(to_tsvector('dutch',  coalesce(manufacturer,'')), 'B') ||
    setweight(to_tsvector('dutch',  coalesce(description,'')),  'C')
  ) STORED;

CREATE INDEX catalog_fts_idx   ON catalog_search_doc USING gin (fts);
CREATE INDEX catalog_attrs_idx ON catalog_search_doc USING gin (attrs);
CREATE INDEX catalog_hnsw_idx  ON catalog_search_doc
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

The 'simple' dictionary on the SKU column is deliberate. You do not want Postgres stemming "M8x40" into "m8x". The 'dutch' dictionary on the human-readable fields gives you stemming on Dutch words (so "schroeven" matches "schroef") without you maintaining a synonym list.

Warning

HNSW build time scales with row count. On 12k rows it finishes in seconds; on 12 million it can take an hour. If you grow into the millions, build the index on a replica and promote.

Step 4: the hybrid query

A vector-only query loses to a lexical query the moment a buyer types an exact SKU. A lexical-only query loses the moment a buyer types "schroef voor buitengevel rvs". You want both, weighted, with a small tiebreaker on stock status. This is one SQL statement:

WITH q AS (
  SELECT
    $1::text   AS qtext,
    $2::vector AS qemb
)
SELECT
  d.sku,
  d.name,
  d.entity_id,
  ts_rank_cd(d.fts, plainto_tsquery('dutch', q.qtext)) AS lex,
  1 - (d.embedding <=> q.qemb)                         AS sem,
  (
      0.45 * ts_rank_cd(d.fts, plainto_tsquery('dutch', q.qtext))
    + 0.55 * (1 - (d.embedding <=> q.qemb))
    + CASE WHEN d.in_stock THEN 0.05 ELSE 0 END
  ) AS score
FROM catalog_search_doc d, q
WHERE
  d.fts @@ plainto_tsquery('dutch', q.qtext)
  OR (d.embedding <=> q.qemb) < 0.55
ORDER BY score DESC
LIMIT 48;

The 0.45/0.55 split is a starting point, not a law. Sit with the operations lead, run twenty real queries from last week's search logs, watch which result list they prefer, and adjust. We have seen B2B catalogues where the lexical weight needs to climb to 0.7 because SKU traffic dominates, and consumer catalogues where it drops to 0.25.

Step 5: register the adapter in Magento

Magento's search engine selection goes through etc/search_engine.xml in your module. Implement Magento\Framework\Search\AdapterInterface, register it as pgvector, and set it as the active engine in app/etc/env.php or through the admin.

<?php
namespace Abn\PgvectorSearch\SearchAdapter;

use Magento\Framework\Search\AdapterInterface;
use Magento\Framework\Search\RequestInterface;
use Magento\Framework\Search\Response\QueryResponse;
use Magento\Framework\Search\Response\QueryResponseFactory;

class Adapter implements AdapterInterface
{
    public function __construct(
        private \Abn\PgvectorSearch\Client $client,
        private QueryResponseFactory $responseFactory,
    ) {}

    public function query(RequestInterface $request): QueryResponse
    {
        $text = $this->extractQueryText($request);
        $hits = $this->client->search($text, limit: 48);

        $documents = array_map(
            fn($h) => ['entity_id' => $h['entity_id'], 'score' => $h['score']],
            $hits,
        );

        return $this->responseFactory->create([
            'documents'    => $documents,
            'aggregations' => $this->client->facets($text),
            'total'        => count($documents),
        ]);
    }
}

The client class is a thin PHP wrapper around the SQL above. It calls a small embedding service (a FastAPI process running bge-m3) for the query vector, then runs the SQL through PDO. On a warm connection pool you are looking at a few milliseconds per request.

Step 6: shadow traffic before cutover

Do not flip the search engine in production on a Friday afternoon. Run pgvector in shadow mode for two weeks. Every search request goes to the existing engine for the user-visible result and to pgvector for logging only. Compare the top-10 results on the same query and log the symmetric difference.

SELECT
  qtext,
  count(*) FILTER (WHERE jaccard_top10 >= 0.7) AS agree,
  count(*) FILTER (WHERE jaccard_top10 <  0.3) AS diverge,
  count(*) AS n
FROM search_shadow_log
WHERE captured_at > now() - interval '7 days'
GROUP BY qtext
ORDER BY n DESC
LIMIT 200;

Sort the divergent queries by frequency. Walk through the top 50 with someone who knows the catalogue. In most of those, pgvector is right and the old engine was wrong, but you need a human to call it. The cases where pgvector is wrong are almost always missing synonyms in your embedded text, not a model failure.

Operating notes

A few things we wish we had known sooner.

Reindex incrementally, not nightly. Magento's catalog_product_save_after event fires per save. Queue the SKU, debounce 5 seconds, re-embed, upsert. A nightly full reindex is acceptable on 12k SKUs but it masks data quality issues until they hurt.

Watch maintenance_work_mem when you rebuild the HNSW index. The default of 64MB is fine for a small index, but at 1024 dimensions and a million rows you want at least 2GB.

Cache the query embedding, not the result set. The vector for "schroef m8x40 rvs" is identical every time. The result set changes the moment stock changes.

Keep the old engine warm for one full quarter. Switching back is a one-line config change. Throwing the cluster away is irreversible.

What you can do this afternoon

When we built the pgvector swap for the fastener catalogue described above, the part that took the most time was not the SQL or the Magento module. It was sitting with the warehouse team and writing down the twenty Dutch synonyms their customers actually type. That work pays off whether you ship pgvector, Typesense, or a tuned OpenSearch. If you want help running the same swap end to end, our legacy-stack migration practice covers the Magento adapter, the indexer and the shadow-traffic gate.

Today, open your search logs, sort the queries by frequency, and read the top fifty out loud. The patterns will tell you whether your search problem is lexical, semantic, or a stock-data problem wearing a search costume.

Key takeaway

Magento 2's search adapter is the cleanest swap point in the stack. Replace it with a pgvector index and the checkout never knows you moved.

FAQ

Does this break Magento's layered navigation or facets?

No. The adapter still returns aggregations. You compute facets in SQL over the catalog_search_doc table, grouped by attribute, and Magento renders them through its existing layered nav UI.

What happens to the existing Elasticsearch or third-party engine during cutover?

Run both in shadow mode for two weeks. Every query hits pgvector for logging only while the old engine still serves users. Cutover is a one-line config change once the divergence on top queries is acceptable.

Do I need a GPU to run the embeddings?

Not for 12,000 SKUs. The bge-m3 model on CPU embeds the full catalogue in around an hour. Incremental updates run at a few hundred SKUs per minute, well within an event-driven reindex budget.

How do you handle stock and price changes?

Subscribe to Magento's stock and price events and upsert the in_stock and price_cents columns. You do not need to re-embed when only stock or price changes; the vector is unaffected.

magentoe-commercemigrationarchitectureintegrations

Building something?

Start a project