optimize: Optimize batch query performance by lokidundun · Pull Request #2982 · apache/hugegraph

lokidundun · 2026-03-29T14:43:54Z

Purpose of the PR

close [Improve] Gremlin query ids batch query to backend such as HBase/HStore #2674

Main Changes

This PR improves the performance of Gremlin queries like g.V(id1, id2, ...) when using RPC‑based backends such as HBase and HStore.
Previously, all vertex ids were either queried one by one or packed into a single large IdQuery, which led to many small RPC calls and poor latency in real production workloads.

Verifying these changes

Trivial rework / code cleanup without any test coverage. (No Need)
Already covered by existing tests, such as (please modify tests here).
Need tests and can be verified as follows:
- xxx

Does this PR potentially affect the following parts?

Dependencies (add/update license info & regenerate_known_dependencies.sh)
Modify configurations
The public API
Other affects (typed here)
Nope

Documentation Status

Doc - TODO
Doc - Done
Doc - No Need

lokidundun · 2026-03-30T14:38:27Z

This CI failure is unrelated to the changes in this PR. The PR focuses on optimizing RocksDB batch query performance, and the failing build check does not involve the code modified here.

imbajin · 2026-03-31T06:56:02Z

Already rerun CI (also could check the tests could pass in local env)

imbajin · 2026-03-31T08:03:19Z

seems a little strange (after rerun 3 times still failed)

Temporarily use super.queryByIds() instead of getByIds() for batch version support.

lokidundun · 2026-04-06T11:02:10Z

@imbajin could you please take another look when you are convenient❤️

Copilot

Pull request overview

This PR introduces batched backend fetching for queryVerticesByIds() to reduce overhead when querying many vertex ids (e.g., g.V(id1, id2, ...)) by splitting backend id lookups into multiple IdQuery requests.

Changes:

Collect backend-only vertex ids during queryVerticesByIds() and issue backend queries in batches using query.batch_size.
Add per-batch capacity checks before executing backend queries.
Minor formatting adjustment in RocksDB table iterator helper signature.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 5 comments.

File	Description
hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/backend/tx/GraphTransaction.java	Batch backend id queries in `queryVerticesByIds()` instead of building a single `IdQuery` for all backend ids.
hugegraph-server/hugegraph-rocksdb/src/main/java/org/apache/hugegraph/backend/store/rocksdb/RocksDBTable.java	Formatting-only change to method parameter indentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T15:35:24Z

+        if (!backendIds.isEmpty()) {
            // Query from backend store
-            query.mustSortByInput(false);
-            Iterator<HugeVertex> it = this.queryVerticesFromBackend(query);
-            QueryResults.fillMap(it, vertices);
+            final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size();
+            for (int i = 0; i < backendIds.size(); i += batch) {
+                int end = Math.min(i + batch, backendIds.size());
+                IdQuery query = new IdQuery(type);


The PR title says this is a RocksDB optimization, but the functional change here is backend-agnostic batching in GraphTransaction (affecting all stores, especially RPC-based ones). Please align the PR title (or add RocksDB-specific changes) so the intent matches what’s actually being modified.

Copilot · 2026-04-06T15:35:24Z

@@ -799,17 +800,30 @@ protected Iterator<Vertex> queryVerticesByIds(Object[] vertexIds, boolean adjace
                // Found from local tx
                vertices.put(vertex.id(), vertex);
            } else {
-                // Prepare to query from backend store
-                query.query(id);
+                // store the IDs queried from backend
+                backendIds.add(id);
            }


This change allocates backendIds as a second full list of ids (in addition to ids), which doubles memory usage for large g.V(id1,id2,...) calls (up to Query.DEFAULT_CAPACITY). Consider batching/issuing backend IdQuery requests incrementally during the main loop (flush when reaching batch size) to avoid retaining all backend ids at once.

Copilot · 2026-04-06T15:35:25Z

+            final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size();
+            for (int i = 0; i < backendIds.size(); i += batch) {
+                int end = Math.min(i + batch, backendIds.size());
+                IdQuery query = new IdQuery(type);
+                for (int j = i; j < end; j++) {
+                    Id id = backendIds.get(j);
+                    query.query(id);
+                }


With batching, duplicated ids that fall into different batches will trigger repeated backend reads/RPCs for the same id. You can keep the output behavior (duplicates preserved via ids) while deduplicating backend fetches (e.g., track a seen-set for backendIds or build per-batch unique ids) to avoid redundant backend queries.

Copilot · 2026-04-06T15:35:25Z

-            query.mustSortByInput(false);
-            Iterator<HugeVertex> it = this.queryVerticesFromBackend(query);
-            QueryResults.fillMap(it, vertices);
+            final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size();


final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size(); is effectively always this.batchSize because query.batch_size is configured with a minimum of 1 (see CoreOptions.QUERY_BATCH_SIZE). Removing the dead fallback branch will simplify the code.

Suggested change

final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size();

final int batch = this.batchSize;

Copilot · 2026-04-06T15:35:25Z

+        if (!backendIds.isEmpty()) {
            // Query from backend store
-            query.mustSortByInput(false);
-            Iterator<HugeVertex> it = this.queryVerticesFromBackend(query);
-            QueryResults.fillMap(it, vertices);
+            final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size();
+            for (int i = 0; i < backendIds.size(); i += batch) {
+                int end = Math.min(i + batch, backendIds.size());
+                IdQuery query = new IdQuery(type);
+                for (int j = i; j < end; j++) {
+                    Id id = backendIds.get(j);
+                    query.query(id);
+                }
+                // Single batch capacity check
+                Query.checkForceCapacity(query.idsSize());
+
+                // Query from backend store
+                query.mustSortByInput(false);
+                Iterator<HugeVertex> it = this.queryVerticesFromBackend(query);
+                QueryResults.fillMap(it, vertices);
+            }


The new multi-batch path isn’t covered by tests. Please add a unit/integration test that exercises queryVerticesByIds() with vertexIds.length > query.batch_size, including (1) duplicates across a batch boundary and (2) mixed local-tx + backend ids, to ensure results and NotFoundException behavior remain unchanged.

imbajin · 2026-04-11T15:34:53Z

        // NOTE: allowed duplicated vertices if query by duplicated ids
        List<Id> ids = InsertionOrderUtil.newList();
        Map<Id, HugeVertex> vertices = new HashMap<>(vertexIds.length);
+        Set<Id> fetchedIds = InsertionOrderUtil.newSet();


⚠️ 这里新增 fetchedIds 之后，非相邻的重复 id 也会被全局去重；而旧逻辑里 IdQuery.query() 只会折叠相邻重复 id。最终返回结果看起来应该还是保持重复输出，但真实的后端访问路径已经变了。建议补一个回归测试，至少覆盖 超过 query.batch_size、跨 batch 的重复 id、缺失 id + checkMustExist 这几个组合场景，避免后面再改这里时把语义悄悄带偏。

imbajin · 2026-04-11T15:34:54Z

        Map<Id, HugeVertex> vertices = new HashMap<>(vertexIds.length);
+        Set<Id> fetchedIds = InsertionOrderUtil.newSet();
+        IdQuery batchQuery = null;
+        final int batchSize = this.batchSize;


⚠️ 这里把 batching 放在 GraphTransaction 通用层后，会影响所有 backend，而不只是 issue #2674 里提到的 RPC backend。

以 RocksDB 为例，当前 queryByIds() 仍然是逐 id 展开查询，并没有真正走 multi-get；现在强制按 query.batch_size 拆成多个 IdQuery，很可能只是增加额外的 query/iterator 次数。建议把这类分批策略下沉到具体 backend，或者至少通过 feature/store type 把它限定在 HBase/HStore 这类 RPC backend 上，避免把针对性优化变成全局行为变化。

PS: 后续我们应该让 RocksDB 使用上原生的 multi-get API (这应该是之前的 TODO)

GraphTransaction 不再按 batch_size 拆分 ID，一次性下发完整 IdQuery；

RocksDBTables.Vertex/Edge 已覆写 queryByIds()，在 !session.hasChanges() 时走 multiGetAsList() 原生批量读取，脏 session 安全回退到逐 id scan；

Agent-Logs-Url: https://github.com/lokidundun/incubator-hugegraph/sessions/d37721a0-2300-48d4-a06c-f8e1b4b6c3d6 Co-authored-by: lokidundun <[email protected]>

Agent-Logs-Url: https://github.com/lokidundun/incubator-hugegraph/sessions/ab26085b-737d-4e58-aa7c-6c88194b2032 Co-authored-by: lokidundun <[email protected]>

Agent-Logs-Url: https://github.com/lokidundun/incubator-hugegraph/sessions/e035a01d-313f-4f83-8efb-e85a306744c9 Co-authored-by: lokidundun <[email protected]>

…r-hugegraph into improvequery

imbajin · 2026-04-17T09:06:37Z

                                                   Collection<Id> ids) {
-            // TODO: use getByIds() after batch version multi-get is ready
+            if (!session.hasChanges()) {
+                return this.getByIds(session, ids);


⚠️ This RocksDB-specific optimization looks reasonable to me as a focused improvement on its own. One thing that may be worth clarifying is how HStore should be handled, since it is the main distributed backend we maintain and its batch-query path is a separate concern from RocksDB.

If you think it makes sense, we could either cover HStore in this PR as well, or keep this PR scoped to RocksDB and follow up with a separate PR for HStore so the behavior and performance trade-offs can be reviewed independently.

Thanks! Let's keep this PR focused on RocksDB for now, and I'll handle HStore in a separate follow-up PR.

imbajin · 2026-04-17T09:06:37Z

+        protected BackendColumnIterator queryByIds(RocksDBSessions.Session session,
+                                                   Collection<Id> ids) {
+            if (!session.hasChanges()) {
+                return this.getByIds(session, ids);


⚠️ Since this adds a new multi-get path for vertex/edge id queries, it would be great to add a small RocksDB regression test here. The core cases that seem worth covering are:

batch query with multiple existing ids

batch query with existing + missing ids mixed together

duplicate ids in the input

fallback to the old path when session.hasChanges() is true

Testing the fallback path (session.hasChanges() = true) is not feasible in unit tests, as RocksDBStdSessions asserts !this.hasChanges() on all read operations (get/scan) when assertions are enabled.

optimize: Optimize RocksDB batch query performance

5ba6d4a

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. perf store Store module labels Mar 29, 2026

github-project-automation bot moved this to In progress in HugeGraph PD-Store Tasks Mar 29, 2026

github-project-automation bot added this to HugeGraph PD-Store Tasks Mar 29, 2026

lokidundun closed this Mar 30, 2026

github-project-automation bot moved this from In progress to Done in HugeGraph PD-Store Tasks Mar 30, 2026

lokidundun reopened this Mar 30, 2026

github-project-automation bot moved this from Done to In progress in HugeGraph PD-Store Tasks Mar 30, 2026

lokidundun and others added 5 commits March 31, 2026 20:28

Refactor getByIds to queryByIds in RocksDBTable

f5405a0

Modify queryByIds to use super method temporarily

939ace0

Temporarily use super.queryByIds() instead of getByIds() for batch version support.

Refactor queryByIds to getByIds with HashSet

502c7df

Update RocksDBTables.java

352f66b

fix: fix ci

0d9052d

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Apr 6, 2026

imbajin requested a review from Copilot April 6, 2026 15:30

Copilot started reviewing on behalf of imbajin April 6, 2026 15:31 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

lokidundun changed the title ~~optimize: Optimize RocksDB batch query performance~~ optimize: Optimize batch query performance Apr 7, 2026

optimize: optimize the batch query

45298f9

imbajin reviewed Apr 11, 2026

View reviewed changes

optimize: optimize rockDb query

223fb28

dosubot bot removed the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 16, 2026

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 16, 2026

lokidundun and others added 6 commits April 16, 2026 12:36

fix: fix ci test

ce4e2cb

fix: preserve input id multiplicity in RocksDB multi-get path

6fe08a7

Agent-Logs-Url: https://github.com/lokidundun/incubator-hugegraph/sessions/d37721a0-2300-48d4-a06c-f8e1b4b6c3d6 Co-authored-by: lokidundun <[email protected]>

fix: fallback getByIds when rocksdb session has pending changes

0391069

Agent-Logs-Url: https://github.com/lokidundun/incubator-hugegraph/sessions/ab26085b-737d-4e58-aa7c-6c88194b2032 Co-authored-by: lokidundun <[email protected]>

revert: restore rocksdb query files to ce4e2cb state

ce802b9

Agent-Logs-Url: https://github.com/lokidundun/incubator-hugegraph/sessions/e035a01d-313f-4f83-8efb-e85a306744c9 Co-authored-by: lokidundun <[email protected]>

fix: fix wrong usage

c06225c

Merge branch 'improvequery' of https://github.com/lokidundun/incubato…

a99491b

…r-hugegraph into improvequery

lokidundun requested a review from imbajin April 17, 2026 08:18

imbajin reviewed Apr 17, 2026

View reviewed changes

test: add test for new multi-get path

10cb0a4

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 17, 2026

	final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size();
	final int batch = this.batchSize;

Conversation

lokidundun commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

lokidundun commented Mar 30, 2026

Uh oh!

imbajin commented Mar 31, 2026

Uh oh!

imbajin commented Mar 31, 2026

Uh oh!

lokidundun commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lokidundun Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lokidundun Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lokidundun Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lokidundun commented Mar 29, 2026 •

edited

Loading

imbajin Apr 11, 2026 •

edited

Loading

imbajin Apr 17, 2026 •

edited

Loading