The Symptom
An API for editing order details started throwing a business error: ORDER_ITEM_CAN_NOT_DELETE — only pending or cancelled items can be deleted. It happened whenever the order contained deposit (prepaid) entries. The older version of the same endpoint, calling the same downstream logic, worked fine.
The developer’s initial instinct: “MyBatis-Plus caching issue; the first and second query return different results.” Directionally correct, but the mechanism was not what anyone expected. The root cause turned out to be a single .add() call written seventeen months before the bug appeared.
The Clue: Same Object, Different Size
The call chain was: updateOrderDetail calls getOrderDetail (which queries order items), then calls saveOrderDetail (which queries order items again before diffing). Two queries to the same mapper, in the same @Transactional method.
I added identity probes at both query sites:
|
|
Results with the failing endpoint:
query#1 identity=1697248320 size=4
query#2 identity=1697248320 size=6
Same identity hash. Same object reference. But size grew from 4 to 6 between the two calls.
So where did the extra 2 come from? The database had 4 rows:
|
|
And 2 deposit entries in a separate table:
|
|
4 + 2 = 6. A search through the call chain between the two query sites found it: getOrderDetail was .add()-ing deposit entries into the list after query #1. And somehow query #2 was seeing that mutation instead of hitting the database for a fresh result. The save logic diffed old items against new, saw 2 deposit entries that did not exist in the incoming list, tried to delete them, and failed the status check.
The Bug: Three Commits That Conspired
Three changes, each independently reasonable, combined to create this bug.
Commit 1: The Mutation (November 2024)
A developer added deposit entries to the order item list for display purposes:
|
|
This made the API response include deposits alongside regular items — a reasonable UI requirement.
The problem: listByOrderId is a MyBatis mapper call. MyBatis’s first-level (L1) cache, managed by BaseExecutor, stores query results in a PerpetualCache, which is just a HashMap<Object, Object>. When a subsequent query hits the cache, localCache.getObject(key) returns the raw reference from the map:
|
|
No defensive copy. No Collections.unmodifiableList(). The exact same List object is returned. So when getOrderDetail called items.add(convertedDeposit), it was mutating the list that the L1 cache held a reference to.
At this point, the mutation was harmless: getOrderDetail was always the terminal consumer. Nobody queried the same mapper method again in the same transaction.
Commit 2: The Second Query (April 2026, Week 1)
A new feature added operation logging: “show what changed between the old and new order.” This required fetching the current state before applying updates:
|
|
Now two calls to the same mapper existed in the same call chain. But this alone did not trigger the bug. Without @Transactional, Spring’s SqlSessionTemplate opens and closes a new SqlSession for each mapper call. Two separate sessions, two separate caches, no pollution.
Commit 3: The Transaction (April 2026, Week 2)
The developer added DML operations to updateOrderDetail and correctly wrapped the method in @Transactional:
|
|
@Transactional binds a single SqlSession to the current thread for the duration of the method. Now query #1 and query #2 share the same executor, the same localCache.
MyBatis builds a CacheKey from the statement ID, the RowBounds offset and limit, the SQL string, and each parameter value. Both queries had identical inputs, so both produced the same CacheKey. Query #2 hit the cache and got back the exact same List object that query #1 returned. The one that .add() had already mutated with 2 extra deposit entries.
The Timeline
Nov 2024 items.add(deposit) — pollution source (dormant)
Apr 2026 W1 added getOrderDetail() call — two queries in same call chain
Apr 2026 W2 added @Transactional — shared SqlSession, cache hit, bug fires
Seventeen months between the first commit and the bug surfacing. Each commit passed code review. Each was correct in isolation.
Why the Old Endpoint Worked
The older endpoint called the same getOrderDetail (query #1) and the same saveOrderDetail (query #2). It also had @Transactional. It should have had the same bug.
It didn’t, by accident. Between query #1 and query #2, the old endpoint ran six DML operations:
|
|
The first DML wiped the cache. BaseExecutor.update() calls clearLocalCache() before every DML:
|
|
The clearing is all-or-nothing: any INSERT, UPDATE, or DELETE in the current session wipes every cached query result, regardless of which table was involved. MyBatis does not track table dependencies.
So the first updateCustomerInfo triggered clearLocalCache(), the polluted list was gone, and query #2 missed the cache, hit the database, and got the correct 4 rows.
The old endpoint was protected by luck, not by design. If someone refactored away those intermediate DMLs, or reordered them to run after query #2, the same bug would appear.
I verified this with a controlled experiment: commenting out the six DML operations in the old endpoint produced identical behavior to the new one. Same identity hash, size 6, same business error.
The saveBatch Trap
The bug above happened because .add() polluted a cached list. But there is a second way L1 can silently serve stale data, and it is easier to walk into. You do not need to mutate anything; just using the wrong save method is enough.
MyBatis-Plus’s batch operations (saveBatch, saveOrUpdateBatch, removeBatchByIds) open a separate SqlSession with ExecutorType.BATCH:
|
|
Each SqlSession has its own BaseExecutor, its own localCache. The batch session’s DML clears the batch session’s cache, not the outer transaction’s. So:
|
|
The two sessions share the same JDBC connection (same database transaction), so the data is consistent at the DB level. But the L1 cache in the outer session has no idea that rows were inserted by the batch session.
The Fix
One line:
|
|
This breaks the reference link between the caller’s list and the L1 cache. Subsequent .add() calls mutate the copy, not the cached original.
A more thorough fix would be to wrap the return value at the domain service layer, protecting all callers:
|
|
This is safer long-term but changes behavior for every caller at once. Higher confidence, higher blast radius. For the immediate fix, wrapping at the call site is lower risk.
Detecting L1 Cache Pollution
When an API returns a business error but the database looks correct, three signals confirm L1 cache pollution:
- Enable MyBatis SQL logging (
log-impl: STDOUT_LOGGING). If the sameSELECTappears only once in the log but your code calls it twice, the second call hit L1. - Print
System.identityHashCode(list)at both query sites. Same hash = same object = cache hit. - Compare
list.size()at both sites against the actualSELECT COUNT(*)from the database. A mismatch means the list was mutated after caching.
All three signals together are diagnostic. Any one alone could have other explanations.
Takeaways
-
Treat any
Listreturned by a MyBatis mapper as immutable. If you need to add or remove elements, copy it first withnew ArrayList<>(source). The L1 cache holds a direct reference to the same object. Your mutation is the cache’s mutation. -
L1 cache scope =
SqlSessionscope. Under Spring@Transactional, that means the entire method. Two identical queries in the same transaction return the same cached object, and any DML between them wipes the cache entirely (not per-table; all of it). No DML between the two queries means the second one silently returns the cached result, mutations included. -
saveBatchandsaveOrUpdateBatchrun in a separateSqlSession. Their DML does not clear the outer transaction’s L1 cache. AselectListaftersaveBatchin the same@Transactionalmethod returns stale data unless you manually clear the cache or run a non-batch DML first. -
When commits months apart each look correct in isolation, the bug they create together is invisible to single-diff code review. Enforce mapper return immutability as a convention. It is simple enough to check in review without tracing the full call graph.
References: