API Performance P95 7s → 0.1s Backend Optimization — Finding the Bottleneck Was Harder Than Fixing It
Hi, I’m Jeongil Jeong, a backend developer working at a proptech platform.
One day, as usual, I was checking the Grafana dashboard for monitoring. The average response times were mostly within a few hundred milliseconds, so I was about to move on thinking “looks fine” — but out of habit, I glanced at the P95 (the slowest 5% of requests) metrics. Things looked quite different there. Some APIs had tail latencies lingering in the multi-second range.
What does a P95 of several seconds mean? It means that 1 out of every 20 users is waiting several seconds for a response. Honestly, I used to think “it’s only the top 5%, so most users are fine, right?” But when you think about it, for high-traffic APIs, the absolute number of that “1 person” is far from small. For an API handling tens of thousands of requests per day, 5% translates to thousands of users experiencing slow responses every single day. I’m confident that if I were a user of the service, I wouldn’t want that experience.
So naturally, I couldn’t just let it slide. I spent about 10 days on intensive performance optimization, and I’d like to share the thought process, decisions, and results from that journey. I hope my experience can be of some help to those facing similar situations. That said, I’m still learning and growing myself, so if you have better approaches, I’d love to hear them.
Starting Point: Where Do We Even Begin?
I knew performance improvements were needed, but when it came time to start, I wasn’t sure where to begin. You can’t set a direction based solely on a vague feeling that things are “slow.”
There’s one thing I want to address here first. I believe the most dangerous thing in performance optimization is starting with the assumption “this is probably the slow part.” The “this” I’m referring to could be not just the API itself, but a specific section of code, a database query, a network bottleneck, or an external API call.
Initially, I thought “wouldn’t introducing a cache for the bottlenecked API help?” But when I actually measured, the bottlenecks were often in places I hadn’t expected. So the very first thing I did was accurately assess the current state.
I started by analyzing bottleneck points one by one using Prometheus metrics, Grafana dashboards, and Tempo’s distributed tracing. From there, I specifically identified whether the bottleneck was in the DB, in a particular section of code, or in external API calls.
If a DB query was the issue, I examined the query execution plan with EXPLAIN. Things that were completely invisible just by reading code became crystal clear when you put numbers in front of them.
After organizing the analysis results, most bottleneck causes fell into three broad categories:
graph LR
Root["Performance Bottleneck"] --> A["Query Inefficiency"]
Root --> B["Sequential Processing"]
Root --> C["Repeated Lookups"]
A --> A1["Non-sargable WHERE"]
A --> A2["N+1 Queries"]
A --> A3["Full Table Scan"]
B --> B1["Sequential API Calls"]
B --> B2["Serial Processing of Independent APIs"]
C --> C1["Repeated DB Lookups for Same Data"]
C --> C2["Duplicate Data Fetching"]
style A fill:#f85149,color:#fff
style B fill:#b35000,color:#fff
style C fill:#1f6feb,color:#fff
- Query inefficiency — WHERE clauses that can’t use indexes, N+1 queries, missing indexes
- Sequential processing — Independent API calls being executed one after another
- Repeated lookups — Fetching the same data from the DB on every request
Now that we’ve identified the bottleneck points, let’s walk through each one in detail.
Query Optimization — There Was a Reason Indexes Weren’t Helping
Ranking API: Why P95 Tail Latency Was Abnormally High
Since our company runs a real estate platform, we have APIs like “transaction volume ranking” and “unit price ranking.” Most single-month queries for these APIs were being processed within normal ranges, but when a cross-year date range query came in, things changed dramatically. In those cases, P95 spiked up to 7.24 seconds.
7 seconds. A 7-second API response. From a user’s perspective, the screen must have felt frozen. They probably gave up and left.
You might be wondering how we could have been running the service in this state. To give some additional context: the client previously only allowed monthly queries, so cross-year date range queries were never actually made. However, customer feedback indicated that “cross-year date range queries were needed,” and we decided to add this feature. The backend was already functionally capable of handling cross-year queries — only the client (frontend) needed updating. This problem was only discovered after the frontend update was completed. So it was a situation where “the service was operational, but customers were experiencing slow responses.”
The data consisted of 1.2 million rows in the monthly price table and 870,000 rows in the monthly transaction table. At the million-row scale, if indexes are properly utilized, there’s no way this level of latency should occur. So I suspected there was some other cause.
First, let me explain the schema of these tables. The year-month information wasn’t stored as a single column like DATE or TIMESTAMP, but was split into aggregation_year (INT) and aggregation_month (INT) — the year and month were in separate columns.
It wasn’t until I ran EXPLAIN that I found the cause.
| |
Can you spot the problem with this query?
Looking at just the code, it might not seem like there’s a problem. It creates an integer like 202603 from year * 100 + month and uses BETWEEN for the range.
But the problem becomes visible when you look at it from the MySQL optimizer’s perspective. When a column expression like aggregation_year * 100 + aggregation_month appears in the WHERE clause, the optimizer cannot use an index. The computed result of the column values isn’t what’s stored in the index. This type of condition is called a Non-sargable (Search ARGument ABLE) expression.
The EXPLAIN result makes it crystal clear:
| Field | Value |
|---|---|
| type | ALL (full table scan) |
| rows | 1,195,038 |
| Extra | Using where; Using temporary; Using filesort |
It was reading all 1.2 million rows, creating a temporary table, and sorting on top of that. The index was completely useless because of this WHERE clause. Sound familiar? Have you ever added an index and thought “that should be fine now”? I certainly have…
If it had been a single date column, a simple range condition like BETWEEN '2025-01' AND '2026-03' would have used the index. But since this was a production table, changing the schema wasn’t feasible. I had to write the best possible query within this split-column structure.
First Improvement: Converting to Sargable Conditions
Once the cause was identified, the direction was clear. Remove the column expression so the optimizer can use the index.
| |
What was originally expressed cleanly in a single formula is now more verbose, but from the optimizer’s perspective, it can now directly compare the aggregation_year and aggregation_month columns.
I added a composite index on (aggregation_year, aggregation_month), and for the household count ranking API, I removed an unnecessary GROUP BY and changed it to query only the most recent month. Household count is an inherent property of an apartment — it doesn’t change monthly. There was no reason to GROUP BY across the entire date range.
Second Improvement: Conditional Branching Based on Query Patterns
The first improvement showed better EXPLAIN results, but analyzing actual usage patterns revealed one more optimization opportunity.
graph TB
Request["Date Range Query Request"] --> Check{"Condition Analysis"}
Check -->|"Single month (most frequent)"| Single["year = ? AND month = ?<br/>→ ref index (equality condition)"]
Check -->|"Same year range"| SameYear["year = ? AND month BETWEEN ? AND ?<br/>→ range scan"]
Check -->|"Cross-year (rare)"| CrossYear["OR pattern maintained<br/>→ range scan"]
style Single fill:#3fb950,color:#fff
style SameYear fill:#1f6feb,color:#fff
style CrossYear fill:#484f58,color:#fff
The vast majority of actual traffic was single-month queries like “this month” or “one month ago.” For a single month, an equality condition of year = ? AND month = ? would suffice, but it was going through the general date range query logic, causing the optimizer to perform a range scan.
So I added branching based on query patterns: equality conditions for single months using ref index lookups, BETWEEN for same-year ranges, and the existing OR pattern only for the rare cross-year cases.
Results
The EXPLAIN results showed immediate improvement:
| Field | Before | After |
|---|---|---|
| type | ALL | ref |
| rows | 1,195,038 | 14,122 |
| Extra | Using temporary; Using filesort | — |
A full scan of 1.2 million rows became an index lookup of 14,000 rows.
Below are screenshots from the production Grafana dashboard showing the transaction volume ranking API’s P95 response time. The first image shows the response time dropping sharply at the deployment point (dotted line), and the second image shows it maintaining stability over 5 days after optimization.


| API | req/s | Before | P50 (prod) | P95 (prod) |
|---|---|---|---|---|
| Transaction Volume Ranking | 16.0 | 1.9~2.4s | 49ms | 95ms |
| Unit Price Ranking | 1.2 | 7.24s | 126ms | 3.1s |
| Household Count Ranking | 0.8 | 6.33s | 125ms | 792ms |
Transaction volume ranking is stable with both P50 and P95 under 100ms. For unit price and household count rankings, most requests (P50) improved to ~125ms, but tail latency remains for cross-year date range queries. The fundamental limitation lies in the schema structure where year and month are separated. Adding a single column like year_month (INT, e.g., 202603) using MySQL’s Generated Column would solve this with a simple BETWEEN, but since it involves migrating a production table with over 1.2 million rows, we plan to proceed with this over time.
This experience really drove home that clean-looking expressions like year * 100 + month can actually cause full scans of 1.2 million rows. Query optimization isn’t simply about whether an index exists — it’s about writing queries that can actually use the index. If the optimizer can’t use the index due to how the query is written, having the index is no different from not having one at all.
Batch N+1: Processing 1,000 Records Was Generating 4,000 Queries
Our platform has a batch job that runs every morning. It sends notifications to users when new transactions occur for apartments they’ve bookmarked. The average execution time had reached about 50 minutes. While it sometimes finished in the 30-minute range, on days with heavy new transaction data, it repeatedly spiked to 70–87 minutes (based on the BATCH_JOB_EXECUTION metadata table).
An average of 50 minutes, maximum of 87 minutes — that’s quite long. While the batch was running, new transaction notifications would queue up, and users would experience “a transaction closed but the notification came late.”
The batch had to process tens of thousands of bookmarked apartments. The problem was the number of queries generated while processing those tens of thousands.
Simplified, the existing code flow looked like this:
sequenceDiagram
participant Batch as Batch Job
participant DB as Database
loop Per bookmarked apartment
Batch->>DB: Query notification history
Batch->>DB: Query floor plan info
Batch->>DB: Query new transactions (4M~7M row table)
Batch->>DB: Query notification settings
end
Note over Batch,DB: Processing 1,000 records → ~4,001 queries
A classic N+1 problem. To process 1,000 bookmarked apartments, about 4,001 queries were being fired. For tens of thousands… the math is painful to even think about.
On top of that, the Reader query that fetched bookmarked apartments itself had issues. It was using a SELECT MAX(id) subquery for deduplication, and this subquery was causing a full table scan on every page. The more data accumulated, the slower it got — reading a single page took 35.6 seconds.
Solution: Page-Level Prefetch + Batch Queries
There are several ways to address N+1. Let me first outline the trade-offs:
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
@BatchSize | One annotation to apply | Requires entity relationship access | JPA lazy loading N+1 |
JOIN FETCH | Solves in one query | No pagination, Cartesian product risk | Few, small related entities |
| Prefetch + IN clause | No structural constraints | Manual implementation, cache management | Direct repository call pattern |
Initially, I considered @BatchSize — it’s the simplest. But looking closely at the batch structure, there was a blocker. Spring Batch’s Reader fetched bookmarked apartments page by page, and the Processor queried supplementary data (notification history, floor plan info, etc.) via separate queries for each item. Since it was directly calling repositories rather than accessing through entity relationships, JPA-level fetch strategies couldn’t help. JOIN FETCH was ruled out for the same reason.
So the only option left was prefetching. I implemented a prefetch cache that queries supplementary data in bulk using IN clauses when the Reader fetches a page, storing results in a ConcurrentHashMap.
sequenceDiagram
participant Batch as Batch Job
participant Cache as Prefetch Cache
participant DB as Database
Batch->>DB: Query bookmarked apartments page (batch unit)
Batch->>DB: Bulk query notification history (IN clause)
Batch->>DB: Bulk query floor plan info (IN clause)
Batch->>Cache: Load into cache
loop Per bookmarked apartment
Batch->>Cache: Lookup from cache (no DB call)
end
Note over Batch,DB: Processing 1,000 records → 2 batch queries
Four queries per item became 2 batch queries per page.
I also changed the Reader’s deduplication query to a NOT EXISTS pattern leveraging a composite index.
For notification settings lookups, there was a pattern of repeatedly querying the same user’s settings, so I applied an in-memory cache based on ConcurrentHashMap. Duplicate notification filtering was also changed from per-item EXISTS queries to batch queries + in-memory filtering.
Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| Reader 1 page query | 35.6s | 15ms | ~2,400x |
| Query count (per 1,000 records) | ~4,001 | 2 | ~2,000x |
With Reader optimization and N+1 resolution combined, overall batch execution time dropped to an average of about 21 minutes (based on the BATCH_JOB_EXECUTION metadata table). The pre-optimization spikes of 70–87 minutes completely disappeared, and the maximum stabilized to within 25 minutes.
One thing that might seem puzzling: query count decreased 2,000x, but overall execution time only decreased by 59%? Even after optimization, each item was still taking over 10 seconds on average. The reason is that most of this time was spent on external notification API calls — email, SMS, push, etc. The DB bottleneck was resolved, but external API calls were an area we couldn’t control. In the end, it was a numerical confirmation of what might seem obvious: “no matter how much you optimize DB queries, if the majority of time is spent on external calls, there’s a limit.”
I also wrote thorough test code. Honestly, modifying batch code was the most nerve-wracking part. If something goes wrong, incorrect notifications or missing notifications could affect tens of thousands of users. That’s why I believe securing ample test coverage is crucial.
Contract Statistics API: Aggregation Was Happening in Code
The contract statistics API also showed a P95 tail latency of 2.4 seconds. It’s an API that simply shows counts and totals — why was it taking so long? Honestly, I would have missed this if I hadn’t sorted by P95 in Grafana. The average response time looked perfectly fine. That’s when I realized the danger of relying on averages.
Looking at the code, there were two issues.
First, it queried by settlement date, but there was no index on the settlement date column. EXPLAIN showed the same pattern as the ranking API:
| Field | Value |
|---|---|
| type | ALL (full table scan) |
| rows | 2,021 |
| key | NULL (no index used) |
A table with only 2,000 rows was doing a full scan. Since the data volume was small, it returned results even without an index, which is likely why index design was overlooked during development. The impact was less noticeable because of the small dataset, but structurally, it was the same problem as the ranking API.
Second, there was a more fundamental issue. Instead of using COUNT and SUM in SQL, the code was fetching all data and then counting with .size and summing with .sumOf {}. When the DB could send just one result row with aggregation, it was pulling all rows to the application and aggregating in memory.
| |
| |
I added an index on the settlement date column and switched to SQL aggregation. Running EXPLAIN again showed type changed to ref, properly using the index. P95 dropped from 2.4 seconds to under 100ms.
The code change was less than 50 lines. Yet it eliminated unnecessary data transfer and reduced application memory usage in one stroke. Cases like this remind me why the basic principle of “let the DB do what the DB does best” matters so much.
Parallelization — Why Were Independent Operations Running Sequentially?
After finishing query optimization, I turned my attention to sequential processing sections. Looking at Tempo trace span waterfalls, I kept seeing this pattern: calls with no dependencies on each other lined up one after another.
Notification Delivery: Email and SMS Were Being Sent Serially
Our service is a platform that connects real estate agents with users. When a match is made, we send matching notifications via email and SMS. The notification delivery API also showed high tail latency at P95.
When both email and SMS were sent, the response delays from both external services added up serially, reaching up to 4.9 seconds. Opening the Tempo trace immediately revealed the cause:
sequenceDiagram
participant Client
participant Server as Notification Service
participant Email as Email Service
participant Sms as SMS Service
Client->>Server: POST /matching-alarm
Server->>Server: Query matching info (91ms)
Server->>Server: Query agent info (27ms)
Server->>Email: Send email (~1.3s)
Note over Server,Email: Waiting for completion
Email-->>Server: Done
Server->>Sms: Send SMS (~1.3s)
Note over Server,Sms: Waiting for completion
Sms-->>Server: Done
Server-->>Client: Response (P95 ~4.9s)
It was a sequential execution structure where SMS couldn’t start until email was done. Even though email and SMS have absolutely no dependency on each other. Of course, notifications don’t have a dependency on the matching itself, so they don’t strictly need to be sent in real-time — but I thought it was best to avoid structures that create bottlenecks.
Does the email result affect SMS? Generally, no. So why were they running sequentially? Probably because when the code was first written, it was natural to write top-to-bottom. Honestly, I catch myself doing the same thing when I’m not being deliberate about it, so I couldn’t point fingers.
Solution: Parallel Execution with Coroutines
No dependencies, so the answer was clear. I changed it to parallel execution using coroutine async.
sequenceDiagram
participant Client
participant Server as Notification Service
participant Email as Email Service
participant Sms as SMS Service
Client->>Server: POST /matching-alarm
Server->>Server: Query matching/agent info (~120ms)
par Parallel Execution
Server->>Email: Send email (~1.3s)
and
Server->>Sms: Send SMS (~1.3s)
end
Email-->>Server: Done
Sms-->>Server: Done
Server-->>Client: Response (P95 ~2s)
Inside a coroutineScope, both tasks start simultaneously with async, and awaitAll() waits for both to complete.
There was one consideration here. If email delivery fails, should SMS also fail? With the default coroutineScope, if one child coroutine fails, the rest get cancelled. But email and SMS are different channels. If we skip SMS just because email failed, the user receives no notification at all. So I used supervisorScope to isolate failures, ensuring one channel’s failure doesn’t affect the other.
| |
Response time changed from sum(email, SMS) to max(email, SMS), and measured P95 dropped from 4.9 seconds to about 2 seconds. I continued finding and improving other areas that could benefit from parallelization.
Independent DB Queries in Parallel Too
A similar pattern was found in DB queries. Within a single API, 3 DB queries were executing sequentially, but analysis showed the third query had no dependency on the first or second.
graph LR
subgraph Before["Before: Serial Execution (~150ms)"]
direction LR
B1["1. Query A<br/>~50-80ms"] --> B2["2. Query B (needs A's result)<br/>~30-50ms"]
B2 --> B3["3. Query C (independent)<br/>~30-50ms"]
end
subgraph After["After: Partial Parallel Execution (~100ms)"]
direction LR
A1["1. Query A<br/>~50-80ms"] --> A2["2. Query B (needs A's result)<br/>~30-50ms"]
A3["3. Query C (independent)<br/>~30-50ms"]
end
style B3 fill:#f85149,color:#fff
style A3 fill:#3fb950,color:#fff
Query B needs Query A’s result, so sequential execution is correct. But Query C can run independently of A and B. I changed Query C to execute in parallel with A+B, reducing this logic’s execution time from about 150ms to 100ms. While the absolute time difference isn’t huge, this logic was shared across multiple modules, so a single change improved P95 by about 30% across several APIs.
Caching — Why Fetch the Same Data Every Time?
After fixing queries and parallelizing independent calls, the next thing that caught my eye was the pattern of fetching the same data from the DB on every request.
Apartment Details: 13+ DB Queries Per Request
Building a single detail page requires various types of data. Basic info, nearby facilities, transportation, loan info, etc. are distributed across different modules and databases. Looking inside these APIs, I found that a single request triggered 13+ DB queries and heavy queries joining 6+ tables. When I first saw this span list in the Tempo trace, I thought “all of this happens in one request?”
graph TB
User["User Request"] --> Detail["Detail Page"]
Detail --> Basic["Basic Info<br/>DB JOIN 6 tables"]
Detail --> Fac["Facilities/Transportation<br/>12 DB queries"]
Detail --> Loan["Loan Info<br/>DB query"]
Detail --> Type["Floor Plan Info<br/>DB query"]
Detail --> Recon["Reconstruction Schedule<br/>DB query"]
style Fac fill:#f85149,color:#fff
style Basic fill:#b35000,color:#fff
Facilities and transportation info was particularly bad. Fetching POI (Point of Interest) data for schools, hospitals, marts, parks, buses, subways, highways, etc. required 12 DB queries. Without caching, when all 12 queries hit, P95 tail latency naturally climbed to multiple seconds.
But let me ask a question here. Do the schools and subway stations near an apartment change on every request? Probably not. Schools don’t relocate overnight and subway lines don’t suddenly change. This data is semi-static with extremely low change frequency. Yet we were making 12 DB queries on every request. There was no reason to hit the DB every time.
Applying Redis Cache-Aside Pattern
When considering a caching strategy, I first weighed Write-Through vs. Cache-Aside.
| Strategy | Mechanism | Pros | Cons | Best For |
|---|---|---|---|---|
| Write-Through | Update cache on write | High cache-DB consistency | Unnecessary overhead for rarely written data | Frequently read and written data |
| Cache-Aside | Check cache first on read | Maximizes read performance, simple implementation | Temporarily stale on TTL expiry | Low-change, read-heavy data |
The data we wanted to cache was low-change, read-heavy data. There was almost never a need to update the cache on write. Previous caching experience had confirmed that Cache-Aside works well for this pattern, and the same judgment applied here. However, this time I needed to differentiate TTL strategies based on data characteristics.
sequenceDiagram
participant Client
participant App as Application
participant Redis
participant DB
rect rgb(35, 134, 54)
Note over Client,DB: Cache Hit
Client->>App: GET /aparts/{id}/facilities
App->>Redis: GET apart:detail:facilities:{id}
Redis-->>App: Return cached data
App-->>Client: Response (~2ms)
end
rect rgb(248, 81, 73)
Note over Client,DB: Cache Miss
Client->>App: GET /aparts/{id}/facilities
App->>Redis: GET apart:detail:facilities:{id}
Redis-->>App: null
App->>DB: 12 DB queries
DB-->>App: Data
App--)Redis: SET (async, TTL 14 days ± 1 day)
App-->>Client: Response (~500ms)
end
Initially, I was going to apply a uniform 24-hour TTL to all caches. But it seemed inefficient for school and subway information around apartments to expire after just one day. Conversely, setting too long a TTL like 14 days for loan information containing interest rates might show users outdated rates. I concluded that “a blanket TTL won’t work” and ended up examining each data type’s change frequency to set appropriate TTLs.
| Data | TTL | Rationale |
|---|---|---|
| POI (schools, hospitals, marts, parks) | 14 days ± 1 day | Extremely low change frequency |
| Transportation (bus, subway, highway) | 14 days ± 1 day | Low route change frequency |
| Loan info | 7 days ± 1 day | Interest rate fluctuation cycles |
| Basic info (DB JOIN) | 6 hours ± 30 min | Needs transaction data updates |
| Reconstruction schedule | 7 days ± 1 day | Schedule updates relatively rare |
| Floor plan info | 7 days ± 1 day | Structural changes rare |
Did you notice the ± random jitter next to the TTLs? There’s a reason for that. Popular apartments get viewed simultaneously by many users. What happens if caches with the same TTL all expire at the same time? Dozens of requests flood the DB at once — a Cache Stampede phenomenon.
In the actual implementation, TTLs are defined as enums, and jitter is calculated using ThreadLocalRandom at query time.
| |
Since each instance and request time gets a different expiration time even for the same data, stampedes are naturally distributed.
I also ensured caches are invalidated when the underlying data gets updated. For example, loan information cache is cleared when interest rates change, and reconstruction schedule cache is cleared when the schedule is modified. This way, caches can be refreshed before TTL expiration, reducing situations where users see stale information.
That covers “what to cache and for how long.” I wish that were the end of it, but TTL wasn’t the only consideration when introducing caching. Questions like “what if the cache causes an outage?” and “is this actually effective?” kept coming up, and these concerns turned out to be more important than I expected.
Async Cache Storage
I also applied caching to the autocomplete API used for searching apartments and subway stations, and another consideration came up.
Autocomplete is an API that gets called every time a user types a character. If the response is slower than typing speed, user experience degrades drastically — it’s an API extremely sensitive to latency. But when a cache miss occurs and we synchronously execute both the DB fetch and Redis storage, the response is delayed by the Redis write time.
I asked myself a question: what happens if cache storage fails? The next request will just try again, right? If so, there’s no reason to block the user’s response. I switched to fire-and-forget async storage.
sequenceDiagram
participant User
participant App as Application
participant DB
participant Redis
rect rgb(248, 81, 73)
Note over User,Redis: Before: Synchronous Storage
User->>App: Type "Raemian"
App->>Redis: GET (miss)
App->>DB: Search
DB-->>App: Results
App->>Redis: SET (blocking wait)
Redis-->>App: OK
App-->>User: Response
Note over App,Redis: Response delayed by Redis SET time
end
rect rgb(35, 134, 54)
Note over User,Redis: After: Async Storage (fire-and-forget)
User->>App: Type "Raemian"
App->>Redis: GET (miss)
App->>DB: Search
DB-->>App: Results
App-->>User: Response (returned immediately)
App--)Redis: SET (separate coroutine, non-blocking)
end
| |
Honestly, I initially implemented it with GlobalScope.launch. It worked fine. But during code review, two issues were flagged. First, running coroutines wouldn’t be properly cleaned up during application shutdown. Second, using a regular Job meant one cache storage failure could cancel other coroutines in the same scope. I almost moved on thinking “it works, so it’s fine” — but after the review, I realized these were problems that could absolutely surface in production. This is where the power of code review really shines. So I changed to SupervisorJob for failure isolation and DisposableBean’s destroy() to cancel the scope on shutdown.
The Service Must Keep Running Even When Cache Fails
Another critical point when introducing caching: “The service must keep running even if the cache (e.g., Redis) goes down.”
I believe performance-optimizing caches should be treated strictly as a means, and if Redis has issues, the system should gracefully fall back to the DB.
| |
Both Redis reads and writes are wrapped in try-catch. If Redis has issues, it’s treated as a cache miss and falls back to the original path (direct DB query). Slower performance is acceptable, but the service itself going down is not. I think the role of a cache is “fast when available, normal when not.” I didn’t want caching to be the cause of an outage.
Ensuring Cache Observability
After introducing caching, I wanted to verify “is this actually working?” Without numbers to confirm whether TTLs are appropriate, what the hit rate is, and whether Redis is slowing down, you’re just running on gut feeling.
So I set up Micrometer counters to collect cache hit/miss counts, tagged by cache type (basic, facilities, loans, types, reconstruction, price) for Grafana monitoring. Redis GET/SET latencies are also collected as timer metrics. I created a dedicated Cache Hit Rate Monitor dashboard to track hit rates by cache type in real-time, which proved quite useful later as the basis for TTL adjustments. Monitoring is needed before deployment, after deployment, and continuously.
Results
These are production Prometheus metric results. The P95 response times shown represent what actual users experience, with a mix of cache hits and misses. The first image shows spikes before the caching deployment, and the second shows the stable current state after deployment.

Due to Prometheus storage limitations (4GB), the above Before image shows only a few days of records before deployment. The Before P95 values in the table below (4,045ms, 4,840ms) were recorded during peak traffic hours before caching was introduced and had already expired from Prometheus during the optimization period.

| API | req/s | Before P50 | After P50 | Before P95 | After P95 | After P99 |
|---|---|---|---|---|---|---|
| Basic Info | 568 | 183ms | 43ms | 4,045ms | 50ms | 50ms |
| Facilities/Transportation | 525 | 497ms | 119ms | 4,840ms | 168ms | 205ms |
| Reconstruction Schedule | 29 | 124ms | 42ms | — | 50ms | 55ms |
| Floor Plan Info | 28 | 147ms | 46ms | — | 63ms | 109ms |
The basic info and facilities/transportation APIs handle 568 and 525 requests per second respectively. On a daily basis, that’s approximately 49 million and 45 million requests. P95 dropped from the 4-second range to 50ms for basic info and from 4.8 seconds to 168ms for facilities/transportation, with P99 staying under 200ms.
You might wonder “doesn’t a cache miss actually make things slower by the Redis query time?” When I checked, Redis GET returning null took only 1~5ms. Even on a cache miss, the performance impact compared to the original is negligible. I did set a Redis timeout of 100ms to cut off early if Redis becomes unresponsive — this timeout rarely triggers, but if it does, the cache query is treated as a failure and falls back to the DB.
Overall Results Summary
After about 10 days of work, here’s the complete summary. Before listing the numbers, I want to note that all these metrics are from the production Prometheus environment. The Grafana dashboards below provide a side-by-side comparison of before and after optimization.


| Area | API / Component | Before | After (prod) | Improvement |
|---|---|---|---|---|
| Query | Transaction Volume Ranking | P95 1.9~2.4s | 95ms | ~21x |
| Query | Unit Price Ranking | P50 7.24s | 126ms | ~57x |
| Query | Household Count Ranking | P50 6.33s | 125ms | ~51x |
| Query | Contract Summary | P95 2.4s | <100ms | ~24x |
| Batch | Reader 1 page | 35.6s | 15ms | ~2,400x |
| Batch | Query count (per 1K records) | ~4,001 | 2 | ~2,000x |
| Batch | Total execution time | ~50 min (max 87 min) | ~21 min (max 25 min) | ~59% |
| Parallelization | Notification delivery (P95) | 4.9s | ~2s | ~59% |
| Parallelization | Independent DB query parallelization | ~150ms | ~100ms | ~30% |
| Caching | Basic info (P95, 568 req/s) | 4,045ms | 50ms | ~99% |
| Caching | Facilities/Transportation (P95, 525 req/s) | 4,840ms | 168ms | ~97% |
Unit price and household count rankings use P50 as the Before baseline. Even the majority of requests (median) were in the 7-second range, so P50 was used instead of P95 for the Before measurement.
graph LR
subgraph QueryOpt["1. Query Optimization"]
Q1["Non-sargable → Sargable"]
Q2["N+1 → Batch Prefetch"]
Q3["Code Aggregation → SQL Aggregation"]
end
subgraph Parallelization["2. Parallelization"]
P1["Serial Calls → Parallel"]
P2["Independent API Parallelization"]
end
subgraph Caching["3. Caching"]
C1["Redis Cache-Aside"]
C2["Per-data TTL + Jitter"]
C3["Async Cache Storage"]
end
QueryOpt --> Result["Overall P95<br/>Dramatically Improved"]
Parallelization --> Result
Caching --> Result
style QueryOpt fill:#238636,color:#fff
style Parallelization fill:#1f6feb,color:#fff
style Caching fill:#8957e5,color:#fff
Reflections
“Having an Index” and “Using an Index” Are Different Problems
Even before this work, I used EXPLAIN — but honestly, I admit there were times I thought “the index exists, so it should be fine” and moved on. I suspect I’m not the only one. After experiencing the non-sargable query issue firsthand, I truly internalized that having an index and the optimizer actually using that index are completely different problems. It reminded me to be much more thorough going forward.
I also realized that code reviews alone can’t catch performance issues. Queries that look clean in code can trigger full scans, and batches that seem to work fine can fire thousands of unnecessary queries. I too initially approached things by guessing and ended up digging in the wrong places. In the end, whether it’s EXPLAIN, Tempo traces, or Prometheus metrics, there’s no substitute for approaching things based on numbers and data from your tools.
What Technology to Use Matters Less Than What the Problem Is
When introducing caching, I initially started by pondering “what’s the optimal cache strategy?” — Write-Through vs. Cache-Aside, how to set TTLs. I jumped straight to technology selection. But looking back, identifying “what data is being repeatedly queried right now” should have come first. Once you accurately understand the problem, the technology choice follows naturally.
The same was true for parallelization. Rather than which async framework to use, the key was first identifying which sequential operations were actually independent. Ultimately, I believe performance optimization starts not with finding better technology, but with accurately identifying current inefficiencies. It might sound obvious, but you need to pinpoint the problem before the right solution can emerge.
Without Monitoring, We Couldn’t Have Even Started
The monitoring system we built previously was a prerequisite for this entire effort. I can’t emphasize this enough. Even the fact that the facilities/transportation API was making 12 DB queries — by code alone, spread across multiple modules, the full picture wasn’t visible at a glance. It was only after expanding the Tempo trace span waterfall that I discovered “oh, all of this happens in a single request.” The saying “you can’t improve what you can’t measure” has never resonated with me as strongly as it did this time.
Closing Thoughts
Looking back, none of the techniques used in this work — fixing non-sargable queries, resolving N+1, adding indexes, parallelization, Cache-Aside — were new technologies. They’re all fundamentals you’ve probably heard of at least once. It might sound cliche, but in the end, the basics are the most powerful.
That said, just because something is fundamental doesn’t mean it’s easy to apply. In fact, finding where things are slow was the hardest part, and once the cause was identified, the fix was often straightforward. A line of EXPLAIN, a few lines of coroutine async, a few lines of Redis cache configuration. The amount of code changed wasn’t much, but the impact was significant. That’s why I titled this article “Finding the Bottleneck Was Harder Than Fixing It.”
One more thing I did after completing this work: this time, I happened to discover the issues by habitually checking the dashboard. But I couldn’t rely on chance next time. So I set up Grafana Alerts to notify me when any API’s P95 exceeds a certain threshold. Next time, the system will tell me first.
Of course, there are still remaining challenges. Some heavy query patterns haven’t been addressed yet, and N+1 issues in legacy service stored procedures remain. These have larger scope, so I plan to improve them gradually over time. It’s a bit disappointing that I can’t say it’s perfectly done, but performance optimization is an area that never truly “ends.” There are no perfect answers — only choices that fit your service’s current situation.
If you’re facing a similar situation, I’d recommend starting with EXPLAIN. The cause of slow queries might be simpler than you think. I certainly learned that firsthand this time.
Thank you for reading this long article.
References
Related Posts
Query Optimization
- Use The Index, Luke — Functions in WHERE Clause — Non-sargable queries and index invalidation patterns
- MySQL Official Docs — EXPLAIN Output Format — Guide to interpreting EXPLAIN output
- MySQL Official Docs — Optimization and Indexes — Composite index design strategies
Caching
- AWS — Database Caching Strategies Using Redis — Cache-Aside vs Write-Through comparison
- Cache Stampede Problem and Solutions — Instagram’s Thundering Herd mitigation
Parallelization / Async
- Kotlin Official Docs — Coroutines Basics — Coroutine fundamentals and structured concurrency
- Kotlin Official Docs — Shared Mutable State and Concurrency — SupervisorJob, exception propagation and isolation
Monitoring
- Micrometer Official Docs — Cache Instrumentations — Cache hit/miss metric collection