Identifying and Eliminating Extra DB Latency

I would like to share my experiences with troubleshooting a recent issue on Oracle databases in our datacenters. While the details in this post involve batch jobs, the lessons learned can apply to any business having Oracle SQL traffic between two or more datacenters.

The problem

We noticed that the running time on a set of batch jobs had suddenly increased substantially. Some increase was understandable, since we had moved a significant percentage of the database hosts accessed by these batch jobs to another eBay data center. However, the latency had jumped beyond what had been expected or could be explained.

The investigation

The first step was to check our logs for the most expensive SQL statements.

For batch jobs, it is very easy to distinguish cross-colo and inter-colo transactions, since all batches run on boxes located in the same datacenter. I immediately found a discrepancy between the SQL times for inter-colo calls compared to cross-colo calls. I expected to see a consistent difference of 20 ms for both minimum and average times; 20 ms is the network latency between the two datacenters involved. In reality, the differences far exceeded 20 ms. The following table shows the statistics collected for approximately 8500 calls in an hour:




Exec Count






Total Time

49028.00 ms

307758.00 ms

On further investigation, I discovered an additional complication. For some queries, the time difference was consistently 20 ms, and for other queries it was consistently 40 ms.

Next, I looked into database performance. I found that the query execution time was the same on all of our database hosts, and that the issue was reproducible independent of the batch jobs, using only tnsping and the Oracle SQL client.

At this point, I almost didn’t suspect a database issue at all; the statistics showed the same results for all DB hosts in the datacenter.  Instead, I suspected an issue related to either the network or network security.

A possible firewall capacity problem was ruled out because firewall latency contribution is constant, and much lower than the latency introduced by the network. When we looked at the network, we saw that ping consistently showed 20 ms, regardless of the time of day or the traffic activity. Of course, in real life we do not merely exchange pings, but rather use actual SQL queries to communicate with DB hosts, so we looked at a tcpdump analysis of the DB hosts next. 

As the root cause of our problem was still unknown, there was no guarantee that the issue could be reproduced by using tcpdump while running queries manually. After trying twice and not getting enough information, we concluded the only reliable way to reproduce the issue was to run tcpdump while the batch job was running.

The last thing to do was to wait until the next high-volume batch run. Finally, we got enough data, and we almost immediately saw an answer (saving us from having to dive even deeper and look at MTU or window size as the cause of additional round trips).  Here is the analysis of the packets for a single query:

No.  Time     Source  Destination  Protocol  Length  Info
1    0.000000 [IP1]   [IP2]        TNS       549     Request, Data (6), Data
2    0.000714 [IP2]   [IP1]        TNS       852     Response, Data (6), Data
3    0.021257 [IP1]   [IP2]        TNS       71      Request, Data (6), Data
4    0.021577 [IP2]   [IP1]        TNS       153     Response, Data (6), Data

Processing the query required four packets spanning about 21.6 ms. Since round-trip-time (RTT) is about 20ms, prior to the first packet there would be 10 ms—the transit time from when the packet left the first datacenter and arrived in the second datacenter, where it was captured as packet 1. After packet 4, there was another 10 ms—the transit time back to the first datacenter. The total RTT for these four packets was therefore about 41.6ms.

We saw that the SELECT statement was in packet 1 and that the response appeared to be in packet 2. We didn’t know what packets 3 and 4 were, but their small size suggested they were some kind of acknowledgement.

The solution

At this point, we had confirmation that for each query, there were four network packets instead of two. The fact that every execution took two round trips indicated that more soft-parsing was occurring than was necessary. The question was, why were commonly running queries getting parsed multiple times? Once parsed, a query should be added to the app server statement cache. We increased the statement cache size parameter, which dictates how many parsed SQL statements can remain in cache at a given time, from 80 to 500.

When the tuned settings went into effect, we were able to eliminate an extra cross-colo round trip for a majority of SQL requests. We immediately saw a positive impact, observing a 30% reduction in execution time for all instances of a batch job.

This experience demonstrates how even small changes in DB settings can positively affect execution response times. And for me, it was both challenging and interesting to drive problem resolution in an area where I did not have much prior experience.

9 thoughts on “Identifying and Eliminating Extra DB Latency

  1. Xu Jian

    does Connection.prepareStatement(sql); always involve a round trip between app server and DB?

  2. jametong

    the parse phase need an round trip from the client to the Oracle database,Use Prepared statement Cache, if the Statement already exists,then the oracle jdbc driver can use this statement handle to execute, so it can save the round trip by the parse phase.

  3. jametong

    I recommend the blog writer to concentrate on the JVM memory usage of your system, Since much large Prepared Statement Cache size need more JVM memory to host the Cursor and Associated data struct.

    1. Konstantin Yurin Post author

      Thank you for emphasizing the importance of proper application tuning. Certainly, application owners must carefully balance performance and resource usage!

  4. jametong

    Using the OCIStmtPrepare() call and passes [b]to it a previously allocated statement handle[/b]. This is a completely local call, requiring no round trip to the server.

    to a previously allocated statement is a must, if client send a parse call to the server, then it will need a round trip to the server.

  5. Erik Brandsberg

    Great analysis. As people are moving to the cloud, one of the silent performance killers is that the latency between nodes can become a much larger issue. Many applications take dozens of database queries per page render, and when even adding only 1ms to a db query vs. being connected to the same switch as the database, it can impact performance in a dramatic way. Even if individual page renders are acceptable, the fact that more pages are in the process of being rendered at a single time can impact memory usage and other variables that can slow the system down even more. A common solution for this is to leverage in-memory caching technologies such as Hazelcast or Redis, although the main problem is the cost of modifying the code and knowing what to cache. One solution to this is a new product called Heimdall Data (disclaimer, I’m CTO), which provides a transparent system to help analyze the database performance of production systems, isolates cacheable queries, and implements a rule based system for implementing caching without code changes while maintaining data integrity. The fastest queries are those that are never sent to the database in the first place, no matter what the RTT.


Leave a Reply

Your email address will not be published. Required fields are marked *