eBay Tech Blog

Identifying and Eliminating Extra DB Latency

by Konstantin Yurin on 09/05/2012

in Software Engineering

I would like to share my experiences with troubleshooting a recent issue on Oracle databases in our datacenters. While the details in this post involve batch jobs, the lessons learned can apply to any business having Oracle SQL traffic between two or more datacenters.

The problem

We noticed that the running time on a set of batch jobs had suddenly increased substantially. Some increase was understandable, since we had moved a significant percentage of the database hosts accessed by these batch jobs to another eBay data center. However, the latency had jumped beyond what had been expected or could be explained.

The investigation

The first step was to check our logs for the most expensive SQL statements.

For batch jobs, it is very easy to distinguish cross-colo and inter-colo transactions, since all batches run on boxes located in the same datacenter. I immediately found a discrepancy between the SQL times for inter-colo calls compared to cross-colo calls. I expected to see a consistent difference of 20 ms for both minimum and average times; 20 ms is the network latency between the two datacenters involved. In reality, the differences far exceeded 20 ms. The following table shows the statistics collected for approximately 8500 calls in an hour:

Statistic

Inter-Colo

Cross-Colo

Exec Count

8600

8490

Minimum/Maximum/Average

1.00/119.00/5.70

21.00/487.00/36.25

Total Time

49028.00 ms

307758.00 ms

On further investigation, I discovered an additional complication. For some queries, the time difference was consistently 20 ms, and for other queries it was consistently 40 ms.

Next, I looked into database performance. I found that the query execution time was the same on all of our database hosts, and that the issue was reproducible independent of the batch jobs, using only tnsping and the Oracle SQL client.

At this point, I almost didn’t suspect a database issue at all; the statistics showed the same results for all DB hosts in the datacenter.  Instead, I suspected an issue related to either the network or network security.

A possible firewall capacity problem was ruled out because firewall latency contribution is constant, and much lower than the latency introduced by the network. When we looked at the network, we saw that ping consistently showed 20 ms, regardless of the time of day or the traffic activity. Of course, in real life we do not merely exchange pings, but rather use actual SQL queries to communicate with DB hosts, so we looked at a tcpdump analysis of the DB hosts next. 

As the root cause of our problem was still unknown, there was no guarantee that the issue could be reproduced by using tcpdump while running queries manually. After trying twice and not getting enough information, we concluded the only reliable way to reproduce the issue was to run tcpdump while the batch job was running.

The last thing to do was to wait until the next high-volume batch run. Finally, we got enough data, and we almost immediately saw an answer (saving us from having to dive even deeper and look at MTU or window size as the cause of additional round trips).  Here is the analysis of the packets for a single query:

No.  Time     Source  Destination  Protocol  Length  Info
1    0.000000 [IP1]   [IP2]        TNS       549     Request, Data (6), Data
2    0.000714 [IP2]   [IP1]        TNS       852     Response, Data (6), Data
3    0.021257 [IP1]   [IP2]        TNS       71      Request, Data (6), Data
4    0.021577 [IP2]   [IP1]        TNS       153     Response, Data (6), Data

Processing the query required four packets spanning about 21.6 ms. Since round-trip-time (RTT) is about 20ms, prior to the first packet there would be 10 ms—the transit time from when the packet left the first datacenter and arrived in the second datacenter, where it was captured as packet 1. After packet 4, there was another 10 ms—the transit time back to the first datacenter. The total RTT for these four packets was therefore about 41.6ms.

We saw that the SELECT statement was in packet 1 and that the response appeared to be in packet 2. We didn’t know what packets 3 and 4 were, but their small size suggested they were some kind of acknowledgement.

The solution

At this point, we had confirmation that for each query, there were four network packets instead of two. The fact that every execution took two round trips indicated that more soft-parsing was occurring than was necessary. The question was, why were commonly running queries getting parsed multiple times? Once parsed, a query should be added to the app server statement cache. We increased the statement cache size parameter, which dictates how many parsed SQL statements can remain in cache at a given time, from 80 to 500.

When the tuned settings went into effect, we were able to eliminate an extra cross-colo round trip for a majority of SQL requests. We immediately saw a positive impact, observing a 30% reduction in execution time for all instances of a batch job.

This experience demonstrates how even small changes in DB settings can positively affect execution response times. And for me, it was both challenging and interesting to drive problem resolution in an area where I did not have much prior experience.

{ 8 comments… read them below or add one }

Xu Jian September 6, 2012 at 2:55AM

does Connection.prepareStatement(sql); always involve a round trip between app server and DB?

Reply

jametong September 9, 2012 at 5:31AM

the parse phase need an round trip from the client to the Oracle database,Use Prepared statement Cache, if the Statement already exists,then the oracle jdbc driver can use this statement handle to execute, so it can save the round trip by the parse phase.

Reply

jametong September 9, 2012 at 5:32AM

more detail about the Prepared Statement cache can be get from dbafree.net (a chinese blog).

Reply

jametong September 9, 2012 at 5:42AM

I recommend the blog writer to concentrate on the JVM memory usage of your system, Since much large Prepared Statement Cache size need more JVM memory to host the Cursor and Associated data struct.

Reply

Konstantin Yurin September 13, 2012 at 6:25PM

Thank you for emphasizing the importance of proper application tuning. Certainly, application owners must carefully balance performance and resource usage!

Reply

Jonathan Stone September 14, 2012 at 4:38AM

Can you expand on the line “The fact that every execution took two round trips indicated that more soft-parsing was occurring than was necessary”?

With the OCI interface, parsing is described as “a completely local call, requiring no round trip to the server”:
http://docs.oracle.com/cd/E14072_01/appdev.112/e10646/oci04sql.htm#i429200
Is this not the case with the thick/thin JDBC driver?

Reply

Konstantin Yurin September 17, 2012 at 4:25PM

Thin JDBC driver does not provide such opportunity.

Reply

jametong September 16, 2012 at 11:22PM

Using the OCIStmtPrepare() call and passes [b]to it a previously allocated statement handle[/b]. This is a completely local call, requiring no round trip to the server.

to a previously allocated statement is a must, if client send a parse call to the server, then it will need a round trip to the server.

Reply

Leave a Comment

Previous post:

Next post:

Copyright © 2011 eBay Inc. All Rights Reserved - User Agreement - Privacy Policy - Comment Policy