Troubleshooting SQL server I/O performance with SAP - Identify rogue transactions



By Dmitry Gorbatov for Beyond The Blocks - Wednesday, March 15, 2017

BLOG_SQL_Header.jpg
Use a SQL query to detect rogue transactions - Read the full story below

Slowly working report in a small company may be acceptable if its occurrence is occasional, but in a large enterprise, any degradation in performance while executing a database query or running a report is not welcome. As it is of high risk to affect a revenue generating application that could ultimately impact the business.

Such performance problems due to I/O bottlenecks are not acceptable and must be eliminated as quickly as possible.

System administrators know where to look – storage subsystems, network, and servers are investigated to find the root cause of slow performance. It is not uncommon to blame a storage array for performance issues.

"Not so fast…" Storage vendors are the first ones to be called to the rescue regardless of where the issue lies on.

Storage is not always the reason for poor performance. Storage systems are getting faster, bulking up on capacity and add-on features every cycle. Every storage vendor is talking about the speed of their solution. Some say it is all about low latency. Some say it is all about high throughput.

There is no argument that the newest media technologies helped to improve the overall storage subsystem performance. But even the fastest media can suffer and not deliver optimal performance if the application accessing it is poorly designed or written.

In this blog post, I want to share my recent experience with one of our customers running a heavy-duty SAP environment on physical Windows servers.

Hopefully, other customers and even vendors can benefit and learn from it.

Environment - SAP on MS SQL server 2008

The main application in the environment is SAP running on Microsoft SQL Server 2008. Overall the environment can be considered dated or even ancient.

All servers are physical Dell or HP running MS Windows Server 2008 R2. SQL version is 2008 SP1 CU1. Up until recently the storage system they used was Dell Compellent. There are two servers in MSCS configuration for production databases. There is a total of 7 production databases all located on the same D:\ drive (LUN).

Logs are stored on an independent LUN; QA and DEV are on separate dedicated servers.

It was curious to see the disk drives on Compellent were failing at an alarming rate – 2 to 3 a week!. This number is way outside of acceptable range. Dell Compellent was pegged at 12,000 IOPs and 750-800 Mb/s throughput. Latency was ranging from 5 to 20ms with spikes up to 100ms.

The Dell Compellent used ten dedicated “high performance” SSDs just for SAP.

Certain SAP queries are made directly to the production databases, and if multiple users were attempting to run reports at the same time, the environment would come to a screeching halt. When the customer made his decision to go with Reduxio HX550, the expectation was that it would be faster.

After all, it is five years younger, uses flash and other innovative technologies. Since the SAP database is the “cash register” for the customer, he felt that using Reduxio BackDating™ to protect his most precious data with one-second granularity was appropriate.

SAP I/O performance Bottleneck in production

After migrating the production SAP environment to Reduxio, few observations were made. One – the performance did not improve much, two – IOPS were unusually steady between somewhere around 8,000 and 12,000 with spikes up to 21,000.

More interesting was the fact that throughput was unusually high – over 500 MB/sec most of the time, day or night.

Nobody could explain what was the application doing all the time to create such load. SAP application is 99% read / 1% write, but it should not be reading that much data all the time. Graphs below are from StorSense™ – a cloud-based proactive monitoring system which is in constant surveillance of our full install base.

If any unusual condition is detected in any of our storage systems, it immediately reflects on StorSense™.

Our support personnel monitor the StorSense central console continuously and will investigate any unusual events reported.

Storsense Screenshot.png

Here is the view from the Reduxio UI. It is important to note that from the storage system’s perspective, the latency was still very low.

Reduxio_UI_low_latency.png

The user experience was still poor, and the customer second-guessed his decision to switch the storage vendors. Can you blame him?

Investigation

Any storage vendor would start the investigation by checking their own proverbial backyard. After all - what changed? You go from Fiber Channel (FC) to Ethernet (iSCSI) protocol, you replaced 8Gbps FC HBA with 10G Ethernet NIC, you are now using the Microsoft iSCSI software initiator – all this must have caused the problem, right?

Makes perfect sense.

You can get into the religious discussion about FC vs. iSCSI. Old school, IT guys, would tell you that iSCSI is not a real block protocol and that it is not reliable, etc. If you are interested in digging a little further, this blog on FC vs. iSCSI might prick your interest.

But we are here to find out what is wrong with this particular customer’s environment. So, as a reputable storage vendor, we started checking the best practices, the timeouts, the initiator settings, the MPIO configuration, we verified that the NIC driver they are using was up-to-date and seemingly everything else. 

Then Check for Problems at the Application Level

We started to tackle the problem from an application level - understanding the problematic queries, improving the indexes on the SQL Server and so on. Little did we know, the application turned out to be much more complex than we anticipated. They have over 150,000 tables, lots of queries of different types that are running simultaneously and it's not clear whether or not a single query is the culprit. We searched far and wide, looked in dark corners of Google, tried to describe symptoms in many different ways, but all to no avail.

Nagle Algorithm

Since we were looking into iSCSI, we came across something on the Microsoft community site talking about the “Nagle Algorithm.” According to Microsoft “the Nagle algorithm was originally implemented to prevent lots of very small TCP packets to congest the network by grouping them together and sending them at once. This increases bandwidth (fewer header bytes per user data) but at the cost of latency.” We thought – here is the “smoking gun” - this is exactly what is happening here – SAP with a bunch of small reads… 

As good as it sounded, changing that registry setting requires a server reboot, and we could not afford to take an outage. Once you reboot, it takes the SAP database literally days to rebuild its cache, and you will see below that cache hit ratio was low anyway. So, we kept digging. There were some discussions about increasing the number of iSCSI sessions to help performance, but that had little effect.

Using SAP ST04 DB performance analysis transaction code

Unfortunately, I do not have historical data to be able to compare to as it was always assumed that the Dell Compellent created the performance bottleneck. With Reduxio as the main storage array, we started looking at what the SAP application was seeing and reporting.

There are SAP performance monitoring tools that you can use in a situation like this:

First, we started looking at the SAP ST04 DB performance analysis transaction code. According to SAP experts, the transaction code ST04 is used to show details of the database behavior and usage. It will show you both real-time and historical data.

Right away we knew something was not right. You want to see data cache hit ratio above 95% [meaning less physical read from disk]. This system showed less than 18%.

ST04_Hit Ratio.png

 

You want to see response times that are less than 15ms. We were getting 9,000 to as much as 35,000ms (!!) and D:\ drive utilization of 100% all the time. No wonder Dell drives were failing. In the end, you will see that it was not their fault.

 

9D.png

 

The customer has a support contract with SAP consulting firm Secure-24. As a side note, these guys are fantastic – they know SAP inside and out and have a great customer-focused attitude. Here is what they saw:

 

PRD.png

Performance_millisecond_latency.png

 

When they got engaged, Secure-24 opened a support ticket with SAP and worked with us on collecting and analyzing performance data. They saw the response times for certain DB transactions to be unusually high.

The reports that should take seconds were taking multiple minutes at best.

 

Transaction Time.png

 

There were also tables that had no indexes in them but would have benefited from adding indexes to them.

 

DB Schema.png

 

We suggested to the customer to open a support ticket with Microsoft. They recommended upgrading the OS, upgrading SQL to SP3 and suggested this was SAN problem.  It was good advice but didn’t really help the customer or us since there was pressure to resolve it without any downtime.

Getting into the weeds with Wireshark

At this point it’s been a long time with no progress, so we decided to bring one of our best engineers onsite to do a SQL data collection and analysis of the application.

We installed Wireshark and took a TCP dump. What we found was that storage response to the SQL Server was always within acceptable levels - 2-5ms at most, as opposed to an average of 20-40ms according to both Windows and SQL Server. This information reaffirmed our conviction that the problem is not storage related.

But in good conscience, you don’t point fingers; you have to see it through.

We started looking at the queries that were running at any given time at the SQL Server level.

Here is the exact query we used to see if there are any “rogue” transactions.

The result was surreal, there were 3-4 queries that were scanning a whole 80GB DB table, without indexing, and retrieved a single value!! And that value was constant!!

At first, we thought it’s a one-time thing. But after a while, it turned out that this pattern was recurring - after each of those queries finished (and it took ~10 minutes for each), another similar query would start running right away.

It took us a while to pinpoint the culprit, but it turned out to be a SAP Business Objects reporting application, which ran on a development server (which is used for developing reports before they go into production).

Yes, you heard that correctly.

Dev Environment was running against Production DB

The development environment was running against the production database, and nobody knew about it. Apparently, it was happening since 2012 when Business Objects was installed!!

We had a long call with a SAP Business Objects expert, hired by the customer, who couldn’t figure out why exactly those queries are running - it seems like they were running regardless of any report being executed - the report only needed to be “installed,” and those queries would run periodically.

Eventually, the person from SAP BO stopped those queries from being called - deleting the problematic reports from the server altogether (even though they weren’t being accessed).

After we had got the confirmation, and we tested it based on our assumptions, we finally resolved the issue.

The results were amazing.

Read throughput on Reduxio went to zero, at most of the times (see below). And everything was working blazing fast. Cache hit ratio went up to ever 99%.

This means that the entire workload of the database was cached at the SQL Server level, without ever going to the storage subsystem at all.

Results.png

My favorite indicator is that after the problem was fixed, the disk drive with the highest latency from SAP perspective was the server’s internal C:\ drive.

c drive.png

Just a brief look at StorSense data showed that the storage was performing great.

Storsense Data.png

 IOPS were low, as well as throughput. 

Storsense Data 1.png

Here is what it looked like from Reduxio UI. 

Reduxio 30k IOPs sharpen.png

Interestingly, the top IOPS broke through the previous 20,000 to over 30,000 with relatively low throughput. 

Reduxio IOPS 7mb_sharpen.png

 

Takeaway

If you followed along, what actually allowed this problem to happen was a poor separation of PROD and QA / DEV environments. The customer was using the production database for development because that’s the only way to get the reports they needed to run the business.

Going forward, they can improve the business flow tremendously by utilizing some new features that Reduxio brought to them.

Instead of pounding the production database with reports and queries that may not always be very efficient, they now have the ability to create an instant clone of the entire production DB – and do it on demand. Once that clone is created, it gets mounted to a VM on their newly built ESX cluster.

The VM is running Windows Server 2012R2 with SQL2008 SP1. The Business Objects Universe is now pointed to the clone instead of production.

Together with the customer, we have concluded that this would be a better setup than SQL replication - it would be more lightweight in just about every component in the system (storage, network, CPU), and would even behave better.

In the end, we scripted the process of clone creation, attachment, and presentation.

Customer is happy, so are we.

  Troubleshooting SQL server I/O performance with SAP

Dmitry Gorbatov

Written by Dmitry Gorbatov

Having worked abroad deploying and supporting various voicemail, SMS text, and multimedia messaging value-add systems, and directing support and field engineering teams in Southeast Asia for Comverse; then more recently, a 10-year period with NetApp in New England as their Senior Solution Principal and Enterprise Systems Engineer, Dmitry brings end-user and vendor perspectives together. Now the regional lead of Systems Engineering for Reduxio Systems, Dmitry offers his experience, professional expertise, as well as his personal passions for golf, distance running, and anything to do with his son’s youth basketball successes.



Want to comment on this blog post?