Twenty years ago, IO operations were relatively slow. Technological advancements make writing and reading data from disks much faster today. We can easily take fast I/O for granted and neglect it when we investigate the performance of large enterprise systems.
During one of our recent projects, disk latency was occasionally high. These findings prompted us to thoroughly examine more storage-related metrics, eventually discovering I/O commands queued error messages in the VMware layer.
In our case, the customer used the VMWare platform. We traced all their business applications using Dynatrace. After noticing disk latencies, we also integrated the VMWare layer into our tracing.
During peak hours when databases created a high number of write and read operations, Dynatrace reported "I/O commands queued
Insufficient storage device queue depth for NETAPP iSCSI Disk" events.
Analysis of IO Latency problems
A review of the VMware layer and NETAPP configuration exposed the root cause. Our customer was using Software iSCSI with a queue length 1024 and a Device (LUN) queue length of 128.
When the IO traffic was high, and more than 128 parallel IO requests were present in the iSCSI adapter, Device (LUN) queues became a bottleneck because they were limited to 128.
How we fixed such IO Latency issues?
SCSI device drivers have a configurable parameter called the LUN queue depth that determines how many commands can be active once for a given LUN. If the host generates more commands to a LUN, the excess commands are queued in the VMkernel.
Increase the queue depth if the sum of active commands from all virtual machines consistently exceeds the LUN depth.
The procedure to increase the queue depth depends on the type of storage adapter the host uses.
When multiple virtual machines are active on a LUN, change the Disk.SchedNumReqOutstanding (DSNRO) parameter matches the queue depth value.
The best way to minimize I/O operation latency is to spread VM deployment across multiple data stores.
When investigating IO or storage-related performance issues, we review the following components
Network Bandwidth and Latency
Bandwidth: The network infrastructure's bandwidth can become a bottleneck if the iSCSI traffic saturates the available bandwidth.
Latency: High network latency can adversely affect iSCSI performance.
Host System Resources
CPU Utilization: Software iSCSI relies on the host's CPU for processing iSCSI protocol overhead. High CPU utilization, especially in a high I/O environment, can become a bottleneck, reducing overall performance.
Memory Usage: Sufficient memory is required to handle considerable queue lengths efficiently. If the host system is low on memory, it could lead to swapping and degraded performance.
Storage Array Performance
I/O Processing Capability: The storage array or SAN (Storage Area Network) must handle many I/O operations. If the array cannot process I/O requests quickly enough, it can become a bottleneck.
LUN Configuration: Each LUN's configuration, including its underlying physical disks and RAID configuration, affects performance. A queue length of 128 per LUN may not fully utilize the storage array's capabilities if the storage backend is highly performant.
Queue Depth Management
Software iSCSI Queue Length: A queue length of 1024 may lead to a situation in which the software iSCSI initiator queues up too many requests, overwhelming the host or the network stack.
LUN Queue Length: A LUN queue length of 128 means the LUN can handle 128 concurrent I/O requests. If the workload exceeds this, additional I/O requests must wait, causing increased latency and reduced throughput.
Disk Subsystem Bottlenecks
Disk Speed: The type of disks (e.g., HDD vs. SSD) in the storage array will significantly impact performance. Slower disks may need help to keep up with high I/O demands.
RAID Configuration: RAID configurations can introduce additional overhead, impacting I/O performance. For instance, RAID 5 and RAID 6 involve parity calculations, which can slow down write operations.
I/O Workload Characteristics
Random vs. Sequential I/O: Random I/O operations are generally more taxing on storage systems compared to sequential I/O. High random I/O can quickly reveal bottlenecks in both the network and storage systems.
Read vs. Write Operations: Write operations, especially in RAID configurations, can be more resource-intensive than read operations.
Good Practices to identify IO Performance Problems
Performance Monitoring: Use performance monitoring tools to track CPU, memory, network bandwidth, and disk I/O metrics on both the host and storage array.
Queue Depth Tuning: Experiment with different queue depths for both the software iSCSI initiator and LUNs to find the optimal settings.
Network Optimization: Ensure that the network infrastructure, including switches and NICs (Network Interface Cards), supports the required bandwidth and has low latency.
Storage System Analysis: Evaluate the performance capabilities of the storage array and ensure it is appropriately configured to handle the expected I/O load.
Keep up the great work! Happy Performance Engineering!
Comments