Why should Performance Engineers expect the unexpected?

Josef Mayrhofer
May 12, 2022
3 min read

SOLR is the search backbone of the world's largest websites. It comes with near real-time indexing capabilities and ensures that search queries are always fast and return the correct data.

SOLR is your proven choice, no matter how big or critical your search workloads are. In addition, it ensures replication and fault tolerance out of the box.

You can easily fall into a trap and take the performance of SOLR for granted. However, in one of our performance engineering projects, we learned that such highly optimized search engines could also be responsible for significant performance problems.

The Verdict

Insurance application
Multi-Cloud
Real-time cloud data protection
SOLR search engine
Insurance backend layer
3rd Party integrations
Kafka
Kubernetes
Tomcat
Nginx

Performance Requirements

95 % API response times < 500ms
100 requests per second

Performance Engineering approach

Observability first
Shift left
Load testing using JMeter
Performance Monitoring using Dynatrace
Performance validation on UAT and Production

The Problem

Response times were excellent during our early load testing activities and within agreed SLAs. However, during the COVID pandemic, business was slow, and this customer had not launched his insurance portal. At the same time, new versions of almost every component came out, and the customer deployed many of them in the brand new production. Therefore, our customer hoped that the final production validation performance test would be a little exercise completed within a few days. But, he was mistaken, and response times under two requests per second scenario were up to 1 minute in the brand new production environment.

The Solution

We simulated the expected load on the UAT stage in the first step. This load test resulted in perfect response times, and everyone was happy to have a green light for deployment on production.

You can imagine that an immense frustration appeared on our customer side as we published our production validation load testing and monitoring results. Seeing the few weeks until the scheduled go-live date and response times of up to 15 seconds were not expected.

Response Time chart showing spikes during production testing

It's useless to blame anyone in such situations. Instead, we carefully investigated layer by layer and identified that the production environment was quite different. There was an online replication in place and several security features turned on, which we never used on the UAT stages. In addition, real-time indexing for SOLR was turned on.

Initially, we thought all these security features, such as Defender, were responsible for the performance degradation, so we disabled them and executed a benchmark test. But, there was minimal performance impact without these security toolings turned off.

After reviewing our full-stack monitoring results and discussing the load testing results with the developers and architects of this insurance portal, we thought we lost too much time due to the multi-cloud communication. However, we still had some monitoring blind spots because our observability solution was not installed on all the components.

The Tuning

A few days prior, our load tests on production were a failover testing exercise. They enabled real-time indexing on SOLR and online data replication. Once we learned about this significant change, we decided to

adjust real-time indexing for SOLR
use SOLR in a Single Node Cluster
move SOLR to the same Network Segment as our remaining components
and run another load test

Surprisingly, the massive spike in response time disappeared, and service latency returned to the desired 500 ms.

Lessons learned

Never take performance for granted.
Performance risk assessment of all changes is a must.
Load test early, often, and every deployment.
Agree on Load test reporting standards.
Ensure you have full-stack monitoring for all critical components and see all services.

Every minor change can result in massive performance problems. If you develop business-critical applications, continuous performance validation is necessary to manage your performance risks.

I wish you happy performance engineering!