Stressed: How performance remediation can help your systems and your business
Exploring barriers to performance and how to overcome them
December 15, 2020 | By James Pulley and Patrick Guindon-Slater
Amid a global pandemic, and economic uncertainty—in 2020, it’s safe to say that everyone and everything is stressed. Here, we’ll just cover stressed technology systems
Where and how people work has shifted. The majority of companies’ collaboration tools weren’t anticipating the load associated with a nearly or fully remote workforce. Anything once handled via impromptu face-to-face interactions has moved online. VoIP, VPN and security policies are adding to network complexity, which comes with additional performance cost.
More people are shopping, banking, applying for services and seeing the doctor virtually. Government agencies are facing significantly larger than normal applications for small businesses as well as for other benefits such as unemployment. Their systems at every level are stressed, and therefore the performance and any associated failures are much more visible. This can risk an organisations’ reputations and ability to meet their mission.
The thing about performance is that when it works, no one notices. But when it doesn’t—it gets really, really noticed and talked about. Every day there are examples of site failures under load, like remote education and, unemployment services. It’s always easier to prevent a newsworthy event than to respond to one. Yet, unfortunately, it’s hard to get leaders to take it seriously since it’s only a “possible” event and is never assumed to be likely. Let’s discuss a few ways to change the conversation.
The case for change
Inside your organisation, there are applications that users don’t like to use because of poor performance—it’s the most common complaint across IT. Externally, economic uncertainty and a competitive market leave no room for error, failure to scale could cause a company to shut its doors.
Let take a look at an online e-commerce example: If you give out carts to customers upon arriving to your site, it locks up a set amount of resources for each user, slowing the site. Picture every shopper in a store lugging a large shopping cart, even if they’re just browsing, it locks up resources and slows things down. A few years ago, a major retailer was noticing significant performance issues, and by looking through logs and testing, identified a cart issue as the cause of the slowdown.
Through performance remediation, they changed the architecture where shopping carts wouldn’t be handed out at arrival, ultimately resulting in site revenue going up over $10 million a month because of increased speed and customer conversion. In a competitive market, your customers will leave to make purchases elsewhere due to site performance.
You have to make the value messaging appropriate for your leadership. That means turning conversations about load, resources and CPU/memory into things that your executive team care deeply about: poor service, potential loss of revenue, reputational damage and employee frustration.
Barriers to performance
Performance is wrongly considered extra, a non-functional requirement. However, if you can’t access necessary functionality of the system then the system isn’t going to meet business needs. Here are a few examples of common barriers to performance.
Barrier #1: They don’t know how to evaluate performance
Performance doesn’t mean a tool. Performance engineering finds ways to improve the efficiency of the system for it to scale better or respond faster to business demands. Organisations don’t always understand how to determine the value of performance until it is missing. The technology industry has failed in educating developers on the root cause of performance: how to use resources, how often and how large of blocks to grab, and how long to hold onto them. If developers need a larger resource pool, they default to getting bigger pools rather than using the pool they have most efficiently.
Barrier #2: Performance lives under the covers
Unlike functionality that you can see (like pushing a button when you see it), performance and security are under the cover and hard to perceive. You have to design for them. Think of a sports car with the wrong engine. At first you don’t know, until you turn the key, hear the wrong noises and realise it doesn’t go fast—then you do. Performance is designing for under the cover.
Barrier #3: Business, marketing, sales and technology teams aren’t aligned on goals
For example,—the marketing team choose a gorgeous and large—over 45 megabytes—image for a website’s home page locking up so many resources on downloading that image that the site runs out of resources and fails to respond. Ultimately, this will not benefit the sales process. Make sure that the goal for your site or system is known and that all involved parties are aligned to that goal and what they need to do to achieve it.
Barrier #4: Perfection isn’t attainable
In general, there is an acceptable level of failure. If my site is up 99% of the time, that’s not bad. But, if you’re an online retailer it is not acceptable to fail at those critical points e.g. launching a new brand or during peak times. The problem is that when you accept that there will be downtime or security breaches, and if you just contain it, you’re not really measuring the full impact in ways that it will impact your business.
How to begin performance remediation
To evaluate performance, you need a series of measurements or diagnostics for how long actions take and a record of what resources were used. Forensics look back on an issue, identify the root cause and then look to remediate it so it doesn’t happen again. Capacity planning is looking ahead. Where you cannot pull data from a live environment, there is performance testing to generate the measurements of end-user response times and resource usage. Think of it like a dress rehearsal or perfect storm testing and designing for those situations in advance so you can actually handle it in case of that event.
Performance is about patterns. From a remediation perspective, you’re going to look for known patterns of behaviour, evidence of where time is being spent and measurements of resource usage to explain why something is running longer. Where and for how long are you holding resources for a user? Dig into the components that reveal the largest lock on resources.
Some issues can be reconciled with configuration or solved by having portions of user requests served by a specialised caching provider, like a content delivery network. This allows you to set policies to remove load associated with common elements (images, page components, style sheets) that are most frequently used as users make their path through the system and don’t change from one user to the next. This ultimately reduces resources used on your core system to the minimum required to get users through the system faster, and thus allow the resources to then be reused.
Without a content delivery network, every single asset has to be served from the data centre, which increases stress on the network. A resource-heavy development model hurts scalability and requires a lot of memory on the servers. So, when you reach a high-load situation, you run out of resources very fast and can’t accept more users.
Prioritising performance is a new way of working. And with a majority of your employees and customers connecting with your organisation virtually it’s an absolute priority. By leveraging performance forensics to identify root causes of issues and remediating in a timely manner, you’ll reap the ultimate in bottom-line outcomes: increased revenue, customer retention and employee satisfaction.
James Pulley is the practice manager of performance engineering and testing for TEKsystems. He has spent the last 20 years helping customers with software application performance and scalability as a performance tester and engineer.
Patrick Guindon-Slater is the practice manager for continuous testing for TEKsystems.