- All
- IT Services
- System Administration
Powered by AI and the LinkedIn community
1
Monitor system metrics
2
Identify the root cause
3
Apply the best solution
4
Prevent future issues
5
Here’s what else to consider
Performance bottlenecks are factors that limit the efficiency and responsiveness of a system. They can cause delays, errors, and frustration for users and administrators alike. As a system administrator, you need to be able to identify and resolve performance bottlenecks before they affect your system's availability and reliability. In this article, you will learn some methods and tools to help you diagnose and troubleshoot performance issues in system administration.
Top experts in this article
Selected by the community from 25 contributions. Learn more
Earn a Community Top Voice badge
Add to collaborative articles to get recognized for your expertise on your profile. Learn more
- Yoandri Gallardo Senior System Engineer @ Amadeus | Infrastructure, Cloud, Virtualization, Automation, Scripting
4
- Bryan Brandau Sr. Director Cloud Platforms, Infrastructure Engineering and Operations at Best Buy
3
-
3
1 Monitor system metrics
The first step to identify performance bottlenecks is to monitor the key metrics of your system, such as CPU, memory, disk, network, and processes. These metrics can indicate how well your system is utilizing its resources and where the potential problems are. You can use various tools to collect and analyze these metrics, such as top , ps , vmstat , iostat , netstat , and sar . You can also use graphical tools or web-based dashboards to visualize and compare these metrics over time.
Help others by sharing more (125 characters min.)
- Yoandri Gallardo Senior System Engineer @ Amadeus | Infrastructure, Cloud, Virtualization, Automation, Scripting
(edited)
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
There are some commands/tools that is always good to have handy. While some are shipped with the OS depending on the distribution, and others will need to be installed the ones below are part of my tool set.- ifconfig (network)- ethtool (network)- nethogs (network)- glances (network)- top (overall performance)- htop (overall performance)- vmstat (memory)- free (memory)- iostat (hdd performance)- nfsiostat (nfs performance)- df (disk space)- ps (processes)- netstat (services and ports)- ss (services and ports)Other dedicated monitoring solutions offer more capabilities, UI and historical data, to name some:- Nagios- Zabbix- Grafana- Thanos- Monit- Icinga- Cacti
LikeLike
Celebrate
Support
Love
Insightful
Funny
4
- Bryan Brandau Sr. Director Cloud Platforms, Infrastructure Engineering and Operations at Best Buy
(edited)
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Monitoring almost always ends up incremental as you learn more about your systems, so a flexible and composable monitoring system needs to be at your disposal. More importantly you need to understand how your application interacts with the system(s) it uses. Do you understand how your JVM works, thread pools work, GC, NUMA, network throughput and calls? These deeper level things are what is always looked at when you are looking for performance bottlenecks in system administration. You don't have telemetry telling you thread pools are near exhaustion - add it. Monitoring and the passion you put in is where you will get the most out. Instead of - I don't have that data - It's, I have that data but I just need to add this alert.
LikeLike
Celebrate
Support
Love
Insightful
Funny
3
- Tarun Chakraborty TOGAF | AWS | GCP | M365 | DevOps | Platform Engineering | IT Lead | Engineering Manager | MLOPS
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Monitoring is always challenge but before monitoring we need to understand what workload running on the system. First 60 second use tools like top, uptime, dmesg | tail, vmstat 1, mpstat P ALL 1, pidstat 1, iostat -xz 1, free -m, sar -n DEV 1 sar -n TCP,ETCP 1, try to understand where is the error. Then you may look into error rate if there is any IOT applications are running could be connectivity timeout etc.
LikeLike
Celebrate
Support
Love
Insightful
Funny
2
-
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Performance testing: This is the process of simulating different scenarios and loads on the system and measuring its response time, throughput, resource utilization, & (KPIs). Performance testing can help identify the system’s capacity, limitations, and potential bottlenecks under various conditions. Performance testing can be done using automated tools, such as LoadRunner, WebLOAD, or Apache JMeter1.Profiling: This is the process of analyzing the behavior and characteristics of a specific component or process within the system, such as an application, a database, a network device, or a CPU. Profiling can be done using tools that monitor and collect data on the component or process, such as code analyzers, debuggers, profilers, or tracers.
LikeLike
Celebrate
Support
Love
Insightful
Funny
2
- Ron Eckart Systems Engineer III at Everstream Solutions
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Utilizing an external monitoring system is key for identifying bottlenecks within a system and to gather the systems normal operating metrics.
LikeLike
Celebrate
Support
Love
Insightful
Funny
2
Load more contributions
2 Identify the root cause
Once you have a general idea of where the bottleneck is, you need to dig deeper and find out the root cause of the problem. This may involve using more specific tools or commands to inspect the details of your system's components, such as lsof , strace , perf , ping , traceroute , and tcpdump . You may also need to check the logs, configuration files, and documentation of your system and its applications to look for clues and errors. You should try to isolate the source of the problem and eliminate other possible causes.
Help others by sharing more (125 characters min.)
-
(edited)
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
It is imperative to distinguish the difference between software vs hardware limitations to limit downtime and cost effective solutions. Oftentimes the issue isn't hardware related, it may be software; i.e. timeout of interfaces / application crash, errors in reporting.
LikeLike
Celebrate
Support
Love
Insightful
Funny
3
- Miguel Maloney Thompson IT Professional | Cyber Security Analyst
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Identifying the root cause could be broken down as:1. Diagnostic Perspective, to get a granular view of the problem using tools that inspect specific areas of a system. For example, system calls and network routes. This pinpoint exactly where the issue lies.2. Historical Perspective from log & configuration analysis with tools like Splunk or ELK Stack to get the historical context of the issue.3. Comparative Perspective with baseline vs current state from tools that have historical data comparison like Zabbix & Prometheus, comparing the problem state against a known good state.4. Collaborative Perspective, from internal documentation and different team members who may have encountered similar issues.
LikeLike
Celebrate
Support
Love
Insightful
Funny
- Stacy Gray
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Every performance bottleneck has a finite number of possible causes. Test the possible causes against the facts about the issue to eliminate false causes until you identify the most probable cause. Consider the following scenario: multiple users complaining about timeouts. Since it isn't just one user, don't waste time looking at their individual machines.
LikeLike
Celebrate
Support
Love
Insightful
Funny
3 Apply the best solution
After you have identified the root cause of the performance bottleneck, you need to apply the best solution to fix it. This may involve tuning the parameters, upgrading the hardware, optimizing the code, changing the architecture, or adding more resources. You should always test the impact of your solution before applying it to the production environment and monitor the results after the implementation. You should also document your findings and actions for future reference and improvement.
Help others by sharing more (125 characters min.)
- Miguel Maloney Thompson IT Professional | Cyber Security Analyst
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
An ecommerce website experienced unexplained slowdown. Short-term monitoring via htop indicated high CPU & memory usage. Long-term monitoring historical data via Splunk ruled out internal server issues and showed instead, increased traffic to specific pages, unrelated to online purchases. Collaborative analysis with IT forums & team insights suggested bad traffic as the culprit. To mitigate this, we opted for a Web Application Firewall (WAF). Before enabling WAF filter rules, we verified the presence of bot traffic in WAF stats. Upon rule activation, the site's performance normalized.This approach aligns with industry best practices for identifying & mitigating non-legitimate traffic, ensuring both availability & security.
LikeLike
Celebrate
Support
Love
Insightful
Funny
1
4 Prevent future issues
The final step to identify performance bottlenecks is to prevent them from happening again. This may involve establishing a baseline for your system's performance, setting up alerts and thresholds for your metrics, automating your monitoring and analysis tasks, and following the best practices and standards for your system and its applications. You should also keep your system updated, secure, and backed up, and conduct regular audits and reviews to identify and address any potential issues.
Performance bottlenecks can affect your system's performance and user satisfaction. By following these steps, you can identify and resolve them effectively and efficiently.
Help others by sharing more (125 characters min.)
- Jazeem Ilyas M DevOps Engineer | Architecting Cost-Optimized Infrastructure for Secure, Scalable, High-Availability Systems
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
One proactive approach that allows you to address scalability issues before they impact users is to test how your system performs under different levels of load. By simulating heavy usage or traffic, you can identify how well your system handles stress and whether it's scalable. Load testing tools like Apache JMeter, Siege and many other tools can help you mimic real-world conditions and discover potential bottlenecks that might only show up when the system is under heavy demand.
LikeLike
Celebrate
Support
Love
Insightful
Funny
2
- Matt Le Digital Transformation | Global Contribution | Service Excellence | Founder
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Maintaining a connection between business users and ICT is a pivotal strategy in the battle against performance bottlenecks. While the we often talk about essential technical steps to prevent bottlenecks, the human element is often underestimated. Business users possess invaluable insights into the system’s real-world usage and can help bridge the gap between technical optimisation and user expectations, of which both are ever evolving.By fostering collaboration, business users can communicate their needs and pain points, leading to more effective alert setting and metric selection. They can provide context that metrics alone cannot convey, ensuring that thresholds and alerts align with the actual impact on business operations.
LikeLike
Celebrate
Support
Love
Insightful
Funny
2
- Miguel Maloney Thompson IT Professional | Cyber Security Analyst
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Upgrades & updates can introduce issues, risking destabilizing environments. To mitigate, incorporating a development platform for testing is crucial. This allows vetting updates & upgrades for compatibility & stability before deploying to production. For instance, the recent end-of-life announcement for the Linux distro CentOS poses potential future issues. Organizations relying on CentOS may face compatibility challenges. Using a development platform, new Linux distros can be tested for compatibility with existing systems. Once confirmed stable, updates can then be safely rolled out to the production environment. This approach minimizes risks & ensures a smooth transition, thereby maintaining system reliability & performance.
LikeLike
Celebrate
Support
Love
Insightful
Funny
- Joseph Marhee Container Technology Consultant at SUSE
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Regular load testing, emergency management drills, etc. allow for systems teams to surface issues before they occur in production. It is also crucial to routinely audit monitors, and evaluate if they measure what they are believed to, and if they will occur in a timely and useful way (for example, if by time an issue occurs, have other prerequisite and detectable failures likely to have also occurred that you are not yet monitoring for?)
LikeLike
Celebrate
Support
Love
Insightful
Funny
5 Here’s what else to consider
This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?
Help others by sharing more (125 characters min.)
- Piyush Jaiswal Associate Director Of Engineering @ Mobileum | Innovative Leader Driving Product Performance
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
"Bottleneck" is a wonderfully descriptive term. It describes an artificial constraint on some form of communication, interaction, or transfer of information. And it leads one to believe that some magical combination of luck, money, and ingenuity can smash that bottleneck and let all good things flow.The trouble with performance bottlenecks is that they can be tough to identify. Is it the CPU? The network? A clumsy bit of code? Often, the most obvious culprit is actually downstream of something larger and more mystifying. And when performance riddles remain unsolved, IT management may find itself faced with a Hobson's choice between admitting ignorance and making up excuses.
LikeLike
Celebrate
Support
Love
Insightful
Funny
1
- Miguel Maloney Thompson IT Professional | Cyber Security Analyst
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
The IT department supports other departments. Understanding their workflows & requirements is crucial for an effective system management. Once the IT team gained insights into the marketing department's needs for A/B testing & PPC campaigns, which require fast page loading times, they tailored their system optimizations accordingly and scheduled maintenance & updates in a manner that minimized negative impact on crucial marketing activities. Essentially, understanding the operational needs of other departments helped us to avoid creating performance bottlenecks for A/B tests or PPC campaigns. Cross-departmental understanding is key to ensuring IT operations align with organizational objectives & department-specific requirements.
LikeLike
Celebrate
Support
Love
Insightful
Funny
1
System Administration
System Administration
+ Follow
Rate this article
We created this article with the help of AI. What do you think of it?
It’s great It’s not so great
Thanks for your feedback
Your feedback is private. Like or react to bring the conversation to your network.
Tell us more
Tell us why you didn’t like this article.
If you think something in this article goes against our Professional Community Policies, please let us know.
We appreciate you letting us know. Though we’re unable to respond directly, your feedback helps us improve this experience for everyone.
If you think this goes against our Professional Community Policies, please let us know.
More articles on System Administration
No more previous content
- Dealing with a demanding client in a network crisis. Can you find a balance between urgency and precision?
- Your system's performance is suffering due to code changes. How can you mitigate the negative impact?
- You're facing potential server crashes. How can you proactively safeguard your system's stability?
- You're facing a surge in network demands. How can you maintain server stability under pressure?
- You're facing major system upgrades. How do you maintain seamless server performance?
- You're bombarded with user complaints about system performance. How do you manage them effectively?
- You're facing conflicting priorities in IT integration. How can you align your teams for success?
No more next content
Explore Other Skills
- IT Strategy
- Technical Support
- Cybersecurity
- IT Management
- Software Project Management
- IT Consulting
- IT Operations
- Data Management
- Information Security
- Information Technology
More relevant reading
- System Administration What do you do if you need to evaluate performance in System Administration?
- Business Operations What strategies can you use to minimize downtime when configuring a dual boot system?
- System Administration How can you improve system performance with tuning tools?
- IT Services How do you ensure transparency and accountability in operating system migration and upgrades?