Slow computers

2019-02-12

tl;dr: in the mid-2000s, at least, your computer is always slow due to disk.

In my first full-time systems administration job, I configured, installed, and ran campus's ERP ("Enterprise Resource Planning") system. This is the system that ran all the business on campus: student registration, hiring, financial aid, payroll, and pretty much anything involving money. I owned the systems administration; other people were the DBAs, application administrators, etc.

We were the first clients to run this application in production on Linux. To that end we bought some really fancy hardware: IBM eServer xSeries 445s. At that time Red Hat had three versions of Advanced Server's kernels, one of which was made specifically for this hardware. You could add additional CPU trays to the machine and hook them up with a special cable. You could even buy multiple chassis and link them together with these cables.

Once we got everything implemented, my job was to maintain the physical servers, operating system, and tiny parts of the application such as the job submission tool. An extremely common ticket/report was, "The system is slow."

Building good problem statements

Any time you're working on a technical problem, you need a good problem statement. See step three of Tom Limoncelli's paper, "Deconstructing User Requests and the Nine Step Model." A problem statement lets you investigate and test the issue.

In contrast to a good problem statement, we may encounter FUD: Fear, Uncertainty, and Doubt. FUD is when people say things like, "it's bad" or "it never works." Beyond FUD, there are also "wiggle words" that make it harder to know what's going on, such as "it takes a long time." "Long" could mean 3 seconds or 3 minutes or 30 minutes. I took a Lean A3 class last year; our instructor said that one company installed call bells (like at a hotel) in each meeting room for people to ring whenever they heard imprecise language in a meeting.

So in attempting to address "the system is slow," it's useful to figure out the problem statement. This requires identifying good investigative tools.

Investigating slowness

I feel that investigating slowness is one of the toughest systems administration challenges. You have to understand the entire system, from the CPU and disk to the network and database. You have to be able to cut through FUD and wiggle words to identify specific times/issues and quantify slowness.

Here are a few of my lessons learned from back then. Hopefully these concepts are useful even though I'd guess most of the specific tools are not relevant anymore.

Measure performance over time. I used atop for this, recording top-like performance every 10 minutes so we could go back in time to see what was going on. There are many other tools for this even going back to sar(1). (I hypothesize people are aware of sar if and only if they are aware of cpio.)
Identify possible bottlenecks and figure out how to measure each one. For each type of bottleneck you identify, figure out some tool for measuring it.
- Disk slow: iostat(1), or other tools depending on your disk configuration (e.g. tools to diagnose your RAID array performance or your storage area network).
- Network slow: I used atop for this or sometimes cactus. As a side bonus this introduced me to the wonders of SNMP. lsof(1) can also be helpful for looking at unusual volumes of network connections.
- CPU slow: top/atop, but especially watch for user vs. system time. If system time is high, this could actually indicate blocking for resource contention as "active waits" count as system time. Also look for threading issues such as processes that don't thread properly or have a flag to run multi-threaded.
- One process is slow: strace(1) is great if you figure out one process that's got some issue, e.g. if it's blocking trying to get an exclusive lock or something.
- RAM slow: top/atop, vmstat(1), or especially watching swap (if swap is even a thing nowadays).
- Database slow: use a good database-specific query monitoring tool, and use EXPLAIN when you get to a specific query. This is one small example of why database administration is a career.
Learn to read uptime. The uptime should be less than or equal to the number of CPUs/cores in the system. For example an uptime of 8 is fine if there are at least 8 processors.

The above helped me work with database administrators to find slow queries and database optimizations.

Although it was important to have adequate CPU, in my investigations nine out of ten times the slowness was due to disk contention.

Relevance to today

The immediate relevance to me of this article is that I've been using a laptop for the last couple of days with a 7200 rpm SATA drive in it and it's killing my productivity. Solid state drives have spoiled me. I'm sure the problem has been compounded by my building systems that require lots of file reads/writes because they're so cheap on SSDs.

John Borwick

Building good problem statements

Investigating slowness

Relevance to today