A Practical Toolkit for Troubleshooting Java in Production

When you are diagnosing Java problems, good reasoning matters most—but tools are what make the work possible. The right tool can cut the effort in half, and sometimes without it you simply cannot keep going. What matters first is knowing what is available; usage can often be learned on demand. Still, it helps a lot to have handled these tools at least once before you truly need them.

Log tools come first

Most investigations lean heavily on logs, so being comfortable with a few command-line basics goes a long way. In practice, tail, find, fgrep, and awk are often enough to cover a large share of day-to-day log analysis.

That naturally leads to a more important point: exception logging and key informational logs must be done well. Poor exception handling makes troubleshooting far harder than it needs to be. A common example is an application with its own ServletContextListener implementation that throws a RuntimeException, causing Tomcat to exit directly—while Tomcat may not print that exception at all. Cases like that are extremely frustrating to track down, even if there are workarounds.

Log standardization matters just as much. If logs are not standardized, sometimes you cannot even figure out where they are. In distributed systems, standardized logs also make tracing much easier, which is a huge advantage when locating the source of a problem.

CPU-related tools

For CPU issues, a small set of tools covers most situations.

`top (-H)`

top lets you watch CPU metrics in real time, including the state of individual cores, which is often much more useful than looking only at aggregate CPU usage. The -H option helps identify which thread is consuming CPU, and that alone can solve many straightforward high-CPU cases.

`sar`

sar is valuable because it gives you historical metrics, not just current ones. And CPU is only part of the picture—you can also inspect memory, disk, network, and more. Since many incidents are already over by the time someone starts investigating, historical data is critical.

`jstack`

jstack shows what threads inside a Java process are doing. It is often useful when an application becomes unresponsive or extremely slow. By default it shows only Java stacks. With jstack -m, you can see both Java stacks and native stacks, but compiled Java methods may not appear clearly—and most frequently executed Java methods are compiled.

`pstack`

pstack is useful when you want to inspect the native stack of a Java process.

`perf`

Simple CPU consumption problems can often be resolved with top -H plus jstack. Once the issue gets more complicated, perf becomes one of the most powerful tools you can bring in.

`cat /proc/interrupts`

This one is especially relevant for distributed applications. Heavy network traffic can make interrupt handling itself a significant source of CPU overhead. At that point, NIC multi-queue configuration and interrupt balancing become important. So if the CPU si metric is not low, checking interrupts is worth doing.

Memory-related tools

Memory problems usually require a different set of tools, and the investigation often moves back and forth between observation, dumping, and tracing.

`jstat`

Commands such as jstat -gcutil or -gc help monitor GC behavior in real time. That said, GC logs are often more convenient to work with over time.

`jmap`

When you need a heap dump to see what is actually in memory, jmap -dump is the standard choice. It can also be used in another way: if you need to force a full GC, jmap -histo:live can help. In collectors such as CMS, where fragmentation is unavoidable, there is never a shortage of reasons to think about it. Obviously, this is not something to run casually.

`gcore`

Compared with jmap -dump, gcore can feel faster, and for that reason it is often preferable. But some JDK versions do not cooperate with gcore very well, so in those situations you still need jmap -dump.

`mat`

A heap dump without a proper analysis tool is not very useful. mat is one of the best tools for this job and is simply very effective in practice.

`btrace`

A small number of memory issues can be understood directly from a heap dump in mat, but in many cases you will still need dynamic tracing. That is where btrace becomes a real powerhouse for Java troubleshooting.

A simple example: suppose you need to find where a running Java application is creating ArrayList instances with an array size greater than 1000. Without dynamic tracing, that can be painful. With btrace, it becomes a very quick task.

`gperf`

The tools above are usually enough for memory consumed inside the Java heap. Off-heap memory is much more troublesome. In practice, gperf is one of the more usable choices here. From experience, Direct ByteBuffer and Deflater/Inflater are common sources of this kind of problem.

Beyond tools, recorded memory information matters just as much as logs do. GC logging should always be enabled so that when a problem occurs, you can go back and compare behavior against the GC records. Startup parameters such as -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc: should be part of the standard configuration.

ClassLoader troubleshooting

If you write Java long enough, ClassLoader problems are nearly impossible to avoid.

The easiest starting point is often -XX:+TraceClassLoading. If you already know which class is involved, a practical approach is to inspect the JARs in all candidate lib directories directly, for example with jar -tvf *.jar, and look for duplicate or conflicting classes.

If that still does not settle it, btrace is again the heavy-duty option—you can trace operations such as Classloader.defineClass to see what is actually happening at load time.

Other useful tools

`jinfo`

Java has a huge number of startup options, and just as many defaults. Documentation is not always reliable; jinfo -flags is. You can also dig into jinfo -flag if you want to explore more of what is really in effect.

`dmesg`

If your Java process suddenly disappears, dmesg is often a good place to look first.

`systemtap`

Some issues cannot be fully understood at the Java layer. When you need to trace function calls deeper in the operating system, systemtap can be the right tool.

`gdb`

For harder and stranger failures, especially when you have a core dump, gdb is the next level.

I/O-related issues are not covered here in detail. There are tools for them too, but if they are not used often enough, listing them without enough context is not very helpful.

In the end, most tools can be learned when needed. But before that, the important thing is simply to know they exist and roughly what each one is good for. If possible, try them in advance—when a real incident happens, even a little familiarity makes a big difference.

Log tools come first

CPU-related tools

top (-H)

sar

jstack

pstack

perf

cat /proc/interrupts

Memory-related tools

jstat

jmap

gcore

mat

btrace

gperf

ClassLoader troubleshooting

Other useful tools

jinfo

dmesg

systemtap

gdb

Related Posts

How I’ve Come to Learn: Books, Papers, Life, and the Long Way In

Keeping Windows Awake in a Domain-Locked Environment with MouseJiggle

The Question Is Already Broken; Any Answer Will Fail

Breaking Free from Fixed Thinking

A Practical Guide to jstat on JDK 8