SRE/DevOps Interview Questions — Linux Troubleshooting (Extended)
This is an extension of my previous post, SRE/DevOps Interview Questions — Linux Troubleshooting.
In technical interviews, questions often start vague and become more specific as the conversation progresses. This “drill-down” approach helps interviewers gauge your depth of knowledge. In this blog, we’ll explore more complex troubleshooting scenarios and the Linux internals that power them.
Scenario: The Mystery of the Exhausted System
Interviewer: “A service on a Linux system is reported as ‘not running properly.’ How do you start your investigation?”
Candidate: “After verifying DNS and basic connectivity, I’d attempt to SSH into the system to check its state.”
Interviewer: “You try to SSH, but you get this error: ssh: connect to host example.com port 22: Resource temporarily unavailable. However, you have IPMI/OOB access and manage to log in. What’s next?”
Candidate: “Once inside, I’ll run standard tools like top, ps, or lsof to see what’s consuming resources.”
Interviewer: “Every command you type—top, ls, ps—returns the same error: fork: retry: Resource temporarily unavailable. How do you troubleshoot a machine when you cannot execute any external commands?”
The Solution: Shell Built-ins and /proc
At this point, you’ve realized the system has exhausted a critical resource (likely processes or file descriptors). Since every external command requires a fork() system call, and the system cannot fork, you must rely on Shell Built-ins.
1. Identify Built-ins
Type help in your shell (Bash/Zsh) to see a list of commands that are internal to the shell and do not require forking a new process.
2. Leverage the /proc Filesystem
Almost all performance data provided by ps, top, and vmstat is actually sourced from the /proc virtual filesystem. You can “read” this data using the built-in read and for loops.
Example: Counting File Descriptors per Process
If you suspect file descriptor exhaustion, you can iterate through /proc without calling ls:
# Using a shell loop to list open FDs for all PIDs
for fd in /proc/[0-9]*/fd/*; do
echo $fd
done
Example: Checking Command Lines
To see what a specific PID is actually running without using ps or cat:
# Read the content of a file into a shell variable
read -r cmd < /proc/${PID}/cmdline
echo $cmd
Interactive Troubleshooting: Data Interpretation
Question: Look at the vmstat output below. What is happening on this machine?

Key areas to focus on:
r(Run queue): Are processes waiting for CPU?b(Blocked): Are processes stuck waiting for I/O?si/so(Swap In/Out): Is the system thrashing due to memory pressure?in(Interrupts) vscs(Context Switches): High numbers here often indicate a specific type of workload or a potential bottleneck in kernel scheduling.
Question: Compare these two vmstat outputs. What has changed?

Networking & Security Deep Dives
Question: TCP vs. IP Datagrams “Given a TCP connection between two machines, what happens if the packets move as raw IP Datagrams? What are the implications for latency and reliability?”
Question: The curl Request Flow
“When you execute curl example.com, walk me through the entire lifecycle of that request.”
This is a multi-layered question covering:
- System Calls: The
fork()andexec()process.- DNS: Resolution of the hostname via
/etc/hostsorresolv.conf.- Transport: TCP 3-way handshake.
- Security: SSL/TLS negotiation (Asymmetric vs. Symmetric encryption).
- Application: HTTP request/response headers.
Follow-up: “The output of your curl shows ‘301 Moved Permanently’. Why did this happen, and how do you fix it?”
Discuss response codes and the
-L(follow redirects) flag.
Essential Configuration Files
nsswitch.conf: Controls the order of lookups (e.g., should the system check/etc/hostsbefore DNS?). Learn more.nscd: The Name Service Cache Daemon. Why is it used, and what are its pitfalls? Read here./etc/services: Maps friendly service names to port numbers and protocols (TCP/UDP). Read the history.
Advanced Scenarios
Simulating Packet Loss “How do you simulate 50% packet drop for testing purposes?”
Answer: Use
tc(Traffic Control) withnetem. Details here.
Short-Lived Processes “How do you troubleshoot CPU spikes caused by processes that only live for a few milliseconds?”
Answer: Standard
topwon’t catch them. Useexecsnoop(perf-tools) to capture short-lived process execution. Read more.
Conclusion
These questions are designed to move beyond “knowing the command” into “understanding the system.” I’ll continue to document these scenarios as they evolve.