Troubleshooting in Practice - connection leak detection as an example

Introduction

As of the writing time, I have been working for over five years. In recent years, with the gradual accumulation of work experience, more and more colleagues and peers have approached me for help in solving challenging problems, such as Node.js memory leak detection, database connection pool leak detection (where connections are held for a long period), performance optimization, deadlock detection, among others. Today, I will use database connection pool leak detection as an example to summarize a set of approaches for troubleshooting complex issues for future reference - my best practices to troubleshooting.

Overall, my approach to analyzing complex issues can be broken down into eight steps.

Step 1: Define Problem

The first and most crucial step in troubleshooting is to precisely define the problem. Start by gathering information about the issue. In the case of connection leak detection,

The response was very slow, even with no response from Gateway but a 503 HTTP status code was responded.

This is the definition of our problem which is very straightforward.

Step 2: Gather Information

The low response comes from kinds of possibilities. No hurry to investigate it immediately. Instead, to understand the problem better, we should collect as much information as possible. This may include environment, code version, logs, error messages, etc.

In our scenario, we found there was an endpoint /metrics timeout from the logs. There wasn’t further information except for db connection failed to acquire.
log1

That means db connection pool was full always. We suspected some connections never had been released after used. We developed a detection algorithm in order to gather more logs about connection leak as below:

const connectionMap = new WeakMap<Connection, IConInfo>();
export function registerForLeaks(label: string, con: Connection, stackErr?: Error) {
    if (connectionMap.has(con)) { // already register
        return;
    }

    const info: IConInfo = {
        id: uuid(),
        stackErr,
    };

    Error.captureStackTrace(info, registerForLeaks); // capture stack excluding current function

    const warnAfterUnusedMs: number = (process.env.DB_LEAK_WARN_AFTER_UNUSED && parseInt(process.env.DB_LEAK_WARN_AFTER_UNUSED, 10)) || TimeUnit.MINUTES.toMillis(30);
    const intervalMs: number = (process.env.DB_LEAK_INTERVAL_CHECK && parseInt(process.env.DB_LEAK_INTERVAL_CHECK, 10)) || TimeUnit.MINUTES.toMillis(1);
    const interval = setInterval(() => {
        if (!connectionMap.has(con)) {
            clearInterval(interval);
            return;
        }
        const unusedSinceMs = Date.now() - (info.lastUsedTimeStamp || info.connectionTimestamp);

        // Remove check after leak detection
        if (unusedSinceMs > warnAfterUnusedMs) {
            logError(
                `[${label}] probable connection leak detected with connection id: ${info.id}, unused since ${TimeUnit.MILLISECONDS.toMinutes(unusedSinceMs)} mins`,
            );
            clearInterval(interval);
            map.delete(con);
        }
    }, intervalMs);

    connectionMap.set(con, info);

    con.on('connected', (evt) => {
        info.connectionTimestamp = Date.now();
        logDebug(`[${label}] connected ${info.id} ${info.connectionTimestamp}`);
    });
    con.on('disconnected', (evt) => {
        connectionMap.delete(evt.connection);
        clearInterval(interval);
    });
    con.on('executed', (evt) => {
        info.lastUsedTimeStamp = Date.now();
    });
    con.on('prepared', (evt) => {
        info.lastUsedTimeStamp = Date.now();
    });
}

Step 3: Reproduce Problem

Before we diagnose the issue, much better if we were able to replicate it consistently. Reproducing the problem allows us to work with a controlled environment. To be able to narrow down the scope of the investigation.

Step 4: Investigate Root Cause

Now that we’ve reproduced the problem, it’s time to analyze the data we’ve collected. Look for patterns, anomalies, and potential causes.

Identifying the root cause is often the most challenging part of troubleshooting. We could follow the pattern ISOLATE-INVESTIGATE-VALIDATE to find the root cause in the end.

In our scenario, we found there was an endpoint /metrics timeout from the logs

log1

Then we confined the investigation scope to /metrics endpoint turned out there were actually lots of history logs pointed to this endpoint.

log2

Dug into more logs of metric endpoint, we found the connection acquired by retriveErrorLocalizedStrings had never been released.

log3

Looking at the corresponding code, we noticed a new connection was acquired occasionally for a new task in retriveErrorLocalizedStrings, but never released.

Step 5: Fix Problem

Once we’ve identified the root cause, it’s time to fix the problem. This might involve writing code to address the issue, reconfiguring system settings, or applying a patch.

Step 6: Validate Solution

After implementing your fix, it’s essential to validate the solution. Test our application again to ensure the problem no longer occurs.

Step 7: Recap and Protection

Troubleshooting shouldn’t end with the issue resolution. Take the time to recap what you’ve learned from the experience. Document the problem, the root cause, and the solution we applied. Consider implementing protective measures to prevent similar issues in the future.
In this example, we

Fixed the problem

Enhanced connection lifecycle management: executeWithNewConnection

export async function executeWithNewConnection<T>(
    connection: Connection,
    task: Task<T>,
    taskName?: string
): Promise<T> {
    const localTaskName = getTaskName(taskName)
    const newConnection = createNewConnection(connection)
    return run(newConnection, task, localTaskName)
}

async function run<T>(newConnection: Connection, task: Task<T>, taskName: string): Promise<T> {
    try {
        const result = await task(newConnection)
        return result
    } catch (err: any) {
        throw err
    } finally {
        await newConnection.finish()
    }
}

Whenever the task is run either successfully or failed, we finish the new connection. This means the lifecycle of this newly created connection was totally controlled.

Step 8: Knowledge Transfer

The final step is to transfer knowledge to our team and other relevant stakeholders. Share our experience and the steps we took to resolve the issue. This not only fosters a culture of continuous improvement but also empowers our team to handle similar problems in the future.