Howdy

Our blog

Factors to Overcome an Error in Production

David Obregón teaches us, from his experience as a seasoned developer, how to overcome errors in production without dying in the attempt.

Published 2025-03-13
LinkedInTwitter
Reunión de un equipo de tecnología
author avatar
David Obregón
Senior Software Engineer

Content

    Greetings Tech Enthusiasts! I'm David, a seasoned software engineer with over 9 years of hands-on experience—typically the guardian of flawless code. However, let me share a recent confession: I unintentionally triggered a digital earthquake in our partner's production scene. Yep, chaos ensued, and it was my bad 🫠

    Ever caused a ripple in the production pond? I found myself inadvertently starring in a real-life drama, causing a high-impact error for our partner (cue this GitLab déjà vu link 🤯:).

    Now, how do we bounce back from such a situation? Here are some crucial factors:

  1. How to Overcome an Error in Production
  2. Understanding that Mistakes Happen

    The first and most important step is to know and understand that these kinds of things happen, whether you're an experienced developer or someone just starting in this field. Surely, at some point in our lives, we will face a similar situation (I'm not wishing any harm upon you). Don't blame or judge yourself; take responsibility with your head held high and contribute to the solution.

    Shared Responsibility

    An error is not the responsibility of a single person. We must understand that beyond having someone who may seem "responsible", the entire team is accountable. In my team, code reviews were conducted, validations were performed, tests were run, and still, the error occurred. Developing a blameless culture is important to overcome and understand that behind errors, there are more things that may not be right, and a serious incident can bring more good than bad.

    Familiarize Yourself with Company Processes

    Inform yourself about how these processes are handled in your company so that you can act appropriately. A good process should have proactive monitoring for alerts on potential issues before they significantly affect the end user, effective logs and tracing, a rapid response team, good documentation, among other aspects. If your company doesn't have a process, it can be a great opportunity for you to propose one.

    Conduct a Post-Mortem

    There are many formats for post-mortems that can be used. My recommendation is to focus on determining the root cause without blaming anyone, have a section for lessons learned and things that can be done to avoid such incidents in the future, and have follow-up tasks with responsible parties to work on agreed-upon improvements.

    My manager shared something very wise with me: think about how many good things you have done for this company. Whenever you feel bad about the incident, read those things and focus on the positive. Perhaps these kinds of situations can help you discover if you're in the right place with the right people. Ready to bounce back from blunders? Let's tackle it together.

Greetings Tech Enthusiasts! I'm David, a seasoned software engineer with over 9 years of hands-on experience—typically the guardian of flawless code. However, let me share a recent confession: I unintentionally triggered a digital earthquake in our partner's production scene. Yep, chaos ensued, and it was my bad 🫠

Ever caused a ripple in the production pond? I found myself inadvertently starring in a real-life drama, causing a high-impact error for our partner (cue this GitLab déjà vu link 🤯:).

Now, how do we bounce back from such a situation? Here are some crucial factors:

How to Overcome an Error in Production

Understanding that Mistakes Happen

The first and most important step is to know and understand that these kinds of things happen, whether you're an experienced developer or someone just starting in this field. Surely, at some point in our lives, we will face a similar situation (I'm not wishing any harm upon you). Don't blame or judge yourself; take responsibility with your head held high and contribute to the solution.

Shared Responsibility

An error is not the responsibility of a single person. We must understand that beyond having someone who may seem "responsible", the entire team is accountable. In my team, code reviews were conducted, validations were performed, tests were run, and still, the error occurred. Developing a blameless culture is important to overcome and understand that behind errors, there are more things that may not be right, and a serious incident can bring more good than bad.

Familiarize Yourself with Company Processes

Inform yourself about how these processes are handled in your company so that you can act appropriately. A good process should have proactive monitoring for alerts on potential issues before they significantly affect the end user, effective logs and tracing, a rapid response team, good documentation, among other aspects. If your company doesn't have a process, it can be a great opportunity for you to propose one.

Conduct a Post-Mortem

There are many formats for post-mortems that can be used. My recommendation is to focus on determining the root cause without blaming anyone, have a section for lessons learned and things that can be done to avoid such incidents in the future, and have follow-up tasks with responsible parties to work on agreed-upon improvements.

My manager shared something very wise with me: think about how many good things you have done for this company. Whenever you feel bad about the incident, read those things and focus on the positive. Perhaps these kinds of situations can help you discover if you're in the right place with the right people. Ready to bounce back from blunders? Let's tackle it together.