Recovery is the process of restoring an application to a stable and functional state after a failure or incident. This involves identifying the root cause of the issue, implementing necessary fixes, and ensuring that the system is resilient enough to handle similar incidents in the future.
For example, if a web application experiences a database outage, the recovery process would involve restoring the database from backups and implementing measures to prevent similar outages in the future.
Explore related concepts
Incident
An incident is an unexpected disruption in the normal operation of an application. Incidents can range from complete service outages to performance degradation, and they often require immediate attention to restore normal functionality.
Reliability
Reliability is the ability of an application to consistently 'do what is says on the can'. It's the ability of the application to perform as expected, even when the set of conditions are not optimal. It involves minimizing the occurrence of failures and ensuring that the system can recover quickly when failures happen.
Resiliency
Resiliency is an app's ability to gracefully handle and recover from failures, ensuring minimal impact on the overall functionality. Building resilient applications means designing apps and infrastructure that can be fault-tolerant and responsive to unexpected issues.