Herding Elephants
Herding Elephants presents insights on how Heroku operates the largest fleet of PostgreSQL databases through a blend of Ruby applications, emphasizing service-oriented architecture, infrastructure as code, and robust fault tolerance. Speaker Clint Shryock, a support engineer at Heroku, uses humor and personal anecdotes to connect with the audience while delving into the technical aspects of their database management approach.

Key Points:

  • Introduction to Heroku and Its Postgres Team:

    • Clint clarifies his role at Heroku and distinguishes his team's responsibilities, noting that they are a small unit managing a vast infrastructure with thousands of PostgreSQL databases.
    • Emphasizes the concept of a database as a service and the add-on relationship of Heroku Postgres, highlighting its early adoption in the marketplace.
  • Evolution of Heroku Postgres:

    • Begins with a simple Sinatra application that has grown into a constellation of applications for effective management.
    • Describes a distributed architecture with specific applications handling different tasks, enhancing operational responsibilities.
  • Monitoring and Managing Databases:

    • Importance of continuously monitoring several databases to spot issues early.
    • States that they adopt an outside-in approach, where workers gather information to assess different database statuses, rather than relying solely on software installations for monitoring.
  • State Machines and Stateless Workers:

    • Clint outlines the use of state machines to manage complex behaviors and transitions among various states (e.g., up, down, uncertain) for server resources.
    • Discusses the efficiency of stateless workers that quickly execute tasks without maintaining deep state connections, allowing for rapid recovery from issues.
  • Incident Management:

    • Design of incident resolution protocols ensures issues are documented and addressed systematically.
    • Playbooks for common incidents promote knowledge sharing across team members, reducing reliance on individual expertise.
  • Handling Failures and Escalations:

    • If resolution efforts fail, there are escalation procedures in place that involve human intervention to resolve complex problems.
    • Stresses the importance of expecting failures as an inherent part of operating at scale and maintaining a positive attitude.


Clint’s talk illustrates the significance of simplicity in design, effective monitoring, and error management in complex systems. He asserts that embracing the inevitability of failures while having a structured approach to handling them is crucial for success. This presentation serves as a valuable resource for engineering teams looking to improve database management processes while maintaining system reliability.

