rubyday 2014

Service Oriented Architecture for robust and scalable systems

Service Oriented Architecture for robust and scalable systems

by Ole Michaelis

In this presentation titled 'Service Oriented Architecture for Robust and Scalable Systems' by Ole Michaelis at RubyDay 2014, the focus is on the importance and practicalities of implementing Service Oriented Architecture (SOA) as businesses grow and their systems become more complex. The speaker shares his experience transitioning from a monolithic architecture to a distributed system, emphasizing how such architectures can lead to improved maintainability and scalability.

Key Points Discussed:
- Introduction to SOA: Michaelis introduces SOA as a design strategy that encapsulates data with the business logic, only allowing access through public service interfaces.
- Personal Journey: He reflects on his career, detailing his experiences at an incubator and Jimdo, illustrating the differences between small teams in startups and larger, more complex organizations.
- Complexity of Monolithic Systems: The challenges of managing large codebases are highlighted, including the slow pace of shipping features in an aged codebase and the necessity for refactoring to prevent entrapment in monolithic architecture.
- Benefits of Distributed Architecture: SOA helps in speeding up decision-making, improving team specialization, and facilitating quicker deployments while also allowing independent scaling of services (e.g., payment services vs. template services).
- Designing for the Future: Michaelis advises that applications should be as stateless as possible to manage scaling easier and discusses the importance of authentication, timeouts, and reducing dependencies between services.
- Patterns for Durability: The presentation covers design practices such as employing circuit breakers and back-pressure patterns to handle service failures gracefully.
- Integration Techniques: Possible communication protocols for services, such as Thrift and Protocol Buffers, are discussed as well as the importance of clear API design.
- Case Studies: Michaelis references notable companies like Amazon, Netflix, and Twitter to illustrate successful transitions to distributed architectures, noting the lessons learned along the way.
- Conclusion: The talk wraps up with the importance of maintaining team communication as architecture evolves, while also acknowledging the challenges of logging, monitoring, and handling failures in distributed systems.

Takeaways:
- SOA is a critical strategy for growing businesses, enabling them to manage complexity effectively.
- Each service should be designed with independence, minimal state dependencies, and a clear, cohesive interface.
- Implementing SOA requires attention to both technological and human aspects, fostering collaboration and accountability within teams.

00:00:03.600 Thank you very much for having me here, and thanks for staying. You had the option of two rooms, and you decided to follow me for the next 45 minutes. I hope you won't regret it.
00:00:10.000 Today, I'd like to talk about service-oriented architecture. But before we dive deep into the topic, let me introduce myself.
00:00:16.960 My name is Ole Michaelis, I'm from Hamburg, Germany, and I'm 26 years old. I consider myself a web nerd, user groupie, and sometimes a conference speaker. If you feel I'm talking too much, just tweet me at Codestars.
00:00:27.439 You can find my blog at post.u, and you might know me because I organized a conference called "Slaughterio" in Hamburg.
00:00:36.000 Currently, I work with Jimdo. Jimdo is a website builder aimed at helping everyone create their own websites without requiring HTML, CSS, or JavaScript knowledge.
00:00:45.039 I am part of the template team and have been working with Jimdo for about a year and a half. I started on January 1st last year and have switched teams quite a bit. Now, as mentioned, I'm part of the template team where my job title is somewhat humorously called 'shipment'.
00:00:56.960 In my team, we are the first in my company implementing service-oriented architecture (SOA). We are trying to move services out of our monolithic application, and I want to share that story and what I learned along the way.
00:01:09.040 This talk is titled 'Service Oriented Architecture for Robust and Scalable Systems,' but I feel like this title is a bit misleading. While the content has mostly stayed the same, I want to relabel it to 'Distributed Architectures' instead. This is because people tend to have a certain picture in mind when they hear 'service-oriented architecture' versus 'microservices'.
00:01:16.080 I'd be open to discussions about this distinction, and a whole other talk could be dedicated to where the differences truly lie. From my point of view, it’s really about distributed architectures since both concepts apply.
00:01:27.440 Before joining Jimdo, I worked at an incubator in the middle of 2012. My job there was to bootstrap projects, and every three months, we would have a new project to initiate.
00:01:35.600 I was responsible for choosing the frameworks, programming languages, databases, and all that kind of stuff. Each new project was super exciting because I had a comprehensive vision of what I wanted to achieve. You have to focus on software architecture, right?
00:01:46.000 As a good software engineer, you have a picture in mind of how your software should look, the architecture, making plans, and selecting the right frameworks.
00:01:57.120 The first project we tackled was a mobile flea market application, where the idea was to create an app for people to sell their unused items. We had a solid plan to get started.
00:02:04.240 However, it was a young project, and back then, we used PHP. I learned my lessons during that project about writing quality software. You aim to write beautiful software following principles like hexagonal architecture and decoupling components.
00:02:19.120 But alongside this ideal, there’s significant business pressure to ship quickly, and that's a contentious topic. I had many discussions with my CTO about finding a middle ground.
00:02:34.079 The project ended up launching after six months instead of the planned three. That's just how software development works—so often, we plan three months but end up needing six.
00:02:41.439 After moving through various projects, the incubator ultimately got bankrupt, which is unusual because incubators typically have ample funds. It was a serious and complicated situation, and in the end, they had to let everyone go, including me.
00:02:57.840 However, being a software engineer, we often land on our feet. With the abundance of job opportunities, it wasn’t too difficult to find another position.
00:03:06.480 I ended up at Jimdo, which was quite the change. I came from a two-person incubator where I made decisions quickly and decisively, to Jimdo's established environment.
00:03:13.040 When I joined, Jimdo had been around for nearly eight years. When I started, there were about 150 employees and 50 engineers.
00:03:20.000 The experience showed me what a grown code base looks like—very tightly coupled with a tremendous amount of complexity.
00:03:28.400 There were a million lines of code in one repository, with different databases and patterns scattered everywhere. Some sections were true spaghetti code, while others were just over-engineered.
00:03:35.200 For example, if I wanted to store a picture in the general code base, it affected 16 classes. On the other hand, there were classes with about 6000 methods crammed into a single file, making it a mess.
00:03:44.640 I found it amusing because just a month ago, I had been part of a very small team where all decisions were made very quickly, next to a big code base where the decisions were more complicated.
00:03:55.600 I often thought about how the mobile flea market app would evolve over seven years, and while it may still look the same, it would depend greatly on business constraints.
00:04:02.880 But I think we can all agree that overly complex architecture is detrimental since it slows down progress.
00:04:09.839 In Jimdo, we have felt that shipping features in this aged system takes an unusually long time. For instance, it took us two years to develop the API layer necessary for our iOS app.
00:04:19.840 All the business logic was there; we just needed a new controller layer, and yet it took us two years.
00:04:26.000 So, we face two extremes: on one side, we have easy startups with simple code and innovative solutions, while on the other, we have this complex, grown-up company architecture.
00:04:34.760 Many of you might feel like you're on one side or the other; rarely do we find ourselves in between.
00:04:41.639 It’s a gradual process to move from a simple base to a more complex code base, and when you find yourself in a big monolithic system, it feels like you're stuck.
00:04:50.080 Thus, the importance of knowing when to change your architecture becomes vital before it’s too late.
00:05:00.240 People often ask me when is the right time to change, and honestly, I can’t answer that definitively. It depends on your project and your timeline.
00:05:09.760 For Jimdo, as a bootstrap company without external funding, we have to think carefully about how we spend our resources.
00:05:19.440 Investing in refactoring often does not yield quick returns; it’s a long-term investment that becomes harder when you're bootstrapped.
00:05:28.480 To avoid being trapped in a monolithic architecture with overgrown code, it’s essential to distribute your architecture.
00:05:39.600 Even if you have a large product, aim to have small, independent code bases that communicate with each other.
00:05:46.639 That’s the essence of distributed architecture, SOA, or microservices.
00:05:53.760 So, what is a distributed architecture in terms of SOA? Here’s a quote from Vane Fogus, the CEO of Amazon: "Service orientation means encapsulating data with the business logic that operates on that data, with access only through published service interfaces."
00:06:01.919 For those of you hoping to learn what SOA is, this quote sums it up perfectly. But we have 40 more minutes to fill, so let's delve deeper.
00:06:09.840 Jeff Bezos, known for his role at Amazon, once wrote to all tech employees in 2006.
00:06:16.480 He mandated that all teams expose their data and functionalities exclusively through service interfaces.
00:06:23.680 Overall, the only communication permitted is through service calls over the network, and teams must design their interfaces to be externalizable.
00:06:31.120 He closed the memo by stating that anyone who did not comply would be fired. That's a tough stance.
00:06:39.760 If I received such an email, I would probably quit. However, reflecting on it, back in the day, sticking with Amazon's philosophy was likely a wise choice.
00:06:48.800 Since 2006, Amazon has become the leader in cloud platforms and scalable systems. When we talk about distributed architecture, Amazon serves as a reference point.
00:06:58.639 This email marked a pivotal start for AWS, although the groundwork began in 2003.
00:07:07.360 The last quote I want to mention is Conway's Law, which states that organizations that design systems are constrained to produce designs that replicate the communication structures of those organizations.
00:07:15.040 In this context, how you communicate with your colleagues directly influences the architecture of your code.
00:07:24.000 This definitely applies to Jimdo's operation. Interestingly, we inverted Conway's Law when we molded the architecture toward how we wanted our code base to be.
00:07:31.440 Two years ago, the Jimdo development team decided to break into feature teams that would take charge of their own codebases.
00:07:39.760 Initially, we had one large development team managing a monolith, but then we split into eight feature teams. This led to the birth of our first services.
00:07:48.400 As teams, we want to maintain our independent codebases, which enhances simplicity.
00:07:55.440 There are various benefits to SOA. One of them is faster decision-making.
00:08:02.560 When you reduce the number of people involved in a codebase, you can make decisions more quickly. Have you ever tried to make a decision as a group of 50? It can take forever.
00:08:11.160 Even a team of three can struggle to agree on a new database or logging framework, let alone with dozens of differing opinions.
00:08:22.080 More importantly, splitting teams allows you to distribute responsibilities effectively.
00:08:28.960 For instance, I'm part of the template team, so I handle everything related to templates. I don’t have to deal with payment systems or APIs I don’t enjoy.
00:08:36.279 This specialization reduces complexity and facilitates easier onboarding, ultimately leading to faster development speeds and, consequently, happier customers.
00:08:43.520 Moreover, quicker deployments lead to more satisfied developers. Smaller codebases allow for faster continuous integrations and deployments.
00:08:50.640 In many ways, smaller codebases equate to developer happiness and bring about the startup feel, which some of us thrive on.
00:09:00.000 Working in smaller teams leads to improvements in scalability as well. Each component of the system can be scaled independently.
00:09:07.480 So, if payment services scale becomes a bottleneck, I can enhance the payment part without affecting the template service.
00:09:14.040 As we look ahead, you may be asking how to implement this in your own environment.
00:09:20.480 One essential lesson is to build up your platform by ensuring your applications remain as stateless as possible.
00:09:29.760 The services you write should primarily avoid state. While some state is necessary, it should be minimal.
00:09:36.560 If your architecture has to maintain state, it easily complicates scaling, leading to substantial difficulties.
00:09:43.280 A cautionary tale I want to share involves one of my coworkers at a conference in 2008 who was excited about SQLite.
00:09:52.160 His idea was to give every customer their own database file. However, this resulted in significant challenges.
00:10:01.440 If something went wrong with a customer’s file, it would lead to access restrictions for their server and make recovery very difficult.
00:10:10.080 After five years of that implementation, we finally managed to eliminate it, but it took us half a year to remove it.
00:10:18.000 When dealing with state, be very careful about where it is managed to ensure manageable scaling.
00:10:27.200 Moreover, build your confidence. When planning your services in a distributed architecture, determine the best means of authentication.
00:10:35.000 Do you need a VPN for security? Typically, I believe VPNs create more issues than they solve due to accessibility constraints.
00:10:43.320 When using authentication, HTTP-based authentication should always utilize HTTPS for best security practice.
00:10:51.440 It's crucial to design your architecture with independence and isolation in mind. A service shouldn’t depend too much on shared components.
00:11:01.920 It’s acceptable for services to connect with one another, but try to minimize reliance on shared components.
00:11:10.200 Creating reliability is also paramount. Aim for automation, which facilitates fast recovery. Offloading the management of hard drives to someone else can also save precious time.
00:11:19.440 Focus on your core business goals by adopting single responsibility principles. Each service should address a specific concern.
00:11:28.399 You should avoid creating a single manager class to handle multiple responsibilities.
00:11:36.160 Instead, each responsibility should be handled by its designated service, such as a user service managing user-specific tasks.
00:11:44.720 A crucial design pattern to mention is the circuit breaker.
00:11:51.279 When one service struggles in an infrastructure, it’s critical to avoid overwhelming it with requests, as this prevents recovery.
00:11:59.200 Introducing a circuit breaker at the network boundary will help manage requests effectively.
00:12:07.200 Some implementations of this pattern include Netflix's Hystrix and a JavaScript version called Circuit Breaker.js.
00:12:15.120 Another important pattern is applying back-pressure to a struggling service. Don't unnecessarily propagate errors through the system.
00:12:23.040 You should take action at the service level where the issue arises, like discarding non-critical requests during system overload.
00:12:30.720 For instance, if you gather metrics on system performance, it may be acceptable to discard those metrics rather than overwhelming a service.
00:12:39.440 However, if it's a critical payment request, you must ensure proper handling rather than dropping it.
00:12:48.000 An essential element in a distributed environment is to set reasonable timeouts.
00:12:56.640 One fantastic example highlights Travis CI, a continuous integration platform, where an excessive timeout caused significant delays within their queue.
00:13:05.120 When Github modified their API without warning, the timeout took ten minutes, causing the queue to overflow.
00:13:12.960 Always set reasonable timeouts, as they are crucial in not overwhelming your system with requests.
00:13:20.160 You might have heard that people can only remember about three things from a presentation. Well, one point I want you to remember is to always default to timeouts.
00:13:29.360 Now, let’s discuss how you can integrate your services.
00:13:37.440 You can leverage high-level protocols such as Thrift, XML-RPC, or Protocol Buffers. Each has pros and cons depending on your specific use case.
00:13:44.240 If you’re developing internal services, Protobuf and Swift may be beneficial; while external services often favor RESTful HTTP communication.
00:13:52.320 Message queues like RabbitMQ and Zookeeper are also reliable options for handling service interactions.
00:13:59.920 For example, one service can produce a job in the queue, while another retrieves the job from the queue.
00:14:07.568 I advocate for HTTP APIs as a straightforward interface for service communication. When implementing, be mindful of headers such as accept and content types.
00:14:15.520 I've had personal experiences where APIs return unexpected content types, which aligns with specific implementation issues.
00:14:23.920 It’s crucial that your URIs point to resources and that format requests are handled through headers, rather than using file extensions.
00:14:31.360 Headers should not be included in the payload if they're relevant to headers—implement them in the correct way.
00:14:39.840 Lastly, when integrating services, it’s essential to ensure they form one cohesive product.
00:14:49.280 In conclusion, it's imperative to consider both technology and human aspects in distributed architecture.
00:14:55.920 At Jimdo, splitting the team structure meant we didn’t just split the components. It also encouraged increased communication and collaboration.
00:15:05.360 Ensure that as you decentralize architecture, you maintain the essence of team communication to prevent unintentional redundancy.
00:15:14.240 Even as we address the merits of SOA, we must always be aware of its downsides, such as logging and monitoring challenges.
00:15:22.400 In a distributed architecture, coordinating logging becomes complex as different teams may adopt various logging infrastructures.
00:15:31.040 When debugging, navigating through different monitoring services can be quite cumbersome.
00:15:40.800 Another issue arises when failures occur. Poorly handled cascading failures can lead to chaotic situations.
00:15:48.720 As each service struggles, orchestrating recovery becomes a challenge—determining which service to spin up first can be a complicated task.
00:15:57.920 In distributed systems, failure is inevitable. The question is not whether it will happen but when it will.
00:16:06.080 Finally, responsibility in a distributed architecture can be a double-edged sword. Though accountability is important, it can manifest in implicit ways.
00:16:15.120 Each team must recognize that they are responsible for their provided services, which reinforces the importance of their role.
00:16:20.960 Having a resilient system means that even when a service fails, the repercussions should ideally be minimal for users.
00:16:28.640 Netflix often emphasizes the importance of resilient systems through their design—users aren’t usually aware if a specific service isn't performing.
00:16:37.440 To wrap up the discussion, I'd like to share some real-world examples of distributed architecture in action.
00:16:45.440 Yammer shared a diagram at a recent conference showcasing many services they implemented. It demonstrates their approach, although it can be hard to follow all the interdependencies.
00:16:53.440 One interesting aspect was a service called Vario, which had no dependencies at all—a noteworthy ambition.
00:17:00.960 Another example comes from Twitter. They transitioned from a monolithic design, which they affectionately called 'Monorail,' to a microservice-oriented structure.
00:17:08.000 After realizing their existing structure couldn’t scale, they managed to start breaking it down into smaller components.
00:17:17.680 It is essential to note that completely eliminating the old monolith is a monumental task that takes dedication.
00:17:27.120 We are currently embracing similar approaches at Jimdo, where our initial dynamic templating service is being split into smaller components.
00:17:34.960 As we drill deeper into our structured services, continuous communication remains vital as teams manage responsibilities.
00:17:42.640 In summary, while SOA has inherent challenges, it also lays the groundwork for scalable architectures that can provide happier developers and end-users.
00:17:51.360 Ultimately, each part of a service can be independently managed and scaled, which significantly enhances the overall efficiency of the system.
00:18:00.080 Thank you all for your time. And I would love to take any questions you may have!
00:18:06.720 I hope you remember the three key elements from this session: timeouts, HTTP, and distributed architecture!
00:18:13.680 If you have any questions, feel free to approach me throughout the day. Thank you!