Tech Debt Burndown Podcast Series 3 E4: Sarah Wells

Show Notes

Recording date: Nov 15, 2024

Download at Apple Podcasts, Spotify, iHeartRadio, Spreaker or wherever you get your podcasts.

Sarah Wells

Sarah Wells joins Nick and Chris to talk about the lessons learned from her decade at the Financial Times, and their journey to implementing microservices. The same lessons that informed her book, ‘Enabling Microservices Success’. We touch on whether the purpose of microservices is to scale a system, or to scale an engineering organisation. By breaking down monoliths into independent domains, teams can reduce cognitive complexity and release software hundreds of times a day. Sarah warns however that this shift introduces a “maintenance treadmill”, requiring robust automation for library upgrades, security patching, and cross-service governance to prevent a sprawl of unmanageable tech debt.

The conversation goes deep on human factors, and the role of organizational culture in technical success. Sarah emphasises the importance of a ‘generative culture’, as defined by Ron Westrum, where information is shared openly. By taking a ‘you build it, you run it’ approach supported by a strong platform team, organizations can balance individual team autonomy with the necessary guardrails for long-term operational health and efficiency.

Show notes (links)

Enabling Microservice Success

Generative organizational culture (DORA) [Westrum model]

Transcript (automated)

Nick: It’s the Tech Debt Burndown podcast. I’m Nick Selby.

Chris: And I’m Chris Swan, and we’re here today with Sarah Wells. Sarah, could you please introduce yourself?

Sarah: Hi. I’m an independent consultant and also I recently wrote a book about ‘Enabling Microservice Success’ based on my experience working at the Financial Times for over a decade, while we went from very monolithic architecture where we delivered releases in production 12 to 20 times a year, to being able to do hundreds of releases a day with a microservice architecture, with a very large number of microservices.

Nick: [00:00:30] Chris and I are both lucky enough to be able to work with you in different areas. We’ve done some consulting together, you and I, Sarah.

Chris: So I’ve worked with Sarah on numerous QCon events. Most recently, we’ve both been on the program committee for QCon London. And so we also get to sort of bump into each other at those events and chat about what’s happening in the tech world.

Nick: Yeah, I did want to kick off just to talk about the book. The book isn’t about microservices. Sarah, do you want to say what the book’s about?

Sarah: [00:01:00] So it’s in the context of microservices, but really when you start saying, how can you be successful with microservices? You’re talking about how to be successful with modern software development. How do you approach building something that you can maintain and run for a long period of time, and not have to restart and not spend a huge amount of your time doing maintenance? I think you could read it and find there’s lots of useful stuff there, even if you’ve got a fairly monolithic application. And I also make the argument that no one really has a monolith anymore. Everyone’s making calls over the network for some part of their system. It’s very rarely one single application. Even when the FT had a monolithic architecture, we had multiple applications and they spoke to each other. It’s just how distributed is it and how independently can you deploy change to different parts of it?

Chris: [00:02:00] I think that also touches on one of the things I obsess about a little sometimes, which is even if you’re not making a single network call, you’ve probably got a distributed system. If you’re using anything modern because of multi-core processors and multi-threaded applications. And so distributed is happening at a much finer grained level there as well. But sometimes we’re just needing to start introducing that network call when it’s kind of necessary from a scaling the system point of view.

Sarah: Yeah. And also quite often you have a monolithic application, but you have a database and maybe that’s not on the same machine. And so you’re already having to deal with the network at that point. Or you’ve got two instances.

Chris: [00:02:30] I think in some ways, the title of the book frames things in the context of microservices being a recognized good thing that you would want to achieve. And some of this conversation was cued up by the fact that you’ve been speaking to people where it’s, “Oh, we’re not ready for microservices,” or “We don’t want to do that yet.” Where does that take you and where does tech debt come into this picture?

Sarah: [00:03:00] Interestingly, both my book and Sam Newman’s book on microservices, there’s a chapter quite early on, and it’s all of the cases where you shouldn’t be thinking about doing this, where this would be a mistake because there are lots of situations where a microservice totally doesn’t make sense. They’re usually a solution to an organizational scaling problem. They’re normally something that you feel the need of when you have lots of people in multiple teams trying to make changes to the same codebase. If you can split those out into separate domains and have teams work on an individual domain, you are able to make progress more quickly. If you’re a small company or a startup and you have one team, there probably aren’t as many reasons why you’d want to do microservices. You might wait until you’re feeling that pain. You also don’t really necessarily understand the domain that you have when you’re first starting to work on it, so you’re unlikely to find those right boundaries, divisions between services, which means you might find that they actually aren’t as independent as you think. There are a lot of reasons to be cautious before you adopt microservices. They have to be something that you can see how they’re going to help you. And that is normally because you are slowing down and being able to build new features because you’re constantly waiting for some other team to finish using the shared integration testing environment, or you’re having to hand something over to somebody else to do something. And whenever you do that, it slows things down.

Chris: You touch upon, for me, the most important misunderstanding of microservices, which is I encounter far too many people who seem to think that the purpose is to scale your system versus the purpose is to scale your engineering organization. I know at one point Adrian Cockcroft, who some see as the father of microservices, will step up and wave his hands and draw attention to this point, but it also feels that that is oft ignored.

Sarah: [00:04:30] I think the problem is when you are a really massive company, microservices do help you with scaling your system. But for a lot of people, for something like the Financial Times, we’re not dealing with huge amounts of data. We have multiple instances in the Financial Times—have multiple instances of services for reliability reasons, not because you can’t fit all of the content the FT ever published on your laptop. I mean, you can; that’s not why you would scale. There is another reason for using microservices because you have scaling issues and you want to scale some part of your system in a different way than others. But it’s not for most people. The reason that you would do it really is around being able to reduce the complexity of what you need to understand. You can work on a microservice, a part of the domain in a microservice architecture, and not have to worry about what’s going on elsewhere. It’s a bit scary at first because you’re thinking there’s this whole context for the systems in our organization, and I have no idea what’s happening over there. Well, that’s fine, because there’s very little chance that they’re going to do something that’s going to impact your ability to keep working on your stuff.

Nick: [00:06:00] In the pre-chat, you had said maybe the title’s not good. I disagree; I think it’s actually a great title. I just think that people don’t take the time to read the title, which is a bit of a problem, right? But ‘Enabling Microservices Success’ is different from enabling microservices. So part of the reason that we thought of talking to you today was, I’ve had the chance to observe your work. And what I like is that there’s a lot of similarities between the stuff that Chris and I do and talk about all the time, which is, all right, where are we now? What do we have that works? What do we have that doesn’t work? How do we know the difference? And it’s this sort of bottom-up approach with a top-down prioritization, but a bottom-up approach to see what we have. And in that, the concept of tech debt versus process debt versus just bad practice and other kinds of things, what do you intend to do? What’s your desired outcome, but how are you going about it? And looking at the way those things don’t quite mesh is something that we’ve done quite a bit of talking about. Where do you start? How would you describe what you do and the process by which if you have a small or midsize company, where do you even start?

Sarah: [00:07:00] I’m always interested in how easy it is for someone to make a small change and release it to production. Because I saw the transformation. When you don’t have to remember what you were doing six weeks ago, at the point you wrote this code that’s been packaged with a load of other stuff and it’s all gone live, and now you’re trying to work out, well, what was I thinking? And which of these changes made a difference? So I’m normally thinking, if you can have a new idea and code it and release it without having to wait a long time, then you’re in a good place, whatever your architecture is. But if you haven’t got that, how could you get there? And is it the easiest way to do it, to start trying to take some part of your system and separate it out, maybe as a microservice? That’s an important aspect for me. The other thing that I think is really important is thinking, how are the decisions we’re making now going to impact us in 3 or 4 years? Because it’s very easy to make progress to start with. But one of the challenges with microservices is that you can feel like it’s a bit of a treadmill because you’re constantly upgrading things.

Sarah: [00:08:00] Part of that is just because you have more services. So let’s say that you need to upgrade a library that you depend on. Maybe because there’s a security vulnerability. You don’t do it once you do it a hundred times, you have to make it really automated and very quick to do it because a hundred times five minutes is a lot better than 100 times two hours. When you’re in a monolith, you can afford to spend half a day to do that upgrade because you’re doing it once. But with microservices, you have to build things in that allow you to know where the thing is and fix it. There’s a whole range of things like that around maintenance. Microservices allow you to have different kinds of data stores, and that’s brilliant because it is a lot easier, for example, to build a system that manages metadata. If you use a graph database underneath it, because metadata is a graph; people are linked to, companies are linked to industries. There’s a whole graph there. It’s much harder to do that in a relational database. That made it much quicker for us to create new features for metadata when I was at the FT, but now you’ve got an additional database that has to be upgraded.

Sarah: [00:09:00] You have to be able to do backup and restore. And suddenly there’s a whole set of additional things that you have to do. But one of the things about the original idea with microservices was it’s amazing. You can completely choose a whole bunch of different languages, and you’ve got complete freedom to do stuff. And it’s fine because you build it in Clojure. And when you don’t have anyone that knows Clojure, you can rebuild it in a very short period of time. Well, first of all, it’s very hard to persuade non-tech people that actually the thing to do is to completely rebuild a system that’s already there, even when it’s microservices. They’re asking, “Why would you do that?” And the second thing is you just end up with something that’s really hard to support. You don’t want five different languages. You don’t want a huge number of things in complete freedom because you can’t easily move people between teams. You can’t move services between teams. You can’t build tools that can help support everyone because you now have to build them in five languages. There’s a definite balance. What do you have to do that keeps it manageable?

Chris: What did you typically see as the relationship between service boundaries and team boundaries?

Sarah: [00:10:00] Well, at the Financial Times, we had a large number of microservices because we were early adopters and we believed this idea of them being smaller and having a single responsibility. You can make that work. What you don’t want is to ever have a service that’s owned by more than one team. You could have each team has a microservice, and then if they find it makes sense to split that up into other services, that’s great. There might be reasons for them to do that, and they might come from some of the other things that work with microservices. They’re talking to different databases, or it just helps us to be able to consider them separately, but never more than one team working on a particular service is a really good way of trying to keep people able to make progress.

Chris: [00:10:30] Did you encounter situations where there were sort of cross-cutting concerns for new products or launches, which were forcing the organization to address multiple microservices all at once? And as a background to this, I was once talking to Ranney, who was at Uber at the time, and they’d been an adopter of microservices. They were pretty happy with how that worked for them, except for. He said, when we did Uber Eats and when we did Uber Truck, we had to touch everything all at once, and it felt as if we were teleported back to the bad old days of doing monolithic releases.

Sarah: [00:11:00] Sometimes you’re making changes where you know you have to make changes in multiple services. That happened to me far more for the services that existed within a particular group within the FT. So maybe the content API and platform had multiple teams working on it. There was more likely to be something that affected different areas there. And we work fairly closely together. Anyway, after we had learned this, we were really careful about the information that we sent between systems and having services only look for the fields. It was JSON over HTTP; services would ignore anything they didn’t understand. That made it easier to at least work out an order that you were going to do things. You can add the new field and all the services that don’t understand it will ignore it. And you can then start updating those services that need to use that new field; that’s not as big as introducing Uber Eats, but at least it means you don’t have to release everything together, which is a nightmare. Occasionally you had something that crossed the boundaries between, say, the content publishing team and the web team, and that needed a bit more care and communication.

Sarah: [00:12:00] The biggest thing actually, I found with having to touch every microservice is when you’re doing big architectural changes. We wanted to move from, believe it or not, our own hand-created container orchestration system because we adopted containers really early. We wanted to move to Kubernetes. Well, that means moving every service to be running in a different way. And that was a challenge because we had six months to a year where we were maintaining two production environments while we migrated services into Kubernetes. And you’ve got to work out: how do you update? Do you update both? Do you deploy everything to two locations? How do you make sure that they’re working in both if only one of them is serving production traffic? That required a lot of careful planning and trial and error to try and work out: do we feel comfortable that we understand what’s happening here?

Chris: [00:13:00] You’ve touched upon a couple of things that I guess are governance topics. Earlier, you said it was scary being able to just get stuck in working on something within a service context and not have to understand the whole picture. But you’re also touching upon a few things here where somebody does have to understand the composition picture and maybe make choices about using these languages and frameworks and not just have a free-for-all. And so who does that? And how does the organization provide a context for that governance?

Sarah: [00:13:30] So in the end, I started off doing microservices. I was in a team that was building them and I ended up moving across to be that person, that team, the one that’s trying to provide that structure because it’s such an interesting challenge. And there are two things. The first thing is to say that normally you should use the standard ways of doing things. This idea of paving the road or the golden path, you say, well, okay, we will provide standard ways of doing things, and there has to be a good reason for you to opt out of that, because we know that there’s a cost to it, and there are certain things where you just end up saying, I’m sorry, but all of the logs need to go to our log aggregation system. You don’t get to decide to do logs differently, because it’s such a challenge to have some part of an event in a different logging tool. You say, here is some common stuff, and if you do go and do something yourself, here’s a set of our expectations. Maybe you call them guardrails, but they’re a checklist saying, if you are going to do this, if you’re going to build your own version of this, it does have to have security scanning in place.

Sarah: [00:14:30] You do have to keep it patched and upgraded. Here’s our patching policy. You have to make sure that the logs go somewhere. They have to be metrics. So you’re providing people with a guideline that you could do your own thing, but it has to meet quality standards. The other thing you start to do is think about how you have those conversations about introducing a new programming language. And we did it through a forum where if you were making a decision that had wider impact than your own team or small group of teams, you should come to the forum and make sure people are aware of it. It was an RFP process where you write a couple of pages to say, here’s the challenge we’ve got, here’s what we’ve considered, and here’s what we want to do. And the rule was you had to go and make sure you had the conversations with everyone to start to get through all of the disagreements before the meeting. So the meeting was very much that we’ve already got broad agreement and are doing the final bits, but it made people go and have that conversation.

Sarah: [00:15:30] Sometimes people would not do that, and I’d find out that there’d been something introduced that had a massive impact. But generally you had the opportunity to say, now I understand why you’re saying we should introduce this new thing because you have some needs not met by the existing one, and we’ve balanced it up. As an owner of the platform engineering team, I always wanted to know: what’s your plan? Is this always going to be maintained by your team, or is it eventually going to be handed over to the platform engineering team? Because very often, particularly a couple of years in, where the people who made the decision have moved on and now they’ve got a team that are having to manage something they’re not very excited about, they just want to get the platform engineering team to take it over. I was trying really hard to say that it’s not a given that my team are going to accept whatever it is that you built. It may not be something we want to make as part of our common paved road.

Chris: And did that come about because of operational concerns? It sounds as if you’re starting with “you build it, you run it,” but that some of the “you run it” was being shared to a platform team, and there would be different expectations around the running of the platform.

Sarah: [00:16:30] There’s a couple of things that I want to unpack from that. I love “you build it, you run it.” I think you should run the thing that you build. If you make a different decision architecturally or how you build stuff, you know you’re the person that has to debug it and fix it at two in the morning. But I think there’s an element where you realize that some of the things that go wrong are actually underneath the application, and you do need a platform team to be supporting those sorts of things. There’s no point phoning a team to say their application is not available if it’s actually because there’s a problem with the CDN or there’s a problem with the Amazon region. I do think there is an element of you have to have support. It’s also really tough to expect every team to have knowledge and expertise in every area. If you do “you build it, you run it” and you don’t have anything central, you’re relying on every team understanding what’s needed for security and doing it. They’re doing that while balancing it against a push to deliver new features. You ought to be aware of how many security vulnerabilities are still in place in all of your software estate, because there’s probably a team that’s a bit overwhelmed that’s not quite got around to it. There is an operational side of things. I think as an organization there are organizational concerns that you really care about, where you ought to be able to have someone say, “Yeah, we’re doing okay cost-wise.” For example, you want to be efficient in your use of cloud resources, but each individual developer that’s spinning up a server to host the thing they’re building wants it not to run out of CPU or memory, so they’re going to overprovision. You’ve got these two tensions and somehow you have to work it out together and make sure that you rightsize everything.

Nick: [00:18:30] In the past four months, I think that if there’s one single thing I’ve heard you say the most, it’s that sentence about “if you build it, you run it.” I’m really impressed. Your primary concern in every meeting that I’ve been in with you has been around the human factor much more so. I think it’s been surprising to some of the engineers that that is your first concern—the organizational constraints. Can you talk about that? It’s what Chris was asking, but it’s also a little different because nothing works unless the humans do. And you’ve been really good at that.

Sarah: [00:19:00] You have to get the interactions and the culture and build those relationships within teams, because the number of times where there’s a problem and you realize it’s because the two teams just don’t have any natural way to talk to each other, they haven’t quite realized that both people are doing the right thing for their constraints and their context. Having that understanding is really important. We ran an internal tech conference at the FT and it really helped to have everybody talking about what they were doing. The feedback after the very first one we did was, “Oh, I didn’t understand all the legacy that that team was dealing with, and that’s why they couldn’t move as quickly as we could.” I think that culturally, there’s a huge aspect of organizational culture—if you’ve not got an open learning culture and you don’t feel that people can talk about what’s really happening, it’s much harder to get things done. I really like the idea of generative culture that Ron Westrum came up with. He said there are three types of culture: bureaucratic, pathological, and generative. Bureaucratic is, “I could do this, but I have to ask my boss to ask your boss to ask you.”

Sarah: [00:20:00] Pathological is all about the power; no one really trusts each other. Generative is actually very open and it’s focused on making sure information gets to the right people at the right time. There’s a lot of broadcast of information so you have the information you need to make decisions. I used to run the operational team at the Financial Times, and the single thing that makes you most likely to be able to keep the systems up and running is that everyone feels very able to say, “I think there’s a problem, and it might be related to the change that I just made.” As soon as those people know that, everyone’s reaction is going to be, “Let’s get on a call, let’s work out what’s going on and see how we can fix it.” Then you’re much more likely to be able to fix things more quickly. That blameless culture is really important, and it’s the first assessment that I have when I’m talking to an organization: is that something that exists or that could exist?

Nick: I don’t mean to have a leading question, but do you see at least a correlation, if not causality, between an organization’s ability to track accurately what people are doing—the metrics of who does what—with that cultural imperative to point fingers, or the lack of it? You probably know where I’m going, but I’m just trying to understand.

Sarah: [00:21:00] The culture that wants to point fingers will spend a lot of time building elaborate plans. I think at some point, I realized that every item on the backlog of work for my team that was four or five deep—everything beyond that was never going to be done. Everything changes, priorities change, and you change what you’re doing. You’d have spent quite a lot of time grooming the backlog and making decisions about all of this stuff, and I just moved to the point where I just want to know what the next thing up is, and we’ll talk about that one in detail. It takes a particular organizational culture to trust that that’s okay and that you’re creating a roadmap. But honestly, you shouldn’t turn around to me in six months’ time and say, “The thing you had on your roadmap for October is not what you did,” because you should expect that a roadmap is a direction of travel right now, but not a commitment that this is how long it’s going to take. I tend to feel that organizations that are great culturally in terms of trying to always think of what we can learn and how we can improve tend to be more open to the idea that you can’t predict exactly how long it’s going to take to do something. You should do it in steps and iterate; all of the good things that I believe come together. I think those cultures tend to go together.

Nick: It just flashed into my head that a roadmap is not a GPS. It’s not supposed to tell you to the second when you’re going to arrive.

Sarah: [00:22:30] And a roadmap is communication of our intentions at the moment. I think that’s great. It’s really useful for people to see that, but I would love it if you could do a roadmap where it just got fuzzier the further out it was. For the first couple of months, you can see the details, and by the time you get out to next quarter, you can almost read the words, but you’re not completely able to.

Chris: You’re touching upon how we do roadmapping at Atsign.

Sarah: Nice.

Chris: [00:23:00] I’m very intentional about this because I understand the pitfalls of being entrapped in ceremony. Part of what you’re describing here is organizations that let their own processes devour them, and the people end up becoming servants of the process rather than the process supporting the people doing their best work. We do a thing where, close in, we’ve got some very focused expectations of what’s going to happen. And the further out you get, not only are we less specific about that, but also we start bracketing the time in bigger chunks. You’ve got the next couple of sprints, and then the next couple of months, and then the next couple of quarters, and then the next two halves.

Sarah: [00:23:30] I would agree with that completely, because I’ve just seen too many situations where people honestly believe the dates. The whole point when we’re doing software development generally is there are very few things where it’s incredibly predictable. You wouldn’t be building it if it was completely predictable because it would probably already exist; it might be available as a commodity. Every time there’s a burndown chart, you don’t really know. The only time I love a burndown chart is for something like moving boxes out of a data center; it’s pretty predictable how long it will take.

Chris: [00:24:00] You’re touching on people believing the dates against all previous evidence as well. One of the worst cases I’ve dealt with was a gigantic “Death Star” project, which by the time I encountered it, was many quarters overdue and considered absolutely crucial to the functioning of the organization. So, it’s going to be ready in two weeks’ time, right? How many times have you been told that it’s going to be ready in two weeks’ time? Dozens already. And how many times have you been disappointed? Dozens already. So why do you believe it this time?

Sarah: [00:24:30] I think when I joined the FT, there was a project that was a six-month project already in its second year. I also remember having a conversation with a Scrum Master about a feature that took six weeks, having been estimated at four weeks. We were going to have a retro to discuss why, and me and the other developers were saying that only being 50% over is miraculous. So, honestly, we were not that bothered by it. But also, we only released code every four weeks. There is no way we could have released this after four weeks just because of that. We had a long conversation about precision versus accuracy as a result.

Chris: Tied up in ceremony.

Nick: [00:25:00] Yeah, we’re going to have to let this go at that. Thank you so much.

Sarah: I could talk to both of you for ages.

Nick: I know, and my mind is racing for seven or eight other things to say, but I have to leave it there. So this has been the Tech Debt Burndown podcast. I’m Nick Selby.

Chris: And I’m Chris Swan, thanks for joining us.

Sarah: Thank you.

Tech Debt Burndown Podcast Series 3 E4: Sarah Wells

Show Notes

Sarah Wells

Show notes (links)

Transcript (automated)

Guests

Sarah Wells

Hosts

Chris Swan

Nick Selby