First it's best to clarify what I refer to when using the word "microservice". Best go by Martin Fowler's definition (https://martinfowler.com/articles/microservices.html). There are often questions and misconceptions about the size of a service when it's a microservice. The architectural style does not mandate that the services absolutely need to be a certain size - it's more about the bounded context - the ideal structure so that each service has a clear and minimally overlapping role with other services in the architecture. This may mean that one service only has a few hundred lines of code while another might have ten thousand. This is tackled from another perspective below.
Why microservices
- Developer velocity
- Independent releases
- Limited scope of work
- Stronger decoupling
- Requires well defined bounded context
- Scalability
- Independent scaling
- Independent storage
- Limited scope is easier to optimize
- Requires well defined bounded context
- Resilience
- String error boundaries
- Partial failures are possible
- Though requires approach where upstream failures are prepared for in the system design though
- Still a lot easier than in a monolithic application where there is high and nigh-unavoidable interdependence at multiple levels
- Secondary benefits
- Polyglot
- Easier to test at least in isolation
- Most important reason: enabler for organizational scaling
- When successful organization grows there will be more developers, teams, products, visibility (small audience = partial failures may not even be noticed but very different with millions of customers), liability, responsibility.
- While growing products become interdependent and so do teams
- Incurs synchronization cost
- Four key metrics of high-performing organizations (https://www.thoughtworks.com/radar/techniques/four-key-metrics)
- Lead time
- Deployment frequency
- MTTR
- Change fail percentage
- They're all negatively affected by synchronization
- Example of synchronization cost is when an issue occurs and first you need to figure out who owns the issue and should start the troubleshooting
- So a high level objective should be to minimize synchronization, maximize independence and microservices are a great fit for this
Lessons learned
- Small is good - it's much more efficient when developers, designers, etc. can focus on a small section of the overall architecture and reason about it without the details of the rest bleeding through
- Smaller interfaces
- Easier to reason about
- Easier to evolve
- Hard to keep small though
- Results in lower coupling which in turn supports the minimization of synchronization cost
- What makes microservice micro
- Not the amount of code but rather minimal API surface area
- Also the reason why you should never share the data source between two services - SQL API is massive
- Polyglot architecture is enabled by microservices
- Great promise of microservices, enables multiple tech stacks
- But incurs significant cost when number of different stacks in a company is too high by making it costly for people to switch teams, transfer service responsibility,
- Number of languages even in many top tech firms is rather limited:
- Google 6
- Facebook 9
- Twitter 4
- Amazon 3
- Netlifx ~3
- Spotify 4
- If you have more than one or two tech stacks for five thousand engineers you're doing it wrong
- There's a dichotomy between individualistic point of view for always selecting the best tool for the job in that situation and for keeping the overall architecture consistent
- Even organizations that initially enable total freedom of individually choosing the best tools based on developer preference end up consolidating to a smaller number of different stacks later
- Smaller number of stacks makes mobilization between teams easier, easier hiring
- For any regular organization 2-3 tech stacks is what you should shoot for
- Conway's Law is true - "Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations"
- Means that you want to structure the teams in a way that supports the architecture you want and vice versa
- Common danger signs:
- One data store owner by multiple services
- Single system or domain or responsibility owned by multiple teams
- Single responsibility principle SRP - if you're constantly violating it
- Reorganize your services or your teams
- Operations matter
- Developer velocity = ship fast and strong
- No bottlenecks allowed:
- No global release
- Instead many small rapid iterations
- No centralized ops
- Distributed ownership, devops
- No manual deployment
- Full automation deployment
- No static topology
- Ephemeral, autoscaled, serverless
- Automation is key - everything as code
- Infrastructure as code
- Automated deployment and CI/CD pipelines as code
- Automated monitoring, metrics, alerts
- Remediation i.e. automated rollbacks, A/B testing, canary releases, etc.
- Provisioning
- Automation empowers your teams
- Tools are still developing but at least we know what to shoot for
- All of this supports minimization of synchronization cost
Criticism
- One often raised (and to a degree valid) criticism for microservices is that by making the services smaller we end up just moving some of the complexity from the code level to the architectural level
- This is indeed a problem when the higher level does not provide tooling to address the complexity better
- This is why I wouldn't really recommend full microservice architecture prior to going full-on Kubernetes since I see K8S being a huge jump forward in the tooling to manage this complexity (better than is often practical at code level for monoliths)
What we haven't learned
- Insufficient knowledge of the "physics of distributed systems" e.g. CAP theorem,
- We're all building distributed systems and concurrency is key
- Disregarding CAP means
- You will end up with inconsistent data
- You will not scale
- You will lose data
- Observability
- Distributed systems are by nature
- Disaggregate
- Hard to reason about
- We are still deficient in tooling and methodoloy
- Tooling is getting better
- Tracing (Jaeger, Zipkin)
- Metrics (Grafana, Prometheus)
- Log aggregation (ELK et al)
- But that's not enough when you can't debug in isolation
- Have to know
- What to log
- What to count
- What to monitor
- How to make sense of it all
- There are no easy answers
- E.g. we're still counting average response time
- Why average response time is a bad metric was summarized nicely in a comment to the video: "Response time usually has a long tail distribution, and the mean/average value of that does not really tell too much. I can be high because there were some very long requests or it can be high because all requests started to take longer. Instead you can use percentiles which tells you what portion of requests are faster than their value. For instance, P50 (median) tells that every second request is faster than its value and every other second request is slower than its value. For samples with normal distribution median is the same as mean/average. Of course you can split the distribution at any arbitrary point (e.g. P75, P90, P99). The downside of percentiles that you require all the samples at one place to compute them, although there are algorithms (e.g. TDigest) that can give you good estimates in distributed environments."
Recap
- We aim to
- Minimize synchronization
- Maximize independence
- We're struggling with
- Safety
- Scalability
- Observability
- The goals are met and issues alleviated via event-driven architectures
- Events make everything simpler (easier to build, easier to reason about, easier to test)
- Modeling interactions with events
- Enforces strong context boundaries
- Lets you scale services independently
- Lets you observe the system in motion
- Increases system reliability
- Persisting event streams (event sourcing)
- Observatility and auditing built in
- Lets you scale use cases independently (CQRS)
- Precludes full consistency (good thing)
- Outside of the narrow field of software engineering, virtually nothing in life is fully consistent, no business is fully consistent
- To get availability we have to very often give up some consistency
- E.g. bank account's balance as a series of all transactions in the history in it
- Modeling workflows in terms of events enabled by Saga pattern (for cross-domain consistency) - a viable alternative to transactions in most cases
- Concentrate on principles - not implementation details. If you're building a microservice framework then you're probably wasting your time
- These tools have already been commoditized, little reason to roll your own
- Kubernetes, Kafka
- Invest in studying
- CAP tradeoffs
- Domain modelling
- Event storming
- Event sourcing / CQRS
- Sagas
What changed? Why abandon the traditional ways?
- 20-25+ years ago hardware was expensive, virtualization was just taking its first baby-steps and software ran on bare metal which had to be manually provisioned and managed
- Servers were costly centralized resources
- It made total sense to optimize for shared resource usage and build monoliths
- This is the fundament that changed - shared resource optimization is no longer the driving factor, allowing us to instead manage and execute different components of the overall architecture in almost complete isolation from each other
No comments:
Post a Comment