Journey of a Contemplative Architect: Let's talk about Microservices

Microservices are a big topic nowadays and for a good reason. Being so talked about there are also a lot of misconceptions (and truly differing opinions) on it. I started writing my take on it but then ended up finding a conference talk that summarizes the current situation so well that this ended being pretty much another summary with some of my related commentary included. The referenced conference talk by Tomer Gabel on the subject titled "Microservices: A Retrospective" (https://www.youtube.com/watch?v=DLRfT44e8uQ).

First it's best to clarify what I refer to when using the word "microservice". Best go by Martin Fowler's definition (https://martinfowler.com/articles/microservices.html). There are often questions and misconceptions about the size of a service when it's a microservice. The architectural style does not mandate that the services absolutely need to be a certain size - it's more about the bounded context - the ideal structure so that each service has a clear and minimally overlapping role with other services in the architecture. This may mean that one service only has a few hundred lines of code while another might have ten thousand. This is tackled from another perspective below.

Why microservices

Developer velocity

Independent releases
Limited scope of work
Stronger decoupling
Requires well defined bounded context

Scalability

Independent scaling
Independent storage
Limited scope is easier to optimize
Requires well defined bounded context

Resilience

String error boundaries
Partial failures are possible

Though requires approach where upstream failures are prepared for in the system design though
Still a lot easier than in a monolithic application where there is high and nigh-unavoidable interdependence at multiple levels

Secondary benefits

Polyglot
Easier to test at least in isolation

Most important reason: enabler for organizational scaling

When successful organization grows there will be more developers, teams, products, visibility (small audience = partial failures may not even be noticed but very different with millions of customers), liability, responsibility.
While growing products become interdependent and so do teams

Incurs synchronization cost

Four key metrics of high-performing organizations (https://www.thoughtworks.com/radar/techniques/four-key-metrics)

Lead time
Deployment frequency
MTTR
Change fail percentage
They're all negatively affected by synchronization

Example of synchronization cost is when an issue occurs and first you need to figure out who owns the issue and should start the troubleshooting
So a high level objective should be to minimize synchronization, maximize independence and microservices are a great fit for this

Lessons learned

Small is good - it's much more efficient when developers, designers, etc. can focus on a small section of the overall architecture and reason about it without the details of the rest bleeding through
Smaller interfaces

Easier to reason about
Easier to evolve
Hard to keep small though
Results in lower coupling which in turn supports the minimization of synchronization cost

What makes microservice micro

Not the amount of code but rather minimal API surface area
Also the reason why you should never share the data source between two services - SQL API is massive

Polyglot architecture is enabled by microservices

Great promise of microservices, enables multiple tech stacks
But incurs significant cost when number of different stacks in a company is too high by making it costly for people to switch teams, transfer service responsibility,
Number of languages even in many top tech firms is rather limited:

Google 6
Facebook 9
Twitter 4
Amazon 3
Netlifx ~3
Spotify 4
If you have more than one or two tech stacks for five thousand engineers you're doing it wrong

There's a dichotomy between individualistic point of view for always selecting the best tool for the job in that situation and for keeping the overall architecture consistent
Even organizations that initially enable total freedom of individually choosing the best tools based on developer preference end up consolidating to a smaller number of different stacks later
Smaller number of stacks makes mobilization between teams easier, easier hiring
For any regular organization 2-3 tech stacks is what you should shoot for

Conway's Law is true - "Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations"

Means that you want to structure the teams in a way that supports the architecture you want and vice versa
Common danger signs:

One data store owner by multiple services
Single system or domain or responsibility owned by multiple teams

Single responsibility principle SRP - if you're constantly violating it

Reorganize your services or your teams

Operations matter

Developer velocity = ship fast and strong
No bottlenecks allowed:

No global release

Instead many small rapid iterations

No centralized ops

Distributed ownership, devops

No manual deployment

Full automation deployment

No static topology

Ephemeral, autoscaled, serverless

Automation is key - everything as code

Infrastructure as code
Automated deployment and CI/CD pipelines as code
Automated monitoring, metrics, alerts
Remediation i.e. automated rollbacks, A/B testing, canary releases, etc.
Provisioning
Automation empowers your teams
Tools are still developing but at least we know what to shoot for

All of this supports minimization of synchronization cost

Criticism

One often raised (and to a degree valid) criticism for microservices is that by making the services smaller we end up just moving some of the complexity from the code level to the architectural level

This is indeed a problem when the higher level does not provide tooling to address the complexity better
This is why I wouldn't really recommend full microservice architecture prior to going full-on Kubernetes since I see K8S being a huge jump forward in the tooling to manage this complexity (better than is often practical at code level for monoliths)

What we haven't learned

Insufficient knowledge of the "physics of distributed systems" e.g. CAP theorem,

We're all building distributed systems and concurrency is key
Disregarding CAP means

You will end up with inconsistent data
You will not scale
You will lose data

Observability

Distributed systems are by nature

Disaggregate
Hard to reason about
We are still deficient in tooling and methodoloy

Tooling is getting better

Tracing (Jaeger, Zipkin)
Metrics (Grafana, Prometheus)
Log aggregation (ELK et al)

But that's not enough when you can't debug in isolation
Have to know

What to log
What to count
What to monitor
How to make sense of it all
There are no easy answers
E.g. we're still counting average response time

Why average response time is a bad metric was summarized nicely in a comment to the video: "Response time usually has a long tail distribution, and the mean/average value of that does not really tell too much. I can be high because there were some very long requests or it can be high because all requests started to take longer. Instead you can use percentiles which tells you what portion of requests are faster than their value. For instance, P50 (median) tells that every second request is faster than its value and every other second request is slower than its value. For samples with normal distribution median is the same as mean/average. Of course you can split the distribution at any arbitrary point (e.g. P75, P90, P99). The downside of percentiles that you require all the samples at one place to compute them, although there are algorithms (e.g. TDigest) that can give you good estimates in distributed environments."

Recap

We aim to

Minimize synchronization
Maximize independence

We're struggling with

Safety
Scalability
Observability

The goals are met and issues alleviated via event-driven architectures

Events make everything simpler (easier to build, easier to reason about, easier to test)
Modeling interactions with events

Enforces strong context boundaries
Lets you scale services independently
Lets you observe the system in motion
Increases system reliability

Persisting event streams (event sourcing)

Observatility and auditing built in
Lets you scale use cases independently (CQRS)
Precludes full consistency (good thing)

Outside of the narrow field of software engineering, virtually nothing in life is fully consistent, no business is fully consistent
To get availability we have to very often give up some consistency

E.g. bank account's balance as a series of all transactions in the history in it

Modeling workflows in terms of events enabled by Saga pattern (for cross-domain consistency) - a viable alternative to transactions in most cases

Concentrate on principles - not implementation details. If you're building a microservice framework then you're probably wasting your time

These tools have already been commoditized, little reason to roll your own
Kubernetes, Kafka

Invest in studying

CAP tradeoffs
Domain modelling
Event storming
Event sourcing / CQRS
Sagas

What changed? Why abandon the traditional ways?

20-25+ years ago hardware was expensive, virtualization was just taking its first baby-steps and software ran on bare metal which had to be manually provisioned and managed
Servers were costly centralized resources

It made total sense to optimize for shared resource usage and build monoliths

This is the fundament that changed - shared resource optimization is no longer the driving factor, allowing us to instead manage and execute different components of the overall architecture in almost complete isolation from each other

Journey of a Contemplative Architect

Saturday, 11 January 2020

Let's talk about Microservices

Why microservices

Lessons learned

Criticism

What we haven't learned

Recap

What changed? Why abandon the traditional ways?

No comments:

Post a Comment

From Architecture to Game Development: A New Blog on Echoes of Myth

Search This Blog