Journey of a Contemplative Architect: December 2019

Friday, 6 December 2019

Kubernetes - is it secure?

Having won the container orchestration war, Kubernetes is increasingly business critical so it follows that it will increasingly be targeted by black hats. Thus it is very important for organizations using it and contemplating using it to know it also measures up on the security side of things - including the core ecosystem forming around it in addition to the natural Kube core itself.

Here is an interesting talk on the subject titled "The Devil in the Details: Kubernetes’ First Security Assessment" by Jay Beale and Aaron Small. The focus is more on Kubernetes internal development but there are definitely some good points also for Kubernetes users who're trying to secure their clusters.

https://www.youtube.com/watch?v=1kaqHTcF3iQ

Highlights and interesting picks:

Kubernetes manages containers at 69% of organizations surveyed (2017)
Kubernetes auditing philosophy:

Open: Public RFP and selection process
Transparent: Public audit GitHub repository

https://github.com/trailofbits/audit-kubernetes

Frugal: specific focuses, allowing for a series of assessments
Future-focused: Threat model and Attackers Guide

An attacker on a cluster is trying to compromise and escalate privilege from

Outside of the cluster

This kind of attack is very rare, most likely requires severe misconfiguration or similar
Attacker sees:

Ingress services
Possibly API server
Less probably kubeletes, etcd servers, ..

Inside a container whose program they've compromised
In a control plane element they've compromised
In a node they've escalted privilege on

Attacker inside a cluster:

Usually sees every pod, etcd servers, worker and master nodes, etc.
May have access to the cloud provider APIs depending on configuration
Has the opportunity to observe or PitM traffic (note HTTP flows and unverified endpoints)

Assessment team found Kubernetes configuration and deployment to be non-trivial, deficiencies in default configuration settings (see https://youtu.be/1kaqHTcF3iQ?t=1688 for full quote)

Overall I get the view from this that CNCF (the foundation overseeing Kubernetes and associated open source projects) takes the security seriously and there's a realistic and active approach to securing it.

On a Kubernetes user level obviously one of the top priorities in securing a Kube cluster is having a defense against escalation from inside a container to the node level (even though there are still more trust zones, defense in layers). One of the most important practices on this front is allowing only rootless containers (RedHat gets points here for mandating rootless model in OpenShift 4 which is supported by going with default CRI-O + Buildah model).

It's also useful to note here that CIS (center for internet security, https://www.cisecurity.org/) provides Kubernetes benchmark.

Sunday, 1 December 2019

Engineering culture beyond scrum / SAFe / agile with capital A

Agile manifesto and Agiles with capital A

Agile manifesto is still as true today as it was when it was published: https://agilemanifesto.org/ (also note the 12 principles https://agilemanifesto.org/principles.html)

For years now what is most often associated to to the word 'agile' is not the manifesto though but rather specific implementations of agility with their own doctrines. You've heard of these and most likely experienced one or several: XP, scrum and the later organizationally scaled up versions SAFe, lean business etc.

One thing that has often bothered me about many of these methodologies is the dogmatism their practitioners often preach which easily goes against the foundation on which it is built.

Agile manifesto and the core principles it and its implementations are based on is that of the realization of human imperfection. We cannot make detailed plans for the future - especially in a context with significant unknown unknowns (i.e. things we don't even know that we don't know). Thus an iterative model is forced upon us whether we want or not - and the more we align our planning with human realities the better we're likely to do.

This iterative and somewhat unpredictable nature of building software naturally goes against traditional annual budgetary planning and everything that has traditionally followed from that (I'm looking at you waterfall planning) and the most modern variants attempt to reconcile these two worlds at the scale of large enterprises (e.g. SAFe - scaled agile framework).

A framework is only a framework though and eventually all the details are up to individuals.

The principles on which the entire frameworks are built are so easily forgotten in the heat of immediate priorities and restrictions. Many of the principles from the manifesto are so much easier to say than to actualize.

One of the core factors that these frameworks and individual companies often struggle with and try to balance is the control between individual teams and centralized functions (applying to operations, architectural guidelines, budgeting, prioritization and many other factors).

Spotify's squad model

Spotify provides a different model, having noticed the constraints of scrum and moved to a new model of their own making.

https://www.youtube.com/watch?v=4GK1NDTWbkY

In Spotify model teams are entirely autonomous, being only given mission, architectural guidelines, product strategy and short-term goals from outside.

This model enables teams to make decisions locally without the overhead of centralization and minimize handoffs to enable higher scaling.

Following the team's mission is very important i.e. "be autonomous but don't suboptimize".

Goal is "loosely coupled, tightly aligned squads".

You might think alignment and autonomy are hopelessly intertwined but they can be thought of as a two-dimensional grid where the sections are:

Low alignment + low autonomy = micromanagement with no high level purpose, just shut up and follow orders
High alignment + low autonomy = leaders are good at communicating what needs to be solved but are also telling you how to solve it
High alignment + high autonomy = leaders focus on what needs to be solved but let teams figure out how to solve them
Low alignment + high autonomy = teams do whatever they want without supervision or direction

Strong alignment affords higher autonomy. This is an important thought - trust has to be earned. It's impossible (for a multitude of reasons) to jump from a culture of low alignment + low autonomy into a the opposite on the grid. Even if that is the direction, the progress must be made in steps so that the teams on the other hand get the opportunity to show they're up to the challenge and on the other hand leadership can display they can stay they hand off the micromanagement wheel enough to give teams sufficient room to grow.

In the squad model architectural guidelines and standardization don't work in the usual way - it's more of a cross-pollination leading to standardization model (i.e. if enough teams think it's a good idea to start doing it then at some point a critical mass is encountered and the rest of the teams should start following along the same lines as well).

People are something that cannot be over-emphasized in this model. While high quality recruitment is essential for any company, in a place shooting for autonomous teams it becomes paramount. Having a positive, constructive and highly skilled team is essential to building the necessary trust for your team mates and other teams to be able to deliver on their mission and promises. It's a combination of competence and attitude.

In the Spotify model the cross-cutting concerns' information sharing is accomplished via forming chapters and guilds which collect together people performing similar responsibilities from multiple teams in larger groups (i.e. chapters for "tribes" that are a collection of teams and guilds for the entire company). These groups then have bi-annual conferences, meetups and other communal organization methods for information exchange and idea pollination.

One interesting note was that "most organizational charts are an illusion so we focus on community rather than hierarchical structure". This very much reflects my personal experience - hierarchy can only capture one perspective of the organization while all the others are usually left up to people to figure out impromptu (SAFe and some other methodologies understand this problem and propose some solutions to it).

"We found that a strong enough community can get away with informal decision structure: if you need to know exactly who is making decisions, you are in the wrong place"

This overall model is an enabler for the practice promoted by most agile ideologies - that is small frequent releases. In the Spotify model this is accomplished by the true decentralization of decisions.

This model also doesn't mean that everyone does everything - there are still squads that focus on infrastructure, squads that focus on client apps and squads that focus on features but even then handoffs are avoided like plague.

For inevitably required synchronization between releases the squad model relies on release train and converges with SAFe on this front albeit from a different angle of approach. Feature toggles are interestingly mentioned as an enabler for being able to keep simplified version control structures and "release" unfinished features (while not being active yet). I'll have to write a post on feature toggles in the future since that is a subject which is popping up on increasing frequency.

This model is not something you can just transition to over a few months. It's something you could set as a long term goal and start building the requisite team trust along the way (and inevitably exchange some people incompatible with the model along the way).

The pyramid of automated testing - or why manual testing as a starting point is a bad idea

The ideal versus the reality

Automated testing is one of those topics that when mentioned, everyone nods their heads in concert to the tune of "yeah we should do that". And in a similar union the response to "should we do it now?" is "we don't have the time for that".

Why is that?

One common reason is that we often approach automated testing entirely incorrectly and try to apply it as a magical remedy to legacy software (in this scenario meaning software that isn't designed to be automatically tested - which covers a large swathe of contemporary software landscape).

The cone of test automation from hell

And when we do attempt automated testing, it's often of the variety "let's build end-to-end UI based tests against this legacy UI with billionty services behind it, and cover all the error cases". Developers have perhaps heard about JUnit and are proud to have added a test suite of three unit tests to the code base. Someone also suggested that perhaps we could create a few tests which work against a specific system's SOAP web service API and seven tests like this were hand-written over a two week period.

And then the all-encompassing UI end-to-end tests were created to handle all the rest.

Sound familiar?

If not, count yourself amongst the lucky. This is a common approach automated testing often seen in the corporate wilds.

It is known as the cone of test automation from hell (as characterized by Venkat Subramaniam, again, https://www.youtube.com/watch?v=uQ75fI1tqoM).

Why is it the cone of test automation from hell?

Because the test that are the easiest to write, fastest to run and most handy to maintain are the fewest in number while the automated end-to-end and manual tests which take the most resources to maintain and execute are the most numerous. This is the short-term thinking in play.

So what's the alternative?

Pyramid of test automation

The right way

The pyramid of the test automation is structured as follows:

The base layer - creating the foundation everything else rests upon and of the highest volume - is constructed with unit testing at the code level

These are the tests which are very fast to both write and execute, and also easy to maintain considering the refactoring tooling in contemporary IDEs
Always run in their entirety as part of both developer builds and CI/CD pipeline build phase
Can cover close to or even fully 100% of cases in the code
Can cover non-happy paths extensively
Can ensure that various entirely unexpected input is also handled in a consistent manner providing high resiliency and fault tolerance
But what they do not do: ensure that different components work happily together, that is left to the next level

The middle layer - service-level tests

These are the tests written against the API of a single service (often in a microservice architecture).
They ensure that the service behaves according to its public contract (the API, see my post on API First principle etc. https://contemplative-architect-journey.blogspot.com/2019/12/the-value-of-internal-api-ecosystem-and.html)
The database and upstream service integrations probably need to be mocked so that the service level tests can be run in isolation
There can also be variants of the service level tests that on one side test the individual service together with upstream mocks and on another test against some pre-configured mock data returning upstream services and database. Details dependent on context and implementation
High coverage of happy paths but only covers the major functional non-happy paths
These tests are not as fast to execute as the unit level but are still often incorporated in the CI/CD pipeline build phase

Optionally some portion can be part of a later stage of acceptance testing in case they take longer to execute

The top layer - the fewest number of tests

Test full end-to-end chains starting from current service's context and covering upstream (obviously not downstream - that's the downstream services' / applications' job)
Either API or UI based depending on the type of service or application being tested
Usually executed in an environment which is production-like in that the connections between services correspond to production level interactions and nothing is mocked (data is often dummy stuff though but that shouldn't affect the overall interactions at API traffic level)
Expected to take significant time to execute
Ideally these don't cover individual features but rather the major customer use cases and flows

When new features are implemented, each new feature doesn't add a new top layer tests but instead usually updates the use case test flows that it affects. Dependent on the case

Covers non-happy paths only when they're relevant for business scenarios and significantly differ functionally from the happy path

Tip of the iceberg - manual testing

Ideally not for verification at all (or only at a very light-weight level)
Instead for exploratory testing and to create insight about the customer experience and how to further improve the end-to-end chain in its various parts

Orthogonal testing aspects

There are some specific types of testing that are not covered by the above-mentioned perspective. This is a list of some of them (and these often need to be handled by specialized personnel in high quality environments):

Penetration testing

Security is a nasty business in that at worst a single mistake anywhere in the chain of calls could potentially expose you to data theft, content defacing or even backend admin access
There is no panacea for security testing. This is a topic which requires a separate post to get even started on the fundamentals
Don't try to include this in the standard test suite
Instead, follow secure-by-design coding practices and design principles and defense-in-depth or security-in-layer architectural principles
For most enterprises it's not feasible to build in-house penetration testing capability so if you're working with important customer data or otherwise work with data assets that are important to safeguard, consult firms specialized in this stuff

Performance testing

This should be done in-house and potentially by the same team but it could be a more centralized capability as well
Use some of the same API level or UI level test suites but tooling might be different
Aim to test some kind of peak load (which depends on usage scenario), scaling performance, etc.

Failover testing

This is also challenging to include in the functional tests but it potentially could be
Important aspect especially for highly available services is ensuring that failover happens smoothly
Less important in a modern Kubernetes context - there the more relevant testing is the next category

Chaos engineering / chaos testing

In an advanced context you should definitely start employing chaos engineering which is an extreme form of testing - randomly every now and then introducing various kinds of faults, delays, disruptions and even entire data centers going down - to your production systems!

Less extreme version is doing this in test

Championed by Netflix as a method of ensuring that they're up and running even in the most extreme of circumestances
Has plenty of tooling available nowadays. Basically requires a highly functional Kubernetes based environment
Wiki page is a good starting point https://en.wikipedia.org/wiki/Chaos_engineering

Other rare non-service related failure scenarios testing

E.g. disaster recovery being a very important one that is often neglected - sometimes to a great detriment

Declarative programming, simplicity, productivity - and why all the fuss about functional programming is on-point

What's this fuss about functional programming

Functional programming is not new although you could be forgiven for getting that impression from following programming news, conference talks and other such outlets. Some of the very first programming languages were functional in the sixties but the rise of easy-to-get-started imperative and then object oriented languages (still in the imperative style) overwhelmed it so completely that functional style was for a good while forgotten from the mainstream.

Now the phoenix is rising from the ashes and wondering why everyone thinks it's new and shiny. Or why it burned out in the first place.

I agree with Venkat Subramaniam in that actually functional programming is not consequential when you compare it to the larger category it actually belongs to which is declarative programming which, at its base, describing what you want to achieve instead of the steps of how to do it (referring to conference talk https://www.youtube.com/watch?v=uQ75fI1tqoM and associated blog post https://contemplative-architect-journey.blogspot.com/2019/11/speed-without-discipline-recipe-for.html). But this is all categorization and wordplay, what is the actual benefit?

What's the benefit - the state of the functional

Functional programming isn't at its very core so much about functions as it's about how we handle state. Which in the functional / declarative world means pretty much staying away from it as much as possible, and when you do have to deal with it, treat it as if it were toxic (which it is, which makes you take care).

"But how can you develop without state" is the natural next question from anyone not versed in the secrets of the immutable mindset.

Consider the following simple code example in Java (borrowed from one of Venkat's talks)

public static void main(String[] args) {
List<integer> numbers = Arrays.asList(1,2,3,4,5,6,7);

int total = 0;

for(int i = 0; i < numbers.size(); i++) {
if(numbers.get(i) % 2 == 0) {
total += numbers.get(i) * 2;
}
}

System.out.println(total);
}

What is happening here? This is a simple example and it's easy to deduce the intent behind the algorithm but still it takes a little moment.

How do you mentally debug this? It requires you to hold in your mind the state of the different variables of which the relevant block is composed and which could change and then mentally execute the logic.

Being an example, this is simple but what about the real world situations where there are actually 20 different variables that could change, that could be in different settings based on at which situation you're thinking of the algorithm, which method calls have been made and what may have been altered from outside the context you're thinking of. And then add a couple of non-intuitive language features or method implementations (why does that get method alter the object's internal state?) and you've got a mess.

This is the situation developers refer to when they tell about being deep in thought debugging a problem when someone interrupts their chain of thought with a question or email or skype message or cat problems and to continue they need to build that mental state image all over again.

Now would you hazard to make a guesstimate on how many bugs and problems this leads to? Whatever your guess, I can make the educated guess that it still under-estimates the problem.

So how does functional style help? It does away with state mutation entirely unless it's absolutely necessary- and you would be really surprised how rarely that is the case when you come from purely imperative background.

Let's take the same example from above and write it in functional style supported by contemporary Java:

public static void main(String[] args) {
List<integer> numbers = Arrays.asList(1,2,3,4,5,6,7);

System.out.println(
      numbers.stream()
          .filter(e -> e % 2 == 0)
          .mapToInt(e -> e * 2)
          .sum());
}

The difference is that now you can look at each individual step in isolation from everything else and know that it takes in and what it returns without having to consider the state of everything else. In this oversimplified example the benefit is not considerable - and if you're not used to this style then the overhead of mentally parsing what's going on is actually higher than the more familiar and traditional imperative style but imagine what the effect will be in the cases where there would traditionally be 20 variables whose state you have to mentally monitor to be able to understand what is going on in the code. Or 50. Or 100? At some point you just can't deal with that. No one's mental capacity is sufficient. And also no matter how the rest of the codebase changes when you run these operations again, it still returns the same answer (reproducibility).

This is the essence of the functional style.

You can also consider an additional factor - read the code aloud from the first and second examples and think how it relates the intent and requirements behind the implementation. In the functional example you can read the lines "get stream of numgers", "then leave the even ones", "then multiple each by two" and finally "and get the sum of those". The reading of the code reflects the intent and the requirements as is while in the imperative example you need to tease and parse out the intent from how it behaves - and even then you can make mistakes (which is why commentary in imperative still is imperative, pun intended - and of course often gets written to be misleading, especially after updates to code and not to commentary).

This is the essence of the declarative style.

You would be surprised how much of the stateful imperative code can be phrased in an entirely immutable fashion.

I have personally experienced the transformative effect this can have on your code quality. While in my imperative object oriented days I would routinely write 10 to 20 lines of code, run it, encounter a bug, fix it, encounter another and after another fix or two get it working as intended. After my personal functional revolution I have many times written as much 500 lines in one go (some of it non-trivial, and still in the same language as before), ran it and it just worked (your mileage may vary). It's just amazing how much your productivity can increase when, after creating the overall structural plan in your mind, you can concentrate on just the current line / function in your implementation and not have to worry about the rest.

It's my view that the reason declarative and functional styles work so well and provide such benefits is the imperfection of the human mind. We have a very limited working memory so the more we can segregate orthogonal concerns to separate threads of thought and focus on a highly (and truly) restricted subset, the better we can do. The best coding styles create structures of a size which our minds are capable of handling (and not just in the happy path case but in the overall complexity of the section of code you're looking at with absolutely everything that could affect it).

So is Java functional or not? What about C?

Functional and declarative are styles, not languages. Almost any programming language is capable of benefiting from functional style of programming even when they do nothing to actively support it (while there are e.g. some basic variants etc. which don't enable it in any sensible way - and still it's possible to get some benefit). Then there are what are known as actually functional languages which to some degree enforce functional approach (Haskell to a very high degree combined with static typing, Clojure and other Lisp variants to a slightly lesser degree from the dynamically typed world, etc.) or at the very least provide extensive native tooling for it and treat it as the default (Scala, F#, etc.). We've also seen the trend of mainstream object oriented languages having more functional capabilities incorporated (Java, C#).

The question of "What is a functional programming language" is (outside of edge cases) more of a terminology issue. What is the more interesting approach is to what degree can you benefit from taking a different approach compared to your current one from learning a new way of thinking with your current tooling. And is there a way to affect your tooling to improve your outcomes based on what you learn.

You can bring the benefits on this immutability / functional thinking even into languages that traditionally get categorized as entirely imperative such as C. I remember seeing a talk by John Carmack (original Doom's programmer) on how he transformed his C programming with functional style which resulted in significant drop in bugs and increased development speed. With some quick googling I could only find his article on the topic relating to C++: https://web.archive.org/web/20130819160454/http://www.altdevblogaday.com/2012/04/26/functional-programming-in-c/

But functional programming is hard - and how to make it less so

The hardest kind of change is when you have to acclimate yourself to an entirely new kind of thinking. In programming one of these transformations is from the imperative to the declarative styles (and as a subset, from object oriented to functional). That does take some getting used to.

But no one forces you to drop Java and go 100% Haskell - a total 180 is hardly beneficial given all the frictions it comes with (especially if mandated from up high).

To start getting benefits from the style of thinking and development approach highlighted above, you don't have to go 100% into it immediately.

What I recommend is you start familiarizing yourself with one functional focused language from your domain (I recommend Clojure for Java stack folks and F# for .NET stack ones) and begin trying to build some practice program after watching a few tutorials. Then do some more tutorials and practice programs. After you feel like you're starting to get the hang of it, find the corresponding stream / LINQ functionalities from your familiar language of choice and start experimenting with making everything your code immutable by default and see how far you get.

This way you don't have to upend your world at once (which wouldn't be practical anyway). Instead you can perhaps start getting some view of what the benefits are after repeatedly hitting your head against the wall when trying to accomplish the Clojure / F# exercises.

And perhaps later you'll decide to challenge yourself and your belief in your superior abstract thinking, by learning Haskell. You will feel stupid. A lot more of wall-head-contact ensues (assuming you haven't given up) but you keep accruing valuable insight all the way. And mentoring and uplifting your colleagues on the way no doubt (hopefully in a positive fashion)

Immutability beyond the lines of code - the architectural perspective

The last few years it has been very interesting seeing how the immutability principle and declarative style has been brought to higher and higher levels of abstraction. Immutability is the basis of event sourcing, it is an important component in the philosophy of build-once-deploy-everywhere CI/CD pipelines, declarative CI/CD pipelines (increasingly the de facto standard of CI/CD), immutable data engineering (https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a - Airflow, etc.), immutable architectures, containers (basis of OpenShift security practices), etc. etc. etc.

It is literally everywhere providing improved decoupling and improving your (the developer's) ability to focus on one single thing at a time and actually and fully understand what is going on in any specific context you pay attention to.

In short, the declarative style enables (or at the very least maximises the chance of) you to understand what is going on without having to keep the entire context of your application or architecture in your mind and pretending to actually understand the systemic complexity of it, black swans et. al.

Referenced conference talks

https://www.youtube.com/watch?v=FQERMVABRrQ
https://www.youtube.com/watch?v=uQ75fI1tqoM

The value of internal API ecosystem and API first - or why this is an improvement over the SOA and ESB model

But isn't API ecosystem just the external stuff?

API ecosystem is a term most often associated with companies providing open interfaces externally for third-party integrations and add-value application layer (most often the ones with a platform business approach).

I claim that for most companies the more important API ecosystem is the internal one. Internal API programs and the API First approach is important in developing functional overall integration architecture and development practices in modern large enterprises with ambitions for agility (and who doesn't at this point?).

Isn't this just SOA all over again?

Service oriented architecture (SOA) and the often associated enterprise service bus (ESB) used to facilitate integration between the different services rose to prominence in the 2000s. My take is that the SOA philosophy was (and is) a good one but the tooling and the ESB approach that traditionally has been associated to it didn't measure up. ESB approach with SOAP protocol, web services etc. have a few built-in issues.

When you think of how ESB model SOA often works in practice is that you have one team provide a core service (often an out-of-the-box external product with some customization) which exposes typically 1-4 services with a variety of different protocols (sometimes HTTP/json, sometimes SOAP web service, sometimes custom binary formats, etc.) via an API which is designed heavily from the perspective of the product itself. After this the team drops the gloves and the responsibility is often left up to an internal but separated integrations team to figure out how to provide an internally consistent API to the other teams to consume.

Essentially this means that the team with the best knowledge about both the technical and functional aspects of the underlying service does nothing about the API from re-usability and standards compliance point of view. The responsibility to define the re-usable and clear API which is easy to use by others (which are very often the stated goals of API creation) is thus left to a team which has by default no knowledge of the system, its requirements, its use cases or anything else and will need to learn it (add some broken phone to the specification chain in many cases). When phrased this way I'm sure you can start to see what kind of issues occur.

The ESB will inevitably end up containing business logic (information for usable APIs will need to be aggregated from multiple services making different assumptions), refinement requires mapping to a different format with some enrichment data databases ending up living close to the ESB, etc. This needs to be maintained and upgraded in cadence with the underlying systems. The cherry on top is that ESB vendors rarely provide or even enable modern CI/CD pipeline capabilities, robust versioning and diff tooling and to add insult to injury, often opt for low code graphical editing interfaces (which are great for small contexts and creating stuff quickly but often break at larger scales and disregard maintenance and lifecycle concerns).

The integrations team & ESB model leads to some issues already when creating the APIs. So how about maintenance? Well since the ESB ends up containing plenty of business knowledge implemented by people who're not the best experts on the underlying system, the ESB naturally also ends up with its share of bugs. Combine this with restricted visibility that is often not a feature naturally provided by ESB vendors (and thus left up to how well the governance is handled by the implementing organization) and you'll end up with a situation where very often the finger is pointed and blame assigned by default to the integrations team when something goes awry in the end-to-end pipeline. This often muddies up the responsibilities, increases MTTR (mean time to repair) and polarizes the internal development departments.

Naturally internal API ecosystem is no panacea - strong governance and architecture vision is required in both models but API First approach does offer solutions to a multiple fundamental issues with the SOA / ESB model.

So how does API First and internal API ecosystem help?

API First and internal API ecosystem are an evolution of service oriented architecture thinking suitable for modern microservice architectures (which are essentially just ways to design and manage large distributed systems at high level).

It's actually useful checking the wiki page on SOA to get a perspective of both the traditional and the modern approaches to it: https://en.wikipedia.org/wiki/Service-oriented_architecture

Here is an overview of how API First approach tackles the previously identified issues and what kinds of new complexities it brings with it (nothing is perfect, it's just tradeoffs all the way down).

API First principle

API First principle changes the overall dynamic of API design responsibilities. It borrows from the modern devops and agile cross-functional product or feature teams thinking in that the team is responsible for providing a working product as a service (maintenance included). The definition of the product is extended to include a re-usable API which follows the company's internal standards and is usable for at least the primary use cases by internal customers (i.e. other internal teams) as is. There is no centralized integrations team which needs to perform magic to transform a messy API into a re-usable one - it is one of the core responsibilities of the product team itself.

This means that the team with the best knowledge of both the technical and functional aspects of the product is also the one to design the API which it provides. While this means more responsibility for the team, it also removes several undesirable phenomena from the overall architecture and organization such as by often making it a lot clearer immediately which system has broken down when there is an issue with the end-to-end functionality and the defect analysis is thus almost automatically directed to the correct party much faster.

An important part of API First is that the API is actually useful to design before the implementation driven by business requirements and owned by the business owner of the system itself. This provides increased business alignment from the get-go and reduces communication barriers that often exist between business and development teams in traditionally structured development organizations. This doesn't mean that the business analyst needs to be one hand-writing Swagger or RAML specs (in fact no one should be hand-writing those) but even conceptually thinking from the perspective of API first is highly beneficial.

A non-obvious benefit from creating the API specification first and implementation after that is that it is very hard to accidentally make breaking changes or couple the internal structure of the implementation to the public interface. If you generate the API spec from code this becomes extremely easy - and causes downstream pain i.e. customer teams consuming your service will have to change their implementation and this couples otherwise independent teams together causing unnecessary organizational development overhead. Overall this tends to motivate better up-front API design since you know you'll end up doing more work when you notice change requirements later in the process (which still happens since we're all imperfect - but the frequency should be lower).

One consequence of this model is that a strong governance framework is required to align all teams' APIs with regards to the shared cross-cutting concerns like authentication, authorization, specification style, common header fields etc. It's very easy for some drift to start happening after multiple teams get the freedom and responsibility to develop and expose their APIs so the architectural guardrails need to be well thought out, tested and evaluated before really starting to internally scale up this approach.

This is where the tooling of the internal API ecosystem steps in.

Internal API ecosystem tooling

To really work, the API First principle alone isn't enough - it requires good tooling, governance, training and practice. The details of the tooling are often highly dependent on the individual company context but there are some must-have components with important responsibilities and features.

Internal developer portal

This is the most visible and centralized part of the internal API ecosystem. All internal APIs (parhaps with the exception of experience layer team / application specific external APIs depending on how you structure your ecosystem) should be published here and should be visible to all developers. It's also a good idea to connect this API visibility to enterprise architecture and service landscape documentation / tooling to improve the overall visibility and observability of internal service offering.

Internal dev portal should provide many of the same capabilities as an external i.e. allow developers to self-onboard, manage their team's (or whatever is permissions structure you decide on) APIs (though ideally this happens only via automated CI/CD pipelines), subscribe to all development and testing environment APIs they want, have the ability to play with sandbox queries, facilitate API lifecycle management e.g. by providing communication support for new version publishing and deprecation of old ones, usage statistics and much more.

Dev portal should also be connected to the API gateways at least enough to be able to provide all end-point information in addition to the API specs, information on authentication and authorization etc.

API Gateways

This is where we get to the context which is very highly dependent (even at high level) on the corporate context. API gateways are important authentication & authorization & policy enforcement checkpoint to segregate different security contexts - both externally and internally. They're may also be useful as common endpoints to hide implementation details (in a large multicloud ecosystem there might be one or more API gateways which guard all services hosted in AWS and a separate set for GCP for example) and also potentially for intra-platform traffic as well (e.g. between different AWS accounts) which may simplify traffic routing configuration.

In case the different platforms are highly distributed geographically the gateway architecture is likely to be important for latency optimization as well while keeping service discovery as simple as possible (not considering intra-cluster discovery within e.g. Kube clusters or service mesh clusters with connected data / control planes which may obviate the requirement for API gw based routing).

Note that it's often desirable to have the API / service specific routing and policy configuration to be connected to internal dev portal publishing process and centralized. I won't delve more into this subject here though.

It's important that API gateways don't perform any mapping or other payload ETL operations to keep them lightweight and to avoid the issues of the centralized ESB integrations teams. It's essential that the teams' services expose sensible APIs at the origin.

Some additional roles and responsibilities that the gateways may have:

Rate limiting (DDOS protection)
Authorization checks
Scope enforcement (e.g. it's possible to use the same gateway instances for both internal and external calls signed by trusted authentication providers but with different scopes which define which services they can call)

API Design and publishing tooling

You'll want to have a standard CI/CD pipeline for managing and publishing the API specifications (separate from service implementation) following everything-as-code principle. Ideally service and API provider teams should have no reason to directly interact with the internal developer portal or gateway configuration, it's all Git config & CI/CD based.

API design and lifecycle management is another place where you'll definitely want robust tooling since writing Swagger specs by hand gets cumbersome and right quick. Fortunately there is a large variety already available to fulfill not just the design phase but many other roles as well. Take a look at https://openapi.tools/ for OpenAPI / Swagger tooling (for RAML my understanding is that it's the go-to option only in case you've chosen Mulesoft which provides a high degree of integrated tooling as part of the enterprise subscription).

Another essential tooling aspect is interface layer code generation based on the API - both for the service and for consumers separately. Since I promote explicitly creating the API spec first, it would be unnecessary duplication to write the same interface descriptions in your selected language of choice by hand when you can just generate the method stubs with requisite annotations / attributes / configuration from the spec (terminology here is language dependent). Same applies (and more acutely) for the consumer side. Easy code generation is an important factor in creating the governance practices to mandate which parts of the API spec are mandatory to enable easy usage and thus good DX.

Warning on anti-patterns of shared binary libraries

One common anti-pattern for helping customer teams consume the service's API is by providing them domain or service customized binary library. At first glance this may seem very sensible - provide re-usability and reduce the need for duplication at consumer side. The problem of the approach is that it creates a binary dependency between teams forcing binary dependency updates which very easily cause breaking changes. One of the main aims of microservice architectures is to reduce coupling between teams and the binary libraries directly break this by increasing coupling.

You could say that there are ways to minimize the downsides e.g. by keeping all logic out of the library but the domain logic has a tendency to swim in when it has an avenue. The library also easily becomes a point of blame making responsibility less explicit since now the service of one team (team is supposed to have total ownership of its service) has another team's code running within and potentially causing trouble.

Yet another problem is that a library usually needs to be provided for a specific language and in a large polyglot corporate environment this means that many teams will have issues.

When using REST and non-strict field binding then often the APIs can be improved without breaking - and this too is easily made more challenging via shared libraries which may need the consumer applications to be rebuilt to get a new version of the library (yet more coupling).

So in short, just don't do it unless you have an extremely good reason and rationale (of which a well known example is Netflix which has different scaling requirements compared to most corporate environments). Googling on the topic left up to you.

Mock data & mock server tools

An important part of governance is mandating providing sufficient mock data in the API specification itself. This is extremely valuable for multiple reasons:

It's automatic test data immediately
Provides useful test data for consumers to use speeding up development using the API
Provides ability to create generic sandboxes within dev portal or via API gws
Acts as additional concrete example based documentation

There are tools available both for creating the mock data and incorporating it within the API spec and also for creating local service instances from the API specs that respond with real results based on the incorporated mock data.

Word on what was left unsaid

Internal API ecosystem governance model was mentioned several times but I'm still developing my views on so I'll leave a deeper look for another time.

REST / HTTP/json wasn't mentioned at all in the context of API First since at the principal and generic tooling level it isn't actually essential. HTTP/json (and hopefully RESTful) services are increasingly the de factor standard though and it's the industry wide trend so I don't see any point in fighting against it unless you have specific needs regarding extremely low latency or high throughput in which case it's very useful looking at protocols like gRPC to standardize on.

There is much more to internal and external API ecosystems but hopefully this provides you with some understanding on what might be gained by looking improving existing ESB based integration architectures and how to structure development team responsibilities for maximum productivity.

The value of developer experience and how the concept connects a variety of important techniques

A focus on good developer experience (DX) as an essential enabler for agile, high quality and fast development has only quite recently risen as an important perspective. Naturally many people have recognized it as essential for quite a while but the branding of "developer experience" drawing on the more widely understood concept of user experience is fresh - and fitting.

Perhaps one reason for the rise of the concept of developer experience is the increasing burden of complexity being placed on the average developer. When you think about devops, GitOps, NoOps, agile cross-functional teams, full-stack development etc. a good developer increasingly has to master a larger number of tools, techniques and layers than before. There was a good reason for highly specific roles in the traditional development model since so much of the details had to be managed manually - getting the last ounce of performance out of an Oracle cluster truly was a task for a specialized professional but I'm glad we're mostly moving past that world now. Not to say that deep specialization isn't necessary and valuable - quite the opposite - but the level of automation and abstraction (that actually makes sense and doesn't leak) has improved to a degree that NoOps (no separate ops team - dev and ops truly combined) actually is within reach with competent people.

The perspective that drives the highly functional abstraction and automation is developer experience. With that in mind the tooling can be designed so as to provide maximum ability to reason about code, architecture and everything while automating all but the truly unique and value-adding decisions. This trend ties together with everything-as-code as another important enabler of high degree of automation and visibility.

So what creates a good developer experience in somewhat more detail without going to anything technology specific?

All repeatable tasks automated - or at the very least with clear and easy to follow point-by-point checklists
Relevant, up-to-date and tight documentation which is visible by default
Everything visible by default so it is easy to find our relevant information while restricting modification controls only to relevant teams on a need-only basis - but with openly available and very clear information on how to get the relevant permissions for when they're required

Internal API ecosystem with a functional developer portal is important for integration level providing a visibility to all the systems available as services within the corporate ecosystem
Ties in to API First principles (will create a separate post about this)
Obviously secrets are not visible by default - but the information and practices surrounding secret management should be visible

This also enforces the security aspect of not trying to get away with simple security-by-obfuscation

Best practices baked into extensive and high quality reference models (of application sample projects, CI/CD pipelines, commonly repeated system architecture templates dependent on domain, etc.)

Also documented in tight format and linked to more extensive driving requirements, compliance, regulatory etc. documentations but understanding the base material extensively should not be a requirement at developer level for secure and compliant development
Reference models should include all common cross-cutting functionalities and non-functional requirements implemented in a default way (and included in platform (e.g. Kube) approach as often as possible). This means failover, HA capability, database usage patterns, logging approach, monitorability (highly dependent on approach), health & readiness checks, single project structure, etc.

Make the reference models so easy to start with that that's just the most fun and quick and effortless way to start a project - that way good recommendable practices start spearing almost by themselves. It's still of course important to have good tech leads, mentors and training and the best results are produced when all of these are aligned

Fully automated CI/CD pipeline with build-once-deploy-everywhere structure, incorporated unit and service level test automation runs, functional pull request based code reviews with minimal hassle
Consistent and unified service authentication and authorization approach

Depends a lot on what the internal integration service landscape looks like. Internal API ecosystem could standardize on unified OIDC / Oauth2 model and proxy legacy ESB / SOAP

API First and high quality service API specifications

More work for when team is providing their service but so much easier to start using other teams' services.
Again heavy tie to internal API ecosystem and developer portal and also the consistent authentication

Unified developer experience for different hosting platforms (on-premise / partner data centers, different cloud providers)

Ideally you would standardize on a single platform but in many cases for larger enterprises this is not ideal or possible due to highly varying requirements of different teams and departments. Plus the untenable overhead of overtly centralized decision making. These factors, combined with enabling low cost exists from specific cloud providers, have led to the rise of multicloud strategies which can cause serious headache for DX if there isn't some common level of standardization at development platform level
This is where Kubernetes steps in. See my previous (and upcoming) blog posts on this topic specifically

It stays possible to reason about how any part of code or the overall architecture works, what it depends on and how it affects others

At code level this is developer / team responsibility
At higher level this is the responsibility of architecture guidance
This is a very important aspect both in how easy it is to develop something new and especially when something goes wrong and needs to be analyzed and fixed quickly
And easy-to-reason-about development is just more fun - we all know the hair-pulling feeling of not being able to figure out why it's not working and it turns out to be something extremely non-intuitive (not to saying doing away with this entirely is possible)

So far I've only talked about DX in the context of increased productivity (although it might not've been apparent from each point) but of course another, and nowadays very considerable, benefit is the higher retention of high competence developers. There is a serious shortage of highly competent developers in the market (and average or below developers (especially with attitude issues) sometimes hurt overall productivity instead of helping) and providing a good developer experience significantly increases the likelihood of both recruiting and retaining them.

Did I forget to list something important that you feel is an essential part of good DX?

Why Kubernetes for developers? Velocity

Ryan Jarvinen makes a very good point in his talk that Kubernetes is often pitched at operations and developers may have harder time seeing the benefits it might bring. So that is the focus on his very interesting talk.

Q: Why Kubernetes?
A: Development Velocity

https://www.youtube.com/watch?v=_W6O_pfA00s

Summary and highlights - and plenty of commentary from me that's not included in the talk itself:

Kube doesn't really have a specific term to offer to the question "So what is an app here?"
Much of the terminology is centered around subjects that are more architecture and ops related than what a traditional developer working at the level of lines of code and classes is used to considering: load balancing, high availability, standardized terminology & packaging, scaling automation, delivery automation

Of course I personally think that these all should be part of every developer's considerations. I'm a strong proponent of the full-stack mentality, devops and the most radical form NoOps - I see a future where separate development and ops organizations are combined - the need for that separation simply won't serve any purpose anymore. This is already happening in some companies
With NoOps there will still be need for centralized platform and tooling administration. In SAFe (scaled agile) terminology this is often known as the Systems team

Convincing the team with minimal onboarding:

Getting started is easy (assuming the platform, reference models and some samples are in place)
Share what you know, be very explicit about and document data flows at low level (read-only, write-only folder, database, integrations)

Everyone will have local Kube installation.

Minikube is recommended in the talk but K3S is a more modern and lighter-weight alternative and even better options are cooking

Typical container adoption path:

Docker
Volumes, PVs
Minikube
K8S modeling and scalability via spec files, pods and other abstractions
Charts, OpenShift templates, or hand-rolled manifest / spec templating
Monocular, kubeapps, ServiceCatalog
PaaS?

Draft - easy way to get started. A tool for developers to create cloud-native applications on Kubernetes

https://draft.sh/
Initial impression on a bit of googling is that this is essentially some kind of simple automated lift & shift packaging?

Charts - packaging format for Helm, essentially a Kube package management system

Question for myself and for later: what is the role of these in a high-quality enterprise environment? External public repos to get pre-packaged software components sounds like security and compliance nightmare - and would require extremely good practices inside the Kube cluster RBAC-wise.
Probably somewhat analogous to the traditional Maven dependency approach of internally hosting package repo with pre-approved (or with reactive stance with lightweight-enough process) packages.

Telepresence - make externally hosted services appear as if they were part of the local cluster (quite important for local development)

Another approach to this is using service mesh. All service communication happens via the mesh so it would be conceivable that you could be running individual pod in isolation and virtually connect it as part of a shared dev cluster (the service's "local" database included) that's externally hosted. This idea was pitched in a discussion a while ago and I haven't seen it used anywhere yet
Anyway these two approaches seem similar at high functional level but very likely many devils in the details

Minishift - OpenShift variant of Minikube

By default more secure multi-tenant enable Kube environment
Good practice: you shouldn't be running containers as root in production so why do it in dev? A good security best practice to go rootless containers by default

Learning resource link collection:

Kubernetes.io Tutorials: https://kubernetes.io/docs/tutorials
Katacoda: https://katacoda.com/courses/kubernetes
RyanJ's K8S-workshops: http://bit.ly/k8s-workshops
Interactive learning for OpenShift: http://learn.openshift.com

On roles: for architects: figure out who owns manifest creation, maintenance and distribution

Current self-note: this should be shared responsibility in the same way that each team owns their own stuff
But how will the common policy enforcement function considering that it's very often suggested that companies go with shared clusters with namespace-per-team

Likely answer is Open policy agent but I haven't drilled down to details on this yet

Agree on RBAC, config and secrets management, secret rotation policies, monitoring etc. with security & compliance teams

Quote pick - this just resonated with me "The future is already here - it's just not very evenly distributed." (W. Gibson)

Consider for example how people use the internet. Some do actually use it as an amazing resource for learning almost anything - and mostly for free! And then the impression I get from mass usage is of social media flamewars and cat videos. This creates an ever-widening skills and understanding gap which reflects to not only development but a variety of societal aspects, politics etc. But I'll not get too much off-track on this, topic for another time, for another blog.

The talk itself didn't actually directly provide background for the answer of why Kube improves development velocity so here's my take on it (based on yet light experience with Kube - this is very much still the beginning of the personal journey):

Kubernetes with service mesh takes care of many of the non-functional requirements that would otherwise have to be manually considered in the service implementation e.g. high availability, functional failover, health and readiness checks, multiple instance message routing, service discovery, etc.

While you'd think companies would have good reference models to provide these recurring cross-cutting concerns etc. this is rarely the case in my experience

Reduces exit costs and effort in case a specific cloud platform becomes too pricy and we want to move applications on another platform
Also the same development experience translates much better between different cloud and on-prem platforms when all have the shared Kubernetes layer - and this is not a low level platform but provides plenty of tooling at the level where developers actually work at

Also makes it easier to rotate between different teams within a company for variety
And also provides relevant experience on interesting tech and improves developers' motivation by letting them develop with modern tooling

Brings cloud native development approach's benefits to on-premise hosted environments as well

Together with the platform-agnosticism enables development teams to create truly agile devops pipelines

Promotes good and secure development practices by highly emphasizing segregated, stateless services, enabling lightweight tooling for creating, maintaining and monitoring microservice architectures where it's truly possible to separate different concerns into separate services without creating a manual maintenance hell for your company

In short Kubernetes is the tool that finally enables actual microservice architectures to be created without burying yourself in maintenance overhead from manually administering the large web of tiny services where complexity has been simply moved from code level to the architectural level.

Journey of a Contemplative Architect