Technology Leadership | joerg-aulich.de

Smarter Requirements: How AI Changes the Game (Part 2)

joerg.aulich — Wed, 19 Mar 2025 19:11:00 +0000

Last week I started talking about using AI in requirements engineering in this post. This week I continue the story.

A Tale of Three Projects: How Things Go Sideways (and How They Don’t)

Let’s start with stories, because numbers rarely change minds, but lived experience does.

Project Northwind looked simple: a new customer portal, a clean UI, a few integrations, and a go-live date shoved neatly between two quarterly board meetings. What could go wrong? Quite a lot, as it turned out. The initial statement—“Make sign-up fast and secure”—read well on a slide. Yet nobody agreed on “fast,” and “secure” meant different things to different teams. Legal wanted consent tracking, operations wanted smooth onboarding for call center reps, and security wanted strict password rules that clashed with UX. The build moved forward while definitions lagged. You can guess the ending: rework, delays, and a go-live that was technically successful but emotionally exhausting.

Project Halcyon tried to avoid that fate by writing everything down—pages upon pages of requirements. The pendulum swung the other way. The document spelled out not only what the system should do, but also which components should do it and how they should talk to one another. Engineers felt boxed in before discovery even began. When load assumptions changed, the team struggled to adapt because design decisions were locked into the requirement set. A tidy plan led to brittle delivery.

Project Cedar took a different path. The team set a simple standard for requirement entries—ID, description, rationale, acceptance criteria, and type (functional or non-functional). They also set up a review rhythm. After each workshop, someone turned raw inputs—emails, chat snippets, meeting notes—into structured entries. Ambiguous phrases were marked with a friendly warning. Missing non-functional needs were suggested, not dictated. The team kept the requirements focused on outcomes, not mechanisms, and let design evolve within guardrails. Did it all go smoothly? Of course not. But the bumps were visible early, discussed openly, and resolved before they grew teeth.

These three projects hint at a theme: clarity beats volume, structure beats improvisation, and gentle discipline beats heavy-handed control.

What Makes Ambiguity So Tempting?

Ambiguity hides in our favorite words—”simple,” “intuitive,” “robust.” They’re comforting because they don’t force decisions. Why commit to 300ms response time when “snappy” feels easier? Why spell out availability targets when “high uptime” sounds nice? The trouble is, these words solve meeting discomfort, not system design.

The fix isn’t to strip language of personality; it’s to ground it. If a requirement uses subjective phrasing, pair it with something testable: a number, a threshold, a clear condition. This is where AI helps. It can spot the soft spots and nudge: “You wrote fast. Do you mean time-to-first-byte, full render, or task completion time? Suggest a metric.” The nudge matters because it turns preference into intent.

The Missing Middle: Non-Functional Needs

Non-functional requirements are like plumbing—you don’t notice them until something smells off. Performance, security, resilience, accessibility, observability, data retention—none of these sparkle in a demo, yet they decide whether a platform holds up in the real world.

Teams miss them for boring reasons. People assume they’re implicit. Backlogs favor visible features. And when schedules get tight, the quiet items are first to slip. AI can’t force anyone to care, but it can hold up a mirror. If it sees a payment flow with no fraud controls called out, it can ask questions. If it finds a login journey without a session timeout policy, it can suggest one. That’s not bureaucracy; that’s muscle memory.

Scope Creep Isn’t Evil—It’s a Signal

Scope creep usually signals learning. Stakeholders see a demo and realize a gap. A new regulation appears. A dependency shifts. The answer isn’t “never change scope.” The answer is “change it with eyes open.”

Good change handling looks like this: an impact note that traces the requirement to screens, services, test cases, and operational runbooks. A short assessment of time, risk, and trade-offs. A visible decision with a rationale. AI can assemble the first draft of that picture by following links and previous patterns. Decision-makers still decide—but now they’re deciding with context rather than gut feel alone.

Stakeholders Don’t Need More Documents; They Need Better Conversations

Stakeholder friction rarely stems from a lack of information. It stems from mismatched views. Executives think in goals and risk. Engineers think in constraints. Operations think in stability and cost-to-serve. The wrong artefact for the wrong audience creates friction.

AI can reframe. The same requirement can be summarized three ways without twisting meaning: a goal statement for leaders, acceptance criteria for testers, and integration notes for teams who run the thing at 3 a.m. That isn’t cloying; it’s respectful. It meets people where they are.

Over-Specification: The Quiet Thief of Agility

Telling engineers how to implement a requirement can feel helpful. It’s also a fast path to regret. When the world shifts—new constraints, new data—those baked-in decisions turn flexible architecture into poured concrete. Keep requirements about outcomes. If you must record a design decision early, mark it as a decision, not a requirement. That tiny act of labeling preserves room to improve.

Traceability Without the Pain

Traceability often dies under its own weight. Teams picture delicate diagrams, stale spreadsheets, and a chorus of sighs. It doesn’t have to be that way. The light version:

Every requirement has an ID.
Commits reference IDs.
Tests reference IDs.
A report shows coverage: which requirements have code, which have tests, which have neither.

AI helps by suggesting links rather than demanding them. “This change set mentions address normalization—likely connected to REQ-143. Link?” One click. Done. The goal isn’t to impress auditors. It’s to know what you built and why it exists.

What Standardization Actually Means (and Doesn’t)

Standardization gets a bad rap because it’s confused with uniformity. The point isn’t to make every project identical. The point is to make them legible to one another. Legibility gives you reuse, shared learning, and faster onboarding.

A light standard can live on a single page:

ID: Machine-friendly, human-memorable.
Type: Functional, non-functional, or regulatory.
Description: Outcome-focused, not mechanism-focused.
Rationale: Why this matters; the business or risk story.
Acceptance Criteria: Observable, verifiable conditions.
Links: To decisions, designs, code, tests, and runbooks.
Status & History: Proposed → Agreed → Implemented → Verified; with dated changes.

If that sounds dull, good. Boring is stable.

Turning Messy Inputs Into Clean Output

Here’s a day-in-the-life scene. Monday morning, your inbox groans. A regional sales lead has sent a “quick thought,” a support manager has forwarded an escalation, and compliance has attached a PDF with cheerful highlights. Meanwhile, your chat stream has a thread titled “Crazy idea, hear me out,” and your calendar holds a two-hour workshop that will absolutely run over.

AI acts like a diligent note-taker. It extracts requirement candidates, threads them back to sources, surfaces contradictions, and drafts entries that follow your standard. It flags uncertainty, not with scolding, but with prompts: “You mention fast. Consider a threshold.” “No acceptance criteria yet. Want suggested checks?” It’s the difference between starting from a blank page and editing a first draft.

Validation: Bring Testing Forward

Nothing makes a requirement more real than a test written early. Even a simple one. If your requirement says, “Customers can reset their password,” a quick script for the happy path changes the conversation. It tells you whether the requirement is clear, whether edge cases exist, and whether the success condition is observable.

AI can produce skeleton tests or BDD-style scenarios. These aren’t replacements for skilled testers; they’re conversation starters. They give stakeholders something to react to beyond words on a page.

Design Without Premature Commitments

An enterprise can’t freeze design while it gathers requirements. Work happens in parallel. The trick is to reduce irreversible choices while knowledge is still forming. Name decisions clearly. Record assumptions. Keep a short, living risk log. Teach teams to treat early choices as pilots rather than monuments.

When AI spots language that looks like a design prescription disguised as a requirement, it can ask: “Is this a requirement or a design decision?” That question, asked at the right moment, protects future flexibility.

The Reference Architecture, Told as a Story

Imagine a river with six gentle falls from source to basin.

Sources: Raindrops of information—emails, chats, tickets, transcripts, policies—gather into streams.
Ingestion: The river is filtered. Sediment is removed; rocks are tagged. Audio becomes text; PDFs become paragraphs.
AI Processing: The water clears. Patterns appear. Similar stones cluster together. Outliers stand out. Drafts form.
Standardization & Compliance: The river runs through a straight channel. Entries take a shape—the one-pager structure everyone knows. Compliance checks the banks.
Output & Integration: The water feeds fields. Repositories get updated. Dashboards show coverage. Stakeholders see what they need without squinting.
Governance & Feedback: Sensors along the banks note changes. People review, correct, and refine. The river learns, season after season.

This isn’t poetry for its own sake. It’s a reminder that movement matters. Stagnant pools breed bugs. Flow turns mess into value.

Compliance Isn’t a Department; It’s a Habit

Compliance teams are often cast as the people who say “no.” That’s unfair and unwise. Treat them as partners early. Ask which obligations carry strict wording and which allow interpretation. Capture those constraints as requirements with IDs of their own. Tie them to tests that can be run often.

When obligations change—and they do—AI can highlight affected requirements and tests. What could be a scramble becomes a checklist.

Observability Starts at Requirements Time

You can’t operate what you can’t see. If a requirement is important enough to build, it’s important enough to observe. That means attaching an operational signal to it: a log line, a metric, an alert condition. “Customers receive order confirmations within two minutes” becomes real when there’s a timer to measure it and a report that shows performance across days.

AI can suggest observability hooks. It can remind teams that done is more than merged: it’s measurable in production.

The Human Loop: Reviews That People Don’t Dread

No one wakes up excited for a two-hour requirement review. Make them shorter, more frequent, and focused. Send pre-reads. Start with the riskiest or most ambiguous items. Time-box the rest. Use comments rather than monologues. Celebrate deletions—dead requirements don’t haunt releases.

AI can tee this up by sorting items by risk, novelty, and dependency. It can remind you that REQ-208 touches three services and affects a holiday season peak. That little nudge shapes the meeting agenda in a useful way.

Metrics That Actually Help

Measure what improves behavior, not what looks tidy on a dashboard.

Useful signals include:

Ambiguity rate: Share of entries with flagged vague terms, trending down over sprints.
Coverage: Percentage of requirements with linked tests and code.
Change clarity: Fraction of scope changes with an impact note and decision record.
Lead time: Days from requirement proposed to verified.
Defect linkage: Bugs traced back to missing or unclear requirements.

If a metric triggers gaming, toss it. If it sparks a real conversation, keep it.

AI Pitfalls—and How to Avoid Them

AI is powerful and fallible. Four traps to watch:

Overconfidence: A smooth summary doesn’t equal truth. Keep the review step. Always.
Drift: Models learn from new data. That’s good until it isn’t. Schedule checks. Keep a small, curated set of gold-standard examples.
Privacy: Requirements often include sensitive context. Govern who sees what. Mask data where you can.
Bias: If past projects sidelined certain needs, a model can learn that habit. Counter with explicit guardrails—non-functional prompts, compliance lists, diversity of inputs.

The antidote to all four is the same: human judgment and simple rules you actually follow.

A 30–60–90 Day Adoption Plan

You can’t flip a switch and change culture. You can, however, stack small wins.

Days 1–30

Write the one-page standard and socialize it.
Pick one pilot team and one product slice.
Turn transcripts and emails into draft entries using AI; review together.
Add IDs to commits and tests. Keep it simple.

Days 31–60

Start change notes for scope shifts. Short, factual, linked.
Add ambiguity checks to your definition of ready.
Publish a tiny dashboard: coverage, ambiguity rate, lead time.

Days 61–90

Tie key requirements to observability signals.
Establish a rotating review squad from different functions.
Hold a retrospective: what to keep, what to drop, what to refine.

Three months won’t fix everything. It will build momentum and trust.

Global Teams, Local Realities

Large enterprises span cultures and time zones. Words carry different weight in different places. A “yes” may mean “I hear you,” not “I agree.” A “quick change” may be polite shorthand for “this is strategically important.”

Write requirements for a global audience: clear, literal, free of local idioms. Pair async reviews with short live sessions. Rotate meeting times so the same region isn’t always drinking midnight coffee. Small signals of respect buy a lot of goodwill.

Vendor and Partner Dance

Few enterprises build alone. Partners bring expertise and capacity—and their own habits. Share your standard early. Ask theirs. Map the differences. Keep a shared glossary. If IDs differ across systems, set up a translation table rather than fighting over naming.

Change control is trickier with external parties. AI can help by tracking cross-organization dependencies and reminding both sides when a decision in one place affects a milestone in another. Clear beats clever. Repetition beats surprise.

Security as a First-Class Requirement

Security isn’t a feature; it’s a property. Treat it like performance—measurable, discussed early, tracked over time. Define what “secure” means in context: encryption at rest and in transit, key rotation, session policies, rate limits, audit trails. Write them down as requirements with acceptance criteria you can test, not as warnings in a slide deck.

AI can surface typical gaps and flag risky phrasing. It’s not a gatekeeper; it’s a flashlight.

Accessibility Is Not a Nice-to-Have

If your system can’t be used by people with different abilities, you’ve limited your market and invited risk. More importantly, you’ve missed the point of building software for humans. Bake accessibility into requirements, not as a catch-all note but as specific, testable statements: keyboard navigation, color contrast, screen reader support, captions. Treat this like any other quality—not optional, not later.

Performance in the Real World

Response times that look fine in a lab can crumble under peak traffic. Tie requirements to realistic loads and seasonal patterns. A retail site behaves differently in late November than in April. A travel platform shifts with holiday waves. Name those periods in your requirements. Attach test data that mirrors them. Add watchpoints to production and review them together.

The Power of Deletion

It’s thrilling to add requirements. It’s wise to retire them. Old constraints linger longer than anyone expects. Every quarter, ask: which requirements no longer reflect reality? Which were temporary? Which emerged from a workaround that no longer exists? Deleting with care is a mark of maturity.

AI can propose candidates for retirement by spotting unused test links or code paths with no recent activity tied to them. Use judgment, not autopilot—but have the conversation.

Story-Driven Templates That People Actually Use

Templates fail when they fight the way people think. Make them conversational:

As a [role], I need [capability], so [benefit].
Because [risk/goal], the system shall [observable behavior].
We’ll know we’re done when [acceptance criterion].

This blend of story and verification lowers the barrier for non-engineers and keeps engineers focused on outcomes. AI can keep the cadence consistent without turning it robotic.

When Speed Matters—and When It Doesn’t

Not every requirement needs a stopwatch. Some need clarity of flow or completeness of data. Be selective. Over-quantifying everything can produce a forest of numbers no one respects. Under-quantifying breeds drift. Strike balance through review and experience. Encourage teams to annotate why a metric was chosen or why a narrative standard suffices.

Frequently Asked (and Quietly Worried) Questions

“Will AI replace our analysts?” No. It will make their work saner by taking on the tedious parts. The hard work—trade-offs, negotiation, context—stays human.

“Can we trust automated links and summaries?” Treat them as drafts. Validate, correct, and move on. Over time, quality improves.

“What about sensitive content?” Define clear handling rules. Mask where feasible. Limit who can view raw sources. Keep logs of access.

“How do we keep from drowning in process?” Keep the standard short. Measure few things well. Review little and often. If a step doesn’t help, drop it.

A Seasonal Note: Peak Season Pressure

Every enterprise has its crunch periods. Year-end closing. Summer travel spikes. Holiday shopping. Write seasonality into requirements. Tie rehearsal drills to those waves. Let AI look back at last year’s signals and remind you where things creaked. Future you will send a thank-you note.

Closing: Quiet Confidence Over Noise

Strong requirements work doesn’t shout. It reads clean, tells a clear story, and leaves traces you can follow months later. With AI as a steady helper, you’ll catch ambiguity sooner, fill gaps faster, and handle change with fewer theatrics. The craft becomes calmer. Release nights feel less like cliff dives and more like well-timed steps.

That’s the point—not ceremony, not perfection. Just steady outcomes that match what people actually need. Less chaos. More quiet confidence. And software that does the job.

Smarter Requirements: How AI Changes the Game (Part 1)

joerg.aulich — Mon, 10 Mar 2025 14:16:56 +0000

You’d think requirements engineering would be easy by now. After all, decades of methodology, tooling, and frameworks have gone into it. Universities teach courses on it, certifications exist for it, and every seasoned engineer has war stories about requirements gone wrong. Yet projects still run off the rails, and fingers still point to “bad requirements” as the root cause.

So why is this practice so tricky? The truth is, requirements engineering lives at the messy intersection of human communication, organizational politics, and technical reality. It’s where abstract business desires collide with engineering feasibility. And it’s here that even seasoned professionals stumble.

Let’s explore the common pitfalls, the costs they incur, and how artificial intelligence can play a role in making this discipline less of a headache. Along the way, we’ll also cover why standards matter, what good documentation looks like, and how AI fits into a reference architecture for enterprise requirements engineering.

The Classic Pitfalls That Refuse to Go Away

Anyone who’s ever written or read requirements knows the pain points:

Ambiguity: Words like fast, secure, or user-friendly sound fine in meetings but unravel when engineers ask, “How fast? How secure?”
Incomplete coverage: Functional requirements get captured, but non-functional ones—like compliance, scalability, or resilience—slip through the cracks.
Scope creep: A few extra “must-haves” sneak in, and before you know it, deadlines are impossible.
Stakeholder friction: Marketing wants innovation, compliance wants control, and operations want stability. Who wins? Too often, the loudest voice.
Over-specification: Requirements dictate design choices too early, cutting off better options.
Lack of traceability: No clear links between requirements, design, code, and tests. Nobody’s sure if the end product matches the original intent.
Poor validation: Requirements that aren’t testable or measurable sneak through the net.

These aren’t theoretical risks. They show up every day in corporate projects, where complexity and scale amplify every misstep.

The Consequences in Corporate Life

In small teams, missed requirements are painful but manageable. In enterprises, they can be devastating. A missed compliance requirement may mean fines or legal trouble. An overlooked scalability need may cause outages that make headlines. A lack of traceability may cripple audits or erode customer trust.

The costs multiply as errors travel downstream. Fixing an unclear requirement during ideation is cheap. Fixing it after release is brutally expensive. There’s an old engineering adage: every stage you delay fixing a requirements issue increases cost tenfold. Enterprises live this reality all too often.

The impact isn’t just monetary. Broken trust between business and IT, frustrated engineers burning out from endless firefighting, and mounting technical debt all leave scars. In global organizations, different regions and business units pull in different directions, making the problem worse. Vendors and outsourcing arrangements add more moving parts. What could be a minor hiccup in a startup can escalate into a multimillion-dollar disaster in a large corporation.

AI as a Wingman, Not a Savior

Artificial intelligence has become the buzzword solution to everything, but let’s be clear: it won’t solve office politics or human indecision. What it can do is act as an untiring assistant, spotting issues, consolidating inputs, and suggesting improvements.

Think of AI as the junior analyst who never gets tired. It can:

Flag ambiguous wording.
Suggest clearer phrasing.
Highlight contradictions across documents.
Cluster similar requirements together.
Track dependencies and impacts when things change.

It doesn’t replace the judgment of experienced professionals. It lightens the load so they can spend time where their expertise really matters—negotiating trade-offs, understanding business drivers, and guiding design.

From Chaos to Clarity: Making Use of Everyday Inputs

Requirements rarely start life as clean, structured statements. They’re born in:

Emails from stakeholders.
Chat threads full of half-formed ideas.
Meeting transcripts.
Issue trackers.
Regulatory documents.

Traditionally, analysts had to comb through all this noise manually. AI changes that. It can parse communication streams, extract requirement-like statements, and organize them. Meeting transcripts become structured summaries with decisions, open issues, and draft requirements. Email chains become categorized and deduplicated.

Picture last week’s heated workshop. Five managers argued, three decisions got made, two got deferred, and one person stormed out. Instead of leaving with scattered notes, AI generates a summary: what was decided, what’s pending, and which points look like requirements. Imperfect? Sure. But miles better than relying on memory or sticky notes.

Standards Aren’t Boring—They’re Liberating

Talk of standards often makes teams groan. Templates, checklists, forms—it sounds like bureaucracy. But standards aren’t the enemy. They’re the shared grammar that keeps chaos at bay.

A solid requirement is:

Atomic: One clear statement, not a bundle.
Testable: You can check if it’s met.
Traceable: It links to design, code, and tests.
Structured: With IDs, rationale, acceptance criteria.

Think of it like cooking. Saying “make dinner” yields chaos. Saying “make a pasta dish with 200g spaghetti, boiled for 10 minutes, served with tomato sauce” creates consistency. Standards don’t kill agility—they enable collaboration.

What a Standard Should Look Like Without Killing Agility

The best standards are lightweight but effective. A simple template works wonders: ID, description, rationale, priority, acceptance criteria. Separate functional from non-functional requirements. Keep statements clear, singular, and versioned.

Agile teams sometimes fear that documentation slows them down. But the irony is, good standards save time. Less time arguing over what “fast” means. Less time fixing preventable mistakes later. Documentation isn’t bureaucracy—it’s efficiency.

How AI Fits the Puzzle

Here’s where the synergy shows. AI can take messy inputs and reshape them into structured requirements. That vague statement “System should be secure” transforms into:

Requirement ID: SEC-001
Type: Non-functional
Description: The system shall encrypt customer data at rest using AES-256.
Acceptance Criteria: Verify database encryption with AES-256.

AI can prompt for missing fields, validate compliance, and cross-reference new inputs against existing requirements. It can generate test cases, create mock-ups, and suggest workflows. It turns noise into order.

The Reference Architecture: From Input to Governance

Imagine the process as a supply chain:

Input Sources – Emails, chats, tickets, documents, regulations.
Ingestion & Preprocessing – Parsing, cleaning, tagging.
AI Processing – Clarity checks, clustering, linking, test case generation.
Standardization & Compliance – Applying templates, verifying testability, ensuring regulations are met.
Output & Integration – Feeding requirements into repositories, dashboards, and tools.
Governance & Feedback – Human oversight, corrections, iterative learning.

This isn’t static. With each cycle, AI improves. With each correction, the system learns. With each project, governance builds trust.

Culture, Governance, and Trust

The more AI is used, the more vital people become. AI can flag ambiguity, but humans interpret context. AI can propose test cases, but humans decide what matters. Without human oversight, AI becomes noise. With it, AI becomes a partner.

Governance enforces accountability: version histories, rationales, approvals. It’s not red tape—it’s how organizations avoid chaos and prove compliance. Culture matters too. If teams see AI as a threat, they resist. If they see it as a helper, they embrace it. Adoption hinges on trust.

Why This Matters More Than Ever

Requirements engineering isn’t glamorous, but it’s the bedrock of enterprise software. Get it right, and you deliver systems that last, satisfy customers, and pass audits. Get it wrong, and you waste money, frustrate teams, and invite risk.

AI won’t erase the human messiness of corporate life, but it can make requirements clearer, faster, and more reliable. That means fewer nasty surprises, fewer compliance nightmares, and more energy spent building rather than arguing.

And really, isn’t that the point? Software that does what it’s supposed to do, built without unnecessary chaos.

The post continues with part 2.

From Domain-Driven Design to Event-Driven Micro-services: A Migration Playbook

joerg.aulich — Fri, 21 Jun 2024 20:26:00 +0000

Introduction

Let’s be honest—most of us built our first successful platforms the way everyone did in the 2010s: a big relational database, a monolithic app, and a pile of shared utilities humming quietly in the background. It worked—beautifully, even. Until it didn’t.

Back then, deployments were rare, regulators were quieter, and a change request didn’t feel like prepping for a moon landing. Fast forward to today, and that same simplicity has become our Achilles’ heel. A tiny bug fix? It requires a full redeploy. One team’s schema change? It breaks another’s feature without warning. And those shared tables? They’ve turned collaboration into a minefield. You tweak something for one product line, and suddenly three others are calling you in a panic. Sound familiar?

Some folks thought the answer was just to break things apart—slap a “micro-service” label on a few APIs and call it a day. But when everything is still synchronous, and no one really owns the data, you end up with a distributed mess. Calls fail. Systems spiral. No one knows which version of truth to believe. We’ve all seen those “micro-service” setups where just looking at a dashboard makes you nervous.

But here’s the good news: there’s a better way—and it’s not a mystery anymore. Over the past years, a repeatable path has emerged. It doesn’t start with tools or technology. It starts with rediscovering your business through Domain-Driven Design (DDD). You map what your business really does, define shared language that actually makes sense to everyone, and carve out boundaries that reflect real accountability.

Then, once you’ve drawn those lines, you connect them—not with brittle APIs—but with durable, auditable, immutable events. Think Kafka, Pulsar, or whatever log-based system fits your environment. You phase things out using strangler patterns, keep data consistent with outbox strategies and sagas, and you test contracts—not just features—so that change becomes something your teams don’t dread.

This playbook walks you through all of it, chapter by chapter. No vendor fluff, no motivational quotes—just practical, field-tested advice. Here’s what we’ll unpack together:

Why starting with DDD matters right now—not six months from now.
How to understand what you’ve already built, technically and organizationally.
How to map your domains and align your teams to them.
How to design an event model that won’t collapse under change.
How to select the right infrastructure backbone—and avoid surprises.
How to actually carve out services from a monolith safely.
How to guarantee resilience when things go sideways (because they will).
How to test, trace, and govern what you build—without slowing delivery.
How to upskill your teams and navigate the human side of all this.
And finally, how to avoid the most common traps we’ve all fallen into.

If you follow the guidance, by the middle of the year you could already have one bounded context running independently, emitting auditable events, and—get this—delivering change without fear.

Let’s get started.

What’s the Rush?

If you’re wondering why there’s such urgency around modernising software architecture lately, it’s not just a passing trend—it’s real pressure, coming from all sides. Regulators, competitors, and even your own finance department are turning up the heat. And if your systems are still monolithic, that heat feels like it’s boiling the whole pot.

Let’s talk about regulation for a second. In March 2024, the EU Parliament passed the Artificial Intelligence Act, and by May it was locked into law. That’s not a distant threat—it’s here. If your platform includes any “high-risk” AI components, you’re now legally on the hook for proving things like data origin, audit trails, and post-deployment monitoring. That’s a tall order for a monolith with shared data layers and spaghetti code. It’s almost laughable to imagine generating a reliable audit trail when you can’t even separate logs by team or feature. Deployments bundled into multi-hour windows? They bury any chance of traceability. Good luck meeting transparency standards when everything’s lumped together.

And regulation isn’t your only fire. Let’s pivot to performance. The 2024 DORA State of DevOps report drew a hard line between winners and laggards. The elite teams? They deploy several times a day and recover from outages in under an hour. The rest? Monthly deploys and multi-day outages. That gap doesn’t just show up in engineering metrics—it hits revenue. Fast movers test, ship, and iterate faster than traditional shops can even scope a feature. Their speed isn’t luxury—it’s competitive edge.

Now mix in a financial squeeze. After a string of interest rate hikes, the money men and women are asking tougher questions. “Why is our cloud bill still climbing?” According to Everest Group, average overspend sits north of 10%. In the UK alone, one study pinned delays and deployment drag at more than £100,000 per company. Not a rounding error.

This trio—legal scrutiny, operational expectation, and financial discipline—makes clinging to a monolith a dangerous game. You’re either agile and auditable, or you’re struggling and exposed.

So What Does a DDD-First, Event-Driven Architecture Actually Give You?

It gives you air to breathe.

Here’s the short list—but make no mistake, these benefits aren’t isolated. They stack. They multiply.

Autonomous teams: Each one owns its context—its code, its data, its release timeline. No more waiting for an ops window or tiptoeing around a central DB.
Compliance by construction: Events are immutable, timestamped, and self-describing. They aren’t just useful—they’re legally defensible.
Scalability with intent: Need to ramp up fraud scoring without touching checkout logic? No problem. Scale what matters, when it matters.
Focused innovation: A team working on a new feature in one context doesn’t need five sign-offs from risk management, operations, and legacy platform leads. That isolation is freedom.

And the ripple effects? Fewer meetings, tighter sprints, fewer late-night incident calls, and—this is a big one—the confidence to move fast even when the rules get stricter.

Taking Stock of the Current Landscape

You Can’t Change What You Don’t Understand

Here’s something we’ve all seen: a team dives into migration without truly grasping what they’ve built—or inherited. Then, halfway through, the project stalls because of “unexpected dependencies” or “surprise compliance blockers.” Sound familiar?

Truth is, every failed migration has one thing in common: it underestimated the mess. That’s why the first real move in this journey isn’t coding—it’s seeing.

Start with Business Capabilities, Not Code

Engineers love to open an IDE and trace function calls. But that’s the wrong door to walk through first.

Instead, start with your product managers. Ask them for a full list of your business capabilities—the stuff that actually earns you money or supports someone who does. Things like: checkout, recommendations, user profiles, fraud scoring.

Then attach meaningful data to each capability:

How much revenue does it drive?
How often does it change?
Is it subject to regulatory scrutiny?

Now make that list visual. Use a Miro board or mural canvas and create a heatmap. The visual feedback is immediate. Suddenly, the high-value, high-risk zones pop out. These are your pressure points—and likely candidates for early refactoring.

Then, Map Out Technical Coupling

Once you know what your business cares about, it’s time to trace what’s really tangled underneath.

Fire up your static analysis tools and look for:

Shared libraries: How tightly are modules bound at compile time?
Database joins: Are multiple modules touching the same tables?
Runtime calls: Who’s calling whom synchronously?

Service mesh telemetry or APM tools like New Relic and Dynatrace can help you expose these runtime dependencies.

You’ll end up with a spaghetti graph. That’s okay—it’s supposed to be ugly. Look for dense clusters. These are your danger zones. Ironically, they’re often the worst places to start breaking things apart. Why? Because complexity breeds paralysis. Instead, choose a capability with clean boundaries and visible business value. Your future success story needs to resonate.

Now, Overlay the Org Chart (Yes, Really)

Here’s the thing: Conway’s Law isn’t a law because someone said so—it’s a law because it happens whether you like it or not.

Whatever your org chart looks like, your software will mirror it. So take your coupling graph and sketch team boundaries on top. Watch what happens:

If one team owns a module but has to reach across five services to get work done, that’s a cognitive sinkhole.
If three teams constantly edit the same folder in the monolith, you’re looking at a domain that’s screaming for clarification.

In both cases, it’s time to think DDD. And if your engineering org is surprised by how misaligned things are? Even better. That’s exactly what this visibility is for.

Don’t Forget Regulation—It’s Not Optional

Let’s not kid ourselves—PCI DSS 4.0, EU AI Act, SEC incident reporting—they all have data classification requirements baked in. That means certain columns, tables, or message fields are legally sensitive. You can’t just copy them somewhere else and hope no one notices.

So tag your sensitive data:

Cardholder data
Personally identifiable info
Anything tied to high-risk AI inputs

This matters for two reasons:

It tells you where you can’t afford to be sloppy during migration.
It helps you plan what to migrate first—and what must wait until you’ve got the proper guardrails in place.

If you ignore this part? Expect to rewrite your migration roadmap when legal sends you a frantic Slack message two days before go-live.

Carving Bounded Contexts

So… What Even Is a Bounded Context?

Here’s the thing: every large system is already split into domains—it’s just that nobody’s named them, nobody owns them, and half of them are overlapping. That’s where things get messy.

A bounded context isn’t just a fancy term from Eric Evans’ playbook. It’s a boundary—both linguistic and technical—that says: “This is our language, our model, our data. Outside of it? Not our problem.” Once you start thinking in contexts, you stop solving for generic abstractions and start aligning to actual business flows.

But you can’t define bounded contexts in a vacuum. You need stories.

Lock the Right People in a Room and Tell Stories

Seriously. Get ten people in a room: a product owner, a senior engineer, a tester, someone from ops, and your favorite data analyst. Then ask one deceptively simple question:

“When a customer places an order, what happens next?”

What happens is magic. People start talking. You’ll see whiteboards fill with actors, arrows, and notes: “Authorize Payment,” “Reserve Stock,” “Generate Invoice.” Everyone starts sketching the flow they live and breathe.

Color-code the steps. Mark ownership. You’ll notice patterns forming—terms like Order, Payment, Shipment—used repeatedly and consistently.

That shared vocabulary? That’s your ubiquitous language. Once it’s validated, it’s more than just words. It’s the blueprint for your system’s shape.

Draw the Map. Literally.

Take everything from the workshop and turn it into a context map.

Draw circles:

Core domains that differentiate your business
Supporting domains you still need, but don’t define you
Generic domains like identity, notifications, or file storage

Then connect the dots—literally. Use arrows to mark which domains depend on others. It’s not just academic; it’s strategic. Upstream domains influence, downstream ones depend. That matters when you’re sequencing your work.

Here’s the kicker: align this map with your org chart. Each bounded context should have an owning team. Clear KPIs, a deploy pipeline, the works.

If two contexts fall under the same team, maybe they belong together. If a team claims three wildly different contexts? Push back. It’s too much. Negotiate a split. The map is your contract. Treat it that way.

Anti-Patterns Lurking in the Shadows

Now, a warning. There are landmines here. Let’s call them out:

Micro-service per aggregate: It sounds clean. It isn’t. Turning every root entity into its own service leads to noisy networks and awkward conversations with ops when latency triples.
Technical slicing: Splitting by layers—API, logic, data—creates half-baked services that depend on each other like co-dependent roommates. Don’t do it.
Ignoring Conway’s Law: If your “context” spans four teams, it’s not a context—it’s an on-call nightmare waiting to happen.

Here’s a quick reality check: take three incidents from the past year. Ask, “Could the owning team have resolved this on their own?” If the answer is “No,” redraw the boundary.

You’re not just building micro-services. You’re building accountable units of delivery. That’s what a bounded context is. And when you get it right? Everything—from deploys to bug fixes—starts to feel a little lighter.

Designing a Durable Event Model

Events Aren’t Just Payloads. They’re Commitments.

Once you’ve defined your bounded contexts, you need a way for them to talk to each other—without yelling across the room.

That’s where events come in.

But hold up—this isn’t just about slapping messages onto a queue. An event isn’t an afterthought or a byproduct. It’s a business fact, frozen in time. And if you treat events that way—from day one—you’ll spare yourself a lot of pain down the road.

Domain Events vs. Integration Events: Yes, There’s a Difference

Let’s clear this up right away.

A domain event is a pure expression of what just happened. “Payment Authorized.” “Order Cancelled.” It’s born inside a bounded context, owns its truth, and never changes once published.
An integration event is a translation—sometimes filtered, sometimes enriched. Maybe it redacts personal data, maybe it adds some fluff for analytics. That’s fine—but don’t confuse the two.

Why does this matter? Because if your “Shipping” service starts treating a cleaned-up analytics event from “Payments” as the gospel truth, you’ve just introduced brittle coupling in disguise.

Naming Events: It’s Not Just Semantics

Event names matter. They’re not just for logs or dashboards—they’re part of your ubiquitous language. So treat them with care.

Use past tense: order.shipped.v1, not order.ship.
Be explicit about intent. “UserRegistered” tells you something meaningful. “UserUpdated”? Not so much—what was updated? Why?
Include a version suffix right in the name. It’s not overhead—it’s a signal. When breaking changes come (and they will), they come in cleanly as .v2, not silently through unexpected field removals.

And yes—never remove a required field. Add optional ones. Mark things as deprecated. But never yank something out from under consumers. You’re not just publishing events—you’re making contracts.

Picking the Right Schema Technology: Don’t Default to JSON

Let’s talk tech for a second.

JSON is easy. Everyone can read it. It’s human-friendly… until it isn’t.

JSON doesn’t enforce contracts. You won’t know something broke until a consumer quietly fails in production. Then you’re back in Slack trying to piece together what went wrong, which schema changed, and who forgot to update what.

Tools like Avro or Protobuf solve this. They compress well, support evolution rules, and work great with schema registries like Confluent or Apicurio. These registries act like the bouncers outside your event bus. If a new schema breaks backward compatibility, the pipeline halts. Good. Let it halt.

You’d rather fix a schema in CI than roll back a broken event in prod.

Choreography vs. Orchestration: Know When to Let the Band Play

You’ve got your events. You’ve got your services. Now, how do they dance?

For simple, linear flows—like order placed → payment authorized → order fulfilled—event choreography works great. Each service listens, reacts, and emits.
But what about complex, reversible flows? Say you booked a shipment, but the warehouse fails to confirm. Now you have to roll back the charge, cancel the label, and notify the user.

That’s where orchestration comes in. A dedicated process manager runs the show—tracking state, handling retries, and issuing compensating actions. Yes, it introduces a bit of indirection. But it also saves you from days of outage triage when something goes wrong halfway through a ten-step process.

Use both patterns. Pick based on context. But above all, design for clarity.

Choosing the Event Backbone and Supporting Infrastructure

Your Events Deserve More Than a Message Queue

Once you’ve nailed down your event model, the next question is: where do those events actually live? And how do they move? If you get this part wrong, it won’t matter how clean your context map is—your services will be arguing over garbled messages or tripping over race conditions.

Let’s get one thing straight: this is about logs, not mailboxes.

Traditional queues (like RabbitMQ or SQS) are fine for fire-and-forget tasks, but building an event-driven system requires more than pushing bytes around. You need persistence. Replayability. Individual consumption offsets. In other words, you need a durable event log.

Kafka, Pulsar, or the Cloud Buffet?

Apache Kafka became the default choice for a reason. It gives you an ordered, append-only log where each consumer tracks its own progress. So your real-time fraud detector doesn’t get slowed down by some nightly batch job.

Apache Pulsar brings some advantages Kafka lacks—like multi-tenancy out of the box and tiered storage for archiving older events without clogging up your hot path. Depending on your scale and use case, it might be the right pick.

Don’t want to manage clusters? Totally fair. Cloud options like Google Pub/Sub, AWS MSK, or Azure Event Hubs abstract the infrastructure away—but watch out. You’ll trade off some fine-grained control, so understand what you’re giving up before going all-in.

When you evaluate, ask:

Do we need strict ordering for sagas or financial transactions?
Do we care about exactly-once semantics? (Spoiler: most teams settle for at-least-once + idempotency.)
What’s our latency tolerance—especially if we’re bridging between cloud and on-prem?

Schema Registry and Contract Testing: Your CI/CD’s Best Friend

If the event backbone is the nervous system, then the schema registry is the immune system. It prevents corrupted or incompatible events from entering the bloodstream.

Here’s how it works:

A developer updates an event schema.
The new version is pushed to the registry.
Compatibility checks run—both forward and backward.
If it passes, it gets published. If it fails, it stops cold.

Think of it like type-checking your entire event model—before deployment.

Run contract tests in your CI pipeline to ensure no one breaks a downstream consumer without knowing it. That simple step has saved countless hours of post-deploy panic.

Some popular choices:

Confluent Schema Registry (Kafka-native)
Apicurio
Open-source Kafka-compatible options

Trust me, once you’ve caught a bad schema in CI, you’ll wonder how you ever lived without it.

Observability: Don’t Fly Blind

Events are invisible unless you make them visible.

Here’s what you want:

Use OpenTelemetry traces to stitch together end-to-end flows—linking REST calls to Kafka offsets and back.
Treat every event as both a metric and a trace. Count them. Time them. Plot them.
Watch for lag—that’s your canary. If a consumer falls behind, that’s not just a tech issue; it could mean orders aren’t shipping, payments aren’t processing, users are churning.

Your observability stack should surface:

Events produced per minute
Consumer group lag
Time from publish to consumer acknowledgment

Without this visibility? You’re just hoping everything works. And hope is not a strategy.

Security and Compliance: Bake It In

Compliance can’t be bolted on later. Build it in now, or pay for it tenfold later.

Some essentials:

TLS everywhere. Encrypt traffic in flight. No excuses.
Access control lists (ACLs) so only the owning service can write to its topic.
Tokenization or encryption for sensitive fields—think card numbers, email addresses, anything PII.

If a downstream consumer doesn’t need the raw data, don’t give it to them. Use integration events with redacted payloads.

Not only does this keep you on the right side of GDPR, but it also positions you well for newer regulations like the AI Act, which demands explainability and transparency. Immutable events with tight access and audit trails? That’s compliance gold.

Migration Strategies in Depth

You Can’t Rewrite the Plane Mid-Flight—But You Can Reroute It

By now, you’ve probably realized: you’re not starting from scratch. There’s a monolith. It’s working—sort of. It’s got warts, sure, but it’s also paying the bills. You can’t just flip a switch and replace it with shiny micro-services. That’s a fantasy.

What you need is a strategy that lets you move incrementally, safely, and without breaking everything every other Tuesday.

Let’s talk tactics.

The Strangler Fig: Nature’s Guide to Legacy Decomposition

Martin Fowler coined the term, but nature got there first. The strangler fig grows around a host tree—bit by bit—until one day, it stands alone.

Here’s how it works in code:

You place a thin facade—say, an HTTP proxy or routing rule—between clients and your monolith.
You build new functionality as micro-services.
That facade selectively routes calls: new stuff goes to the service; old stuff stays with the monolith.
Over time, more traffic shifts to the service.
Eventually, the monolith’s old module becomes redundant—and gets deleted.

Zero downtime. No big-bang rewrites. Just quiet, steady progress.

The Outbox Pattern: Say It Once, Say It Right

Here’s the problem with event-driven systems: what happens if the service updates the database but crashes before publishing the event?

Boom. You’ve got data that no one else knows about. Silent inconsistencies. The kind that haunt you at 3 a.m.

Enter the Outbox Pattern.

Instead of publishing events directly, you:

Write the event to a dedicated outbox table, in the same database transaction as your business logic.
A separate relay process reads from the outbox and publishes to Kafka (or whatever broker you’re using).

Now, even if the relay crashes, the event is safely stored. No duplicates. No ghost updates.

This is your foundation for exactly-once semantics—or at least effectively once, which is what matters most.

CDC: When You Can’t Touch the Monolith

Sometimes you’re stuck. The monolith’s ORM is older than your intern. The team that built it is long gone. You can’t risk changing anything inside.

That’s when Change-Data-Capture (CDC) becomes your secret weapon.

Tools like Debezium hook into the WAL (Write-Ahead Log) of your database. They listen for row-level changes and stream them out as events—without touching your application code.

It’s a clever workaround, but there’s a catch: CDC gives you technical changes, not business intent.

So, a row changes. Great. But what does it mean? Was the order cancelled? Was it just an address update? You’ll need a transformer layer to map those raw changes into proper domain events.

Still, when the monolith is off-limits, CDC is the way in.

Managing the Two Truths Problem

During migration, you’re in a messy state: some data lives in the monolith; some in your shiny new service. So how do you keep your story straight?

Here are your main options:

Immutable ownership: Only the owning context writes to a table. Others consume it as a read-only projection. No overlaps. No debates.
Temporal fences: The new service handles all future data—new users, new orders, etc. The monolith keeps the historical stuff. You just draw a date line and stick to it.
Graceful rollback: Always have a way back. Keep feature toggles that let you reroute traffic to the monolith if something goes sideways. The new service doesn’t get deleted—it just goes dark until you fix it.

Real talk: teams that practice rollback drills recover 10x faster than those that rely on duct tape and heroic last-minute debugging.

Ensuring Consistency and Resilience

Your System Will Fail. Now Design Like You Know That

Distributed systems aren’t gentle. Messages get dropped. Services restart mid-transaction. Someone restarts a Kafka broker without telling the team. It happens.

So the real question isn’t “How do we prevent failure?” It’s “How do we survive it—and stay consistent while we’re at it?”

This is where patterns like sagas, idempotency, and even a little chaos engineering come into play.

Sagas: The Narrative Backbone of Distributed Consistency

When a process stretches across multiple services—say, placing an order, charging a card, booking shipment, and confirming delivery—you can’t just wrap that in a traditional transaction. There’s no cross-service BEGIN/COMMIT in this world.

What you need is a saga.

Sagas are long-running, distributed workflows built from smaller, isolated transactions. Each step completes and emits an event. If a step fails, the saga kicks off compensating actions—think: refund the payment, restock the item, notify the user.

Two ways to manage this:

Choreography: Each service listens for specific events and emits follow-ups. Light, elegant, but a little opaque once things get hairy.
Orchestration: A process manager tracks the whole flow explicitly—logging state, making decisions, and coordinating retries. Slightly heavier, but far more visible.

Whichever you choose, persist the saga’s state. Otherwise, if the coordinator goes down mid-flight, you’ve got no recovery plan—and no trace of what was mid-air.

And make sure compensating actions also emit events. You’ll want to trace these steps later when something breaks and the postmortem starts.

Idempotency: Your Safety Net Against “Oops, It Happened Twice”

Here’s a law of distributed systems: If something can happen more than once, it will.

Network blips. Broker retries. Misconfigured consumers. You’ll get duplicate events. It’s not a bug—it’s a guarantee.

That’s why idempotency is your best friend.

Each event needs:

A unique event ID
A natural aggregate ID (like order number or payment ID)

When your service consumes an event, it checks: “Have I seen this before?” If yes, skip. If no, process and log the ID.

No complicated deduplication logic. No weird partial state. Just a clean record of what’s been handled.

Avoid relying on random UUIDs alone—they’re hard to trace and even harder to debug. Lean on domain-specific keys whenever you can.

Chaos Engineering: If You Don’t Test Failure, It’ll Surprise You Later

Want to know if your system can handle failure? Don’t wait for prod to find out.

Instead, run controlled chaos:

Delay messages randomly.
Drop 5% of events at random.
Restart brokers mid-test.
Inject partition unavailability.
Simulate a replay storm during peak hours.

Your goal isn’t to break things for fun—it’s to build muscle memory:

Do your consumers retry with exponential backoff?
Does the dead-letter queue catch and alert on poisoned messages?
Can you replay lost events without corrupting state?

Make these drills a habit:

Run chaos scenarios in staging every sprint.
Do production game days once a quarter.

The teams that rehearse failure recover faster—and with fewer grey hairs.

Testing, Tracing, and Metrics

“It Works on My Machine” Doesn’t Cut It Anymore

When you move to an event-driven architecture, something shifts. You’re no longer just testing APIs—you’re testing conversations. And like any good conversation, what matters isn’t just what’s said, but when and how it’s said.

That means your test suite needs an upgrade. And your observability stack? It becomes a lifeline.

Let’s break it down.

Contract-Driven Testing: Trust, But Verify

In a world where services communicate through events, schemas are contracts. And contracts aren’t optional. If a producer makes a change, consumers need to know before it hits production.

So how do you keep everyone honest?

The producer team maintains event schema files in their codebase.
The consumer teams pull those schemas into their test suites using stub generators.
On every CI run, the system checks: is the change backward-compatible?

For example:

If a producer adds a new field? Cool—just make it optional.
If they remove or rename a required field? CI should fail. Hard.

No silent breakage. No weekend firefights. Just clean, predictable communication.

And yes—make this a ritual, not a recommendation. If a pull request modifies a schema, it must include:

Compatibility results
Updated contract stubs
Migration notes for consumers

Automate it all. Humans forget. Pipelines don’t.

End-to-End Replay: Your Secret Weapon Against Edge Cases

Let’s face it—unit tests miss things. Integration tests get close, but they’re still limited. You know what catches the weird bugs? Replaying real-world events.

Here’s how to build your replay harness:

Capture a slice of production traffic (sanitized if needed).
Store it in a separate log or object store.
On every release candidate, replay those events into a staging environment.
Compare actual outcomes to expected state or traces.

This isn’t just testing—it’s simulation. You’ll uncover:

Events arriving in unexpected orders
Edge cases you didn’t even know existed
Latency-induced flakiness

Bonus: the replay harness becomes a living spec. Every new corner case you discover? Add it to the next run.

The Golden Signals of Event-Driven Systems

Traditional apps have RED metrics: Rate, Errors, Duration. That’s a good start—but event-driven systems need more.

Here’s what to track:

Event throughput: How many events are being produced and consumed per minute?
Consumer lag: Is any service falling behind? Lag is the canary in your coal mine.
Mean processing latency: How long does it take from event publish to final acknowledgment?
Saga failure rate: Are distributed workflows completing, compensating, or falling flat?

Now bring these metrics into business focus. Tie them to real KPIs:

A five-second delay in order.fraud_check? That might correlate with cart abandonment.
A spike in payment.refund_failed events? That’s a support nightmare in the making.

Visualize all of this in Grafana, Datadog, or whatever dashboard you live in. Don’t just throw alerts over the wall—make them actionable.

Governance and Change Management

It’s Not Just the Architecture That Changes—It’s the Culture

Here’s a not-so-secret truth: most migrations don’t fail because the tech is wrong. They fail because people weren’t aligned, weren’t prepared, or weren’t heard.

Governance isn’t about bureaucracy. Done right, it’s about reducing surprises, building trust, and making change feel safe. And when the foundation is events and bounded contexts, governance becomes something you bake into your pipelines—not just paste on at the end.

Let’s look at the rituals that make change stick.

Publishing Policy: One Team, One Topic

This is the golden rule. Repeat it. Tattoo it. Whisper it into your schema registry at night:

One team owns each topic. Period.

They—and only they—decide what goes into that schema, how it evolves, and when it changes. Cross-team consumers? Welcome, but they’re guests, not co-authors.

That means every pull request touching an event schema must include:

A registry compatibility pass
Updated contract stubs for any affected consumers
Migration notes in plain language

Set up automated checks in GitHub or GitLab. Don’t rely on engineers to remember every step. Let your CI yell if someone breaks the rules. It’s better than Slack yelling after prod breaks.

This isn’t control for control’s sake—it’s protection against accidental coupling and silent regressions.

Documentation Cadence: Keep the “Why” Alive

People come and go. Teams change. Six months from now, someone will ask, “Why does this topic even exist?” or “Why do we version that event instead of extending it?”

That’s where Architecture Decision Records (ADRs) come in.

Every schema change, boundary adjustment, or integration handshake should come with an ADR. Just a short doc that says:

What changed
Why it changed
Who decided
And when

Use a bot to post new ADRs in your team’s #architecture Slack channel. Once a quarter, clean up the old stuff. Keep your decision log readable—because it’s not just documentation. It’s your project memory.

People First: Training Is Not Optional

Most migration blockers aren’t technical—they’re human. People don’t like uncertainty, especially when their day-to-day changes.

So invest—early and often—in skills, language, and practice.

Here’s what actually works:

Event-storming workshops: Run by external coaches or experienced facilitators. They help teams discover domain boundaries together and define ubiquitous language without arguing about data models first.
Kata exercises: Tiny, low-risk practice sessions where devs build outbox-driven services from scratch, break them, and fix them—in a sandbox where failure is safe.
Shared vocabulary cheat sheets: Yes, like flashcards. So that testers, analysts, and devs all use the same nouns. It sounds small. It’s not. It’s alignment made visible.

These don’t cost much. Less than a week of downtime, for sure. But they build something far more valuable than code: confidence.

And when people feel confident in the platform and in each other, things ship faster. Reviews go smoother. Escalations vanish. Your migration becomes a shared achievement—not a top-down mandate.

Common Pitfalls

The Mistakes Everyone Makes (So You Don’t Have To)

You’re almost there. You’ve got your bounded contexts, your event model, your migration strategy, and your team on board. But let’s slow down for a moment—because even with the best intentions, things can go sideways.

Let’s walk through the usual suspects. Some are technical. Some are cultural. All are avoidable.

Pitfall #1: Event Spaghetti

You know those generic “updated” events? Like user.updated, product.changed, something.happened?

Yeah—don’t do that.

They sound flexible, but in practice they become dumping grounds for vague changes. Consumers can’t reason about them. Schemas balloon. Debugging becomes guesswork.

Instead, use explicit, domain-namespaced events. order.shipped.v1. user.email_changed.v1. You’re not just naming messages—you’re designing interfaces.

Clarity is power.

Pitfall #2: Overeager Slicing

We get it. You’re excited. The strangler worked once, and now you want to slice everything. But slow down.

Not every module needs to be a micro-service right away. Resist the urge to turn every aggregate or table into its own bounded context.

Start with one core domain. Prove it works. Learn from it. Then expand. Teams that migrate in waves succeed more often than those that try to “fix everything” in one roadmap cycle.

Pitfall #3: Zombie Contracts

Old event versions pile up. No one uses order.created.v1 anymore, but it’s still in the registry—just sitting there, waiting to confuse a new hire or trigger a bad deploy.

Solution? Quarterly registry pruning. Track consumer usage. Delete unused schemas with ceremony. Celebrate it. Dead contracts are technical debt in disguise.

Pitfall #4: Telemetry Sticker Shock

If you trace every event, log every payload, and monitor every span at 100% fidelity—your observability bill is going to look like a joke.

So be smart:

Sample traces during normal traffic
Compress logs before they hit storage
Archive cold events to object storage (e.g., S3, Azure Blob) after a week

You don’t need everything, forever. You need just enough, at the right fidelity, for the right audience.

Pitfall #5: Unfunded Skill Gaps

This one’s subtle. You’ve planned the tech. You’ve drafted the migration board. But you forgot to budget for people.

Event-driven architecture isn’t just a new stack—it’s a new mindset. If your engineers have never done DDD, never written a consumer that handles retries gracefully, never worked with schema evolution, you can’t assume they’ll just “figure it out.”

So:

Make training part of the plan
Track DDD fluency like you track sprint velocity
Even bring in external coaches for a few sessions

Think of it this way: the cost of training is a rounding error compared to the cost of a failed migration.

Conclusion: No, You’re Not Just Refactoring

Let’s not sugarcoat it—breaking a monolith is hard. It’s not about code alone. It’s about reshaping how your organization thinks, speaks, and builds software.

But here’s the thing: it’s possible. And more than that, it’s necessary.

Domain-Driven Design gives you the compass. It helps you see the real business logic hiding in the mess of code. Event-driven architecture gives you the roadways—resilient, decoupled, scalable. Together, they let you build something that grows with your business, not against it.

Start by seeing your domain clearly.

Draw your bounded contexts.
Give each team a clear identity.
Let events tell the story of what matters.

Choose an event backbone that fits your latency and governance needs. Use outboxes and CDC to migrate carefully. Build sagas to orchestrate change. Keep your systems honest with contract tests and tracing.

Measure what matters. Teach what’s missing. Move piece by piece. And never let the fear of complexity freeze you in place.

Because the alternative? That monolith you’ve been nursing for a decade? It’s not just slowing you down—it’s draining your team’s energy, your customer’s patience, and your ability to adapt.

Act now, and in a year, you’ll look back and wonder why “big-bang rewrite” ever seemed like the only option.

Kick-Starting Your Internal Developer Platform in 2024: What Actually Works (and What Doesn’t)

joerg.aulich — Fri, 05 Jan 2024 19:49:00 +0000

Let’s not sugarcoat it—those patchwork Jenkins jobs, Terraform scripts passed around like family recipes, and “Steve knows how it works” knowledge bases? They’re no match for the pace, complexity, and regulatory heat of 2024. Compliance has grown teeth. Incident costs have climbed. Every new team member means another potential wrench in the delivery chain.

Internal Developer Platforms (IDPs) have stepped in as the antidote. When done right, they lighten the cognitive load on developers, keep auditors from breathing down your neck, and let product teams focus on what matters—building and shipping features, not fighting friction.

This isn’t your typical clickbait playbook. What you’re reading here is a detailed, human-first walkthrough of what it actually takes to launch—or seriously overhaul—your IDP. It explains how to build buy-in, what governance models actually work, how to discover and shape golden paths, and how to keep momentum long after the initial glow wears off. Every example and data point is pulled from real-world trenches, not vendor keynotes.

Follow this approach, and you’ll have a battle-tested roadmap. One that starts with getting executive buy-in, ships a lovable MVP by end of Q1, and rolls through the rest of the year in tight, visible, weekly feedback loops that compound progress.

The Jenkins Era Is Over (And That’s Okay)

Let’s rewind to 2011. Heroku published its twelve-factor manifesto, promising a utopia where developers could push code and never worry about things like load balancers, SSL certs, or log retention. That vision sounded perfect for startups in a world of greenfield codebases.

But if you were working in a bank, a telco, or any other grown-up enterprise? Good luck. You had mainframes to deal with. ESBs. Government-mandated change control processes. PaaS solutions just couldn’t handle that kind of complexity.

So teams improvised. Jenkins, Puppet, Artifactory, homegrown shell scripts… you name it. A Frankensteinian ecosystem that somehow worked—for a while.

By 2018, though, it was painfully clear things were spinning out of control. Every team had their own “special” pipeline. A single bank might run fifty subtly different CI/CD setups—all aiming to solve the same problems, just in slightly incompatible ways. Then COVID hit. Suddenly, remote onboarding made all those “you kinda have to be in the office to understand it” processes a massive liability.

Google’s SRE practices gave us hope. But even those assumed a relatively clean world of microservices and containers. Meanwhile, most enterprises were juggling aging monoliths, real-time systems, Python notebooks, and increasingly, regulated AI models. Each artifact came with its own bespoke build, scan, and deploy needs.

Platform engineering emerged from this chaos—not as another attempt to sell black-box magic, but as a pragmatic middle layer. A product, not a tool. Something that stitched together all those disparate components into a cohesive developer experience. No wonder Gartner put platform engineering on its 2024 Strategic Trends list. Done right, it can reduce dev cognitive load by 80% in under two years.

Why January 2024 Is the Moment of Truth

You don’t need a hype cycle to tell you when to act. What you need is internal pressure—and right now, there’s plenty.

Three forces are bearing down:

The EU AI Act is now law. If your models can’t demonstrate traceability and controls? You’re staring down fines of up to 7% of global revenue.
Cloud costs are still spiraling. Everest Group reports that 66% of enterprises overshoot their cloud budgets by 10% or more.
Velocity gaps are widening. The latest State of DevOps shows that fewer than 20% of orgs deploy multiple times a day. But those that do? They’re dominating the digital game.

And here’s the kicker—January is budget season. Wait too long, and OKRs are locked in, the money’s allocated, and your platform dreams are deferred.

If you’re a platform lead with a vision? Now’s the time. Tie your IDP pitch to outcomes the business actually cares about: better compliance, faster delivery, and lower incident costs. Translate your roadmap into executive language. Secure the bag before the year locks up.

From PowerPoint to Product: Governance That Actually Works

Ever sat through a steering committee where nobody had decision rights and everyone was just hedging until the next meeting? That’s where good IDP ideas go to die.

Here’s how you avoid that fate:

Appoint a single product owner. Someone with a real budget and the power to say “yes” or “no.”
Pair them with a product manager. A person who actually interviews internal users, curates the backlog, and runs discovery just like they would for external features.
Add a tech lead who ensures architectural coherence.
Include a DX designer who obsesses over developer experience—every CLI command, portal click, and GitHub template.

Want to avoid friction with finance? In year one, keep the platform budget as a shared cost pool. Let teams opt in without being slapped with cross-charges. Add billing for premium stuff—like GPU clusters or emergency support—once adoption is solid.

Governance artifacts should be real—but lightweight:

A five-slide deck: problem, vision, first golden path, success metrics.
A public Git repo: part handbook, part RFC space.
A decision log: why you chose Argo CD over Flux, and under what conditions you might switch.

This isn’t bureaucracy. It’s clarity. And clarity scales.

Golden Paths Aren’t Myths—They’re Revealed by Maps

Most developer frustration doesn’t come from edge cases. It comes from doing the same annoying thing 20 times a month.

The trick? Don’t start with tech. Start with maps.

Run a full-day workshop. Bring 3–5 representative teams into a room. Chart everything from idea-to-ticket to code-in-prod. Mark pain points in red—manual approvals, waiting on infra, misaligned tooling.

You’ll almost always find that 70% of the delay comes from 2–3 repeatable workflows. These are your golden paths.

Prioritize with data:

How often does this path happen?
What’s the business or compliance impact?
How loud are the complaints?

Then build your first golden path. Often, it’s something like a stateless web API. So you scaffold it:

Repo generator with sane defaults
Automated PR pipeline with static analysis and unit tests
Infra-as-code with GitOps
Canary deploys
Pre-wired observability

What used to take five days, three departments, and six tickets now takes fifteen minutes. Unattended.

That’s not magic. That’s design.

Building a Platform That Doesn’t Suck (or Stall)

Let’s talk architecture. Not in the abstract, buzzwordy sense. But in terms of what really makes up a usable, scalable IDP that developers will actually want to use.

Picture it as a four-layer cake. Each layer builds on the last, and skipping one? That’s how you get a soggy bottom.

Layer 1: The Self-Service Interface
This is the front door. Could be Backstage, could be a homegrown web portal, could even be a well-designed CLI. The key? Everything is discoverable in one place. Templates, docs, golden paths, environment info—it’s all there. If new joiners have to open six tabs and pray they’ve got the latest wiki link, you’ve already lost.

Layer 2: Orchestration and Workflow
Here’s where the magic happens. You’ve got GitOps pipelines (Argo CD, Flux—pick your flavor), automated policy checks (Open Policy Agent is a solid choice), and secrets pulled from a vault—not pasted into YAML. Every change is versioned. Every deployment is traceable. Every mistake? Rewindable.

Layer 3: Shared Services
This is your utility belt. Kubernetes for workload scheduling. Service mesh for encryption and traffic shaping. Feature flags, database provisioning, CI runners—all centrally managed. Why should every team reinvent toggles?

Layer 4: Data and Artefact Layer
Where does the stuff live? This layer covers logs, container images, SBOMs, traces—and ensures they’re all linked by labels or digests. So if something breaks in prod, you can trace it all the way back to the offending commit and dependency version. Bonus: auditors love it.

And remember: security isn’t a separate layer—it threads through them all. Admission controllers block unsigned images. SPIFFE replaces long-lived keys. Policies shift left, so devs get clear feedback in their pull requests, not a cryptic ticket weeks later.

Ninety Days to Something You Can Touch

Most platform projects fail for one simple reason: they try to do everything.

Spoiler: you can’t. And you shouldn’t. Set a tight scope. Build something lovable in 90 days.

Here’s one possible tempo:

Weeks 1–2: Scaffold the portal. Set up a working GitOps proof-of-concept.
Week 4: Developers can scaffold and deploy a demo service in a sandboxed cluster.
Weeks 5–8: Add a policy engine. Wire in observability and cost tagging.
Weeks 9–11: Finalize docs. Add canary releases. Polish onboarding.
Week 12: A real product team ships something through the platform and gives you an NPS score.

Tool selection debates will happen. People have strong opinions. That’s fine—just be open and structured about it. Publish a scorecard: strategic fit, ecosystem health, operational cost, licensing. And define up front when and how you’d walk back a decision. That avoids analysis paralysis.

Metrics That Matter—And Drive Real Action

If it’s not measurable, it’s guesswork. And guesswork won’t survive past Q2.

Your metrics should span four domains:

1. Business Outcomes
Use the DORA metrics: lead time, deployment frequency, change failure rate, mean time to recovery. These resonate with leadership and tie back to revenue.

2. Developer Experience
Run quarterly surveys. Embed NPS prompts in your developer portal. Track time spent on toil vs. time spent on product work.

3. Platform Adoption
Look at the real data. Are teams using the platform? Query clusters for workloads with standardized label sets. Count the number of services migrated to the golden path. Compare against legacy pipelines.

4. Cost Metrics
Measure cost per deploy. Watch reserved instance coverage. Monitor log volume anomalies. Every dollar saved is headroom earned.

Use OpenTelemetry exporters for your telemetry stack. Keep observability costs sane with tail sampling and tiered storage. And don’t just alert on outages—alert on waste (e.g., exploding log volumes, underused compute).

After the Confetti: Sustaining Momentum

The first time someone ships through your shiny new platform? It’s a moment. Celebrate it. Emojis in Slack. A shoutout at all-hands. Stickers, even.

But a launch is not a legacy. You need rituals to keep momentum alive:

Office hours every two weeks, open to all teams.
A newsletter that’s short, real, and helpful—not corporate filler.
Docs in Git—broken examples are fixed via PRs, not Jira tickets.
Public improvement backlog so feedback is visible, not siloed.

Track maturity using the CNCF Platform Engineering Maturity Model. Share quarterly updates with execs and engineers alike. And when feedback tanks—say, your NPS drops after a clunky redesign—own it and fix it.

The platform should evolve in full view of its users. Because that’s how trust is built.

Ten Classic Fails (and the Moves That Prevent Them)

You don’t need to invent new ways to fail. The classics still work:

The Logo Swap: You rebrand Ops and expect magic. Reality: same ticket queues, new name. Fix: Assign a true product owner with budget and power.
Gold-Plating: You overdesign before shipping anything. Fix: Enforce a 90-day MVP rule.
Mandates Over Incentives: You force adoption. Teams build shadow pipelines. Fix: Make the IDP the obvious choice, not the required one.
Telemetry Bloat: You drown in logs and traces. Fix: Tail-sample and tier your storage. Only keep what’s useful.
Empty Dashboards: You’ve got metrics, but nobody looks. Fix: Review data in your planning cadence. Make it real.
No DX Voice: You build without empathy. Fix: Always have a designer on the platform team.
Frozen Decisions: You chose tools once and never reevaluated. Fix: Document exit criteria.
Opaque Backlogs: Nobody knows what’s next. Fix: Maintain a transparent roadmap.
Tool Soup: Every team picks their stack. Fix: Offer golden paths that feel like a cheat code, not a compromise.
No Celebration: You launched, then moved on. Fix: Celebrate small wins often. Culture is cumulative.

Final Thoughts: The Paved Road Is Yours to Lay

Whether you plan for it or not, you’ll end up with a platform. The only question is: will it be coherent, productized, and lovable? Or will it be a chaotic patchwork that grows like weeds in the dark corners of your codebase?

A well-built Internal Developer Platform isn’t a luxury—it’s infrastructure for modern software delivery.

So start now. Get leadership backing. Design your governance model. Build that first golden path. Ship an MVP in 90 days. Measure obsessively. And keep listening.

Yes, it takes technical savvy. And product thinking. And loads of empathy.

But the payoff is real:

Developers get their joy back.
Auditors get real-time traceability.
Finance gets cost predictability.
Customers get features—faster and safer.

The paved road is already under construction.

All that’s missing is your first step.