# Observability 2.0 — Observability Is About Asking Any Question *For years, the observability industry has been telling developers two contradictory things: "Build reliable systems" and "Log less, it's too expensive." Think about that for a second.* You're being asked to understand complex distributed systems — figure out why orders fail at 3am, why latency spikes for one restaurant but not another, why that deploy broke something nobody expected — by looking at *less data*. That's like asking a detective to solve a murder but only letting them look at three pages of evidence. "We'd give you more, but filing cabinets are expensive." The problem isn't that we need less data. The problem is that we've been collecting the **wrong shape** of data. For a decade. And the tools we built reinforced that shape, so nobody questioned it. This post is about a different shape — and quality — of data. One that gives you **more answers for less money**. One that lets you ask questions you never thought to ask — without deploying new code. One that's been independently discovered by Stripe, Facebook, Honeycomb, ClickStack — and now standardized by OpenTelemetry. It's called **wide events**. And once you see it, you can't unsee it. --- ## What is observability? Before we fix anything, let's be precise about what we're trying to achieve. <span class="key-line">A system is **observable** if you can understand what's happening inside it by asking any question from its external outputs — without changing code, without deploying, without guessing.</span> That's it. Not dashboards. Not Grafana. Not three pillars. Here's the test. Something breaks in production. You have five questions: #otest-wrap{font-family:'IBM Plex Mono',monospace;font-size:13px;max-width:600px;margin:1em auto;} .otest-q{padding:12px 14px;margin:8px 0;border-radius:6px;border:1px solid #334155;background:#1e293b;display:flex;align-items:center;gap:12px;cursor:pointer;transition:all 0.2s;} .otest-q:hover{border-color:#475569;background:#1e293b;} .otest-q.otest-no{border-color:#475569;background:#1e293b;} .otest-q.otest-yes{border-color:#475569;background:#1e293b;} .otest-check{width:22px;height:22px;border-radius:4px;border:2px solid #475569;display:flex;align-items:center;justify-content:center;font-size:12px;font-weight:700;flex-shrink:0;transition:all 0.2s;} .otest-no .otest-check{border-color:#f472b6;color:#f472b6;background:rgba(244,114,182,0.1);} .otest-yes .otest-check{border-color:#4ade80;color:#4ade80;background:rgba(74,222,128,0.1);} .otest-text{color:#c8d6e5;flex:1;font-size:13px;line-height:1.4;} .otest-no .otest-text{color:#94a3b8;} .otest-yes .otest-text{color:#94a3b8;} #otest-result{margin-top:12px;padding:10px 14px;border-radius:6px;background:#1e293b;font-size:12px;color:#94a3b8;min-height:20px;transition:all 0.3s;border:1px solid #334155;} <div id="otest-wrap"> <div style="color:#94a3b8;font-size:12px;margin-bottom:10px;">Can your current system answer these — right now, without deploying new code? Click to mark Y/N:</div> <div class="otest-q" data-i="0"><span class="otest-check"></span><span class="otest-text">What's the p99 latency of /orders for premium users in ap-south-1?</span></div> <div class="otest-q" data-i="1"><span class="otest-check"></span><span class="otest-text">Which deploy SHA introduced the latency regression?</span></div> <div class="otest-q" data-i="2"><span class="otest-check"></span><span class="otest-text">Are users on the new feature flag experiencing more errors than control?</span></div> <div class="otest-q" data-i="3"><span class="otest-check"></span><span class="otest-text">Which restaurant_id is causing 40% of our order-placement timeouts?</span></div> <div class="otest-q" data-i="4"><span class="otest-check"></span><span class="otest-text">Do cache misses correlate with the latency spike, and only for plan=premium?</span></div> <div id="otest-result"></div> </div> (function(){ var qs=document.querySelectorAll('#observability-test .otest-q'); var states=[0,0,0,0,0]; var interacted=false; function clickQ(idx,state){ states[idx]=state; qs[idx].className='otest-q'+(states[idx]===1?' otest-yes':states[idx]===2?' otest-no':''); qs[idx].querySelector('.otest-check').textContent=states[idx]===1?'Y':states[idx]===2?'N':''; updateResult(); } function updateResult(){ var yes=states.filter(function(s){return s===1;}).length; var no=states.filter(function(s){return s===2;}).length; var total=yes+no; var el=document.getElementById('otest-result'); if(total===0){el.textContent='';return;} if(no>=3){ el.innerHTML='<span style="color:#f472b6;">'+no+'/'+total+' you can\'t answer.</span> <span style="color:#94a3b8;">That\'s not observability. That\'s monitoring with extra steps. You can only answer questions you anticipated and pre-built dashboards for.</span>'; } else if(no>=1){ el.innerHTML='<span style="color:#e5c07b;">'+no+'/'+total+' you can\'t answer.</span> <span style="color:#94a3b8;">Close — but real observability means ANY question, not just the ones you thought of in advance.</span>'; } else if(total===5){ el.innerHTML='<span style="color:#4ade80;">All 5 — you already have observability.</span> <span style="color:#94a3b8;">Read on for the cost math — you might be paying too much for it.</span>'; } } qs.forEach(function(q){ q.addEventListener('click',function(){ interacted=true; var i=parseInt(this.getAttribute('data-i')); states[i]=(states[i]+1)%3; clickQ(i,states[i]); }); }); setTimeout(function(){ if(!interacted){clickQ(0,2);setTimeout(function(){if(!interacted)clickQ(3,2);},800);} },3000); })(); </div> Observability means you can ask **any question** — right now, without deploying new code — and get an answer. Not just the questions you anticipated when you built your dashboards. Any question. Including the ones that only become relevant at 3am when something you never imagined breaks. If you can only answer questions you pre-planned for — that's <span class="key-line">monitoring</span>. Watching for known failures. Fire alarm, not detective work. --- ## Why can't we ask any question? Because for the last decade, we pre-decided what questions matter. We split observability into three "pillars": **Metrics** — pre-aggregated counters. `http_requests_total{status="500"}`. You decided in advance which dimensions to track. Want to filter by `user_id`? Can't — that's a "high cardinality" field. Your metrics tool explodes. **Logs** — text dumped per-event, structured or not. `{"level":"info","msg":"order placed","user_id":"u_123","items":3}`. Fifty lines per request, each with 2-3 fields of context. Can grep, maybe filter. But can't GROUP BY across fields that live in different log lines. **Traces** — spans with timing. Honestly? Mostly unread. Looked at occasionally when you needed to see a request flow across services — but never the central piece of observability. A disconnected afterthought, not the foundation everything is built on. Nobody queries traces like a database. Nobody builds SLOs from them. They just sit there. Three separate systems. Three query languages. Three data stores. And when something breaks, you **bunny-hop** between all three trying to correlate what happened. #od-wrap{font-family:'IBM Plex Mono',monospace;font-size:10px;max-width:580px;margin:1.5em auto;} <div id="od-wrap"> <svg viewBox="0 0 580 240" style="width:100%;display:block;"> <defs><marker id="od-arrow" markerWidth="6" markerHeight="4" refX="6" refY="2" orient="auto"><path d="M0,0 L6,2 L0,4" fill="#475569"/></marker></defs> <!-- App box --> <rect x="10" y="90" width="80" height="40" rx="6" fill="#1e293b" stroke="#4ade80" stroke-width="1.5"/> <text x="50" y="114" fill="#4ade80" font-size="10" font-family="IBM Plex Mono" text-anchor="middle">Application</text> <!-- Arrows from app to pillars --> <line x1="90" y1="100" x2="145" y2="50" stroke="#475569" stroke-width="1"/> <line x1="90" y1="110" x2="145" y2="110" stroke="#475569" stroke-width="1"/> <line x1="90" y1="120" x2="145" y2="170" stroke="#475569" stroke-width="1"/> <!-- Pillar boxes --> <rect x="145" y="30" width="75" height="34" rx="5" fill="#1e293b" stroke="#475569" stroke-width="1"/> <text x="182" y="51" fill="#94a3b8" font-size="10" font-family="IBM Plex Mono" text-anchor="middle">Metrics</text> <rect x="145" y="93" width="75" height="34" rx="5" fill="#1e293b" stroke="#475569" stroke-width="1"/> <text x="182" y="114" fill="#94a3b8" font-size="10" font-family="IBM Plex Mono" text-anchor="middle">Traces</text> <rect x="145" y="153" width="75" height="34" rx="5" fill="#1e293b" stroke="#475569" stroke-width="1"/> <text x="182" y="174" fill="#94a3b8" font-size="10" font-family="IBM Plex Mono" text-anchor="middle">Logs</text> <!-- Arrows to tools --> <line x1="220" y1="47" x2="275" y2="47" stroke="#475569" stroke-width="1" marker-end="url(#od-arrow)"/> <line x1="220" y1="110" x2="275" y2="110" stroke="#475569" stroke-width="1" marker-end="url(#od-arrow)"/> <line x1="220" y1="170" x2="275" y2="170" stroke="#475569" stroke-width="1" marker-end="url(#od-arrow)"/> <!-- Tool boxes --> <rect x="278" y="30" width="100" height="34" rx="5" fill="#1e293b" stroke="#e5c07b" stroke-width="1"/> <text x="328" y="45" fill="#e5c07b" font-size="9" font-family="IBM Plex Mono" text-anchor="middle">Prometheus</text> <text x="328" y="56" fill="#64748b" font-size="8" font-family="IBM Plex Mono" text-anchor="middle">+ Grafana</text> <rect x="278" y="93" width="100" height="34" rx="5" fill="#1e293b" stroke="#e5c07b" stroke-width="1"/> <text x="328" y="108" fill="#e5c07b" font-size="9" font-family="IBM Plex Mono" text-anchor="middle">Jaeger / Tempo</text> <text x="328" y="119" fill="#64748b" font-size="8" font-family="IBM Plex Mono" text-anchor="middle">trace viewer</text> <rect x="278" y="153" width="100" height="34" rx="5" fill="#1e293b" stroke="#e5c07b" stroke-width="1"/> <text x="328" y="168" fill="#e5c07b" font-size="9" font-family="IBM Plex Mono" text-anchor="middle">ELK / Loki</text> <text x="328" y="179" fill="#64748b" font-size="8" font-family="IBM Plex Mono" text-anchor="middle">log search</text> <!-- Developer trying to connect all 3 --> <circle cx="480" cy="110" r="22" fill="#1e293b" stroke="#f472b6" stroke-width="1.5"/> <text x="480" y="107" fill="#f472b6" font-size="14" font-family="IBM Plex Mono" text-anchor="middle">?!</text> <text x="480" y="118" fill="#64748b" font-size="7" font-family="IBM Plex Mono" text-anchor="middle">dev</text> <!-- Messy arrows from developer to all 3 tools (they are the connector) --> <path d="M458,98 C430,70 400,50 378,47" fill="none" stroke="#f472b6" stroke-width="1.2" stroke-dasharray="3,3"/> <path d="M458,110 L378,110" fill="none" stroke="#f472b6" stroke-width="1.2" stroke-dasharray="3,3"/> <path d="M458,122 C430,150 400,165 378,170" fill="none" stroke="#f472b6" stroke-width="1.2" stroke-dasharray="3,3"/> <!-- Speech bubble --> <rect x="420" y="140" width="150" height="38" rx="5" fill="#1e293b" stroke="#475569" stroke-width="0.8"/> <text x="495" y="155" fill="#94a3b8" font-size="8" font-family="IBM Plex Mono" text-anchor="middle">"I am the correlation</text> <text x="495" y="166" fill="#94a3b8" font-size="8" font-family="IBM Plex Mono" text-anchor="middle">engine between these 3"</text> <polygon points="465,140 470,133 475,140" fill="#1e293b" stroke="#475569" stroke-width="0.8"/> <!-- No connection line between tools --> <line x1="328" y1="64" x2="328" y2="93" stroke="#334155" stroke-width="0.5" stroke-dasharray="2,4"/> <line x1="328" y1="127" x2="328" y2="153" stroke="#334155" stroke-width="0.5" stroke-dasharray="2,4"/> <text x="340" y="80" fill="#334155" font-size="7" font-family="IBM Plex Mono">no link</text> <text x="340" y="142" fill="#334155" font-size="7" font-family="IBM Plex Mono">no link</text> <!-- Title --> <text x="290" y="225" fill="#64748b" font-size="10" font-family="IBM Plex Mono" text-anchor="middle">The developer IS the correlation engine. That's the problem.</text> </svg> </div> </div> #bh-wrap{font-family:'IBM Plex Mono',monospace;font-size:11px;max-width:560px;margin:1em auto;border:1px solid #334155;border-radius:8px;padding:12px;background:#0f1219;} #bh-steps{position:relative;} .bh-step{padding:8px 12px;margin:4px 0;border-radius:6px;border:1px solid #334155;background:#1e293b;opacity:0;transform:translateX(-10px);transition:all 0.4s;} .bh-step.bh-show{opacity:1;transform:translateX(0);} .bh-tool{display:inline-block;padding:2px 6px;border-radius:3px;font-size:10px;font-weight:600;margin-right:6px;} .bh-grafana{background:rgba(74,222,128,0.15);color:#4ade80;} .bh-kibana{background:rgba(229,192,123,0.15);color:#e5c07b;} .bh-jaeger{background:rgba(244,114,182,0.15);color:#f472b6;} .bh-thought{color:#64748b;font-style:italic;} #bh-timer{text-align:center;margin-top:8px;color:#e5c07b;font-size:13px;font-weight:600;} @keyframes bh-pulse{0%,100%{box-shadow:0 0 0 0 rgba(74,222,128,0.4);}50%{box-shadow:0 0 0 6px rgba(74,222,128,0);}} #bh-btn{background:#1e293b;border:1px solid #4ade80;color:#4ade80;font-family:'IBM Plex Mono';font-size:11px;padding:4px 12px;border-radius:4px;cursor:pointer;animation:bh-pulse 1.5s ease-in-out infinite;} #bh-btn:hover{background:rgba(74,222,128,0.1);} <div id="bh-wrap"> <div style="display:flex;align-items:center;justify-content:space-between;margin-bottom:8px;"> <span style="color:#e5c07b;font-size:11px;text-transform:uppercase;letter-spacing:0.1em;font-weight:600;">The 3am debug dance</span> <button id="bh-btn">▶ Play</button> </div> <div id="bh-steps"> <div class="bh-step" data-t="0"><span class="bh-tool bh-grafana">Grafana</span>p99 latency spiked on /orders. Which service?</div> <div class="bh-step" data-t="1"><span class="bh-tool bh-grafana">Grafana</span>order-service. But why? Metrics don't say.</div> <div class="bh-step" data-t="2"><span class="bh-thought">→ context switch: open Kibana, construct query...</span></div> <div class="bh-step" data-t="3"><span class="bh-tool bh-kibana">Kibana</span>grep "error" in order-service. 2000 results. Which ones matter?</div> <div class="bh-step" data-t="4"><span class="bh-tool bh-kibana">Kibana</span>found a timeout. But for which user? Which restaurant? Log doesn't say.</div> <div class="bh-step" data-t="5"><span class="bh-thought">→ context switch: open Jaeger, find a trace...</span></div> <div class="bh-step" data-t="6"><span class="bh-tool bh-jaeger">Jaeger</span>found a slow trace. DB call took 800ms. But is this the common case or an outlier?</div> <div class="bh-step" data-t="7"><span class="bh-thought">→ context switch: back to Grafana, check DB metrics...</span></div> <div class="bh-step" data-t="8"><span class="bh-tool bh-grafana">Grafana</span>DB p99 is fine overall. So it's specific to... something. What?</div> <div class="bh-step" data-t="9"><span class="bh-thought">→ still don't know: which restaurant? which region? which deploy? 45 minutes gone.</span></div> </div> <div id="bh-timer"></div> </div> (function(){ var steps=document.querySelectorAll('#bunny-hop .bh-step'); var timer=document.getElementById('bh-timer'); var btn=document.getElementById('bh-btn'); var running=false,t=0,interval; function play(){ if(running)return; running=true;btn.textContent='playing...';btn.style.animation='none'; steps.forEach(function(s){s.classList.remove('bh-show');}); t=0;timer.textContent=''; interval=setInterval(function(){ if(t>=steps.length){ clearInterval(interval);running=false; btn.textContent='↺ Replay'; timer.textContent='45 minutes. 3 tools. Still no root cause.'; return; } steps[t].classList.add('bh-show'); timer.textContent=(t+1)*5+' min...'; t++; },1200); } btn.addEventListener('click',function(){if(!running)play();}); setTimeout(play,2000); })(); </div> This is the developer experience of "three pillars." The developer IS the correlation engine — hopping between tabs, working on **guesswork and intuition**, debugging based on scar tissue of past incidents. Nothing connects these systems. The metric doesn't link to the trace. The trace doesn't carry the business context. The log doesn't know which user or which deploy caused it. The developer connects them manually, in their head, at 3am. <span class="key-line">The three pillars aren't three separate things. They're three degraded views of one thing.</span> --- ## Wide, not deep Here's the core insight. Read this carefully because it changes everything. There are two ways to capture what happened during a request: **Deep** — many events, each with little context: ``` [INFO] Request started: POST /v1/orders [INFO] User authenticated: user_id=u_123 [INFO] Fetching restaurant: restaurant_id=r_456 [DEBUG] Cache miss for restaurant menu [INFO] Order placed: items=3, total=750, currency=INR [INFO] Calling delivery-assignment: eta_min=25 [INFO] Response sent: status=200, duration=340ms ``` Seven log lines. Each carries 1-2 bits of context. To answer "which restaurants have cache misses AND slow responses?" you need to **JOIN** across log lines — correlate line 3 with line 4 with line 7. At scale, with millions of requests, that JOIN is either slow or impossible. Or — more commonly — the developer doesn't even try. They think of the question, give up on logs, and add a Prometheus metric directly from the app: `cache_miss_total{restaurant="..."}`. Next week, another question, another metric. The codebase fills up with ad-hoc counters — each one a pre-aggregated answer to exactly one question somebody thought of in advance. What if instead — while processing the request — you emit **one event** and enrich it with everything? Every business attribute (restaurant, order, delivery), every infra attribute (k8s pod, region, deploy SHA), every HTTP attribute (route, status, duration), every dependency stat (cache hit, DB query count). One event. All context. Written once at the end of the request. Then ask any question. That's it. That's the simple, beautiful design. **Wide** — one event, all context: ```json { "trace_id": "abc-123", "service": "order-service", "http.method": "POST", "http.route": "/v1/orders", "http.status": 200, "duration_ms": 340, "user.id": "u_123", "user.plan": "premium", "restaurant.id": "r_456", "restaurant.cache_hit": false, "order.item_count": 3, "order.total": 750, "delivery.eta_min": 25, "deploy.sha": "a1b2c3d", "feature_flags.new_cart": true, "db.query_count": 3, "db.total_ms": 120, "region": "ap-south-1" } ``` One event. Eighteen fields. No JOINs needed. GROUP BY anything. Filter by anything. <span class="key-line">Ask any question.</span> #wb2-wrap{font-family:'IBM Plex Mono',monospace;font-size:11px;max-width:600px;margin:1em auto;border:1px solid #334155;border-radius:8px;padding:12px;background:#0f1219;} #wb2-cols{display:flex;gap:12px;align-items:flex-start;} #wb2-left{flex:1;border:1px solid #475569;border-radius:6px;padding:8px;background:#1a1f2e;} #wb2-right{flex:1;border:2px solid #4ade80;border-radius:6px;padding:8px;background:rgba(74,222,128,0.03);min-height:200px;} @keyframes wb2-glow{0%,100%{border-color:#4ade80;box-shadow:0 0 0 0 rgba(74,222,128,0.2);}50%{border-color:#4ade80;box-shadow:0 0 8px 2px rgba(74,222,128,0.15);}} .wb2-log{padding:6px 8px;margin:4px 0;border-radius:4px;background:#1e293b;color:#c8d6e5;cursor:pointer;transition:all 0.3s;font-size:10px;border:1px solid #4ade80;animation:wb2-glow 2s ease-in-out infinite;} .wb2-log:hover{background:rgba(74,222,128,0.1);transform:translateX(2px);} .wb2-log.wb2-used{opacity:0.25;text-decoration:line-through;border-color:#334155;cursor:default;animation:none;} .wb2-field{padding:3px 6px;margin:2px 0;border-radius:3px;font-size:10px;opacity:0;transform:translateY(-4px);transition:all 0.3s;} .wb2-field.wb2-show{opacity:1;transform:translateY(0);} .wb2-key{color:#4ade80;}.wb2-val{color:#e5c07b;} #wb2-counter{text-align:center;margin-top:8px;font-size:11px;color:#94a3b8;font-weight:600;} .wb2-label{font-size:10px;text-transform:uppercase;letter-spacing:0.08em;margin-bottom:6px;font-weight:600;} #wb2-left .wb2-label{color:#f472b6;} #wb2-right .wb2-label{color:#4ade80;} #wb2-question{margin-top:10px;padding:8px;background:#1e293b;border-radius:6px;border:1px solid #334155;font-size:11px;color:#94a3b8;min-height:20px;transition:all 0.3s;} <div id="wb2-wrap"> <div style="color:#4ade80;font-size:11px;margin-bottom:8px;text-transform:uppercase;letter-spacing:0.1em;font-weight:600;">▶ Click each log line to absorb it into one wide event</div> <div id="wb2-cols"> <div id="wb2-left"> <div class="wb2-label">Deep — 7 log lines</div> <div class="wb2-log" data-fields='[["http.method","POST"],["http.route","/v1/orders"]]'>[INFO] Request started: POST /v1/orders</div> <div class="wb2-log" data-fields='[["user.id","u_123"],["user.plan","premium"]]'>[INFO] User authenticated: user_id=u_123</div> <div class="wb2-log" data-fields='[["restaurant.id","r_456"]]'>[INFO] Fetching restaurant: restaurant_id=r_456</div> <div class="wb2-log" data-fields='[["restaurant.cache_hit",false]]'>[DEBUG] Cache miss for restaurant menu</div> <div class="wb2-log" data-fields='[["order.total",750],["order.item_count",3]]'>[INFO] Order placed: items=3, total=750</div> <div class="wb2-log" data-fields='[["delivery.eta_min",25]]'>[INFO] Delivery assignment: eta=25min</div> <div class="wb2-log" data-fields='[["http.status",200],["duration_ms",340]]'>[INFO] Response: status=200, 340ms</div> </div> <div id="wb2-right"> <div class="wb2-label">Wide — 1 event</div> <div id="wb2-fields"></div> </div> </div> <div id="wb2-counter"></div> <div id="wb2-question"></div> </div> (function(){ var logs=document.querySelectorAll('#wide-builder .wb2-log'); var fieldsEl=document.getElementById('wb2-fields'); var counter=document.getElementById('wb2-counter'); var question=document.getElementById('wb2-question'); var absorbed=0,allFields=[],interacted=false; var questions=[ '', 'Now you can filter by method and route.', 'Now you can GROUP BY user.plan — see if premium users are slower.', 'Now you can correlate restaurant with everything else.', 'Now you can ask: "do cache misses cause latency?" — one WHERE clause.', 'Now you can ask: "which restaurants have highest order totals but slowest responses?"', 'Now you can ask: "are high ETAs correlated with cache misses for premium users?" — ONE query. No JOINs.', '<span style="color:#4ade80;font-weight:600;">Same data. One row. Infinitely more queryable. Every question answered without deploying new code.</span>' ]; function reset(){ absorbed=0;allFields=[]; fieldsEl.innerHTML=''; counter.textContent=''; question.innerHTML=''; logs.forEach(function(l){l.classList.remove('wb2-used');}); } function absorbLog(log){ if(log.classList.contains('wb2-used'))return; log.classList.add('wb2-used'); var fields=JSON.parse(log.getAttribute('data-fields')); fields.forEach(function(f,i){ var div=document.createElement('div'); div.className='wb2-field'; div.innerHTML='<span class="wb2-key">'+f[0]+'</span>: <span class="wb2-val">'+f[1]+'</span>'; fieldsEl.appendChild(div); allFields.push(f); setTimeout(function(){div.classList.add('wb2-show');},50+i*100); }); absorbed++; counter.textContent=absorbed+'/7 log lines absorbed → '+allFields.length+' fields in one event'; question.innerHTML=questions[absorbed]||''; if(absorbed===7){ setTimeout(function(){ counter.innerHTML+=' <span style="color:#4ade80;cursor:pointer;border:1px solid #4ade80;padding:2px 8px;border-radius:4px;margin-left:8px;" id="wb2-reset">↺ Reset</span>'; document.getElementById('wb2-reset').addEventListener('click',reset); },500); } } logs.forEach(function(log){ log.addEventListener('click',function(){interacted=true;absorbLog(this);}); }); setTimeout(function(){ if(!interacted){ var i=0; var autoInterval=setInterval(function(){ if(interacted||i>=3){clearInterval(autoInterval);return;} absorbLog(logs[i]);i++; },1000); } },3500); })(); </div> I had a [discussion on LinkedIn](https://www.linkedin.com/feed/update/urn:li:activity:7466852629856362497/?dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287467155940493148160%2Curn%3Ali%3Aactivity%3A7466852629856362497%29&dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287467818357317206016%2Curn%3Ali%3Aactivity%3A7466852629856362497%29) about this with Martin Thwaites (who's been building observability tooling for a decade) and the framing he used stuck with me — **"wide not deep"**: 100 custom dimensions on one event, not 100 log lines with 1-2 bits of context each. The wide event isn't more data — it's the **same data, better shaped**. Every field that was scattered across 7 log lines is now in one row. No JOINs. No correlation. Just `WHERE restaurant.cache_hit = false AND duration_ms > 300 GROUP BY restaurant.id`. <span class="key-line">You're not logging more. You're logging differently.</span> --- ## The power: ask any question Ok so you have wide events. Now what? Let's see what "ask any question" actually looks like. Here's one event from our dataset — a single request to the order service. Every request emits one of these: #se-wrap{font-family:'IBM Plex Mono',monospace;font-size:10px;max-width:560px;margin:1em auto;border:1px solid #334155;border-radius:8px;padding:12px;background:#0f1219;position:relative;} #se-json{color:#94a3b8;line-height:1.6;white-space:pre;} #se-json .se-k{color:#4ade80;} #se-json .se-v-str{color:#e5c07b;} #se-json .se-v-num{color:#f472b6;} #se-json .se-v-bool{color:#22d3ee;} #se-count{position:absolute;top:8px;right:12px;color:#64748b;font-size:9px;background:#1e293b;padding:2px 8px;border-radius:10px;border:1px solid #334155;} <div id="se-wrap"> <span id="se-count">× 10,000 requests/sec</span> <div id="se-json">{ <span class="se-k">"trace_id"</span>: <span class="se-v-str">"abc-123-def"</span>, <span class="se-k">"service"</span>: <span class="se-v-str">"order-service"</span>, <span class="se-k">"http.route"</span>: <span class="se-v-str">"/v1/orders"</span>, <span class="se-k">"http.status"</span>: <span class="se-v-num">200</span>, <span class="se-k">"duration_ms"</span>: <span class="se-v-num">340</span>, <span class="se-k">"user.id"</span>: <span class="se-v-str">"u_123"</span>, <span class="se-k">"user.plan"</span>: <span class="se-v-str">"premium"</span>, <span class="se-k">"restaurant.id"</span>: <span class="se-v-str">"r_dosa_corner"</span>, <span class="se-k">"cache_hit"</span>: <span class="se-v-bool">false</span>, <span class="se-k">"deploy.sha"</span>: <span class="se-v-str">"sha_g7h8i9"</span>, <span class="se-k">"region"</span>: <span class="se-v-str">"ap-south-1"</span>, <span class="se-k">"order.total"</span>: <span class="se-v-num">750</span>, <span class="se-k">"delivery.eta_min"</span>: <span class="se-v-num">25</span> }</div> </div> </div> We have thousands of these per second. Each one with all dimensions — infra, business, user, performance. Now watch what happens when you query this data. Same dataset, different questions — just change the GROUP BY: #sd-wrap{font-family:'IBM Plex Mono',monospace;font-size:11px;max-width:680px;margin:1em auto;border:1px solid #334155;border-radius:8px;padding:16px;background:#0f1219;} #sd-question{font-size:14px;color:#c8d6e5;margin-bottom:12px;min-height:20px;font-weight:600;line-height:1.4;} #sd-controls-row{display:flex;gap:6px;align-items:center;margin-bottom:10px;flex-wrap:wrap;} #sd-controls-row select{background:#1e293b;border:1px solid #475569;color:#c8d6e5;font-family:'IBM Plex Mono';font-size:11px;padding:6px 10px;border-radius:4px;cursor:pointer;min-width:120px;} #sd-controls-row select:focus{outline:none;border-color:#4ade80;} #sd-controls-row label{color:#64748b;font-size:10px;text-transform:uppercase;letter-spacing:0.05em;} #sd-sql{padding:10px 12px;background:#1a1f2e;border-radius:6px;border:1px solid #334155;font-size:11px;color:#94a3b8;margin-bottom:12px;font-family:'IBM Plex Mono';} #sd-sql .kw{color:#4ade80;} #sd-sql .col{color:#c8d6e5;} #sd-sql .val{color:#e5c07b;} #sd-chart{position:relative;height:280px;background:#1a1f2e;border-radius:6px;border:1px solid #334155;margin-bottom:8px;} #sd-canvas{width:100%;height:100%;} #sd-legend{display:flex;gap:12px;flex-wrap:wrap;font-size:10px;padding:4px 0;} .sd-legend-item{display:flex;align-items:center;gap:4px;color:#94a3b8;} .sd-legend-dot{width:10px;height:3px;border-radius:1px;} #sd-insight{margin-top:8px;padding:8px 10px;background:#1e293b;border-radius:4px;font-size:11px;color:#94a3b8;border:1px solid #334155;min-height:18px;} .sd-presets{display:flex;gap:5px;flex-wrap:wrap;margin-bottom:12px;} .sd-preset{background:#1e293b;border:1px solid #475569;color:#94a3b8;font-family:'IBM Plex Mono';font-size:10px;padding:5px 10px;border-radius:4px;cursor:pointer;transition:all 0.2s;} .sd-preset:hover{border-color:#c8d6e5;color:#c8d6e5;} .sd-preset.sd-active{border-color:#4ade80;color:#4ade80;background:rgba(74,222,128,0.08);} <div id="sd-wrap"> <div id="sd-question"></div> <div class="sd-presets"> <span class="sd-preset" data-g="region" data-w="" data-q="Which region is slowest?">Infra: by region</span> <span class="sd-preset" data-g="restaurant.id" data-w="" data-q="Which restaurant is causing the most latency?">Business: by restaurant</span> <span class="sd-preset" data-g="user.plan" data-w="" data-q="Are premium users experiencing worse performance?">By user plan</span> <span class="sd-preset" data-g="cache_hit" data-w="" data-q="How much does a cache miss cost in latency?">Cache hit vs miss</span> <span class="sd-preset" data-g="deploy.sha" data-w="cache_hit=false" data-q="Which deploy broke the cache?">deploy × cache miss</span> <span class="sd-preset" data-g="restaurant.id" data-w="region=ap-south-1,user.plan=premium" data-q="Which restaurant is slow for premium users in ap-south-1?">premium + region + restaurant</span> </div> <div id="sd-controls-row"> <label>GROUP BY</label> <select id="sd-groupby"> <option value="region">region</option> <option value="restaurant.id">restaurant.id</option> <option value="user.plan">user.plan</option> <option value="cache_hit">cache_hit</option> <option value="deploy.sha">deploy.sha</option> </select> <label>WHERE</label> <select id="sd-where"> <option value="">— no filter —</option> <option value="cache_hit=false">cache_hit = false</option> <option value="region=ap-south-1">region = ap-south-1</option> <option value="user.plan=premium">user.plan = premium</option> <option value="duration_ms>300">duration_ms > 300</option> </select> </div> <div id="sd-sql"></div> <div id="sd-chart"><canvas id="sd-canvas"></canvas></div> <div id="sd-legend"></div> <div id="sd-insight"></div> </div> (function(){ var canvas=document.getElementById('sd-canvas'); var ctx=canvas.getContext('2d'); var colors=['#4ade80','#3b82f6','#e5c07b','#f472b6','#a78bfa','#22d3ee']; var data=[]; var regions=['ap-south-1','us-east-1','eu-west-1']; var restaurants=['r_biryani','r_pizza','r_sushi','r_dosa','r_burger']; var plans=['premium','starter','free']; var deploys=['sha_a1b','sha_d4e','sha_g7h']; var seed=42;function srand(){seed=(seed*1103515245+12345)&0x7fffffff;return seed/0x7fffffff;} for(var t=0;t<10;t++){ for(var j=0;j<15;j++){ var ri=j%3;var region=regions[ri]; var restaurant=restaurants[j%5]; var plan=plans[j%3]; var deploy=t>=5?'sha_g7h':(t>=2?'sha_d4e':'sha_a1b'); var cache=srand()>0.3; var dur; if(region==='ap-south-1'){dur=cache?90+srand()*30:180+srand()*60;if(t>=5)dur+=40+t*15;} else if(region==='us-east-1'){dur=cache?70+srand()*20:140+srand()*40;} else{dur=cache?80+srand()*25:160+srand()*50;} if(!cache&&deploy==='sha_g7h')dur+=60+t*10; if(restaurant==='r_dosa'&&t>=6)dur+=100; data.push({t:t,region:region,'restaurant.id':restaurant,'user.plan':plan,cache_hit:cache,'deploy.sha':deploy,duration_ms:Math.round(dur)}); } } function query(groupBy,where){ var filtered=data; if(where){ where.split(',').forEach(function(w){ var parts=w.split('='); if(w.indexOf('>')>-1){parts=w.split('>');filtered=filtered.filter(function(d){return d[parts[0]]>parseFloat(parts[1]);});} else{var val=parts[1]==='false'?false:parts[1]==='true'?true:isNaN(parts[1])?parts[1]:parseFloat(parts[1]);filtered=filtered.filter(function(d){return d[parts[0]]===val||String(d[parts[0]])===parts[1];});} }); } var groups={}; filtered.forEach(function(d){ var key=String(d[groupBy]); if(!groups[key])groups[key]={timeBuckets:{}}; if(!groups[key].timeBuckets[d.t])groups[key].timeBuckets[d.t]={total:0,count:0}; groups[key].timeBuckets[d.t].total+=d.duration_ms; groups[key].timeBuckets[d.t].count++; }); var series=[]; for(var k in groups){ var points=[]; for(var t=0;t<10;t++){ var b=groups[k].timeBuckets[t]; points.push(b?Math.round(b.total/b.count):0); } series.push({label:k,points:points}); } series.sort(function(a,b){var la=a.points[a.points.length-1],lb=b.points[b.points.length-1];return lb-la;}); return series; } function drawChart(series){ var W=canvas.width=canvas.offsetWidth*2; var H=canvas.height=canvas.offsetHeight*2; ctx.clearRect(0,0,W,H); var pad={l:65,r:30,t:25,b:35}; var cW=W-pad.l-pad.r,cH=H-pad.t-pad.b; var maxVal=0; series.forEach(function(s){s.points.forEach(function(p){if(p>maxVal)maxVal=p;});}); maxVal=maxVal||100; ctx.strokeStyle='#334155';ctx.lineWidth=1; for(var i=0;i<=4;i++){ var y=pad.t+cH*(1-i/4); ctx.beginPath();ctx.moveTo(pad.l,y);ctx.lineTo(W-pad.r,y);ctx.stroke(); ctx.fillStyle='#64748b';ctx.font=(H>300?'20':'14')+'px IBM Plex Mono';ctx.textAlign='right'; ctx.fillText(Math.round(maxVal*i/4)+'ms',pad.l-8,y+5); } for(var t=0;t<10;t++){ var x=pad.l+t/(9)*cW; ctx.fillStyle='#475569';ctx.textAlign='center';ctx.font=(H>300?'18':'12')+'px IBM Plex Mono'; ctx.fillText('t'+t,x,H-8); } series.forEach(function(s,si){ ctx.beginPath();ctx.strokeStyle=colors[si%colors.length];ctx.lineWidth=3; s.points.forEach(function(p,pi){ var x=pad.l+pi/9*cW; var y=pad.t+cH*(1-p/maxVal); if(pi===0)ctx.moveTo(x,y);else ctx.lineTo(x,y); }); ctx.stroke(); var lastX=pad.l+9/9*cW; var lastY=pad.t+cH*(1-s.points[9]/maxVal); ctx.beginPath();ctx.arc(lastX,lastY,5,0,Math.PI*2);ctx.fillStyle=colors[si%colors.length];ctx.fill(); }); } function render(questionText){ var g=document.getElementById('sd-groupby').value; var w=document.getElementById('sd-where').value; var series=query(g,w); document.getElementById('sd-sql').innerHTML='<span class="kw">SELECT</span> <span class="col">'+g+'</span>, <span class="col">AVG(duration_ms)</span> <span class="kw">FROM</span> spans'+(w?' <span class="kw">WHERE</span> <span class="val">'+w+'</span>':'')+' <span class="kw">GROUP BY</span> <span class="col">'+g+'</span>'; drawChart(series); var legendEl=document.getElementById('sd-legend'); legendEl.innerHTML=series.map(function(s,i){return '<span class="sd-legend-item"><span class="sd-legend-dot" style="background:'+colors[i%colors.length]+'"></span>'+s.label+' ('+s.points[9]+'ms)</span>';}).join(''); var insightEl=document.getElementById('sd-insight'); if(series.length>1&&series[0].points[9]>1.5*series[series.length-1].points[9]){ insightEl.innerHTML='<span style="color:#4ade80;">Answer:</span> <strong>'+series[0].label+'</strong> is '+Math.round(series[0].points[9]/series[series.length-1].points[9])+'x slower than '+series[series.length-1].label+'. Found in one query — no new instrumentation needed.'; } else { insightEl.innerHTML='<span style="color:#94a3b8;">Results shown above — try different GROUP BY and WHERE combinations.</span>'; } if(questionText)document.getElementById('sd-question').textContent=questionText; } document.getElementById('sd-groupby').addEventListener('change',function(){render();}); document.getElementById('sd-where').addEventListener('change',function(){render();}); document.querySelectorAll('#slice-dice .sd-preset').forEach(function(p){ p.addEventListener('click',function(){ document.querySelectorAll('#slice-dice .sd-preset').forEach(function(x){x.classList.remove('sd-active');}); this.classList.add('sd-active'); document.getElementById('sd-groupby').value=this.getAttribute('data-g'); var w=this.getAttribute('data-w'); var sel=document.getElementById('sd-where'); sel.value=w; if(!sel.value&&w){var opt=document.createElement('option');opt.value=w;opt.textContent=w;sel.appendChild(opt);sel.value=w;} render(this.getAttribute('data-q')); }); }); render('Which region is slowest?'); document.querySelector('#slice-dice .sd-preset').classList.add('sd-active'); setTimeout(function(){ var presets=document.querySelectorAll('#slice-dice .sd-preset'); var idx=1; var auto=setInterval(function(){ if(idx>=4){clearInterval(auto);return;} presets[idx].click();idx++; },3000); },3000); })(); </div> Same data. Different questions. Infra team asks "which region is slow?" — change the GROUP BY. Product team asks "which restaurant is causing timeouts?" — change the GROUP BY. Combined question nobody planned for: "which deploy broke cache misses for premium users in ap-south-1?" — add a WHERE, change the GROUP BY. <span class="key-line">No new instrumentation. No new deploys. Just different queries on the same wide events.</span> --- ## Wait — what about my dashboards? I still need metrics. You do. But you don't need to **pre-aggregate them from the app anymore**. Every metric you've ever built from Prometheus counters — latency percentiles, error rates, throughput — is just a query on the same raw wide events. No separate instrumentation. No `prometheus.NewHistogram()` in your code. No cardinality limits. The raw data IS your metrics. You just query it differently: #mfs-wrap{font-family:'IBM Plex Mono',monospace;font-size:11px;max-width:680px;margin:1em auto;} .mfs-card{background:#1e293b;border:1px solid #334155;border-radius:8px;padding:14px;margin:10px 0;} .mfs-card-header{display:flex;justify-content:space-between;align-items:center;margin-bottom:8px;} .mfs-card-title{color:#c8d6e5;font-size:12px;font-weight:600;} .mfs-card-value{font-size:20px;font-weight:700;} .mfs-query{font-size:9px;color:#64748b;padding:6px 8px;background:#0f1219;border-radius:4px;margin-bottom:8px;font-family:'IBM Plex Mono';} .mfs-query .kw{color:#4ade80;} .mfs-mini-chart{height:50px;display:flex;align-items:flex-end;gap:2px;} .mfs-mini-bar{flex:1;border-radius:2px 2px 0 0;transition:height 0.3s;} .mfs-row{display:grid;grid-template-columns:1fr 1fr;gap:10px;} .mfs-section-title{color:#94a3b8;font-size:10px;text-transform:uppercase;letter-spacing:0.08em;margin:16px 0 6px;font-weight:600;} <div id="mfs-wrap"> <div class="mfs-section-title">APM — computed from raw spans, no pre-aggregation</div> <div class="mfs-row"> <div class="mfs-card"> <div class="mfs-card-header"> <span class="mfs-card-title">P99 Latency</span> <span class="mfs-card-value" style="color:#e5c07b;">482ms</span> </div> <div class="mfs-query"><span class="kw">SELECT</span> PERCENTILE(duration_ms, 99) <span class="kw">FROM</span> spans <span class="kw">WHERE</span> service = 'order-service'</div> <div class="mfs-mini-chart" id="mfs-latency"></div> </div> <div class="mfs-card"> <div class="mfs-card-header"> <span class="mfs-card-title">Error Rate</span> <span class="mfs-card-value" style="color:#f472b6;">2.3%</span> </div> <div class="mfs-query"><span class="kw">SELECT</span> COUNT(status>=500) / COUNT(*) * 100 <span class="kw">FROM</span> spans <span class="kw">WHERE</span> service = 'order-service'</div> <div class="mfs-mini-chart" id="mfs-errors"></div> </div> </div> <div class="mfs-row"> <div class="mfs-card"> <div class="mfs-card-header"> <span class="mfs-card-title">Throughput</span> <span class="mfs-card-value" style="color:#4ade80;">12.4k rpm</span> </div> <div class="mfs-query"><span class="kw">SELECT</span> COUNT(*) / interval_minutes <span class="kw">FROM</span> spans <span class="kw">GROUP BY</span> minute</div> <div class="mfs-mini-chart" id="mfs-throughput"></div> </div> <div class="mfs-card"> <div class="mfs-card-header"> <span class="mfs-card-title">Cache Hit Rate</span> <span class="mfs-card-value" style="color:#22d3ee;">67%</span> </div> <div class="mfs-query"><span class="kw">SELECT</span> COUNT(cache_hit=true) / COUNT(*) * 100 <span class="kw">FROM</span> spans</div> <div class="mfs-mini-chart" id="mfs-cache"></div> </div> </div> <div class="mfs-section-title">Business dashboards — same data, business GROUP BY</div> <div class="mfs-row"> <div class="mfs-card"> <div class="mfs-card-header"> <span class="mfs-card-title">Orders by Region</span> <span class="mfs-card-value" style="color:#3b82f6;font-size:14px;">ap-south: 58%</span> </div> <div class="mfs-query"><span class="kw">SELECT</span> region, COUNT(*) <span class="kw">FROM</span> spans <span class="kw">GROUP BY</span> region</div> <div class="mfs-mini-chart" id="mfs-region"></div> </div> <div class="mfs-card"> <div class="mfs-card-header"> <span class="mfs-card-title">Avg Order Value by Plan</span> <span class="mfs-card-value" style="color:#a78bfa;font-size:14px;">premium: ₹920</span> </div> <div class="mfs-query"><span class="kw">SELECT</span> user_plan, AVG(order_total) <span class="kw">FROM</span> spans <span class="kw">GROUP BY</span> user_plan</div> <div class="mfs-mini-chart" id="mfs-aov"></div> </div> </div> </div> (function(){ function miniChart(id,values,color){ var el=document.getElementById(id);if(!el)return; var max=Math.max.apply(null,values); el.innerHTML=values.map(function(v){return '<div class="mfs-mini-bar" style="height:'+Math.max(4,Math.round(v/max*46))+'px;background:'+color+';opacity:'+(0.5+0.5*v/max)+';"></div>';}).join(''); } miniChart('mfs-latency',[120,135,128,145,160,180,220,310,420,482],'#e5c07b'); miniChart('mfs-errors',[0.5,0.8,0.6,0.9,1.1,1.4,1.8,2.0,2.1,2.3],'#f472b6'); miniChart('mfs-throughput',[11.2,11.8,12.0,12.1,12.4,12.3,12.5,12.2,12.6,12.4],'#4ade80'); miniChart('mfs-cache',[72,71,70,69,68,67,66,67,67,67],'#22d3ee'); miniChart('mfs-region',[58,55,52,54,56,58,57,58,59,58],'#3b82f6'); miniChart('mfs-aov',[880,890,900,910,905,915,920,918,920,920],'#a78bfa'); })(); </div> Every single one of these "metrics" is a query on the raw span data. No Prometheus counters in your code. No pre-aggregation. No cardinality explosion when you want to add a new dimension. And the killer part: **these dashboards are live views, not frozen snapshots**. Want to suddenly see error rate broken down by `restaurant.id`? Just add a GROUP BY. With pre-aggregated metrics you'd need to add new instrumentation, deploy, and wait. With raw wide events — it's already there. You just never asked before. <span class="key-line">Metrics don't disappear in this world. They become queries instead of code.</span> --- ## Automatic outlier detection: the system finds the problem for you Slicing and dicing is powerful when you know what to ask. But what about when you **don't know what to ask**? You just see "latency spiked" — no hypothesis, no clue where to start. This is where **automatic outlier detection** changes everything. You select the anomalous area on the heatmap. The system compares those traces against the baseline — and automatically surfaces which dimensions are different. The cause surfaces without you guessing. Think about it: a small restaurant with 10 orders/day has a problem. In aggregate metrics, their 10 slow requests are invisible against 100,000 fast ones. But with outlier detection — you circle the anomaly, and that restaurant's name surfaces immediately. #bu-wrap{font-family:'IBM Plex Mono',monospace;font-size:11px;max-width:680px;margin:1em auto;border:1px solid #334155;border-radius:8px;padding:14px;background:#0f1219;} #bu-heatmap{position:relative;height:160px;background:#1a1f2e;border-radius:6px;margin:8px 0;overflow:hidden;} #bu-canvas{width:100%;height:100%;} #bu-hint{text-align:center;font-size:10px;color:#64748b;margin:6px 0;} #bu-cards{display:grid;grid-template-columns:1fr 1fr 1fr;gap:8px;margin-top:12px;opacity:0;transition:opacity 0.5s;} #bu-cards.bu-show{opacity:1;} .bu-card{background:#1e293b;border:1px solid #334155;border-radius:6px;padding:10px;} .bu-card.bu-hot{border-color:#e5c07b;} .bu-card-title{font-size:9px;color:#64748b;margin-bottom:8px;display:flex;justify-content:space-between;align-items:center;} .bu-hot-badge{color:#e5c07b;font-weight:700;font-size:8px;text-transform:uppercase;letter-spacing:0.05em;} .bu-card-chart{display:flex;align-items:flex-end;gap:3px;height:50px;padding:0 4px;} .bu-bar-group{display:flex;flex-direction:column;align-items:center;flex:1;gap:0;} .bu-bar-pair{display:flex;gap:2px;align-items:flex-end;width:100%;justify-content:center;} .bu-bar{border-radius:2px 2px 0 0;width:10px;} .bu-bar-base{background:#3b82f6;} .bu-bar-sel{background:#e5c07b;} .bu-bar-label{font-size:7px;color:#475569;margin-top:3px;text-align:center;white-space:nowrap;} .bu-card-legend{display:flex;gap:10px;font-size:8px;color:#64748b;margin-top:6px;padding-top:6px;border-top:1px solid #334155;} #bu-verdict{margin-top:12px;padding:12px;background:#1e293b;border:1px solid #e5c07b;border-radius:6px;font-size:12px;color:#c8d6e5;display:none;text-align:center;line-height:1.5;} #bu-btn{display:block;margin:8px auto 0;background:#1e293b;border:1px solid #475569;color:#94a3b8;font-family:'IBM Plex Mono';font-size:10px;padding:5px 14px;border-radius:4px;cursor:pointer;} #bu-btn:hover{border-color:#e5c07b;color:#e5c07b;} <div id="bu-wrap"> <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:6px;"> <span style="color:#94a3b8;font-size:10px;">Latency over time — each dot = 1 request</span> <span style="font-size:9px;"><span style="color:#3b82f6;">●</span> baseline <span style="color:#e5c07b;">●</span> selection</span> </div> <div id="bu-heatmap"><canvas id="bu-canvas"></canvas></div> <div id="bu-hint">12 requests selected out of 120 total — anomaly area highlighted</div> <button id="bu-btn">Select anomaly area ▶</button> <div id="bu-cards"></div> <div id="bu-verdict"></div> </div> (function(){ var canvas=document.getElementById('bu-canvas'); var ctx=canvas.getContext('2d'); var W=680,H=160; canvas.width=W*2;canvas.height=H*2;ctx.scale(2,2); var seed=7;function sr(){seed=(seed*48271)%2147483647;return seed/2147483647;} var dots=[]; for(var i=0;i<120;i++){ var x=sr()*W; var y=H*0.55+sr()*H*0.35; dots.push({x:x,y:y,sel:false}); } for(var i=0;i<12;i++){ var x=W*0.35+sr()*W*0.2; var y=H*0.08+sr()*H*0.35; dots.push({x:x,y:y,sel:true}); } var selBox={x1:W*0.33,y1:H*0.02,x2:W*0.58,y2:H*0.48}; function draw(showSel){ ctx.clearRect(0,0,W,H); dots.forEach(function(d){ ctx.beginPath();ctx.arc(d.x,d.y,3.5,0,Math.PI*2); ctx.fillStyle=showSel&&d.sel?'#e5c07b':'#3b82f6'; ctx.globalAlpha=showSel&&d.sel?1:0.55; ctx.fill(); }); ctx.globalAlpha=1; if(showSel){ ctx.strokeStyle='#e5c07b';ctx.lineWidth=1.5;ctx.setLineDash([5,4]); ctx.strokeRect(selBox.x1,selBox.y1,selBox.x2-selBox.x1,selBox.y2-selBox.y1); ctx.setLineDash([]); } } draw(false); var cardsData=[ {name:'feature_flag.new_cart',hot:true,values:[ {label:'true',base:12,sel:92}, {label:'false',base:88,sel:8} ]}, {name:'cache_hit',hot:true,values:[ {label:'true',base:68,sel:8}, {label:'false',base:32,sel:92} ]}, {name:'region',hot:true,values:[ {label:'ap-south',base:35,sel:83}, {label:'us-east',base:33,sel:8}, {label:'eu-west',base:32,sel:9} ]}, {name:'user.plan',hot:false,values:[ {label:'premium',base:33,sel:42}, {label:'starter',base:34,sel:33}, {label:'free',base:33,sel:25} ]}, {name:'deploy.sha',hot:true,values:[ {label:'sha_a1b',base:30,sel:0}, {label:'sha_d4e',base:35,sel:8}, {label:'sha_g7h',base:35,sel:92} ]}, {name:'http.status',hot:true,values:[ {label:'200',base:96,sel:33}, {label:'500',base:4,sel:67} ]} ]; function showCards(){ draw(true); var cards=document.getElementById('bu-cards'); var html=''; cardsData.forEach(function(dim){ var maxPct=Math.max.apply(null,dim.values.map(function(v){return Math.max(v.base,v.sel);})); html+='<div class="bu-card'+(dim.hot?' bu-hot':'')+'"><div class="bu-card-title"><span>'+dim.name+'</span>'+(dim.hot?'<span class="bu-hot-badge">DIFFERENT</span>':'<span style="color:#475569;font-size:8px;">similar</span>')+'</div><div class="bu-card-chart">'; dim.values.forEach(function(v){ var bH=Math.max(2,Math.round(v.base/maxPct*44)); var sH=Math.max(2,Math.round(v.sel/maxPct*44)); html+='<div class="bu-bar-group"><div class="bu-bar-pair"><div class="bu-bar bu-bar-base" style="height:'+bH+'px;"></div><div class="bu-bar bu-bar-sel" style="height:'+sH+'px;"></div></div><div class="bu-bar-label">'+v.label+'</div></div>'; }); html+='</div><div class="bu-card-legend"><span><span style="color:#3b82f6;">■</span> baseline</span><span><span style="color:#e5c07b;">■</span> selection</span></div></div>'; }); cards.innerHTML=html; cards.classList.add('bu-show'); document.getElementById('bu-verdict').style.display='block'; document.getElementById('bu-verdict').innerHTML='<span style="color:#e5c07b;font-weight:700;">Root cause found:</span> The selected anomaly is overwhelmingly <span style="color:#e5c07b;">feature_flag.new_cart=true</span> + <span style="color:#e5c07b;">cache_hit=false</span> + <span style="color:#e5c07b;">deploy=sha_g7h</span>. The new feature flag broke the cache in ap-south-1. No hypothesis needed — the data told you.'; } document.getElementById('bu-btn').addEventListener('click',function(){ if(this.textContent.indexOf('Reset')>-1){ draw(false); document.getElementById('bu-cards').classList.remove('bu-show'); document.getElementById('bu-cards').innerHTML=''; document.getElementById('bu-verdict').style.display='none'; this.textContent='Select anomaly area ▶'; } else { showCards(); this.textContent='↺ Reset'; } }); setTimeout(showCards,3000); })(); </div> That's **automatic outlier detection**. You see an anomaly cluster — latency spiked for some requests. You have no idea why. You select that area. The system compares those requests against the baseline and shows you: `feature_flag.new_cart=true` is 92% of the anomaly but only 12% of the baseline. The new feature flag broke the cache in one region. Found in seconds, not hours. This only works with wide events. If your data is scattered across 50 log lines, you can't compare dimensions across the selection and the baseline — there's nothing to compare. <span class="key-line">Wide data enables automatic root cause discovery.</span> --- ## The data: what goes into a wide event This is the most important part. The power of slice-and-dice and outlier detection only works if your attributes are **rich** — if they capture what happened, where it happened, how it happened, who it happened to, and what was different about this request. Most teams put 5-10 attributes on a span and call it a day. That tells you nothing when things break. You need dimensions that explain every part of the request's processing. A truly wide event has **100-300 attributes** spanning infrastructure, HTTP, user context, business logic, dependencies, and runtime. The goal: when something breaks, the answer is already in the data. You don't need to add more instrumentation and wait for the next occurrence. Here's what goes on every single request event — grouped by who cares about it: #ac-wrap{font-family:'IBM Plex Mono',monospace;font-size:11px;max-width:600px;margin:1em auto;border:1px solid #334155;border-radius:8px;background:#0f1219;overflow:hidden;} #ac-json{padding:14px 16px;max-height:420px;overflow-y:auto;line-height:1.7;color:#94a3b8;white-space:pre;} #ac-json::-webkit-scrollbar{width:6px;} #ac-json::-webkit-scrollbar-track{background:#0f1219;} #ac-json::-webkit-scrollbar-thumb{background:#334155;border-radius:3px;} #ac-json .ac-comment{color:#475569;font-style:italic;} #ac-json .ac-key{color:#94a3b8;} #ac-json .ac-str{color:#c8d6e5;} #ac-json .ac-num{color:#c8d6e5;} #ac-json .ac-bool{color:#c8d6e5;} #ac-footer{padding:8px 16px;background:#1e293b;border-top:1px solid #334155;display:flex;justify-content:space-between;align-items:center;font-size:10px;} <div id="ac-wrap"> <div id="ac-json">{ <span class="ac-comment">// ─── infrastructure & deploy ───────────────────</span> <span class="ac-key">"service.name"</span>: <span class="ac-str">"order-service"</span>, <span class="ac-key">"service.version"</span>: <span class="ac-str">"2.14.3"</span>, <span class="ac-key">"service.team"</span>: <span class="ac-str">"order-platform"</span>, <span class="ac-key">"k8s.pod.name"</span>: <span class="ac-str">"order-svc-7f8d4-xk9mv"</span>, <span class="ac-key">"k8s.namespace"</span>: <span class="ac-str">"production"</span>, <span class="ac-key">"cloud.region"</span>: <span class="ac-str">"ap-south-1"</span>, <span class="ac-key">"cloud.availability_zone"</span>: <span class="ac-str">"ap-south-1a"</span>, <span class="ac-key">"instance.type"</span>: <span class="ac-str">"c5.2xlarge"</span>, <span class="ac-key">"deploy.sha"</span>: <span class="ac-str">"a1b2c3d"</span>, <span class="ac-key">"deploy.time"</span>: <span class="ac-str">"2026-06-10T14:30:00Z"</span>, <span class="ac-key">"deploy.user"</span>: <span class="ac-str">"ravi@team.dev"</span>, <span class="ac-comment">// ─── HTTP request & response ──────────────────</span> <span class="ac-key">"http.method"</span>: <span class="ac-str">"POST"</span>, <span class="ac-key">"http.route"</span>: <span class="ac-str">"/v1/orders"</span>, <span class="ac-key">"http.status_code"</span>: <span class="ac-num">200</span>, <span class="ac-key">"http.request.body_size"</span>: <span class="ac-num">1240</span>, <span class="ac-key">"http.response.body_size"</span>: <span class="ac-num">380</span>, <span class="ac-key">"user_agent.device"</span>: <span class="ac-str">"mobile"</span>, <span class="ac-key">"user_agent.os"</span>: <span class="ac-str">"iOS 18.2"</span>, <span class="ac-comment">// ─── user & customer ──────────────────────────</span> <span class="ac-key">"user.id"</span>: <span class="ac-str">"u_8847291"</span>, <span class="ac-key">"user.plan"</span>: <span class="ac-str">"premium"</span>, <span class="ac-key">"user.org_id"</span>: <span class="ac-str">"org_acme_corp"</span>, <span class="ac-key">"user.age_days"</span>: <span class="ac-num">342</span>, <span class="ac-key">"user.country"</span>: <span class="ac-str">"IN"</span>, <span class="ac-key">"user.city"</span>: <span class="ac-str">"Bangalore"</span>, <span class="ac-comment">// ─── business / domain ────────────────────────</span> <span class="ac-key">"restaurant.id"</span>: <span class="ac-str">"r_dosa_corner"</span>, <span class="ac-key">"restaurant.cuisine"</span>: <span class="ac-str">"south-indian"</span>, <span class="ac-key">"restaurant.city"</span>: <span class="ac-str">"Bangalore"</span>, <span class="ac-key">"order.total"</span>: <span class="ac-num">750</span>, <span class="ac-key">"order.item_count"</span>: <span class="ac-num">3</span>, <span class="ac-key">"order.type"</span>: <span class="ac-str">"delivery"</span>, <span class="ac-key">"delivery.eta_min"</span>: <span class="ac-num">25</span>, <span class="ac-key">"delivery.distance_km"</span>: <span class="ac-num">4.2</span>, <span class="ac-key">"promo.code"</span>: <span class="ac-str">"MONSOON20"</span>, <span class="ac-key">"promo.discount_pct"</span>: <span class="ac-num">20</span>, <span class="ac-comment">// ─── performance & dependencies ───────────────</span> <span class="ac-key">"duration_ms"</span>: <span class="ac-num">340</span>, <span class="ac-key">"db.query_count"</span>: <span class="ac-num">4</span>, <span class="ac-key">"db.total_ms"</span>: <span class="ac-num">120</span>, <span class="ac-key">"cache.hit"</span>: <span class="ac-bool">false</span>, <span class="ac-key">"external.calls_count"</span>: <span class="ac-num">2</span>, <span class="ac-key">"external.total_ms"</span>: <span class="ac-num">85</span>, <span class="ac-key">"queue.wait_ms"</span>: <span class="ac-num">12</span>, <span class="ac-comment">// ─── feature flags ────────────────────────────</span> <span class="ac-key">"feature_flag.new_cart"</span>: <span class="ac-bool">true</span>, <span class="ac-key">"feature_flag.rec_v2"</span>: <span class="ac-bool">false</span>, <span class="ac-key">"feature_flag.dark_mode"</span>: <span class="ac-bool">true</span>, <span class="ac-comment">// ─── error context (when applicable) ──────────</span> <span class="ac-key">"error"</span>: <span class="ac-bool">false</span>, <span class="ac-key">"error.slug"</span>: <span class="ac-str">""</span> }</div> <div id="ac-footer"> <span style="color:#4ade80;font-weight:600;">48 attributes. One row. One write. That's "wide."</span> <span style="color:#64748b;">scroll ↕</span> </div> </div> </div> The key insight: **default to inclusion**. If you find yourself asking "will I ever need this field?" — the answer is yes. You will, at 3am, when something you never imagined breaks. Adding a field to a wide event costs nothing — it's still one write, one row. The marginal cost of one more column in a columnar database is nearly zero. And notice the categories. Infrastructure attributes (k8s, cloud, deploy) let you answer infra questions. User attributes let you answer business questions. **Combined** — "are premium users in ap-south-1 on the new deploy hitting more cache misses?" — that's the question nobody pre-planned for. That's observability. --- ## The anti-pattern: reaching for log.Info Here's the developer instinct that needs to change. You're writing code. You want to record context about a search result. The instinct says: ```go log.Info("product found", "product_id", productID) log.Info("price fetched", "price", price) ``` Two new log lines. Two fields each. Deep. The right instinct — OTEL semantic conventions for standard infra attributes, your org's own convention package for business attributes: ```go import ( semconv "go.opentelemetry.io/otel/semconv/v1.26.0" // standard: HTTP, k8s, cloud, DB orgconv "github.com/your-org/otelconv" // your org: business domain attributes ) // infra — comes from semconv (standardized across the industry) span.SetAttributes( semconv.HTTPRequestMethodKey.String("POST"), semconv.HTTPRouteKey.String("/v1/search"), ) // business — comes from orgconv (standardized across your org) span.SetAttributes( orgconv.ProductID(productID), orgconv.ProductPrice(price), orgconv.SearchResultCount(resultCount), ) ``` Two packages. One for industry-standard attributes (HTTP, gRPC, database, k8s, cloud — defined by OpenTelemetry). One for your org's business domain (product, order, restaurant, user segments — defined by your platform team). Both end up as attributes on the same span. No new events created. No extra storage rows. Just more columns on the same row. The `orgconv` package is key — it standardizes business attribute names across all your services. When everyone uses `orgconv.RestaurantID(id)` instead of hand-typing `"restaurant_id"` or `"rest_id"` or `"restaurantId"`, your GROUP BY queries work across every service consistently. #ic-wrap{font-family:'IBM Plex Mono',monospace;font-size:11px;max-width:600px;margin:1em auto;display:flex;gap:12px;} .ic-box{flex:1;border-radius:6px;padding:10px;border:1px solid #334155;} .ic-bad{border-color:#f472b6;background:rgba(244,114,182,0.03);} .ic-good{border-color:#4ade80;background:rgba(74,222,128,0.03);} .ic-title{font-size:10px;text-transform:uppercase;letter-spacing:0.08em;margin-bottom:6px;font-weight:600;} .ic-bad .ic-title{color:#f472b6;} .ic-good .ic-title{color:#4ade80;} .ic-code{background:#0f1219;padding:6px;border-radius:4px;font-size:10px;color:#94a3b8;margin:4px 0;white-space:pre-wrap;} .ic-result{font-size:10px;color:#64748b;margin-top:6px;padding-top:6px;border-top:1px solid #334155;} <div id="ic-wrap"> <div class="ic-box ic-bad"> <div class="ic-title">The instinct (deep)</div> <div class="ic-code">log.Info("product found", "product_id", pid) log.Info("price calculated", "price", price) log.Info("discount applied", "discount", disc)</div> <div class="ic-result">3 new rows in storage. Each isolated. To correlate product with discount: JOIN on trace_id across 3 events. Slow at scale.</div> </div> <div class="ic-box ic-good"> <div class="ic-title">The fix (wide)</div> <div class="ic-code">span.SetAttributes( attribute.String("product.id", pid), attribute.Float64("price", price), attribute.Float64("discount", disc), )</div> <div class="ic-result">0 new rows. 3 new columns on existing event. Already correlated with user, restaurant, duration, everything. One WHERE clause answers any question.</div> </div> </div> </div> The rule: **if you're about to write a new log line, ask "can this be an attribute on my existing span instead?" If yes — do that.** Client spans (HTTP calls, DB queries) are fine as separate events — they measure separate durations. But adding a new log just to record a product ID? That's deep when you should go wide. --- ## But what about the cost? "Sure, wide events sound great. But if I'm putting 150 fields on every request at 10,000 RPS — that's a LOT of data. Won't this bankrupt us?" This is the question everyone asks. And the answer is counterintuitive: wide events are actually **cheaper** than deep logging. Let me prove it. #cc-wrap{font-family:'IBM Plex Mono',monospace;font-size:11px;max-width:640px;margin:1em auto;border:1px solid #334155;border-radius:8px;padding:14px;background:#0f1219;} #cc-config{display:flex;gap:8px;flex-wrap:wrap;margin-bottom:12px;align-items:center;} #cc-config label{color:#64748b;font-size:10px;} #cc-config input{background:#1e293b;border:1px solid #334155;border-radius:4px;color:#f1f5f9;font-family:'IBM Plex Mono';font-size:11px;padding:3px 6px;width:70px;text-align:center;} #cc-config input:focus{border-color:#4ade80;outline:none;} #cc-compare{display:grid;grid-template-columns:1fr 1fr;gap:10px;margin-bottom:12px;} .cc-col{border-radius:6px;padding:10px;font-size:10px;} .cc-col-deep{background:#1e293b;border:1px solid #475569;} .cc-col-wide{background:#1e293b;border:1px solid #4ade80;} .cc-col-title{font-size:10px;text-transform:uppercase;letter-spacing:0.08em;margin-bottom:8px;font-weight:600;} .cc-col-deep .cc-col-title{color:#f472b6;} .cc-col-wide .cc-col-title{color:#4ade80;} .cc-line{padding:2px 0;color:#94a3b8;font-size:9px;border-bottom:1px solid #1a1f2e;} .cc-line-overhead{color:#64748b;} .cc-line-data{color:#c8d6e5;} #cc-math{background:#1e293b;border-radius:6px;padding:10px;margin-bottom:10px;} .cc-math-row{display:flex;justify-content:space-between;padding:4px 0;border-bottom:1px solid #0f1219;font-size:10px;} .cc-math-label{color:#94a3b8;} .cc-math-deep{color:#f472b6;min-width:80px;text-align:right;} .cc-math-wide{color:#4ade80;min-width:80px;text-align:right;} .cc-math-header{display:flex;justify-content:space-between;padding:4px 0;margin-bottom:4px;font-size:9px;text-transform:uppercase;letter-spacing:0.05em;} #cc-verdict{text-align:center;padding:10px;background:rgba(74,222,128,0.06);border:1px solid rgba(74,222,128,0.3);border-radius:6px;font-size:11px;} <div id="cc-wrap"> <div style="color:#e5c07b;font-size:11px;text-transform:uppercase;letter-spacing:0.1em;margin-bottom:8px;font-weight:600;">Same data, two shapes — apple to apple</div> <div id="cc-config"> <label>RPS:</label><input id="cc-rps" type="number" value="10000"> <label>Fields to capture:</label><input id="cc-fields" type="number" value="20"> <label>Cost/GB:</label><input id="cc-costgb" type="number" value="0.30" step="0.01"> <label>Retention:</label><input id="cc-days" type="number" value="30"><label>days</label> </div> <div id="cc-compare"> <div class="cc-col cc-col-deep"> <div class="cc-col-title">Deep — one log line per field</div> <div id="cc-deep-lines"></div> </div> <div class="cc-col cc-col-wide"> <div class="cc-col-title">Wide — one event, all fields</div> <div id="cc-wide-event"></div> </div> </div> <div id="cc-math"> <div class="cc-math-header"><span class="cc-math-label"></span><span style="color:#f472b6;">Deep</span><span style="color:#4ade80;">Wide</span></div> <div class="cc-math-row"><span class="cc-math-label">Rows per request</span><span class="cc-math-deep" id="cc-m-rows-d"></span><span class="cc-math-wide" id="cc-m-rows-w"></span></div> <div class="cc-math-row"><span class="cc-math-label">Overhead bytes (repeated per row)</span><span class="cc-math-deep" id="cc-m-overhead-d"></span><span class="cc-math-wide" id="cc-m-overhead-w"></span></div> <div class="cc-math-row"><span class="cc-math-label">Actual data bytes</span><span class="cc-math-deep" id="cc-m-data-d"></span><span class="cc-math-wide" id="cc-m-data-w"></span></div> <div class="cc-math-row"><span class="cc-math-label">Total bytes/request (raw)</span><span class="cc-math-deep" id="cc-m-total-d"></span><span class="cc-math-wide" id="cc-m-total-w"></span></div> <div class="cc-math-row"><span class="cc-math-label">Compression ratio</span><span class="cc-math-deep" id="cc-m-compress-d"></span><span class="cc-math-wide" id="cc-m-compress-w"></span></div> <div class="cc-math-row"><span class="cc-math-label">GB/day (compressed)</span><span class="cc-math-deep" id="cc-m-daily-d"></span><span class="cc-math-wide" id="cc-m-daily-w"></span></div> <div class="cc-math-row" style="border-bottom:none;padding-top:6px;font-weight:600;"><span class="cc-math-label" style="color:#c8d6e5;">Monthly cost</span><span class="cc-math-deep" id="cc-m-cost-d" style="font-size:13px;"></span><span class="cc-math-wide" id="cc-m-cost-w" style="font-size:13px;"></span></div> </div> <div id="cc-verdict"></div> </div> (function(){ var el=function(id){return document.getElementById(id);}; var overheadPerRow=65; var bytesPerField=25; var sampleFields=['user.id','user.plan','restaurant.id','cache_hit','duration_ms','http.status','region','deploy.sha','order.total','delivery.eta','db.query_count','db.total_ms','feature_flag.new_cart','http.method','http.route','error','order.item_count','user.city','instance.type','k8s.pod']; function calc(){ var rps=parseFloat(el('cc-rps').value)||1; var fields=parseInt(el('cc-fields').value)||1; var costgb=parseFloat(el('cc-costgb').value)||0; var days=parseFloat(el('cc-days').value)||1; var deepLinesHtml=''; var wideLinesHtml=''; var shown=Math.min(fields,sampleFields.length); for(var i=0;i<shown;i++){ deepLinesHtml+='<div class="cc-line"><span class="cc-line-overhead">ts,trace,svc,</span><span class="cc-line-data">'+sampleFields[i]+'=...</span></div>'; wideLinesHtml+='<div class="cc-line"><span class="cc-line-data">'+sampleFields[i]+'=...</span></div>'; } if(fields>shown)deepLinesHtml+='<div class="cc-line" style="color:#64748b;">...+'+(fields-shown)+' more lines</div>'; if(fields>shown)wideLinesHtml+='<div class="cc-line" style="color:#64748b;">...+'+(fields-shown)+' more fields</div>'; el('cc-deep-lines').innerHTML=deepLinesHtml; el('cc-wide-event').innerHTML='<div class="cc-line" style="color:#64748b;border:none;">ts, trace_id, service — once</div>'+wideLinesHtml; var deepOverhead=fields*overheadPerRow; var wideOverhead=overheadPerRow; var dataBytes=fields*bytesPerField; var deepTotal=deepOverhead+dataBytes; var wideTotal=wideOverhead+dataBytes; var deepCompress=7; var wideCompress=12; var deepDailyGB=(rps*deepTotal*86400)/(1024*1024*1024)/deepCompress; var wideDailyGB=(rps*wideTotal*86400)/(1024*1024*1024)/wideCompress; var deepCost=deepDailyGB*days*costgb; var wideCost=wideDailyGB*days*costgb; el('cc-m-rows-d').textContent=fields+' rows'; el('cc-m-rows-w').textContent='1 row'; el('cc-m-overhead-d').textContent=deepOverhead+' B ('+overheadPerRow+'×'+fields+')'; el('cc-m-overhead-w').textContent=wideOverhead+' B ('+overheadPerRow+'×1)'; el('cc-m-data-d').textContent=dataBytes+' B'; el('cc-m-data-w').textContent=dataBytes+' B (same!)'; el('cc-m-total-d').textContent=deepTotal+' B'; el('cc-m-total-w').textContent=wideTotal+' B'; el('cc-m-compress-d').textContent=deepCompress+'x'; el('cc-m-compress-w').textContent=wideCompress+'x'; el('cc-m-daily-d').textContent=deepDailyGB.toFixed(1)+' GB'; el('cc-m-daily-w').textContent=wideDailyGB.toFixed(1)+' GB'; el('cc-m-cost-d').textContent='