When Modularity Falls Short

Note: this will be the first in a series exploring Agentic Design Patterns. The goal is to build useful intuitions that can guide good agentic system design.


When Modularity Falls Short: A Cautionary Tale in System Design and the Computational Search Stack

In any engineered system—whether in software or in the real world—each added layer of complexity may inch us closer to an ideal performance, yet it also compounds risk and fragility. True design wisdom lies in balancing the pursuit of optimality with the need for robustness, embracing complexity only when its gains clearly outweigh the loss of simplicity and reliability. One of the canonical ways to promote clarity and simplicity in traditional software engineering—modularity—counterintuitively introduces complexity in agentic LLM systems.

In this article, we'll explore the illustrative example of giving tailored insights based on expressed interests from a series of research papers (for instance, giving a summary of the training methodologies from a collection of machine learning papers). The key principle we will build up to: introduce additional sequential processing steps only when necessary constraints (e.g., cost, latency, or technological limitations) justify the compounded error risk.

The Promise of Modular Pipelines

Modern software system designs emphasize modularity. Breaking down a monolithic function into discrete, specialized components can improve maintainability, debuggability, and testability. However, due to the stochastic nature of LLMs, this can come at the cost of reliability. Let's consider a hypothetical pipeline for extracting tailored insights.

A rough system might look like the following (once the corpus of papers and interests are selected):

  1. Paper to Extracted Snippets: Use an LLM to ingest the paper and use the given insights for all interests to output many snippets verbatim from the text. Repeat for each paper.
  2. Consolidate Extracted Snippets: For all snippets from all the papers, consolidate all the different ideas while noting the relative frequencies into a cohesive answer.

A proposed refactor might include the following stages, where step 1 from the original splits into steps 1 and 2 below, and step 2 similarly splits into steps 3 and 4.

  1. Original Paper to Distilled Paper: Use a lightweight model to reduce token size and remove redundant or irrelevant parts as well as standardize formatting.
  2. Distilled Paper to Extracted Snippets: Run the extraction function on the distilled output for paper.
  3. Deduplication: Deduplicate the snippets into one master list.
  4. Final Answer Synthesis: Apply an LLM to take the deduped snippets and form a user-friendly answer.

At first glance, this pipeline promises improved performance by leveraging sequential modularity to reduce the complexity of each stage. However, the inclusion of these additional steps begs a critical question: Do these additional steps actually enhance overall accuracy, or simply compound errors?

The Mathematical Perspective:


Error Propagation in Cascaded Functions
Let’s consider a highly simplified mathematical model to examine this trade-off. To start, we will only consider the design's impact on performance (ignoring cost and latency concerns).

Let's abstract both designs from above into the following function flow:

Original Design:

B_orig -> D_orig

Refactored Design:

A -> B_seq -> C -> D_seq

A: Function A reduces the paper's complexity while preserving all relevant information (refactor step 1).

B_seq: Function B extracts snippets when working with the distilled paper (refactor step 2).

B_orig: Function B extracts snippets when working with the full, unmodified paper (original design step 1).

Since B_orig is doing the job of both A and B_seq in the refactor, let's compare these first to see the impact of sequential modularity.

We define:

pA: The probability that Function A is correct.

pB_seq: The probability that Function B_seq is correct.

pB_orig: The probability that Function B_orig is correct.

For the two-step process (distillation followed by extraction) to be advantageous, we require:

pA * pB_seq > pB_orig

This is because if either A or B_seq is incorrect, the final output will be incorrect.

This inequality reveals a key insight: unless the combination of the distillation step and the subsequent extraction significantly improves extraction quality compared to the direct extraction on the full input, the extra step is not justified. In many cases, employing a lightweight model for distillation introduces a non-negligible error rate. Consequently, even if the downstream extractor benefits from a shorter input (increasing pB_seq), the overall process could underperform relative to simply using the extraction function alone (pB_orig).

Plugging in real world values – let's say that you get 90% for each stage, then it's only worth it if your consequent drop from context addition puts pB_orig below 81%!

Coupled with the reality that performance becomes exponentially harder to improve as it approaches 100%, this explains why modularity with LLMs is not always desirable.

Reflections on Modularity in Systems

Why Modularity Works in Software
The natural question then becomes: why doesn't this apply to pure software since the analysis didn't seem LLM specific? In traditional software systems, the functions are far more specified and the substitutability, maintainability, and testability result in a drastically lower developer error rate (coupled with the other non-immediate benefits of modularity). In LLM systems, the models are opaque reasoners that are far less deterministic in behavior.

Reasons for Sequential Modularity in Agentic LLM systems
I believe that contraints are the only reason to introduce modularity in AI systems. However, there are many types of constraints, hence many reasons to introduce sequential modularity in AI systems. Some types of contraints include:

  • Performance – for any given LLM, it is only able to handle so complex a task (see Tables 1 and 2 in the Appendix to find values where the performance with one call is very poor). If you expect a single LLM call to do the work of an entire system, you may find that the drop is so significant, that the introduction of a sequential stage is warranted.
  • Intermediate Verifiability – if an intermediate step has some ability to be verified accurately, then it can make sense to introduce sequential modularity.
  • Cost & Latency– generally, if cost or latency is a serious consideration, multiple cheaper (and often faster) models can be used since they individually have less of a capacity for complex tasks.
  • Context Window – certain models (such as the Gemini series) specialize in long context lengths. If your task is very large at the top of the funnel, then using a model specialized for that, or even moving to more traditional hybrid search techniques may well be warranted.

It is also important to keep in mind that the Pareto frontier of these models is rapidly changing. What may be impossible with a given contraint set today may well be possible in 6-12 months. This is a future proofing benefit of avoiding unnecessary stages.

Broader Implications in System Design and Search

This principle extends beyond paper processing. In search and recommendation systems, for example, cascaded models are often used to filter large datasets quickly, followed by more precise (but computationally expensive) models for ranking or refining results. The key is balancing the trade-off:

Cascaded Filtering: A preliminary step that efficiently reduces the search space that must be sufficiently accurate (in recall); otherwise, it risks eliminating relevant items.

Downstream Processing: Subsequent steps can then focus on finer-grained analysis, but their success hinges on the quality of the initial filtering.

The risk of compounded error in cascaded systems is a fundamental challenge in system design. Every layer not only contributes to overall latency but also has the potential to degrade the final output if errors propagate unchecked. The decision to include a component should, therefore, be guided by rigorous cost-benefit analysis.

A Cautionary Tale for Designers

The distillation example serves as a cautionary tale for system designers:

  1. Assess Impact Early: Before introducing an extra processing layer, quantify its impact on overall accuracy and performance. Mathematical models, like the one we discussed, can help predict whether the added complexity is justified.
  2. Guard Against Compounding Errors: If an introduced intermediate step carries a risk of error (even a small one), understand how that risk compounds with downstream processing. In some cases, a single robust model might outperform a series of specialized but error-prone models.
  3. Evaluate Contextual Constraints: The move to a lower-powered model should be driven by real-world constraints. Without the pressure of cost or latency restrictions, it might be wiser to employ a more powerful model across the board, ensuring maximum accuracy.
  4. Iterate with Feedback: If modular pipelines are used, implement robust monitoring and feedback/logging. This way, any degradation due to the new layer can be quickly identified and addressed.

Conclusion

In the realm of system design and search, every design decision is a balance between trade-offs. The distillation step in our paper insight extraction pipeline illustrates a broader principle: Only “move down” the computational search stack and introduce sequential modularity when compelling constraints necessitate it. Otherwise, the introduction of an extra layer may result in compounded errors that degrade the overall performance of the system.

By rigorously analyzing each component’s contribution and understanding how errors propagate through cascaded models, engineers can build systems that are not only modular and scalable but also robust and accurate.

P.S. Stay tuned for a follow up where we explore the idea of "critics" and when they're beneficial. There will also be a fast followup exploring parallel modularity.

Appendix:

B_orig \ Y 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
10% (0.1) 0.190 0.280 0.370 0.460 0.550 0.640 0.730 0.820 0.910 1.000
20% (0.2) 0.280 0.360 0.440 0.520 0.600 0.680 0.760 0.840 0.920 1.000
30% (0.3) 0.370 0.440 0.510 0.580 0.650 0.720 0.790 0.860 0.930 1.000
40% (0.4) 0.460 0.520 0.580 0.640 0.700 0.760 0.820 0.880 0.940 1.000
50% (0.5) 0.550 0.600 0.650 0.700 0.750 0.800 0.850 0.900 0.950 1.000
60% (0.6) 0.640 0.680 0.720 0.760 0.800 0.840 0.880 0.920 0.960 1.000
70% (0.7) 0.730 0.760 0.790 0.820 0.850 0.880 0.910 0.940 0.970 1.000
80% (0.8) 0.820 0.840 0.860 0.880 0.900 0.920 0.940 0.960 0.980 1.000
90% (0.9) 0.910 0.920 0.930 0.940 0.950 0.960 0.970 0.980 0.990 1.000
100% (1.0) 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Table 1: Stage B_seq values for different Advantages to B_orig

In Table 1, Y is the percent improvement applied to the “headroom” (i.e. the gap between B_orig and 1) that B_seq experiences. For example, Y=0.1 means that B_seq has a 10% improvement on the headroom of B_orig (so if B_orig is 0.1 then B_seq = 0.19 = 0.1 + 0.1*(1-0.9)).

B_orig \ Y 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
10% (0.1) 0.526 0.357 0.270 0.217 0.182 0.156 0.137 0.122 0.110 0.100
20% (0.2) 0.714 0.556 0.455 0.385 0.333 0.294 0.263 0.238 0.217 0.200
30% (0.3) 0.811 0.682 0.588 0.517 0.462 0.417 0.380 0.349 0.323 0.300
40% (0.4) 0.870 0.769 0.690 0.625 0.571 0.526 0.488 0.455 0.426 0.400
50% (0.5) 0.909 0.833 0.769 0.714 0.667 0.625 0.588 0.556 0.526 0.500
60% (0.6) 0.938 0.882 0.833 0.789 0.750 0.714 0.682 0.652 0.625 0.600
70% (0.7) 0.959 0.921 0.886 0.854 0.824 0.796 0.769 0.745 0.722 0.700
80% (0.8) 0.976 0.952 0.930 0.909 0.889 0.870 0.851 0.833 0.816 0.800
90% (0.9) 0.989 0.978 0.968 0.957 0.947 0.938 0.928 0.918 0.909 0.900
100% (1.0) 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Table 2: Required performance for Stage A to make Sequential Modularity equivalent to B

Table 2 more interestingly shows the performance Stage A needs to give equivalent performance to B_orig for different relative advantage (Y) values. Intuitively, you can see that if the relative advantage is small, and B_orig works well, then A needs to perform incredibly well to be equivalent. Similarly if the relative advantage is large, and B_orig is performing poorly, then A can perform quite poorly and still match the performance.