How a 2,960-gate synthesizable IP block eliminates unnecessary CPU cycles at the hardware level — for every architecture, every sector, every process node.
Every processor in the world wastes energy confirming that data has not changed. This paper presents the Reflexive Processing Unit (RPU), a synthesizable hardware IP that monitors the temporal rate of change of incoming data and autonomously gates downstream clock and power — without software, without a PMU, and without modifying any existing logic. Validated at TSMC 65nm GP (625 MHz, 1.702 mW, 0 ps slack) and SkyWater SKY130 (100 MHz, 0.014 mW leakage). Integrated with lowRISC Ibex RISC-V over 5,000,004 simulation cycles: 99.998% CPU cycle reduction in stable data conditions, with a consistent 2-cycle wake-up latency across all tested scenarios. Zero false wake-ups.
Pick any digital system that processes a continuous data stream — a radar installation, a cardiac monitor, a data center telemetry agent, an autonomous vehicle sensor, an industrial vibration monitor. In every one of these systems, a CPU is executing a loop. It reads the sensor. It checks the value. It decides nothing has changed. It does this again. Millions of times per second. Every single iteration produces zero useful output.
This behavior is called polling, and it is the architectural default of every digital system ever built. The processor has no mechanism for knowing that data has not changed without actively looking. So it looks, constantly, even when the answer is always the same: nothing happened.
In always-on systems — IoT sensors, radar receivers, data center monitors, medical implants, autonomous vehicle sensors — the majority of CPU active cycles are spent confirming stagnation. The processor runs at full power to produce the answer "no change." That energy is not a necessary cost. It is pure waste.
Existing solutions attempt to address this in software: interrupt-driven comparators, OS-level DVFS, centralized PMU controllers. All of them fail for the same reason: they require the CPU to decide when it should stop being awake. But deciding requires being awake. It is a circular dependency that software cannot resolve, because the software layer is part of the problem.
What is needed is a hardware element that can observe a data stream, compute how fast it is changing, compare that rate against a threshold, and autonomously gate power — all within a single clock cycle, without any instruction execution, without any external controller. Until the RPU, no such element existed as a synthesizable IP on standard CMOS.
The RPU does not ask "has the data changed?" It measures how fast the data is changing — the temporal rate of change, expressed as δ = |avg_new − avg_old|. When that rate falls below a configurable threshold, the RPU gates the downstream clock and power. When it rises above, it asserts a wake signal and the CPU activates in exactly 2 clock cycles.
The decision path is fully combinational. It involves no program counter, no instruction memory, no bus communication, and no software layer. The entire chain — from incoming data sample to clock gate enable — completes within a single clock cycle. The RPU does not need to know what the data means. It only needs to know whether the data is changing fast enough to justify waking the processor.
The RTL implementation (rpu_core.sv, 560 lines, SystemVerilog IEEE 1800-2017) consists of seven modules forming an integrated autonomous decision chain:
Module 101 — Input Module. Receives any digital data stream. Compatible with ADC output, sensor bus, memory interface, or any custom signal. No semantic interpretation of data content.
Module 102 — State Store. Circular ring buffer of configurable depth (default DEPTH=32, power-of-two). Maintains two running sums — sum_new and sum_old — for the new and old halves of the window. Updated in O(1) per sample regardless of buffer depth.
Module 103 — Temporal Change Analysis. Computes δ = |avg_new − avg_old| using bit-shift division. No hardware divider required. The result is available within the same clock cycle as the incoming sample.
Module 104 — Threshold Control. Adaptive asymmetric threshold engine. Raises the threshold (desensitizes) when the signal is noisy; lowers it (sensitizes) when the environment is quiet. Configurable at runtime via C-HAL without re-synthesis.
Module 105 — Reconfiguration Module. Connects directly to an Integrated Clock Gating (ICG) cell enable input or sleep transistor gate terminal. The path from decision to gate is purely combinational — no sequential dependency, no added latency.
Module 106 — Output Unit. Asserts wake_en to the CPU interrupt pin. Compatible with RISC-V irq_external_i, ARM Cortex-M NVIC, or any level-triggered interrupt controller. No firmware modifications required on the CPU side.
Module 107 — Guardian Sideband. An independent monitoring channel operating on the ungated clock. Reports last_delta, active_threshold, and alert_status even when the main clock is gated. Essential for defense watchdog compliance, safety-critical observability, and always-on monitoring applications.
The RPU connects in parallel with the existing data path. It does not interrupt it, replace it, or modify it. The complete integration is a single RTL instantiation:
// 2 inputs, 1 output. Nothing else in your system changes. rpu_core #(.DEPTH(32), .DATA_WIDTH(12)) u_rpu ( .clk (sys_clk), .rst_n (sys_rst_n), .in_data(sensor_data), .in_valid(data_valid), .wake_en(irq_external_i) // → CPU interrupt pin ); // RISC-V: wake_en → irq_external_i // ARM Cortex-M: wake_en → any NVIC line // Fail-safe: remove this block → system reverts to polling, zero degradation
The fail-safe property is unconditional. Worst case: remove the RPU, system reverts to conventional polling. Zero difference.
The RPU is realizable in standard synchronous CMOS logic without exotic processes, special memory cells, or non-standard design flows. The implementation uses only combinational logic, flip-flops, and standard adder/subtractor cells available in any PDK.
The sliding window mechanism divides the ring buffer at its midpoint (HALF = DEPTH/2) into two equal subgroups. Running sums are maintained in separate accumulator registers. On each new sample: sum_new is incremented by the incoming value, the midpoint sample transitions from sum_new to sum_old, and the oldest sample is subtracted from sum_old. This update requires a constant number of operations regardless of buffer depth — O(1) complexity per clock cycle.
Subgroup averages are computed by right-shifting the running sums by log₂(HALF) — binary division without a hardware divider. The temporal change metric δ is then the absolute difference of the two shifted values. The complete decision path from δ computation to ICG enable is combinational and closes timing within a single clock period.
All parameters — DEPTH, DATA_WIDTH, threshold bounds, adaptive step sizes — are compile-time configurable. Runtime adjustment of threshold parameters is available via the C-HAL memory-mapped register interface, which uses zero dynamic memory allocation (no malloc) and operates strictly out-of-band without interfering with the hardware decision path.
The RPU was validated across four independent evidence layers: dual-node ASIC synthesis, FPGA comparative measurement, and full system-level RISC-V integration. No results in this section are estimated or extrapolated.
Synthesis was performed using Cadence Genus with a target clock period of 1.6 ns (625 MHz). The design achieved full timing closure with zero violating paths and zero total negative slack. Power analysis used activity data derived from simulation (VCD).
Power decomposition: leakage 0.178 mW (10.5%), internal 1.051 mW (61.7%), switching 0.473 mW (27.8%). Internal power dominates, consistent with expected synchronous register-dominated behavior. Total cell area: 12,990 µm². Net area: 5,073 µm².
The identical RTL was synthesized on SkyWater SKY130 with a target clock period of 10.0 ns (100 MHz). Full timing closure achieved. This confirms technology portability: the architecture is not specific to TSMC 65nm and synthesizes correctly on any standard CMOS process.
Leakage power at SKY130: 0.014 mW (0.35% of total power), compared to 10.5% at TSMC 65nm. The significant leakage difference reflects the larger geometry. Total power at SKY130 is 3.876 mW — higher than TSMC 65nm due to less optimized standard cells, but the functionality and timing behavior are identical.
Comparative validation was performed on a Nexys A7-100T (Xilinx XC7A100TCSG324-1) using Vivado Power Analyzer with live physical sensor input. A light-dependent resistor (LDR) provided continuous analog readings digitized through the on-board ADC, capturing real-world ambient light variations including noise, drift, and transient changes.
The same live sensor data stream was fed simultaneously to a conventional always-on threshold circuit and the RPU, eliminating external variables and ensuring a direct hardware-to-hardware comparison under identical real-world conditions.
| Parameter | Conventional AT Circuit | RPU Circuit |
|---|---|---|
| Signal Rate | 7.043 Mt/s | 0.468 Mt/s |
| Toggle Rate | 17.5% | ≤12.5% |
| Operating Mode | Always-on | Event-driven |
| Reduction Ratio | (baseline) | ≈15× |
The 15× reduction in signal toggle rate under identical sensor input directly demonstrates that the RPU achieves equivalent logical output with fundamentally fewer hardware operations. Dynamic power scales proportionally with toggle rate (P = α·C·V²·f), confirming proportional power reduction.
The rpu_core RTL module was integrated with the lowRISC Ibex RISC-V processor in a complete SoC testbench comprising the Ibex core, dual-port RAM, a bus interconnect, and a timer peripheral. The wake_en output was connected directly to the Ibex external interrupt input (irq_external_i). Two firmware configurations were compared under identical simulated input data streams over 5,000,004 executed cycles using Verilator:
(1) A baseline polling firmware in which the CPU continuously reads and evaluates incoming data in a software loop; and (2) an RPU-assisted firmware in which the CPU enters WFI sleep and is awakened only when the RPU detects temporal change exceeding the configured threshold.
| Scenario | Polling Cycles | RPU Cycles | Reduction | Wake Latency |
|---|---|---|---|---|
| Stable data with small noise | 5,000,000 | 125 | 99.998% | 2 cycles |
| Sudden spike / rapid change | 5,000,000 | 338 | 99.993% | 2 cycles |
| Slow drift + large anomaly | 5,000,000 | 1,487,143 | 70.3% | 2 cycles |
Across all three scenarios, the wake-up latency from RPU asserting wake_en to the Ibex core executing its first post-WFI instruction was consistently 2 clock cycles. Not approximately 2. Exactly 2, every time. Zero false wake-ups in the stable scenario across 5,000,000 cycles.
Because the RPU operates on temporal rate of change rather than data content, it applies to any system that processes a continuous data stream — regardless of the data's origin, format, or meaning. The threshold and buffer depth are the only parameters that need to change between applications. The architecture is identical.
Defense and radar. Radar returns from empty airspace change very slowly or not at all. The CPU wakes only when the temporal change rate exceeds the threshold, indicating a genuine target or signal event. The Guardian Sideband maintains always-on observability and watchdog compliance even when the main clock is gated — critical for defense certification requirements.
Automotive ADAS. When a vehicle is stationary or the scene ahead has not changed, LiDAR point clouds, radar returns, and camera frames contain no actionable information. The RPU suppresses these frames before they consume GPU or CPU cycles. The 15× toggle reduction measured on FPGA translates directly to proportional reduction in dynamic power on the automotive compute platform.
IoT and medical devices. Battery-powered sensors — cardiac monitors, glucose sensors, environmental loggers, wearables — spend the vast majority of their operational lifetime observing unchanged or slowly changing data. The RPU keeps the CPU in WFI sleep and wakes it in 2 cycles only when a meaningful change occurs. The SKY130 leakage of 0.014 mW establishes the floor for always-on RPU operation.
Data center and SmartNIC. Monitoring and telemetry pipelines continuously transmit metrics regardless of whether values have changed. An RPU at a SmartNIC or DPU interface suppresses unchanged telemetry before it triggers host CPU interrupts, kernel context switches, and downstream storage writes. Storage systems additionally suffer write amplification from identical data blocks being repeatedly committed to flash media — a temporal change filter on the write path addresses this directly.
Edge AI and inference. Video analytics and inference pipelines process high-frame-rate streams. In surveillance and industrial monitoring, the majority of frames depict static or near-static scenes. The RPU gates the inference accelerator front-end before stagnant input tensors consume GPU/TPU compute cycles and memory bus bandwidth.
Industrial and predictive maintenance. Machinery vibration and acoustic signals are nominal during 99% of operational lifetime. The RPU keeps the CPU asleep during normal operation and delivers a 2-cycle wake-up when an anomaly crosses the threshold. The C-HAL adaptive threshold adjusts sensitivity at runtime as operating conditions change.
Space and harsh environments. The combinational decision path contains no instruction-fetch vulnerability. The architecture synthesizes on any rad-tolerant CMOS node. The 2-cycle wake-up latency is deterministic regardless of environmental conditions. The Guardian Sideband provides always-on observability without breaking power isolation.
The core architectural distinction of the RPU is that the temporal change metric computation, threshold comparison, and physical gating execution all occur within a single integrated block — without external controller involvement. This combination does not exist in any prior synthesizable IP on standard CMOS.
Software-based DVFS and OS P-state management operate through a software layer with millisecond-scale latency and no data-change awareness. Conventional clock gating passively follows externally generated enable signals without internal temporal analysis. The decision does not originate from within the gated block.
Wang et al. (Nature Communications, 2024) present a memristor-based adaptive neuromorphic perception system. The critical limitation is that sensory data acquisition, feature extraction, and modulation scheme selection are managed through an external FPGA control platform — a Von Neumann sequential instruction model requiring program counters, instruction memory, ALU execution cycles, and bus communication. This is architecturally distinct from the RPU's combinational hardware reflex.
HP's memristor patent (US8450711B2) describes a resistive switching device that retains past electrical stimuli. However, it does not compute the temporal rate of change of its stored state, compare it against a threshold, or generate autonomous hardware decisions based on that comparison. It is a passive storage element without a reflexive decision mechanism.
| Capability | DVFS | Clk Gate | Wang 2024 | HP Memristor | RPU |
|---|---|---|---|---|---|
| Decision origin | OS/SW | External | FPGA+ALU | Passive | Within cell |
| ΔC/Δt computation | — | — | ALU-based | — | O(1) hardware |
| Clk + power isolation | Clk only | Clk only | None | None | Both |
| Wake-up latency | ms | External | µs | — | 2 clock cycles |
| CPU required | Yes | Partial | Yes | — | Zero |
| Standard CMOS | Yes | Yes | FPGA only | Memristive | Any node |
| Adaptive threshold | No | No | FPGA | No | Hardware |
| Fail-safe passthrough | No | No | No | No | Guaranteed |
In the TÜRKPATENT international search report for TR 2025/012696, HP's memristor patent was assigned a Y-code (relevant but non-blocking in combination) and IBM's neuromorphic architecture (US11144718B2) was assigned an A-code (background art). Global novelty was formally recognized by an independent patent authority.
The RPU is protected under PCT/IB2026/053070, filed March 27, 2026, with priority from Turkish national application TR 2025/012696 (September 4, 2025). The patent application covers the ΔC/Δt principle, the direct causal path from temporal metric computation to clock/power gate control, the Guardian Sideband, watchdog integration, C-HAL, and clock/power gating mechanisms. 25 claims. Currently in the international phase under WIPO.
Our technical paper covers the complete architecture, dual-node ASIC validation, FPGA comparative results, RISC-V system integration, seven deployment scenarios, and detailed comparison with prior art.
The RPU is listed on Design & Reuse (design-reuse.com) under three categories: general purpose, RISC-V optimized, and IoT platforms. Listing: RPU-REFLEX 01 · Universal Reflexive Nerve System for IP Optimization.
The following are available immediately under NDA:
— RTL source: rpu_core.sv (560 lines, 7 modules, SystemVerilog IEEE 1800-2017)
— C-HAL driver (C99, no malloc, memory-mapped register interface)
— ASIC PPA reports: TSMC 65nm and SkyWater SKY130
— RISC-V benchmark data: all three scenarios, full Verilator simulation logs
— FPGA comparative data: Vivado Power Analyzer output
— Integration guide: step-by-step for RISC-V and ARM Cortex-M targets
Evaluation. Time-limited. Full RTL access under NDA. Integration support included. Free for qualified engineering teams — run the RPU on your own architecture, with your own data, before committing to a license.
Single Product. Per tape-out license. C-HAL and integration guide included. Technical support through tapeout.
Portfolio. Volume pricing for multiple products. Co-design consulting available.
Academic. Apache 2.0 post-publication. Research collaborations welcome.
Guarantee: If the RPU does not outperform your current polling implementation on your own benchmarks, there is no obligation. The worst case is a system that behaves identically to how it did without the RPU. The evaluation is free. The integration risk is zero.
"The cheapest computation is the one that never occurs."