Sumin Park

I am a Master’s student at School of Computing, Korea Advanced Institute of Science and Technology (KAIST), advised by Professor Noseong Park. My research focuses on how architectural inductive biases and training dynamics give rise to structured internal representations and functional specialization in large-scale neural networks. More specifically, my current interests include:

Research Interest

Mechanistic interpretability of LLMs
Efficient sequence modeling with state space models and linear attention
Representation learning for input-network functional specialization

Building on my undergraduate background in neuroscience, my longer-term research direction lies in introducing brain-inspired inductive biases as a conceptual framework for understanding and designing universal learning principles that can be shared by both brain and machines.

Ongoing Projects

Understanding attention failures through spectral regimes

My current project investigates whether different modes of model failures correspond to distinct spectral regimes of attention, rather than a single pathology. Using a Gaussian-equivalent null model for attention, we analyze diffuse versus structured failure patterns through random matrix theory (RMT).

Selected publications

2026

In Review

Q-Delta: Beyond Key–Value Associative State Evolution

Sumin Park, Seojin Kim, Noseong Park

Query-aware delta rule for linear attention that uses mixed key–query prediction errors, enabling richer, jointly corrective state evolution dynamics

Linear Attention SSMs LLMs

ABS

Linear attention reformulates sequence modeling as recurrent state evolution, enabling efficient linear-time inference. Under the key–value associative paradigm, existing approaches restrict the role of the query to the readout operation, decoupling it from state evolution. We show that query-conditioned state readout induces a structured value prediction over accumulated memory that complements key-based retrieval. Based on this insight, we propose Q-Delta, a query-aware delta rule that integrates mixed key–query prediction errors into state evolution, enabling jointly corrective dynamics while preserving delta-rule efficiency. We establish stability guarantees for the resulting dynamics and derive a hardware-efficient chunkwise-parallel formulation with a custom Triton implementation. Empirical results demonstrate stable optimization, competitive throughput, and consistent improvements over strong baselines on language modeling and long-context retrieval tasks.
2026

In Review

STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

Sumin Park, Noseong Park

Input-aware MoE routing based on incremental subspace learning for evolving input representation

MoE Representation LLMs

ABS

Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actually aware of input structure. In practice, MoE routing is typically implemented as a shallow linear projection with limited awareness of input representation, which often leads to unstable and suboptimal specialization. We propose STAR, a STructure-Aware Routing that rethinks MoE routing as a subspace learning problem by augmenting standard learnable routing with an evolving principal subspace that tracks dominant input structure via Generalized Hebbian Algorithm (GHA). By aligning routing decisions directly with input structure along with the task-supervision from learnable gate, STAR enables stable and balanced expert specialization without relying on auxiliary load-balancing losses. We evaluate STAR on controlled synthetic setup and large-scale language and vision tasks, where it consistently improves routing quality and downstream performance over strong MoE baselines. Moreover, optional testtime subspace updates further enhance routing robustness under distribution shifts.
2026

AAAI

How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts

Sumin Park, Noseong Park

Adaptive MoE expansion mechanism based on gradient-guided semantic drift signals

MoE

ABS arXiv PDF

Finding the optimal configuration of Sparse Mixture-ofExperts (SMoE) that maximizes semantic differentiation among experts is essential for exploiting the full potential of MoE architectures. However, existing SMoE frameworks either heavily rely on hyperparameter tuning or overlook the importance of diversifying semantic roles across experts when adapting the expert pool size. We propose Mixture-of-Experts for Adaptive Semantic Specialization (MASS), a semanticaware MoE framework for adaptive expert expansion and dynamic routing. MASS introduces two key advancements: (i) a gradient-based semantic drift detector that prompts targeted expert expansion when the existing expert pool lacks capacity to capture the full semantic diversity of the data, and (ii) an integration of adaptive routing strategy that dynamically adjusts expert usage based on token-level routing confidence mass. We first demonstrate that MASS reliably converges to the point of optimal balance between cost-performance trade-off with notably improved semantic specialization in a highly controlled synthetic setup. Further empirical results on real-world datasets across language and vision domains show that MASS consistently outperforms a range of strong MoE baselines, demonstrating its domain robustness and enhanced expert specialization.
2024

ICML

PANDA: Expanded Width-Aware Message Passing Beyond Rewiring

Jeongwhan Choi, Sumin Park, Hyowon Wi, Sung-Bae Cho, Noseong Park

Expanded width-aware message passing for GNNs to address the over-squashing problem

GNNs

ABS arXiv PDF

Recent research in the field of graph neural network (GNN) has identified a critical issue known as "over-squashing," resulting from the bottleneck phenomenon in graph structures, which impedes the propagation of long-range information. Prior works have proposed a variety of graph rewiring concepts that aim at optimizing the spatial or spectral properties of graphs to promote the signal propagation. However, such approaches inevitably deteriorate the original graph topology, which may lead to a distortion of information flow. To address this, we introduce an expanded width-aware (PANDA) message passing, a new message passing paradigm where nodes with high centrality, a potential source of over-squashing, are selectively expanded in width to encapsulate the growing influx of signals from distant nodes. Experimental results show that our method outperforms existing rewiring methods, suggesting that selectively expanding the hidden state of nodes can be a compelling alternative to graph rewiring for addressing the over-squashing.