Machine learning models for clinical prediction rely on structured data extracted from Electronic Medical Records (EMRs), yet this process remains dominated by hardcoded, database-specific pipelines for cohort definition, feature selection, and code mapping. These manual efforts limit scalability, reproducibility, and cross-institutional generalization. To address this, we introduce EMR-AGENT (Automated Generalized Extraction and Navigation Tool), an agent-based framework that replaces manual rule writing with dynamic, language model-driven interaction to extract and standardize structured clinical data. Our framework automates cohort selection, feature extraction, and code mapping through interactive querying of databases. Our modular agents iteratively observe query results and reason over schema and documentation, using SQL not just for data retrieval but also as a tool for database observation and decision making. This eliminates the need for hand-crafted, schema-specific logic. To enable rigorous evaluation, we develop a benchmarking codebase for three EMR databases (MIMIC-III, eICU, SICdb), including both seen and unseen schema settings. Our results demonstrate strong performance and generalization across these databases, highlighting the feasibility of automating a process previously thought to require expert-driven design. The code will be released publicly.
Cohort Selection: Include only male ICU patients between the age 70 and 89 who have only single admission in their lifetime.
Features: ICU-stay id, gender(Male/Female/Unknown), age(integer), mortality status(Dead/Alive/Unknown).
[Target Feature]: Mapping code of Hematocrit [Volume Fraction] of Blood (Lab-Test) for SICDB database.
Our EMR-AGENT framework, CFSA and CMA, can be applied to automatically generate Event Stream Dataset for downstream clinical tasks such as mortality prediction without any hard-coded rules.
Here, we present an example application where CMA generates an SQL query conditioned on time constraints, which is then integrated with CFSA’s SQL query to produce the final Event Stream Dataset.
I want only male ICU patients between the age 70 and 89 who have only single admission in their lifetime
and who measure Heart rate within 48 hours after ICU admission.
I am looking for age, gender as static features and mortality status for prediction,
and Heart rate(Vital Sign) information about name, values, unit, measurement time in MIMIC-III database.