DBTOKEN
Interactive Demo
Tokenization strategies for longitudinal EHR data
GitHub
Records are synthetically generated by rule-based random generation and do not correspond to any real patient.
×
Text Tokenization
Text Mode
How non-numeric text is encoded. `concept` = whole-string tokens; `auto` picks per-class by cardinality.
concept
auto
Auto Threshold
In `auto` mode: classes with more unique text values than this use BPE; others use concept.
Vocab Size
Target BPE vocabulary size. Specials+concepts are reserved; the rest is BPE merges.
Class Overrides
Force specific classes to the `Override mode` below regardless of `auto`.
Override Mode
The text mode applied to every class selected above.
concept
Numeric Tokenization
How numeric values become tokens.
Numeric Type
`discrete` = quantile bins (Q-tokens); `continuous` = float value scaled by per-group distribution.
discrete
continuous
Numeric Sequence
`factored` = numeric as its own token; `fused` = concept+value merged (e.g. `hemoglobin::Q3`).
factored
fused
Bins
Number of quantile bins for discrete numerics (and time deltas in discrete mode).
Level Threshold
If a group has ≤ this many unique values, use categorical L-tokens instead of bins/scaling.
Clipping
Percentile clipping before binning/fitting. Protects against outliers.
Bin Clip Min %
Lower percentile edge for quantile binning.
Bin Clip Max %
Upper percentile edge for quantile binning.
Cont. Clip Min %
Lower percentile clip before distribution fitting (continuous mode).
Cont. Clip Max %
Upper percentile clip before distribution fitting (continuous mode).
Milestones & Clinical
Calendar/shift-boundary tokens emitted on change (Kalman-filter style).
IP/OP state machine (admission/discharge)
Builds state_transitions from HOSPITAL_ADMISSION → inpatient, HOSPITAL_DISCHARGE → outpatient.
Milestone (Inpatient)
Milestone granularity during inpatient state.
none
8hr
12hr
daily
week
month
Milestone (Outpatient)
Milestone granularity during outpatient state.
none
week
month
daily
12hr
8hr
Shift Start Hour
Hour-of-day (0–23) for shift boundary. Only used when a milestone is 8hr/12hr.
Time Scales (sec)
Comma-separated thresholds (sec) to split time-delta distributions. e.g. "86400" = under/over 24h. Leave blank for a single global distribution.
Scale Names
Optional comma-separated labels for each band (must be len(thresholds)+1). Leave blank for auto "0","1",…
Birth Date Class
Class name for birth-date demographic rows.
Birth Date Text Value
Text-value marker for birth-date rows (blank = none).
Display
Patient ID
Which patient's sequence to encode & visualize.
Show rows
Max debug-table rows.
Show tokens
Max token chips in the synchronized stream.
▶ Run Tokenizer
↺ Reset Settings
⇄ Side-by-Side Comparison
Run two configs on the same patient and compare output side-by-side.
▶
Configure Panel B settings
Full configuration for comparison panel B. Unspecified fields inherit from panel A.
text_mode
concept
auto
Auto Threshold
Vocab Size
Class Overrides
Override Mode
concept
Numeric Type
discrete
continuous
Numeric Sequence
factored
fused
Bins
Level Threshold
Bin Clip Min %
Bin Clip Max %
Cont. Clip Min %
Cont. Clip Max %
IP/OP state machine
Milestone (IP)
none
8hr
12hr
daily
week
month
Milestone (OP)
none
week
month
daily
12hr
8hr
Shift Start Hour
Time Scales (sec)
Scale Names
Birth Date Class
Birth Date Text Value
⇄ Compare A vs B
⚡ Pattern Search
+ Add Element
collector
Accumulator run across matched elements (e.g. `collect_pairs` → [(token, time), …]).
none
count
sum_times
collect_tokens
collect_pairs
max_dt (sec)
Max duration of the whole pattern. Blank = ∞.
Greedy (backtrackable=False on all elements)
{ }
Show Python script
📋 Copy script
⚡ Search Pattern
‹
⚗️
Configure parameters and click
Run Tokenizer
to see the output.
Loading metadata…