How it works
This is a guide to how chardet’s detection algorithm works internally.
You may also be interested in the research paper which originally inspired the Mozilla implementation that chardet is based on: A composite approach to language/encoding detection.
Overview
The main entry point is universaldetector.py, which contains the
UniversalDetector class. (The detect function in
chardet/__init__.py is a convenience wrapper that creates a
UniversalDetector, feeds it data, and returns the result.)
UniversalDetector processes input through a pipeline of probers,
each specialized for a category of encodings. Detection proceeds
through these stages in order:
BOM detection — immediate identification of UTF-8-SIG, UTF-16, or UTF-32 via byte order marks.
UTF-16/32 without BOM —
UTF1632Proberdetects UTF-16/32 by analyzing null-byte patterns and byte distributions.Escaped encodings —
EscCharSetProberdetects 7-bit encodings that use escape sequences (ISO-2022-JP,ISO-2022-KR,HZ-GB-2312).Multi-byte encodings —
MBCSGroupProberruns probers forUTF-8,GB18030,Big5,EUC-JP,EUC-KR,Shift-JIS,CP949, andJohab.Single-byte encodings —
SBCSGroupProberruns hundreds of encoding+language-specific probers using bigram frequency models.Encoding era filtering — results are filtered by the requested
EncodingEratier, and close confidence scores are broken by preferring more modern encodings.
BOM detection
If the text starts with a byte order mark (BOM), UniversalDetector
immediately identifies the encoding as UTF-8-SIG, UTF-16 BE/LE,
or UTF-32 BE/LE and returns the result without further processing.
UTF-16/32 without BOM
UTF1632Prober (defined in utf1632prober.py) detects UTF-16 and
UTF-32 encoded text that lacks a BOM. It analyzes the distribution of
null bytes: UTF-32 produces characteristic patterns of 3 null bytes per
character for ASCII-range text, while UTF-16 produces alternating null
and non-null bytes.
Escaped encodings
If the text contains escape sequences, UniversalDetector creates an
EscCharSetProber (defined in escprober.py) which runs state
machines for HZ-GB-2312, ISO-2022-JP, and ISO-2022-KR
(defined in escsm.py). Each state machine processes the text one
byte at a time. If any state machine uniquely identifies the encoding,
the result is returned immediately. State machines that encounter
illegal sequences are dropped.
Multi-byte encodings
When high-bit characters are detected, UniversalDetector creates a
MBCSGroupProber (defined in mbcsgroupprober.py) which manages
probers for each multi-byte encoding:
UTF8Prober— UTF-8GB18030Prober— GB18030 / GB2312 (Simplified Chinese)Big5Prober— Big5 (Traditional Chinese)EUCJPProber— EUC-JP (Japanese)SJISProber— Shift-JIS (Japanese)EUCKRProber— EUC-KR (Korean)CP949Prober— CP949 (Korean)JOHABProber— Johab (Korean)
Each multi-byte prober inherits from MultiByteCharSetProber (defined
in mbcharsetprober.py) and uses two analysis techniques:
Coding state machines (defined in mbcssm.py) process the text
one byte at a time, looking for byte sequences that are valid or invalid
in the target encoding. An illegal sequence immediately eliminates that
encoding from consideration. A uniquely identifying sequence produces an
immediate positive result.
Character distribution analysis (defined in chardistribution.py)
uses language-specific frequency tables to measure how well the decoded
characters match expected usage patterns. Once enough text has been
processed, a confidence rating is calculated.
The case of Japanese is more complex. Single-character distribution
analysis alone cannot always distinguish EUC-JP from Shift-JIS,
so SJISProber (defined in sjisprober.py) also uses 2-character
context analysis. SJISContextAnalysis and EUCJPContextAnalysis
(both defined in jpcntx.py) check the frequency of Hiragana
syllabary characters to help distinguish between the two encodings.
Single-byte encodings
SBCSGroupProber (defined in sbcsgroupprober.py) manages hundreds
of SingleByteCharSetProber instances, one for each combination of
single-byte encoding and language. For example, Windows-1252 is
paired with English, French, German, Spanish, and many other Western
European languages, while KOI8-R is paired with Russian.
Every single-byte encoding is detected the same way: each
SingleByteCharSetProber (defined in sbcharsetprober.py) takes a
bigram language model as input. These models (stored in
lang*model.py files) define how frequently each pair of consecutive
characters appears in typical text for that language and encoding. The
prober tallies bigram frequencies in the input and calculates a
confidence score.
The bigram models are trained using the create_language_model.py
script from the CulturaX multilingual corpus, covering 45+ languages.
This unified approach replaces the older system where only a few
languages had trained models and Western encodings relied on
special-case heuristics.
Hebrew is handled as a special case by HebrewProber (defined in
hebrewprober.py), which distinguishes between Visual Hebrew (stored
right-to-left, displayed verbatim) and Logical Hebrew (stored in
reading order, rendered right-to-left by the client) by analyzing the
positions of final-form characters.
Encoding era filtering and tie-breaking
After all probers report their confidence scores, UniversalDetector
filters results by the requested EncodingEra. Only encodings
belonging to the selected era(s) are considered.
When multiple encodings have very close confidence scores, the detector
prefers encodings from more modern tiers (MODERN_WEB over
LEGACY_ISO over LEGACY_MAC, and so on). This prevents legacy
encodings from winning ties against their modern equivalents.
See Supported encodings for which encodings belong to each era.