Changelog¶

7.0.0 (2026-03-02)¶

Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x.

Highlights:

MIT license (previous versions were LGPL)
96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)
41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer
Language detection for every result (90.5% accuracy across 49 languages)
99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)
12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing
Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs
Optional mypyc compilation — 1.49x additional speedup on CPython
Thread-safe detect() and detect_all() with no measurable overhead; scales on free-threaded Python 3.13t+
Negligible import memory (96 B)
Zero runtime dependencies

Breaking changes vs 6.0.0:

detect() and detect_all() now default to encoding_era=EncodingEra.ALL (6.0.0 defaulted to MODERN_WEB)
Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
LanguageFilter is accepted but ignored (deprecation warning emitted)
chunk_size is accepted but ignored (deprecation warning emitted)

Features:

Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces Latin1Prober and MacRomanProber heuristics)
38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh
EncodingEra filtering via new encoding_era parameter
max_bytes and chunk_size parameters for detect(), detect_all(), and UniversalDetector
-e/--encoding-era CLI flag
EBCDIC detection (CP037, CP500)
Direct GB18030 support (replaces redundant GB2312 prober)
Binary file detection
Python 3.12, 3.13, and 3.14 support

Breaking changes:

Fixes:

Added should_rename_legacy argument to remap legacy encoding names to modern equivalents
Added MacRoman encoding prober
Added --minimal flag to chardetect CLI
Added type annotations and mypy CI
Added support for Python 3.11
Removed support for Python 3.6

Added Johab Korean prober
Added UTF-16/32 BE/LE probers
Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, Turkish
Improved XML tag filtering
Made detect_all return child prober confidences
Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)

Added Turkish ISO-8859-9 detection
Modernized naming conventions (typical_positive_ratio instead of mTypicalPositiveRatio)
Added language property to probers and results
Switched from Travis to GitHub Actions
Fixed CharsetGroupProber.state not being set to FOUND_IT

Initial release: Python 2 port of Mozilla’s universal charset detector (Mark Pilgrim)