Changelog¶
7.1.0 (2026-03-11)¶
Features:
Added PEP 263 encoding declaration detection —
# -*- coding: ... -*-and# coding=...declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (#249)Added
chardet.universaldetectorbackward-compatibility stub so thatfrom chardet.universaldetector import UniversalDetectorworks with a deprecation warning (#341)
Fixes:
Fixed false UTF-7 detection of ASCII text containing
++or+wordpatterns (#332)Fixed 0.5s startup cost on first
detect()call — model norms are now computed during loading instead of lazily iterating 21M entries (#333)Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
detect()now returns chardet 5.x-compatible names by default (#338)Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
Fixed silent truncation of corrupt model data (
iter_unpackyielded fewer tuples instead of raising)Fixed incorrect date in LICENSE
Performance:
5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of
load_models()~40% faster model parsing via
struct.iter_unpackfor bulk entry extraction (eliminates ~305K individualunpackcalls)
New API parameters:
Added
compat_namesparameter (defaultTrue) todetect(),detect_all(), andUniversalDetector— set toFalseto get raw Python codec names instead of chardet 5.x/6.x compatible display namesAdded
prefer_supersetparameter (defaultFalse) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default to ``True`` in the next major version (8.0).Deprecated
should_rename_legacyin favor ofprefer_superset— a deprecation warning is emitted when used
Improvements:
Switched internal canonical encoding names to Python codec names (e.g.,
"utf-8"instead of"UTF-8"), withcompat_namescontrolling the public output format. See Usage for the full mapping table.Added
lookup_encoding()toregistryfor case-insensitive resolution of arbitrary encoding name input to canonical namesAchieved 100% line coverage across all source modules (+31 tests)
Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files
Pinned test-data cloning to chardet release version tags for reproducible builds
7.0.1 (2026-03-04)¶
Fixes:
Fixed false UTF-7 detection of SHA-1 git hashes (#324)
Fixed
_SINGLE_LANG_MAPmissing aliases for single-language encoding lookup (e.g.,big5→big5hkscs)Fixed PyPy
TypeErrorin UTF-7 codec handling
Improvements:
Retrained bigram models — 24 previously failing test cases now pass
Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)
7.0.0 (2026-03-02)¶
Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x.
Highlights:
MIT license (previous versions were LGPL)
96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)
41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer
Language detection for every result (90.5% accuracy across 49 languages)
99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)
12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing
Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs
Optional mypyc compilation — 1.49x additional speedup on CPython
Thread-safe
detect()anddetect_all()with no measurable overhead; scales on free-threaded Python 3.13t+Negligible import memory (96 B)
Zero runtime dependencies
Breaking changes vs 6.0.0:
detect()anddetect_all()now default toencoding_era=EncodingEra.ALL(6.0.0 defaulted toMODERN_WEB)Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
LanguageFilteris accepted but ignored (deprecation warning emitted)chunk_sizeis accepted but ignored (deprecation warning emitted)
6.0.0 (2026-02-22)¶
Features:
Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces
Latin1ProberandMacRomanProberheuristics)38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh
EncodingErafiltering via newencoding_eraparametermax_bytesandchunk_sizeparameters fordetect(),detect_all(), andUniversalDetector-e/--encoding-eraCLI flagEBCDIC detection (CP037, CP500)
Direct GB18030 support (replaces redundant GB2312 prober)
Binary file detection
Python 3.12, 3.13, and 3.14 support
Breaking changes:
Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
Removed
Latin1ProberandMacRomanProberRemoved EUC-TW support
Removed
LanguageFilter.NONEdetect()default changed toencoding_era=EncodingEra.MODERN_WEB
Fixes:
Fixed CP949 state machine
Fixed SJIS distribution analysis (second-byte range >= 0x80)
Fixed UTF-16/32 detection for non-ASCII-heavy text
Fixed GB18030
char_len_tableFixed UTF-8 state machine
Fixed
detect_all()returning inactive probersFixed early cutoff bug
5.2.0 (2023-08-01)¶
Added support for running the CLI via
python -m chardet
5.1.0 (2022-12-01)¶
Added
should_rename_legacyargument to remap legacy encoding names to modern equivalentsAdded MacRoman encoding prober
Added
--minimalflag tochardetectCLIAdded type annotations and mypy CI
Added support for Python 3.11
Removed support for Python 3.6
5.0.0 (2022-06-25)¶
Added Johab Korean prober
Added UTF-16/32 BE/LE probers
Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, Turkish
Improved XML tag filtering
Made
detect_allreturn child prober confidencesDropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)
4.0.0 (2020-12-10)¶
Added
detect_all()function returning all candidate encodingsConverted single-byte charset probers to nested dicts (performance)
CharsetGroupProbernow short-circuits on definite matches (performance)Added
languagefield todetect_alloutputDropped Python 2.6, 3.4, 3.5
3.0.4 (2017-06-08)¶
Fixed packaging issue with
pytest_runnerUpdated old URLs in README and docs
3.0.3 (2017-05-16)¶
Fixed crash when debug logging was enabled
3.0.2 (2017-04-12)¶
Fixed
detectsometimes returningNoneinstead of a result dict
3.0.1 (2017-04-11)¶
Fixed crash in EUC-TW prober with certain strings
3.0.0 (2017-04-11)¶
Added Turkish ISO-8859-9 detection
Modernized naming conventions (
typical_positive_ratioinstead ofmTypicalPositiveRatio)Added
languageproperty to probers and resultsSwitched from Travis to GitHub Actions
Fixed
CharsetGroupProber.statenot being set toFOUND_IT
2.3.0 (2014-10-07)¶
Added CP932 detection
Fixed UTF-8 BOM not detected as UTF-8-SIG
Switched
chardetectto useargparse
2.2.1 (2013-12-18)¶
Fixed missing parenthesis in
chardetect.py
2.2.0 (2013-12-16)¶
First release after merger with charade (Python 3 support)
2.1.1 (2012-10-01)¶
Bumped version past Mark Pilgrim’s last release
chardetectcan now read from stdin (Erik Rose)Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (Toshio Kuratomi)
Restored Mark Pilgrim’s original docs and COPYING file (Toshio Kuratomi)
1.1 (2012-07-27)¶
Added
chardetectCLI tool (Erik Rose)Fixed
utf8probercrash when character is out of range (David Cramer)Cleaned up detection logic to fail gracefully (David Cramer)
Fixed feed encoding errors (David Cramer)
1.0.1 (2008-04-19)¶
Packaging fix, added egg distributions for Python 2.4 and 2.5 (Mark Pilgrim)
1.0 (2006-12-23)¶
Initial release: Python 2 port of Mozilla’s universal charset detector (Mark Pilgrim)