Changelog¶

7.1.0 (2026-03-11)¶

Features:

Added PEP 263 encoding declaration detection — # -*- coding: ... -*- and # coding=... declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (#249)
Added chardet.universaldetector backward-compatibility stub so that from chardet.universaldetector import UniversalDetector works with a deprecation warning (#341)

Fixes:

Fixed false UTF-7 detection of ASCII text containing ++ or +word patterns (#332)
Fixed 0.5s startup cost on first detect() call — model norms are now computed during loading instead of lazily iterating 21M entries (#333)
Fixed undocumented encoding name changes between chardet 5.x and 7.0 — detect() now returns chardet 5.x-compatible names by default (#338)
Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
Fixed silent truncation of corrupt model data (iter_unpack yielded fewer tuples instead of raising)
Fixed incorrect date in LICENSE

Performance:

5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of load_models()
~40% faster model parsing via struct.iter_unpack for bulk entry extraction (eliminates ~305K individual unpack calls)

New API parameters:

Added compat_names parameter (default True) to detect(), detect_all(), and UniversalDetector — set to False to get raw Python codec names instead of chardet 5.x/6.x compatible display names
Added prefer_superset parameter (default False) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default to ``True`` in the next major version (8.0).
Deprecated should_rename_legacy in favor of prefer_superset — a deprecation warning is emitted when used

Improvements:

Switched internal canonical encoding names to Python codec names (e.g., "utf-8" instead of "UTF-8"), with compat_names controlling the public output format. See Usage for the full mapping table.
Added lookup_encoding() to registry for case-insensitive resolution of arbitrary encoding name input to canonical names
Achieved 100% line coverage across all source modules (+31 tests)
Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files
Pinned test-data cloning to chardet release version tags for reproducible builds

7.0.1 (2026-03-04)¶

Fixes:

Fixed false UTF-7 detection of SHA-1 git hashes (#324)
Fixed _SINGLE_LANG_MAP missing aliases for single-language encoding lookup (e.g., big5 → big5hkscs)
Fixed PyPy TypeError in UTF-7 codec handling

Improvements:

Retrained bigram models — 24 previously failing test cases now pass
Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)

7.0.0 (2026-03-02)¶

Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x.

Highlights:

MIT license (previous versions were LGPL)
96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)
41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer
Language detection for every result (90.5% accuracy across 49 languages)
99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)
12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing
Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs
Optional mypyc compilation — 1.49x additional speedup on CPython
Thread-safe detect() and detect_all() with no measurable overhead; scales on free-threaded Python 3.13t+
Negligible import memory (96 B)
Zero runtime dependencies

Breaking changes vs 6.0.0:

detect() and detect_all() now default to encoding_era=EncodingEra.ALL (6.0.0 defaulted to MODERN_WEB)
Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
LanguageFilter is accepted but ignored (deprecation warning emitted)
chunk_size is accepted but ignored (deprecation warning emitted)

6.0.0 (2026-02-22)¶

Features:

Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces Latin1Prober and MacRomanProber heuristics)
38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh
EncodingEra filtering via new encoding_era parameter
max_bytes and chunk_size parameters for detect(), detect_all(), and UniversalDetector
-e/--encoding-era CLI flag
EBCDIC detection (CP037, CP500)
Direct GB18030 support (replaces redundant GB2312 prober)
Binary file detection
Python 3.12, 3.13, and 3.14 support

Breaking changes:

Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
Removed Latin1Prober and MacRomanProber
Removed EUC-TW support
Removed LanguageFilter.NONE
detect() default changed to encoding_era=EncodingEra.MODERN_WEB

Fixes:

Fixed CP949 state machine
Fixed SJIS distribution analysis (second-byte range >= 0x80)
Fixed UTF-16/32 detection for non-ASCII-heavy text
Fixed GB18030 char_len_table
Fixed UTF-8 state machine
Fixed detect_all() returning inactive probers
Fixed early cutoff bug

5.2.0 (2023-08-01)¶

Added support for running the CLI via python -m chardet

5.1.0 (2022-12-01)¶

Added should_rename_legacy argument to remap legacy encoding names to modern equivalents
Added MacRoman encoding prober
Added --minimal flag to chardetect CLI
Added type annotations and mypy CI
Added support for Python 3.11
Removed support for Python 3.6

5.0.0 (2022-06-25)¶

Added Johab Korean prober
Added UTF-16/32 BE/LE probers
Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, Turkish
Improved XML tag filtering
Made detect_all return child prober confidences
Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)

4.0.0 (2020-12-10)¶

Added detect_all() function returning all candidate encodings
Converted single-byte charset probers to nested dicts (performance)
CharsetGroupProber now short-circuits on definite matches (performance)
Added language field to detect_all output
Dropped Python 2.6, 3.4, 3.5

3.0.4 (2017-06-08)¶

Fixed packaging issue with pytest_runner
Updated old URLs in README and docs

3.0.3 (2017-05-16)¶

Fixed crash when debug logging was enabled

3.0.2 (2017-04-12)¶

Fixed detect sometimes returning None instead of a result dict

3.0.1 (2017-04-11)¶

Fixed crash in EUC-TW prober with certain strings

3.0.0 (2017-04-11)¶

Added Turkish ISO-8859-9 detection
Modernized naming conventions (typical_positive_ratio instead of mTypicalPositiveRatio)
Added language property to probers and results
Switched from Travis to GitHub Actions
Fixed CharsetGroupProber.state not being set to FOUND_IT

2.3.0 (2014-10-07)¶

Added CP932 detection
Fixed UTF-8 BOM not detected as UTF-8-SIG
Switched chardetect to use argparse

2.2.1 (2013-12-18)¶

Fixed missing parenthesis in chardetect.py

2.2.0 (2013-12-16)¶

First release after merger with charade (Python 3 support)

2.1.1 (2012-10-01)¶

Bumped version past Mark Pilgrim’s last release
chardetect can now read from stdin (Erik Rose)
Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (Toshio Kuratomi)
Restored Mark Pilgrim’s original docs and COPYING file (Toshio Kuratomi)

1.1 (2012-07-27)¶

Added chardetect CLI tool (Erik Rose)
Fixed utf8prober crash when character is out of range (David Cramer)
Cleaned up detection logic to fail gracefully (David Cramer)
Fixed feed encoding errors (David Cramer)

1.0.1 (2008-04-19)¶

Packaging fix, added egg distributions for Python 2.4 and 2.5 (Mark Pilgrim)

1.0 (2006-12-23)¶

Initial release: Python 2 port of Mozilla’s universal charset detector (Mark Pilgrim)