Changelog ========= 7.1.0 (2026-03-11) ------------------- **Features:** - Added PEP 263 encoding declaration detection — ``# -*- coding: ... -*-`` and ``# coding=...`` declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (`#249 `_) - Added ``chardet.universaldetector`` backward-compatibility stub so that ``from chardet.universaldetector import UniversalDetector`` works with a deprecation warning (`#341 `_) **Fixes:** - Fixed false UTF-7 detection of ASCII text containing ``++`` or ``+word`` patterns (`#332 `_) - Fixed 0.5s startup cost on first ``detect()`` call — model norms are now computed during loading instead of lazily iterating 21M entries (`#333 `_) - Fixed undocumented encoding name changes between chardet 5.x and 7.0 — ``detect()`` now returns chardet 5.x-compatible names by default (`#338 `_) - Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana) - Fixed silent truncation of corrupt model data (``iter_unpack`` yielded fewer tuples instead of raising) - Fixed incorrect date in LICENSE **Performance:** - 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of ``load_models()`` - ~40% faster model parsing via ``struct.iter_unpack`` for bulk entry extraction (eliminates ~305K individual ``unpack`` calls) **New API parameters:** - Added ``compat_names`` parameter (default ``True``) to :func:`~chardet.detect`, :func:`~chardet.detect_all`, and :class:`~chardet.UniversalDetector` — set to ``False`` to get raw Python codec names instead of chardet 5.x/6.x compatible display names - Added ``prefer_superset`` parameter (default ``False``) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). **This will default to ``True`` in the next major version (8.0).** - Deprecated ``should_rename_legacy`` in favor of ``prefer_superset`` — a deprecation warning is emitted when used **Improvements:** - Switched internal canonical encoding names to Python codec names (e.g., ``"utf-8"`` instead of ``"UTF-8"``), with ``compat_names`` controlling the public output format. See :doc:`usage` for the full mapping table. - Added ``lookup_encoding()`` to ``registry`` for case-insensitive resolution of arbitrary encoding name input to canonical names - Achieved 100% line coverage across all source modules (+31 tests) - Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files - Pinned test-data cloning to chardet release version tags for reproducible builds 7.0.1 (2026-03-04) ------------------- **Fixes:** - Fixed false UTF-7 detection of SHA-1 git hashes (`#324 `_) - Fixed ``_SINGLE_LANG_MAP`` missing aliases for single-language encoding lookup (e.g., ``big5`` → ``big5hkscs``) - Fixed PyPy ``TypeError`` in UTF-7 codec handling **Improvements:** - Retrained bigram models — 24 previously failing test cases now pass - Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages) 7.0.0 (2026-03-02) ------------------- Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x. **Highlights:** - **MIT license** (previous versions were LGPL) - **96.8% accuracy** on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer) - **41x faster** than chardet 6.0.0 with mypyc (**28x** pure Python), **7.5x faster** than charset-normalizer - **Language detection** for every result (90.5% accuracy across 49 languages) - **99 encodings** across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME) - **12-stage detection pipeline** — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing - **Bigram frequency models** trained on CulturaX multilingual corpus data for all supported language/encoding pairs - **Optional mypyc compilation** — 1.49x additional speedup on CPython - **Thread-safe** ``detect()`` and ``detect_all()`` with no measurable overhead; scales on free-threaded Python 3.13t+ - **Negligible import memory** (96 B) - **Zero runtime dependencies** **Breaking changes vs 6.0.0:** - ``detect()`` and ``detect_all()`` now default to ``encoding_era=EncodingEra.ALL`` (6.0.0 defaulted to ``MODERN_WEB``) - Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved. - ``LanguageFilter`` is accepted but ignored (deprecation warning emitted) - ``chunk_size`` is accepted but ignored (deprecation warning emitted) 6.0.0 (2026-02-22) ------------------- **Features:** - Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces ``Latin1Prober`` and ``MacRomanProber`` heuristics) - 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh - ``EncodingEra`` filtering via new ``encoding_era`` parameter - ``max_bytes`` and ``chunk_size`` parameters for ``detect()``, ``detect_all()``, and ``UniversalDetector`` - ``-e``/``--encoding-era`` CLI flag - EBCDIC detection (CP037, CP500) - Direct GB18030 support (replaces redundant GB2312 prober) - Binary file detection - Python 3.12, 3.13, and 3.14 support **Breaking changes:** - Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+) - Removed ``Latin1Prober`` and ``MacRomanProber`` - Removed EUC-TW support - Removed ``LanguageFilter.NONE`` - ``detect()`` default changed to ``encoding_era=EncodingEra.MODERN_WEB`` **Fixes:** - Fixed CP949 state machine - Fixed SJIS distribution analysis (second-byte range >= 0x80) - Fixed UTF-16/32 detection for non-ASCII-heavy text - Fixed GB18030 ``char_len_table`` - Fixed UTF-8 state machine - Fixed ``detect_all()`` returning inactive probers - Fixed early cutoff bug 5.2.0 (2023-08-01) ------------------- - Added support for running the CLI via ``python -m chardet`` 5.1.0 (2022-12-01) ------------------- - Added ``should_rename_legacy`` argument to remap legacy encoding names to modern equivalents - Added MacRoman encoding prober - Added ``--minimal`` flag to ``chardetect`` CLI - Added type annotations and mypy CI - Added support for Python 3.11 - Removed support for Python 3.6 5.0.0 (2022-06-25) ------------------- - Added Johab Korean prober - Added UTF-16/32 BE/LE probers - Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, Turkish - Improved XML tag filtering - Made ``detect_all`` return child prober confidences - Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+) 4.0.0 (2020-12-10) ------------------- - Added ``detect_all()`` function returning all candidate encodings - Converted single-byte charset probers to nested dicts (performance) - ``CharsetGroupProber`` now short-circuits on definite matches (performance) - Added ``language`` field to ``detect_all`` output - Dropped Python 2.6, 3.4, 3.5 3.0.4 (2017-06-08) ------------------- - Fixed packaging issue with ``pytest_runner`` - Updated old URLs in README and docs 3.0.3 (2017-05-16) ------------------- - Fixed crash when debug logging was enabled 3.0.2 (2017-04-12) ------------------- - Fixed ``detect`` sometimes returning ``None`` instead of a result dict 3.0.1 (2017-04-11) ------------------- - Fixed crash in EUC-TW prober with certain strings 3.0.0 (2017-04-11) ------------------- - Added Turkish ISO-8859-9 detection - Modernized naming conventions (``typical_positive_ratio`` instead of ``mTypicalPositiveRatio``) - Added ``language`` property to probers and results - Switched from Travis to GitHub Actions - Fixed ``CharsetGroupProber.state`` not being set to ``FOUND_IT`` 2.3.0 (2014-10-07) ------------------- - Added CP932 detection - Fixed UTF-8 BOM not detected as UTF-8-SIG - Switched ``chardetect`` to use ``argparse`` 2.2.1 (2013-12-18) ------------------- - Fixed missing parenthesis in ``chardetect.py`` 2.2.0 (2013-12-16) ------------------- - First release after merger with charade (Python 3 support) 2.1.1 (2012-10-01) ------------------- - Bumped version past Mark Pilgrim's last release - ``chardetect`` can now read from stdin (Erik Rose) - Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (Toshio Kuratomi) - Restored Mark Pilgrim's original docs and COPYING file (Toshio Kuratomi) 1.1 (2012-07-27) ----------------- - Added ``chardetect`` CLI tool (Erik Rose) - Fixed ``utf8prober`` crash when character is out of range (David Cramer) - Cleaned up detection logic to fail gracefully (David Cramer) - Fixed feed encoding errors (David Cramer) 1.0.1 (2008-04-19) ------------------- - Packaging fix, added egg distributions for Python 2.4 and 2.5 (Mark Pilgrim) 1.0 (2006-12-23) ----------------- - Initial release: Python 2 port of Mozilla's universal charset detector (Mark Pilgrim)