Changelog ========= .. note:: Entries marked "via Claude" were developed with `Claude Code `_. Dan directed the design, reviewed all output, and takes responsibility for the result. Unmarked entries by Dan were written without AI assistance. 7.4.1 (2026-04-07) ------------------- **Bug Fixes:** - BOM-prefixed UTF-16 and UTF-32 input now reports ``utf-16`` and ``utf-32`` instead of the endian-specific variants. Python's ``utf-16-le``/``utf-16-be``/``utf-32-le``/``utf-32-be`` codecs keep the BOM as a U+FEFF in the decoded string, while ``utf-16``/``utf-32`` strip it, so callers passing the detection result directly to ``.decode()`` were getting a stray BOM at the start of their text. BOM-less UTF-16/32 detection (via null-byte patterns) is unchanged and still returns the endian-specific name. (`Dan Blanchard `_ via Claude, `#364 `_, `#365 `_) 7.4.0 (2026-03-26) ------------------- **Performance:** - Switched to dense zlib-compressed model format (v2): models are now stored as contiguous ``memoryview`` slices of a single decompressed blob, eliminating per-model ``struct.unpack`` overhead. Cold start (import + first detect) dropped from ~75ms to ~13ms with mypyc. (`Dan Blanchard `_ via Claude, `#354 `_) **Accuracy:** - Accuracy improved from 98.6% to 99.3% (2499/2517 files) through a combination of training and scoring improvements: - Eliminated train/test data overlap by content-fingerprinting test suite articles and excluding them from training data (`#351 `_) - Added MADLAD-400 and Wikipedia as supplemental training sources to fill gaps left by exclusion filtering (`#351 `_) - Improved non-ASCII bigram scoring: high-byte bigrams are now preserved during training (instead of being crushed by global normalization), and weighted by per-bigram IDF so encoding-specific byte patterns contribute proportionally to how discriminative they are (`#352 `_) - Added encoding-aware substitution filtering: character substitutions during training now only apply for characters the target encoding cannot represent - Increased training samples from 15K to 25K per language/encoding pair (`Dan Blanchard `_ via Claude) **Bug Fixes:** - Added dedicated structural analyzers for CP932, CP949, and Big5-HKSCS: these superset encodings previously shared their base encoding's byte-range analyzer, missing extended ranges unique to each superset (`Dan Blanchard `_ via Claude, `#353 `_) 7.3.0 (2026-03-24) ------------------- **License:** - **0BSD license** — the project license has been changed from MIT to `0BSD `_, a maximally permissive license with no attribution requirement. All prior 7.x releases should also be considered 0BSD licensed as of this release. (`Dan Blanchard `_ via Claude) **Features:** - Added ``mime_type`` field to detection results — identifies file types for both binary (via magic number matching) and text content. Returned in all ``detect()``, ``detect_all()``, and ``UniversalDetector`` results. (`Dan Blanchard `_ via Claude, `#350 `_) - New ``pipeline/magic.py`` module detects 40+ binary file formats including images, audio/video, archives, documents, executables, and fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel, OpenDocument) are distinguished by entry filenames. (`Dan Blanchard `_ via Claude, `#350 `_) **Bug Fixes:** - Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in accuracy testing — these are distinct encodings with different byte order, not interchangeable (`Dan Blanchard `_ via Claude) **Performance:** - Added 4 new modules to mypyc compilation (orchestrator, confusion, magic, ascii), bringing the total to 11 compiled modules (`Dan Blanchard `_ via Claude) - Capped statistical scoring at 16 KB — bigram models converge quickly, so large files no longer score the full 200 KB. Worst-case detection time dropped from 62ms to 26ms with no accuracy loss. (`Dan Blanchard `_ via Claude) - Replaced ``dataclasses.replace()`` with direct ``DetectionResult`` construction on hot paths, eliminating ~354k function calls per full test suite run (`Dan Blanchard `_ via Claude) **Build:** - Added riscv64 to the mypyc wheel build matrix — prebuilt wheels are now published for RISC-V Linux alongside existing architectures (`Bruno Verachten `_, `#348 `_) 7.2.0 (2026-03-17) ------------------- **Features:** - Added ``include_encodings`` and ``exclude_encodings`` parameters to :func:`~chardet.detect`, :func:`~chardet.detect_all`, and :class:`~chardet.UniversalDetector` — restrict or exclude specific encodings from the candidate set, with corresponding ``-i``/``--include-encodings`` and ``-x``/``--exclude-encodings`` CLI flags (`Dan Blanchard `_ via Claude, `#343 `_) - Added ``no_match_encoding`` (default ``"cp1252"``) and ``empty_input_encoding`` (default ``"utf-8"``) parameters — control which encoding is returned when no candidate survives the pipeline or the input is empty, with corresponding CLI flags (`Dan Blanchard `_ via Claude, `#343 `_) - Added ``-l``/``--language`` flag to ``chardetect`` CLI — shows the detected language (ISO 639-1 code and English name) alongside the encoding (`Dan Blanchard `_ via Claude, `#342 `_) 7.1.0 (2026-03-11) ------------------- **Features:** - Added PEP 263 encoding declaration detection — ``# -*- coding: ... -*-`` and ``# coding=...`` declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (`Dan Blanchard `_ via Claude, `#249 `_) - Added ``chardet.universaldetector`` backward-compatibility stub so that ``from chardet.universaldetector import UniversalDetector`` works with a deprecation warning (`Dan Blanchard `_ via Claude, `#341 `_) **Fixes:** - Fixed false UTF-7 detection of ASCII text containing ``++`` or ``+word`` patterns (`Dan Blanchard `_, `#332 `_, `#335 `_) - Fixed 0.5s startup cost on first ``detect()`` call — model norms are now computed during loading instead of lazily iterating 21M entries (`Dan Blanchard `_ via Claude, `#333 `_, `#336 `_) - Fixed undocumented encoding name changes between chardet 5.x and 7.0 — ``detect()`` now returns chardet 5.x-compatible names by default (`Dan Blanchard `_ via Claude, `#338 `_) - Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana) (`Dan Blanchard `_ via Claude) - Fixed silent truncation of corrupt model data (``iter_unpack`` yielded fewer tuples instead of raising) (`Dan Blanchard `_ via Claude) - Fixed incorrect date in LICENSE (`Dan Blanchard `_) **Performance:** - 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of ``load_models()`` (`Dan Blanchard `_ via Claude) - ~40% faster model parsing via ``struct.iter_unpack`` for bulk entry extraction (eliminates ~305K individual ``unpack`` calls) (`Dan Blanchard `_ via Claude) **New API parameters:** - Added ``compat_names`` parameter (default ``True``) to :func:`~chardet.detect`, :func:`~chardet.detect_all`, and :class:`~chardet.UniversalDetector` — set to ``False`` to get raw Python codec names instead of chardet 5.x/6.x compatible display names (`Dan Blanchard `_ via Claude) - Added ``prefer_superset`` parameter (default ``False``) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). **This will default to ``True`` in the next major version (8.0).** (`Dan Blanchard `_ via Claude) - Deprecated ``should_rename_legacy`` in favor of ``prefer_superset`` — a deprecation warning is emitted when used (`Dan Blanchard `_ via Claude) **Improvements:** - Switched internal canonical encoding names to Python codec names (e.g., ``"utf-8"`` instead of ``"UTF-8"``), with ``compat_names`` controlling the public output format. See :doc:`usage` for the full mapping table. (`Dan Blanchard `_ via Claude) - Added ``lookup_encoding()`` to ``registry`` for case-insensitive resolution of arbitrary encoding name input to canonical names (`Dan Blanchard `_ via Claude) - Achieved 100% line coverage across all source modules (+31 tests) (`Dan Blanchard `_ via Claude) - Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files (`Dan Blanchard `_ via Claude) - Pinned test-data cloning to chardet release version tags for reproducible builds (`Dan Blanchard `_ via Claude) 7.0.1 (2026-03-04) ------------------- **Fixes:** - Fixed false UTF-7 detection of SHA-1 git hashes (`Alex Rembish `_, `#324 `_) - Fixed ``_SINGLE_LANG_MAP`` missing aliases for single-language encoding lookup (e.g., ``big5`` → ``big5hkscs``) (`Dan Blanchard `_) - Fixed PyPy ``TypeError`` in UTF-7 codec handling (`Dan Blanchard `_) **Improvements:** - Retrained bigram models — 24 previously failing test cases now pass (`Dan Blanchard `_ via Claude) - Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages) (`Dan Blanchard `_ via Claude) 7.0.0 (2026-03-02) ------------------- Ground-up, 0BSD-licensed rewrite of chardet (`Dan Blanchard `_ via Claude, `#322 `_). Same package name, same public API — drop-in replacement for chardet 5.x/6.x. **Highlights:** - **0BSD license** (previous versions were LGPL) - **96.8% accuracy** on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer) - **41x faster** than chardet 6.0.0 with mypyc (**28x** pure Python), **7.5x faster** than charset-normalizer - **Language detection** for every result (90.5% accuracy across 49 languages) - **99 encodings** across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME) - **12-stage detection pipeline** — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing - **Bigram frequency models** trained on CulturaX multilingual corpus data for all supported language/encoding pairs - **Optional mypyc compilation** — 1.49x additional speedup on CPython - **Thread-safe** ``detect()`` and ``detect_all()`` with no measurable overhead; scales on free-threaded Python 3.13t+ - **Negligible import memory** (96 B) - **Zero runtime dependencies** **Breaking changes vs 6.0.0:** - ``detect()`` and ``detect_all()`` now default to ``encoding_era=EncodingEra.ALL`` (6.0.0 defaulted to ``MODERN_WEB``) - Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved. - ``LanguageFilter`` is accepted but ignored (deprecation warning emitted) - ``chunk_size`` is accepted but ignored (deprecation warning emitted) 6.0.0.post1 (2026-02-22) ------------------------- - Fixed ``__version__`` not being set correctly in the package (`Dan Blanchard `_) 6.0.0 (2026-02-22) ------------------- **Features:** - Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces ``Latin1Prober`` and ``MacRomanProber`` heuristics) (`Dan Blanchard `_) - 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh (`Dan Blanchard `_) - ``EncodingEra`` filtering via new ``encoding_era`` parameter (`Dan Blanchard `_) - ``max_bytes`` and ``chunk_size`` parameters for ``detect()``, ``detect_all()``, and ``UniversalDetector`` (`Dan Blanchard `_) - ``-e``/``--encoding-era`` CLI flag (`Dan Blanchard `_ via Claude) - EBCDIC detection (CP037, CP500) (`Dan Blanchard `_) - Direct GB18030 support (replaces redundant GB2312 prober) (`Dan Blanchard `_) - Binary file detection (`Dan Blanchard `_) - Python 3.12, 3.13, and 3.14 support (`Hugo van Kemenade `_, `#283 `_) - GitHub Codespaces support (`oxygen dioxide `_, `#312 `_) **Breaking changes:** - Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+) - Removed ``Latin1Prober`` and ``MacRomanProber`` - Removed EUC-TW support - Removed ``LanguageFilter.NONE`` - ``detect()`` default changed to ``encoding_era=EncodingEra.MODERN_WEB`` **Fixes:** - Fixed CP949 state machine (`nenw* `_, `#268 `_) - Fixed SJIS distribution analysis (second-byte range >= 0x80) (`Kadir Can Ozden `_, `#315 `_) - Fixed ``max_bytes`` not being passed to ``UniversalDetector`` (`Kadir Can Ozden `_, `#314 `_) - Fixed UTF-16/32 detection for non-ASCII-heavy text (`Dan Blanchard `_) - Fixed GB18030 ``char_len_table`` (`Dan Blanchard `_) - Fixed UTF-8 state machine (`Dan Blanchard `_) - Fixed ``detect_all()`` returning inactive probers (`Dan Blanchard `_) - Fixed early cutoff bug (`Dan Blanchard `_) - Updated LGPLv2.1 license text for remote-only FSF address (`Ben Beasley `_, `#307 `_) 5.2.0 (2023-08-01) ------------------- - Added support for running the CLI via ``python -m chardet`` (`Dan Blanchard `_) 5.1.0 (2022-12-01) ------------------- - Added ``should_rename_legacy`` argument to remap legacy encoding names to modern equivalents (`Dan Blanchard `_, `#264 `_) - Added MacRoman encoding prober (`Elia Robyn Lake `_) - Added ``--minimal`` flag to ``chardetect`` CLI (`Dan Blanchard `_, `#214 `_) - Added type annotations and mypy CI (`Jon Dufresne `_, `#261 `_) - Added support for Python 3.11 (`Hugo van Kemenade `_, `#274 `_) - Added ISO-8859-15 capital letter sharp S handling (`Simon Waldherr `_, `#222 `_) - Clarified LGPL version in license trove classifier (`Ben Beasley `_, `#255 `_) - Removed support for Python 3.6 (`Jon Dufresne `_, `#260 `_) 5.0.0 (2022-06-25) ------------------- - Added Johab Korean prober (`grizlupo `_, `#172 `_, `#207 `_) - Added UTF-16/32 BE/LE probers (`Jason Zavaglia `_, `#109 `_, `#206 `_) - Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, Turkish (`Dan Blanchard `_) - Improved XML tag filtering (`Dan Blanchard `_, `#208 `_) - Made ``detect_all`` return child prober confidences (`Dan Blanchard `_, `#210 `_) - Added support for Python 3.10 (`Hugo van Kemenade `_, `#232 `_) - Slight performance increase (`deedy5 `_, `#252 `_) - Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+) 4.0.0 (2020-12-10) ------------------- - Added ``detect_all()`` function returning all candidate encodings (`Damien `_, `#111 `_) - Converted single-byte charset probers to nested dicts (performance) (`Dan Blanchard `_, `#121 `_) - ``CharsetGroupProber`` now short-circuits on definite matches (performance) (`Dan Blanchard `_, `#203 `_) - Added ``language`` field to ``detect_all`` output (`Dan Blanchard `_) - Switched from Travis to GitHub Actions (`Dan Blanchard `_, `#204 `_) - Dropped Python 2.6, 3.4, 3.5 3.0.4 (2017-06-08) ------------------- - Fixed packaging issue with ``pytest_runner`` (`Zac Medico `_, `#119 `_) - Included ``test.py`` in source distribution (`Zac Medico `_, `#118 `_) - Updated old URLs in README and docs (`Qi Fan `_, `#123 `_; `Jon Dufresne `_, `#129 `_) 3.0.3 (2017-05-16) ------------------- - Fixed crash when debug logging was enabled (`Dan Blanchard `_, `#117 `_) 3.0.2 (2017-04-12) ------------------- - Fixed ``detect`` sometimes returning ``None`` instead of a result dict (`Dan Blanchard `_, `#114 `_) 3.0.1 (2017-04-11) ------------------- - Fixed crash in EUC-TW prober with certain strings (`Dan Blanchard `_) 3.0.0 (2017-04-11) ------------------- - Added Turkish ISO-8859-9 detection (`queeup `_) - Modernized naming conventions (``typical_positive_ratio`` instead of ``mTypicalPositiveRatio``) (`Dan Blanchard `_, `#107 `_) - Added ``language`` property to probers and results (`Dan Blanchard `_, `#108 `_) - Switched from Travis to GitHub Actions (`Dan Blanchard `_) - Fixed ``CharsetGroupProber.state`` not being set to ``FOUND_IT`` (`Dan Blanchard `_) - Added Hypothesis-based fuzz testing (`David R. MacIver `_, `#66 `_) - Don't indicate byte order for UTF-16/32 with given BOM, for compatibility with ``decode()`` (`Sebastian Noack `_, `#73 `_) - Stop reading file immediately when file type is known (`Jason Zavaglia `_, `#103 `_) chardet 2.3.0 (2014-10-07) -------------------------- - Added CP932 detection (`hashy `_) - Fixed UTF-8 BOM not detected as UTF-8-SIG (`atbest `_, `#32 `_) - Switched ``chardetect`` to use ``argparse`` (`Dan Blanchard `_) chardet 2.2.1 (2013-12-18) --------------------------- - Fixed missing parenthesis in ``chardetect.py`` (`Owen `_, `#12 `_) chardet 2.2.0 (2013-12-16) --------------------------- Merged the charade fork back into chardet, unifying Python 2 and Python 3 support under the original package name. - Added CP949 detection (`Kyung-hown Chung `_) - Fixed BOM detection (`Jean Boussier `_) charade 1.0.3 (2013-01-18) --------------------------- - Fixed codecs usage for compatibility (`Ian Cordasco `_) charade 1.0.2 (2013-01-18) --------------------------- - Fixed BOM detection (`Jean Boussier `_) - Improved multibyte sequence handling (`Kyung-hown Chung `_) charade 1.0.1 (2012-12-03) --------------------------- - Version fix (`Ian Cordasco `_) charade 1.0.0 (2012-12-02) --------------------------- - Initial release: Python 3 port of chardet, forked as a separate package (`Ian Cordasco `_) chardet 2.1.1 (2012-10-01) --------------------------- - Bumped version past Mark Pilgrim's last release - ``chardetect`` can now read from stdin (`Erik Rose `_) - Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (`Toshio Kuratomi `_) - Restored Mark Pilgrim's original docs and COPYING file (`Toshio Kuratomi `_) chardet 1.1 (2012-07-27) ------------------------- - Added ``chardetect`` CLI tool (`Erik Rose `_) - Fixed ``utf8prober`` crash when character is out of range (`David Cramer `_) - Cleaned up detection logic to fail gracefully (`David Cramer `_) - Fixed feed encoding errors (`David Cramer `_) chardet 1.0.1 (2008-04-19) --------------------------- - Packaging fix, added egg distributions for Python 2.4 and 2.5 (`Mark Pilgrim `_) chardet 1.0 (2006-12-23) ------------------------- - Initial release: Python 2 port of Mozilla's universal charset detector (`Mark Pilgrim `_)