Changelog ========= .. note:: Entries marked "via Claude" were developed with `Claude Code `_. Dan directed the design, reviewed all output, and takes responsibility for the result. Unmarked entries by Dan were written without AI assistance. 7.4.3 (2026-04-13) ------------------- **Bug Fixes:** - Fixed ``ValueError: embedded null character`` crash when input contained a ```` declaration with a null byte in the encoding name (e.g. ``b''``). ``codecs.lookup()`` raises ``ValueError`` on embedded nulls, and ``lookup_encoding()`` was only catching ``LookupError``. Also added defensive ``ValueError`` catches in ``_validate_bytes()`` and ``_to_utf8()`` for completeness. (`Dan Blanchard `_ via Claude, `#369 `_) 7.4.2 (2026-04-12) ------------------- **Bug Fixes:** - Fixed ``RuntimeError: pipeline must always return at least one result`` on ~2% of all possible two-byte inputs (e.g. ``b"\xf9\x92"``). Multi-byte encodings like CP932 and Johab could score above the structural confidence threshold on very short inputs, but then statistical scoring would return nothing, leaving the pipeline with an empty result list instead of falling through to the ``no_match_encoding`` fallback. (`Jason Barnett `_ via Claude, `#367 `_, `#368 `_) **Improvements:** - Added ~90 encoding aliases from the WHATWG Encoding Standard and IANA Character Sets registry so that ```` labels like ``x-cp1252``, ``x-sjis``, ``dos-874``, ``csUTF8``, and the ``cswindows*`` family all resolve correctly through the markup detection stage. Every alias was driven by a failing spec-compliance test. (`Dan Blanchard `_ via Claude, `#366 `_) - Added a spec-compliance test suite covering Python decode round-trips for all 86 registry encodings, WHATWG web-platform label resolution, IANA preferred MIME names, and Unicode/RFC conformance (BOM sniffing, UTF-8 boundary cases, UTF-16 surrogate pairs). This is the test suite that would have caught the 7.4.1 BOM bug before release. (`Dan Blanchard `_ via Claude, `#366 `_) 7.4.1 (2026-04-07) ------------------- **Bug Fixes:** - BOM-prefixed UTF-16 and UTF-32 input now reports ``utf-16`` and ``utf-32`` instead of the endian-specific variants. Python's ``utf-16-le``/``utf-16-be``/``utf-32-le``/``utf-32-be`` codecs keep the BOM as a U+FEFF in the decoded string, while ``utf-16``/``utf-32`` strip it, so callers passing the detection result directly to ``.decode()`` were getting a stray BOM at the start of their text. BOM-less UTF-16/32 detection (via null-byte patterns) is unchanged and still returns the endian-specific name. (`Dan Blanchard `_ via Claude, `#364 `_, `#365 `_) 7.4.0 (2026-03-26) ------------------- **Performance:** - Switched to dense zlib-compressed model format (v2): models are now stored as contiguous ``memoryview`` slices of a single decompressed blob, eliminating per-model ``struct.unpack`` overhead. Cold start (import + first detect) dropped from ~75ms to ~13ms with mypyc. (`Dan Blanchard `_ via Claude, `#354 `_) **Accuracy:** - Accuracy improved from 98.6% to 99.3% (2499/2517 files) through a combination of training and scoring improvements: - Eliminated train/test data overlap by content-fingerprinting test suite articles and excluding them from training data (`#351 `_) - Added MADLAD-400 and Wikipedia as supplemental training sources to fill gaps left by exclusion filtering (`#351 `_) - Improved non-ASCII bigram scoring: high-byte bigrams are now preserved during training (instead of being crushed by global normalization), and weighted by per-bigram IDF so encoding-specific byte patterns contribute proportionally to how discriminative they are (`#352 `_) - Added encoding-aware substitution filtering: character substitutions during training now only apply for characters the target encoding cannot represent - Increased training samples from 15K to 25K per language/encoding pair (`Dan Blanchard `_ via Claude) **Bug Fixes:** - Added dedicated structural analyzers for CP932, CP949, and Big5-HKSCS: these superset encodings previously shared their base encoding's byte-range analyzer, missing extended ranges unique to each superset (`Dan Blanchard `_ via Claude, `#353 `_) 7.3.0 (2026-03-24) ------------------- **License:** - **0BSD license** — the project license has been changed from MIT to `0BSD `_, a maximally permissive license with no attribution requirement. All prior 7.x releases should also be considered 0BSD licensed as of this release. (`Dan Blanchard `_ via Claude) **Features:** - Added ``mime_type`` field to detection results — identifies file types for both binary (via magic number matching) and text content. Returned in all ``detect()``, ``detect_all()``, and ``UniversalDetector`` results. (`Dan Blanchard `_ via Claude, `#350 `_) - New ``pipeline/magic.py`` module detects 40+ binary file formats including images, audio/video, archives, documents, executables, and fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel, OpenDocument) are distinguished by entry filenames. (`Dan Blanchard `_ via Claude, `#350 `_) **Bug Fixes:** - Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in accuracy testing — these are distinct encodings with different byte order, not interchangeable (`Dan Blanchard `_ via Claude) **Performance:** - Added 4 new modules to mypyc compilation (orchestrator, confusion, magic, ascii), bringing the total to 11 compiled modules (`Dan Blanchard `_ via Claude) - Capped statistical scoring at 16 KB — bigram models converge quickly, so large files no longer score the full 200 KB. Worst-case detection time dropped from 62ms to 26ms with no accuracy loss. (`Dan Blanchard `_ via Claude) - Replaced ``dataclasses.replace()`` with direct ``DetectionResult`` construction on hot paths, eliminating ~354k function calls per full test suite run (`Dan Blanchard `_ via Claude) **Build:** - Added riscv64 to the mypyc wheel build matrix — prebuilt wheels are now published for RISC-V Linux alongside existing architectures (`Bruno Verachten `_, `#348 `_) 7.2.0 (2026-03-17) ------------------- **Features:** - Added ``include_encodings`` and ``exclude_encodings`` parameters to :func:`~chardet.detect`, :func:`~chardet.detect_all`, and :class:`~chardet.UniversalDetector` — restrict or exclude specific encodings from the candidate set, with corresponding ``-i``/``--include-encodings`` and ``-x``/``--exclude-encodings`` CLI flags (`Dan Blanchard `_ via Claude, `#343 `_) - Added ``no_match_encoding`` (default ``"cp1252"``) and ``empty_input_encoding`` (default ``"utf-8"``) parameters — control which encoding is returned when no candidate survives the pipeline or the input is empty, with corresponding CLI flags (`Dan Blanchard `_ via Claude, `#343 `_) - Added ``-l``/``--language`` flag to ``chardetect`` CLI — shows the detected language (ISO 639-1 code and English name) alongside the encoding (`Dan Blanchard `_ via Claude, `#342 `_) 7.1.0 (2026-03-11) ------------------- **Features:** - Added PEP 263 encoding declaration detection — ``# -*- coding: ... -*-`` and ``# coding=...`` declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (`Dan Blanchard `_ via Claude, `#249 `_) - Added ``chardet.universaldetector`` backward-compatibility stub so that ``from chardet.universaldetector import UniversalDetector`` works with a deprecation warning (`Dan Blanchard `_ via Claude, `#341 `_) **Fixes:** - Fixed false UTF-7 detection of ASCII text containing ``++`` or ``+word`` patterns (`Dan Blanchard `_, `#332 `_, `#335 `_) - Fixed 0.5s startup cost on first ``detect()`` call — model norms are now computed during loading instead of lazily iterating 21M entries (`Dan Blanchard `_ via Claude, `#333 `_, `#336 `_) - Fixed undocumented encoding name changes between chardet 5.x and 7.0 — ``detect()`` now returns chardet 5.x-compatible names by default (`Dan Blanchard `_ via Claude, `#338 `_) - Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana) (`Dan Blanchard `_ via Claude) - Fixed silent truncation of corrupt model data (``iter_unpack`` yielded fewer tuples instead of raising) (`Dan Blanchard `_ via Claude) - Fixed incorrect date in LICENSE (`Dan Blanchard `_) **Performance:** - 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of ``load_models()`` (`Dan Blanchard `_ via Claude) - ~40% faster model parsing via ``struct.iter_unpack`` for bulk entry extraction (eliminates ~305K individual ``unpack`` calls) (`Dan Blanchard `_ via Claude) **New API parameters:** - Added ``compat_names`` parameter (default ``True``) to :func:`~chardet.detect`, :func:`~chardet.detect_all`, and :class:`~chardet.UniversalDetector` — set to ``False`` to get raw Python codec names instead of chardet 5.x/6.x compatible display names (`Dan Blanchard `_ via Claude) - Added ``prefer_superset`` parameter (default ``False``) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). **This will default to ``True`` in the next major version (8.0).** (`Dan Blanchard `_ via Claude) - Deprecated ``should_rename_legacy`` in favor of ``prefer_superset`` — a deprecation warning is emitted when used (`Dan Blanchard `_ via Claude) **Improvements:** - Switched internal canonical encoding names to Python codec names (e.g., ``"utf-8"`` instead of ``"UTF-8"``), with ``compat_names`` controlling the public output format. See :doc:`usage` for the full mapping table. (`Dan Blanchard `_ via Claude) - Added ``lookup_encoding()`` to ``registry`` for case-insensitive resolution of arbitrary encoding name input to canonical names (`Dan Blanchard `_ via Claude) - Achieved 100% line coverage across all source modules (+31 tests) (`Dan Blanchard `_ via Claude) - Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files (`Dan Blanchard `_ via Claude) - Pinned test-data cloning to chardet release version tags for reproducible builds (`Dan Blanchard `_ via Claude) 7.0.1 (2026-03-04) ------------------- **Fixes:** - Fixed false UTF-7 detection of SHA-1 git hashes (`Alex Rembish `_, `#324 `_) - Fixed ``_SINGLE_LANG_MAP`` missing aliases for single-language encoding lookup (e.g., ``big5`` → ``big5hkscs``) (`Dan Blanchard `_) - Fixed PyPy ``TypeError`` in UTF-7 codec handling (`Dan Blanchard `_) **Improvements:** - Retrained bigram models — 24 previously failing test cases now pass (`Dan Blanchard `_ via Claude) - Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages) (`Dan Blanchard `_ via Claude) 7.0.0 (2026-03-02) ------------------- Ground-up, 0BSD-licensed rewrite of chardet (`Dan Blanchard `_ via Claude, `#322 `_). Same package name, same public API — drop-in replacement for chardet 5.x/6.x. **Highlights:** - **0BSD license** (previous versions were LGPL) - **96.8% accuracy** on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer) - **41x faster** than chardet 6.0.0 with mypyc (**28x** pure Python), **7.5x faster** than charset-normalizer - **Language detection** for every result (90.5% accuracy across 49 languages) - **99 encodings** across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME) - **12-stage detection pipeline** — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing - **Bigram frequency models** trained on CulturaX multilingual corpus data for all supported language/encoding pairs - **Optional mypyc compilation** — 1.49x additional speedup on CPython - **Thread-safe** ``detect()`` and ``detect_all()`` with no measurable overhead; scales on free-threaded Python 3.13t+ - **Negligible import memory** (96 B) - **Zero runtime dependencies** **Breaking changes vs 6.0.0:** - ``detect()`` and ``detect_all()`` now default to ``encoding_era=EncodingEra.ALL`` (6.0.0 defaulted to ``MODERN_WEB``) - Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved. - ``LanguageFilter`` is accepted but ignored (deprecation warning emitted) - ``chunk_size`` is accepted but ignored (deprecation warning emitted) 6.0.0.post1 (2026-02-22) ------------------------- - Fixed ``__version__`` not being set correctly in the package (`Dan Blanchard `_) 6.0.0 (2026-02-22) ------------------- **Features:** - Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces ``Latin1Prober`` and ``MacRomanProber`` heuristics) (`Dan Blanchard `_) - 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh (`Dan Blanchard `_) - ``EncodingEra`` filtering via new ``encoding_era`` parameter (`Dan Blanchard `_) - ``max_bytes`` and ``chunk_size`` parameters for ``detect()``, ``detect_all()``, and ``UniversalDetector`` (`Dan Blanchard `_) - ``-e``/``--encoding-era`` CLI flag (`Dan Blanchard `_ via Claude) - EBCDIC detection (CP037, CP500) (`Dan Blanchard `_) - Direct GB18030 support (replaces redundant GB2312 prober) (`Dan Blanchard `_) - Binary file detection (`Dan Blanchard `_) - Python 3.12, 3.13, and 3.14 support (`Hugo van Kemenade `_, `#283 `_) - GitHub Codespaces support (`oxygen dioxide `_, `#312 `_) **Breaking changes:** - Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+) - Removed ``Latin1Prober`` and ``MacRomanProber`` - Removed EUC-TW support - Removed ``LanguageFilter.NONE`` - ``detect()`` default changed to ``encoding_era=EncodingEra.MODERN_WEB`` **Fixes:** - Fixed CP949 state machine (`nenw* `_, `#268 `_) - Fixed SJIS distribution analysis (second-byte range >= 0x80) (`Kadir Can Ozden `_, `#315 `_) - Fixed ``max_bytes`` not being passed to ``UniversalDetector`` (`Kadir Can Ozden `_, `#314 `_) - Fixed UTF-16/32 detection for non-ASCII-heavy text (`Dan Blanchard `_) - Fixed GB18030 ``char_len_table`` (`Dan Blanchard `_) - Fixed UTF-8 state machine (`Dan Blanchard `_) - Fixed ``detect_all()`` returning inactive probers (`Dan Blanchard `_) - Fixed early cutoff bug (`Dan Blanchard `_) - Updated LGPLv2.1 license text for remote-only FSF address (`Ben Beasley `_, `#307 `_) 5.2.0 (2023-08-01) ------------------- - Added support for running the CLI via ``python -m chardet`` (`Dan Blanchard `_) 5.1.0 (2022-12-01) ------------------- - Added ``should_rename_legacy`` argument to remap legacy encoding names to modern equivalents (`Dan Blanchard `_, `#264 `_) - Added MacRoman encoding prober (`Elia Robyn Lake `_) - Added ``--minimal`` flag to ``chardetect`` CLI (`Dan Blanchard `_, `#214 `_) - Added type annotations and mypy CI (`Jon Dufresne `_, `#261 `_) - Added support for Python 3.11 (`Hugo van Kemenade `_, `#274 `_) - Added ISO-8859-15 capital letter sharp S handling (`Simon Waldherr `_, `#222 `_) - Clarified LGPL version in license trove classifier (`Ben Beasley `_, `#255 `_) - Removed support for Python 3.6 (`Jon Dufresne `_, `#260 `_) 5.0.0 (2022-06-25) ------------------- - Added Johab Korean prober (`grizlupo `_, `#172 `_, `#207 `_) - Added UTF-16/32 BE/LE probers (`Jason Zavaglia `_, `#109 `_, `#206 `_) - Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, Turkish (`Dan Blanchard `_) - Improved XML tag filtering (`Dan Blanchard `_, `#208 `_) - Made ``detect_all`` return child prober confidences (`Dan Blanchard `_, `#210 `_) - Added support for Python 3.10 (`Hugo van Kemenade `_, `#232 `_) - Slight performance increase (`deedy5 `_, `#252 `_) - Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+) 4.0.0 (2020-12-10) ------------------- - Added ``detect_all()`` function returning all candidate encodings (`Damien `_, `#111 `_) - Converted single-byte charset probers to nested dicts (performance) (`Dan Blanchard `_, `#121 `_) - ``CharsetGroupProber`` now short-circuits on definite matches (performance) (`Dan Blanchard `_, `#203 `_) - Added ``language`` field to ``detect_all`` output (`Dan Blanchard `_) - Switched from Travis to GitHub Actions (`Dan Blanchard `_, `#204 `_) - Dropped Python 2.6, 3.4, 3.5 3.0.4 (2017-06-08) ------------------- - Fixed packaging issue with ``pytest_runner`` (`Zac Medico `_, `#119 `_) - Included ``test.py`` in source distribution (`Zac Medico `_, `#118 `_) - Updated old URLs in README and docs (`Qi Fan `_, `#123 `_; `Jon Dufresne `_, `#129 `_) 3.0.3 (2017-05-16) ------------------- - Fixed crash when debug logging was enabled (`Dan Blanchard `_, `#117 `_) 3.0.2 (2017-04-12) ------------------- - Fixed ``detect`` sometimes returning ``None`` instead of a result dict (`Dan Blanchard `_, `#114 `_) 3.0.1 (2017-04-11) ------------------- - Fixed crash in EUC-TW prober with certain strings (`Dan Blanchard `_) 3.0.0 (2017-04-11) ------------------- - Added Turkish ISO-8859-9 detection (`queeup `_) - Modernized naming conventions (``typical_positive_ratio`` instead of ``mTypicalPositiveRatio``) (`Dan Blanchard `_, `#107 `_) - Added ``language`` property to probers and results (`Dan Blanchard `_, `#108 `_) - Switched from Travis to GitHub Actions (`Dan Blanchard `_) - Fixed ``CharsetGroupProber.state`` not being set to ``FOUND_IT`` (`Dan Blanchard `_) - Added Hypothesis-based fuzz testing (`David R. MacIver `_, `#66 `_) - Don't indicate byte order for UTF-16/32 with given BOM, for compatibility with ``decode()`` (`Sebastian Noack `_, `#73 `_) - Stop reading file immediately when file type is known (`Jason Zavaglia `_, `#103 `_) chardet 2.3.0 (2014-10-07) -------------------------- - Added CP932 detection (`hashy `_) - Fixed UTF-8 BOM not detected as UTF-8-SIG (`atbest `_, `#32 `_) - Switched ``chardetect`` to use ``argparse`` (`Dan Blanchard `_) chardet 2.2.1 (2013-12-18) --------------------------- - Fixed missing parenthesis in ``chardetect.py`` (`Owen `_, `#12 `_) chardet 2.2.0 (2013-12-16) --------------------------- Merged the charade fork back into chardet, unifying Python 2 and Python 3 support under the original package name. - Added CP949 detection (`Kyung-hown Chung `_) - Fixed BOM detection (`Jean Boussier `_) charade 1.0.3 (2013-01-18) --------------------------- - Fixed codecs usage for compatibility (`Ian Cordasco `_) charade 1.0.2 (2013-01-18) --------------------------- - Fixed BOM detection (`Jean Boussier `_) - Improved multibyte sequence handling (`Kyung-hown Chung `_) charade 1.0.1 (2012-12-03) --------------------------- - Version fix (`Ian Cordasco `_) charade 1.0.0 (2012-12-02) --------------------------- - Initial release: Python 3 port of chardet, forked as a separate package (`Ian Cordasco `_) chardet 2.1.1 (2012-10-01) --------------------------- - Bumped version past Mark Pilgrim's last release - ``chardetect`` can now read from stdin (`Erik Rose `_) - Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (`Toshio Kuratomi `_) - Restored Mark Pilgrim's original docs and COPYING file (`Toshio Kuratomi `_) chardet 1.1 (2012-07-27) ------------------------- - Added ``chardetect`` CLI tool (`Erik Rose `_) - Fixed ``utf8prober`` crash when character is out of range (`David Cramer `_) - Cleaned up detection logic to fail gracefully (`David Cramer `_) - Fixed feed encoding errors (`David Cramer `_) chardet 1.0.1 (2008-04-19) --------------------------- - Packaging fix, added egg distributions for Python 2.4 and 2.5 (`Mark Pilgrim `_) chardet 1.0 (2006-12-23) ------------------------- - Initial release: Python 2 port of Mozilla's universal charset detector (`Mark Pilgrim `_)