Changelog
=========

7.1.0 (2026-03-11)
-------------------

**Features:**

- Added PEP 263 encoding declaration detection — ``# -*- coding: ... -*-``
  and ``# coding=...`` declarations on lines 1–2 of Python source files are
  now recognized with confidence 0.95 (`#249
  <https://github.com/chardet/chardet/issues/249>`_)
- Added ``chardet.universaldetector`` backward-compatibility stub so that
  ``from chardet.universaldetector import UniversalDetector`` works with a
  deprecation warning (`#341
  <https://github.com/chardet/chardet/issues/341>`_)

**Fixes:**

- Fixed false UTF-7 detection of ASCII text containing ``++`` or ``+word``
  patterns (`#332 <https://github.com/chardet/chardet/issues/332>`_)
- Fixed 0.5s startup cost on first ``detect()`` call — model norms are now
  computed during loading instead of lazily iterating 21M entries (`#333
  <https://github.com/chardet/chardet/issues/333>`_)
- Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
  ``detect()`` now returns chardet 5.x-compatible names by default (`#338
  <https://github.com/chardet/chardet/issues/338>`_)
- Improved ISO-2022-JP family detection — recognizes ESC sequences for
  ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
- Fixed silent truncation of corrupt model data (``iter_unpack`` yielded
  fewer tuples instead of raising)
- Fixed incorrect date in LICENSE

**Performance:**

- 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model
  norms as a side-product of ``load_models()``
- ~40% faster model parsing via ``struct.iter_unpack`` for bulk entry
  extraction (eliminates ~305K individual ``unpack`` calls)

**New API parameters:**

- Added ``compat_names`` parameter (default ``True``) to
  :func:`~chardet.detect`, :func:`~chardet.detect_all`, and
  :class:`~chardet.UniversalDetector` — set to ``False`` to get raw Python
  codec names instead of chardet 5.x/6.x compatible display names
- Added ``prefer_superset`` parameter (default ``False``) — remaps legacy
  ISO/subset encodings to their modern Windows/CP superset equivalents
  (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252).
  **This will default to ``True`` in the next major version (8.0).**
- Deprecated ``should_rename_legacy`` in favor of ``prefer_superset`` —
  a deprecation warning is emitted when used

**Improvements:**

- Switched internal canonical encoding names to Python codec names
  (e.g., ``"utf-8"`` instead of ``"UTF-8"``), with ``compat_names``
  controlling the public output format.  See :doc:`usage` for the full
  mapping table.
- Added ``lookup_encoding()`` to ``registry`` for case-insensitive
  resolution of arbitrary encoding name input to canonical names
- Achieved 100% line coverage across all source modules (+31 tests)
- Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language
  accuracy on 2,510 test files
- Pinned test-data cloning to chardet release version tags for
  reproducible builds

7.0.1 (2026-03-04)
-------------------

**Fixes:**

- Fixed false UTF-7 detection of SHA-1 git hashes (`#324
  <https://github.com/chardet/chardet/issues/324>`_)
- Fixed ``_SINGLE_LANG_MAP`` missing aliases for single-language encoding
  lookup (e.g., ``big5`` → ``big5hkscs``)
- Fixed PyPy ``TypeError`` in UTF-7 codec handling

**Improvements:**

- Retrained bigram models — 24 previously failing test cases now pass
- Updated language equivalences for mutual intelligibility (Slovak/Czech,
  East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)

7.0.0 (2026-03-02)
-------------------

Ground-up, MIT-licensed rewrite of chardet. Same package name, same
public API — drop-in replacement for chardet 5.x/6.x.

**Highlights:**

- **MIT license** (previous versions were LGPL)
- **96.8% accuracy** on 2,179 test files (+2.3pp vs chardet 6.0.0,
  +7.7pp vs charset-normalizer)
- **41x faster** than chardet 6.0.0 with mypyc (**28x** pure Python),
  **7.5x faster** than charset-normalizer
- **Language detection** for every result (90.5% accuracy across 49
  languages)
- **99 encodings** across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC,
  LEGACY_REGIONAL, DOS, MAINFRAME)
- **12-stage detection pipeline** — BOM, UTF-16/32 patterns, escape
  sequences, binary detection, markup charset, ASCII, UTF-8 validation,
  byte validity, CJK gating, structural probing, statistical scoring,
  post-processing
- **Bigram frequency models** trained on CulturaX multilingual corpus
  data for all supported language/encoding pairs
- **Optional mypyc compilation** — 1.49x additional speedup on CPython
- **Thread-safe** ``detect()`` and ``detect_all()`` with no measurable
  overhead; scales on free-threaded Python 3.13t+
- **Negligible import memory** (96 B)
- **Zero runtime dependencies**

**Breaking changes vs 6.0.0:**

- ``detect()`` and ``detect_all()`` now default to
  ``encoding_era=EncodingEra.ALL`` (6.0.0 defaulted to ``MODERN_WEB``)
- Internal architecture is completely different (probers replaced by
  pipeline stages). Only the public API is preserved.
- ``LanguageFilter`` is accepted but ignored (deprecation warning
  emitted)
- ``chunk_size`` is accepted but ignored (deprecation warning emitted)

6.0.0 (2026-02-22)
-------------------

**Features:**

- Unified single-byte charset detection with proper language-specific
  bigram models for all single-byte encodings (replaces ``Latin1Prober``
  and ``MacRomanProber`` heuristics)
- 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish,
  Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German,
  Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian,
  Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian,
  Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik,
  Ukrainian, Vietnamese, Welsh
- ``EncodingEra`` filtering via new ``encoding_era`` parameter
- ``max_bytes`` and ``chunk_size`` parameters for ``detect()``,
  ``detect_all()``, and ``UniversalDetector``
- ``-e``/``--encoding-era`` CLI flag
- EBCDIC detection (CP037, CP500)
- Direct GB18030 support (replaces redundant GB2312 prober)
- Binary file detection
- Python 3.12, 3.13, and 3.14 support

**Breaking changes:**

- Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
- Removed ``Latin1Prober`` and ``MacRomanProber``
- Removed EUC-TW support
- Removed ``LanguageFilter.NONE``
- ``detect()`` default changed to ``encoding_era=EncodingEra.MODERN_WEB``

**Fixes:**

- Fixed CP949 state machine
- Fixed SJIS distribution analysis (second-byte range >= 0x80)
- Fixed UTF-16/32 detection for non-ASCII-heavy text
- Fixed GB18030 ``char_len_table``
- Fixed UTF-8 state machine
- Fixed ``detect_all()`` returning inactive probers
- Fixed early cutoff bug

5.2.0 (2023-08-01)
-------------------

- Added support for running the CLI via ``python -m chardet``

5.1.0 (2022-12-01)
-------------------

- Added ``should_rename_legacy`` argument to remap legacy encoding names
  to modern equivalents
- Added MacRoman encoding prober
- Added ``--minimal`` flag to ``chardetect`` CLI
- Added type annotations and mypy CI
- Added support for Python 3.11
- Removed support for Python 3.6

5.0.0 (2022-06-25)
-------------------

- Added Johab Korean prober
- Added UTF-16/32 BE/LE probers
- Added test data for Croatian, Czech, Hungarian, Polish, Slovak,
  Slovene, Greek, Turkish
- Improved XML tag filtering
- Made ``detect_all`` return child prober confidences
- Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)

4.0.0 (2020-12-10)
-------------------

- Added ``detect_all()`` function returning all candidate encodings
- Converted single-byte charset probers to nested dicts (performance)
- ``CharsetGroupProber`` now short-circuits on definite matches
  (performance)
- Added ``language`` field to ``detect_all`` output
- Dropped Python 2.6, 3.4, 3.5

3.0.4 (2017-06-08)
-------------------

- Fixed packaging issue with ``pytest_runner``
- Updated old URLs in README and docs

3.0.3 (2017-05-16)
-------------------

- Fixed crash when debug logging was enabled

3.0.2 (2017-04-12)
-------------------

- Fixed ``detect`` sometimes returning ``None`` instead of a result dict

3.0.1 (2017-04-11)
-------------------

- Fixed crash in EUC-TW prober with certain strings

3.0.0 (2017-04-11)
-------------------

- Added Turkish ISO-8859-9 detection
- Modernized naming conventions (``typical_positive_ratio`` instead of
  ``mTypicalPositiveRatio``)
- Added ``language`` property to probers and results
- Switched from Travis to GitHub Actions
- Fixed ``CharsetGroupProber.state`` not being set to ``FOUND_IT``

2.3.0 (2014-10-07)
-------------------

- Added CP932 detection
- Fixed UTF-8 BOM not detected as UTF-8-SIG
- Switched ``chardetect`` to use ``argparse``

2.2.1 (2013-12-18)
-------------------

- Fixed missing parenthesis in ``chardetect.py``

2.2.0 (2013-12-16)
-------------------

- First release after merger with charade (Python 3 support)

2.1.1 (2012-10-01)
-------------------

- Bumped version past Mark Pilgrim's last release
- ``chardetect`` can now read from stdin (Erik Rose)
- Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (Toshio Kuratomi)
- Restored Mark Pilgrim's original docs and COPYING file (Toshio Kuratomi)

1.1 (2012-07-27)
-----------------

- Added ``chardetect`` CLI tool (Erik Rose)
- Fixed ``utf8prober`` crash when character is out of range (David Cramer)
- Cleaned up detection logic to fail gracefully (David Cramer)
- Fixed feed encoding errors (David Cramer)

1.0.1 (2008-04-19)
-------------------

- Packaging fix, added egg distributions for Python 2.4 and 2.5
  (Mark Pilgrim)

1.0 (2006-12-23)
-----------------

- Initial release: Python 2 port of Mozilla's universal charset detector
  (Mark Pilgrim)