Changelog

7.3.0 (2026-03-24)

License:

  • 0BSD license — the project license has been changed from MIT to 0BSD, a maximally permissive license with no attribution requirement. All prior 7.x releases should also be considered 0BSD licensed as of this release. (Dan Blanchard)

Features:

  • Added mime_type field to detection results — identifies file types for both binary (via magic number matching) and text content. Returned in all detect(), detect_all(), and UniversalDetector results. (Dan Blanchard, #350)

  • New pipeline/magic.py module detects 40+ binary file formats including images, audio/video, archives, documents, executables, and fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel, OpenDocument) are distinguished by entry filenames. (Dan Blanchard, #350)

Bug Fixes:

  • Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in accuracy testing — these are distinct encodings with different byte order, not interchangeable (Dan Blanchard)

Performance:

  • Added 4 new modules to mypyc compilation (orchestrator, confusion, magic, ascii), bringing the total to 11 compiled modules (Dan Blanchard)

  • Capped statistical scoring at 16 KB — bigram models converge quickly, so large files no longer score the full 200 KB. Worst-case detection time dropped from 62ms to 26ms with no accuracy loss. (Dan Blanchard)

  • Replaced dataclasses.replace() with direct DetectionResult construction on hot paths, eliminating ~354k function calls per full test suite run (Dan Blanchard)

Build:

  • Added riscv64 to the mypyc wheel build matrix — prebuilt wheels are now published for RISC-V Linux alongside existing architectures (Bruno Verachten, #348)

7.2.0 (2026-03-17)

Features:

  • Added include_encodings and exclude_encodings parameters to detect(), detect_all(), and UniversalDetector — restrict or exclude specific encodings from the candidate set, with corresponding -i/--include-encodings and -x/--exclude-encodings CLI flags (Dan Blanchard, #343)

  • Added no_match_encoding (default "cp1252") and empty_input_encoding (default "utf-8") parameters — control which encoding is returned when no candidate survives the pipeline or the input is empty, with corresponding CLI flags (Dan Blanchard, #343)

  • Added -l/--language flag to chardetect CLI — shows the detected language (ISO 639-1 code and English name) alongside the encoding (Dan Blanchard, #342)

7.1.0 (2026-03-11)

Features:

  • Added PEP 263 encoding declaration detection — # -*- coding: ... -*- and # coding=... declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (Dan Blanchard, #249)

  • Added chardet.universaldetector backward-compatibility stub so that from chardet.universaldetector import UniversalDetector works with a deprecation warning (Dan Blanchard, #341)

Fixes:

  • Fixed false UTF-7 detection of ASCII text containing ++ or +word patterns (Dan Blanchard, #332, #335)

  • Fixed 0.5s startup cost on first detect() call — model norms are now computed during loading instead of lazily iterating 21M entries (Dan Blanchard, #333, #336)

  • Fixed undocumented encoding name changes between chardet 5.x and 7.0 — detect() now returns chardet 5.x-compatible names by default (Dan Blanchard, #338)

  • Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana) (Dan Blanchard)

  • Fixed silent truncation of corrupt model data (iter_unpack yielded fewer tuples instead of raising) (Dan Blanchard)

  • Fixed incorrect date in LICENSE (Dan Blanchard)

Performance:

  • 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of load_models() (Dan Blanchard)

  • ~40% faster model parsing via struct.iter_unpack for bulk entry extraction (eliminates ~305K individual unpack calls) (Dan Blanchard)

New API parameters:

  • Added compat_names parameter (default True) to detect(), detect_all(), and UniversalDetector — set to False to get raw Python codec names instead of chardet 5.x/6.x compatible display names (Dan Blanchard)

  • Added prefer_superset parameter (default False) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default to ``True`` in the next major version (8.0). (Dan Blanchard)

  • Deprecated should_rename_legacy in favor of prefer_superset — a deprecation warning is emitted when used (Dan Blanchard)

Improvements:

  • Switched internal canonical encoding names to Python codec names (e.g., "utf-8" instead of "UTF-8"), with compat_names controlling the public output format. See Usage for the full mapping table. (Dan Blanchard)

  • Added lookup_encoding() to registry for case-insensitive resolution of arbitrary encoding name input to canonical names (Dan Blanchard)

  • Achieved 100% line coverage across all source modules (+31 tests) (Dan Blanchard)

  • Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files (Dan Blanchard)

  • Pinned test-data cloning to chardet release version tags for reproducible builds (Dan Blanchard)

7.0.1 (2026-03-04)

Fixes:

  • Fixed false UTF-7 detection of SHA-1 git hashes (Alex Rembish, #324)

  • Fixed _SINGLE_LANG_MAP missing aliases for single-language encoding lookup (e.g., big5big5hkscs) (Dan Blanchard)

  • Fixed PyPy TypeError in UTF-7 codec handling (Dan Blanchard)

Improvements:

  • Retrained bigram models — 24 previously failing test cases now pass (Dan Blanchard)

  • Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages) (Dan Blanchard)

7.0.0 (2026-03-02)

Ground-up, 0BSD-licensed rewrite of chardet (Dan Blanchard, #322). Same package name, same public API — drop-in replacement for chardet 5.x/6.x.

Highlights:

  • 0BSD license (previous versions were LGPL)

  • 96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)

  • 41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer

  • Language detection for every result (90.5% accuracy across 49 languages)

  • 99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)

  • 12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing

  • Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs

  • Optional mypyc compilation — 1.49x additional speedup on CPython

  • Thread-safe detect() and detect_all() with no measurable overhead; scales on free-threaded Python 3.13t+

  • Negligible import memory (96 B)

  • Zero runtime dependencies

Breaking changes vs 6.0.0:

  • detect() and detect_all() now default to encoding_era=EncodingEra.ALL (6.0.0 defaulted to MODERN_WEB)

  • Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.

  • LanguageFilter is accepted but ignored (deprecation warning emitted)

  • chunk_size is accepted but ignored (deprecation warning emitted)

6.0.0.post1 (2026-02-22)

  • Fixed __version__ not being set correctly in the package (Dan Blanchard)

6.0.0 (2026-02-22)

Features:

  • Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces Latin1Prober and MacRomanProber heuristics) (Dan Blanchard)

  • 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh (Dan Blanchard)

  • EncodingEra filtering via new encoding_era parameter (Dan Blanchard)

  • max_bytes and chunk_size parameters for detect(), detect_all(), and UniversalDetector (Dan Blanchard)

  • -e/--encoding-era CLI flag (Dan Blanchard)

  • EBCDIC detection (CP037, CP500) (Dan Blanchard)

  • Direct GB18030 support (replaces redundant GB2312 prober) (Dan Blanchard)

  • Binary file detection (Dan Blanchard)

  • Python 3.12, 3.13, and 3.14 support (Hugo van Kemenade, #283)

  • GitHub Codespaces support (oxygen dioxide, #312)

Breaking changes:

  • Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)

  • Removed Latin1Prober and MacRomanProber

  • Removed EUC-TW support

  • Removed LanguageFilter.NONE

  • detect() default changed to encoding_era=EncodingEra.MODERN_WEB

Fixes:

5.2.0 (2023-08-01)

  • Added support for running the CLI via python -m chardet (Dan Blanchard)

5.1.0 (2022-12-01)

5.0.0 (2022-06-25)

4.0.0 (2020-12-10)

  • Added detect_all() function returning all candidate encodings (Damien, #111)

  • Converted single-byte charset probers to nested dicts (performance) (Dan Blanchard, #121)

  • CharsetGroupProber now short-circuits on definite matches (performance) (Dan Blanchard, #203)

  • Added language field to detect_all output (Dan Blanchard)

  • Switched from Travis to GitHub Actions (Dan Blanchard, #204)

  • Dropped Python 2.6, 3.4, 3.5

3.0.4 (2017-06-08)

3.0.3 (2017-05-16)

3.0.2 (2017-04-12)

  • Fixed detect sometimes returning None instead of a result dict (Dan Blanchard, #114)

3.0.1 (2017-04-11)

  • Fixed crash in EUC-TW prober with certain strings (Dan Blanchard)

3.0.0 (2017-04-11)

chardet 2.3.0 (2014-10-07)

  • Added CP932 detection (hashy)

  • Fixed UTF-8 BOM not detected as UTF-8-SIG (atbest, #32)

  • Switched chardetect to use argparse (Dan Blanchard)

chardet 2.2.1 (2013-12-18)

  • Fixed missing parenthesis in chardetect.py (Owen, #12)

chardet 2.2.0 (2013-12-16)

Merged the charade fork back into chardet, unifying Python 2 and Python 3 support under the original package name.

charade 1.0.3 (2013-01-18)

charade 1.0.2 (2013-01-18)

charade 1.0.1 (2012-12-03)

charade 1.0.0 (2012-12-02)

  • Initial release: Python 3 port of chardet, forked as a separate package (Ian Cordasco)

chardet 2.1.1 (2012-10-01)

  • Bumped version past Mark Pilgrim’s last release

  • chardetect can now read from stdin (Erik Rose)

  • Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (Toshio Kuratomi)

  • Restored Mark Pilgrim’s original docs and COPYING file (Toshio Kuratomi)

chardet 1.1 (2012-07-27)

chardet 1.0.1 (2008-04-19)

  • Packaging fix, added egg distributions for Python 2.4 and 2.5 (Mark Pilgrim)

chardet 1.0 (2006-12-23)

  • Initial release: Python 2 port of Mozilla’s universal charset detector (Mark Pilgrim)