Changelog¶

Note

Entries marked “via Claude” were developed with Claude Code. Dan directed the design, reviewed all output, and takes responsibility for the result. Unmarked entries by Dan were written without AI assistance.

7.4.2 (2026-04-12)¶

Bug Fixes:

Fixed RuntimeError: pipeline must always return at least one result on ~2% of all possible two-byte inputs (e.g. b"\xf9\x92"). Multi-byte encodings like CP932 and Johab could score above the structural confidence threshold on very short inputs, but then statistical scoring would return nothing, leaving the pipeline with an empty result list instead of falling through to the no_match_encoding fallback. (Jason Barnett via Claude, #367, #368)

Improvements:

Added ~90 encoding aliases from the WHATWG Encoding Standard and IANA Character Sets registry so that <meta charset> labels like x-cp1252, x-sjis, dos-874, csUTF8, and the cswindows* family all resolve correctly through the markup detection stage. Every alias was driven by a failing spec-compliance test. (Dan Blanchard via Claude, #366)
Added a spec-compliance test suite covering Python decode round-trips for all 86 registry encodings, WHATWG web-platform label resolution, IANA preferred MIME names, and Unicode/RFC conformance (BOM sniffing, UTF-8 boundary cases, UTF-16 surrogate pairs). This is the test suite that would have caught the 7.4.1 BOM bug before release. (Dan Blanchard via Claude, #366)

7.4.1 (2026-04-07)¶

Bug Fixes:

BOM-prefixed UTF-16 and UTF-32 input now reports utf-16 and utf-32 instead of the endian-specific variants. Python’s utf-16-le/utf-16-be/utf-32-le/utf-32-be codecs keep the BOM as a U+FEFF in the decoded string, while utf-16/utf-32 strip it, so callers passing the detection result directly to .decode() were getting a stray BOM at the start of their text. BOM-less UTF-16/32 detection (via null-byte patterns) is unchanged and still returns the endian-specific name. (Dan Blanchard via Claude, #364, #365)

7.4.0 (2026-03-26)¶

Performance:

Switched to dense zlib-compressed model format (v2): models are now stored as contiguous memoryview slices of a single decompressed blob, eliminating per-model struct.unpack overhead. Cold start (import + first detect) dropped from ~75ms to ~13ms with mypyc. (Dan Blanchard via Claude, #354)

Accuracy:

Accuracy improved from 98.6% to 99.3% (2499/2517 files) through a combination of training and scoring improvements:
- Eliminated train/test data overlap by content-fingerprinting test suite articles and excluding them from training data (#351)
- Added MADLAD-400 and Wikipedia as supplemental training sources to fill gaps left by exclusion filtering (#351)
- Improved non-ASCII bigram scoring: high-byte bigrams are now preserved during training (instead of being crushed by global normalization), and weighted by per-bigram IDF so encoding-specific byte patterns contribute proportionally to how discriminative they are (#352)
- Added encoding-aware substitution filtering: character substitutions during training now only apply for characters the target encoding cannot represent
- Increased training samples from 15K to 25K per language/encoding pair (Dan Blanchard via Claude)

Bug Fixes:

Added dedicated structural analyzers for CP932, CP949, and Big5-HKSCS: these superset encodings previously shared their base encoding’s byte-range analyzer, missing extended ranges unique to each superset (Dan Blanchard via Claude, #353)

7.3.0 (2026-03-24)¶

License:

0BSD license — the project license has been changed from MIT to 0BSD, a maximally permissive license with no attribution requirement. All prior 7.x releases should also be considered 0BSD licensed as of this release. (Dan Blanchard via Claude)

Features:

Added mime_type field to detection results — identifies file types for both binary (via magic number matching) and text content. Returned in all detect(), detect_all(), and UniversalDetector results. (Dan Blanchard via Claude, #350)
New pipeline/magic.py module detects 40+ binary file formats including images, audio/video, archives, documents, executables, and fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel, OpenDocument) are distinguished by entry filenames. (Dan Blanchard via Claude, #350)

Bug Fixes:

Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in accuracy testing — these are distinct encodings with different byte order, not interchangeable (Dan Blanchard via Claude)

Performance:

Added 4 new modules to mypyc compilation (orchestrator, confusion, magic, ascii), bringing the total to 11 compiled modules (Dan Blanchard via Claude)
Capped statistical scoring at 16 KB — bigram models converge quickly, so large files no longer score the full 200 KB. Worst-case detection time dropped from 62ms to 26ms with no accuracy loss. (Dan Blanchard via Claude)
Replaced dataclasses.replace() with direct DetectionResult construction on hot paths, eliminating ~354k function calls per full test suite run (Dan Blanchard via Claude)

Build:

Added riscv64 to the mypyc wheel build matrix — prebuilt wheels are now published for RISC-V Linux alongside existing architectures (Bruno Verachten, #348)

7.2.0 (2026-03-17)¶

Features:

Added include_encodings and exclude_encodings parameters to detect(), detect_all(), and UniversalDetector — restrict or exclude specific encodings from the candidate set, with corresponding -i/--include-encodings and -x/--exclude-encodings CLI flags (Dan Blanchard via Claude, #343)
Added no_match_encoding (default "cp1252") and empty_input_encoding (default "utf-8") parameters — control which encoding is returned when no candidate survives the pipeline or the input is empty, with corresponding CLI flags (Dan Blanchard via Claude, #343)
Added -l/--language flag to chardetect CLI — shows the detected language (ISO 639-1 code and English name) alongside the encoding (Dan Blanchard via Claude, #342)

7.1.0 (2026-03-11)¶

Features:

Added PEP 263 encoding declaration detection — # -*- coding: ... -*- and # coding=... declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (Dan Blanchard via Claude, #249)
Added chardet.universaldetector backward-compatibility stub so that from chardet.universaldetector import UniversalDetector works with a deprecation warning (Dan Blanchard via Claude, #341)

Fixes:

Fixed false UTF-7 detection of ASCII text containing ++ or +word patterns (Dan Blanchard, #332, #335)
Fixed 0.5s startup cost on first detect() call — model norms are now computed during loading instead of lazily iterating 21M entries (Dan Blanchard via Claude, #333, #336)
Fixed undocumented encoding name changes between chardet 5.x and 7.0 — detect() now returns chardet 5.x-compatible names by default (Dan Blanchard via Claude, #338)
Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana) (Dan Blanchard via Claude)
Fixed silent truncation of corrupt model data (iter_unpack yielded fewer tuples instead of raising) (Dan Blanchard via Claude)
Fixed incorrect date in LICENSE (Dan Blanchard)

Performance:

5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of load_models() (Dan Blanchard via Claude)
~40% faster model parsing via struct.iter_unpack for bulk entry extraction (eliminates ~305K individual unpack calls) (Dan Blanchard via Claude)

New API parameters:

Added compat_names parameter (default True) to detect(), detect_all(), and UniversalDetector — set to False to get raw Python codec names instead of chardet 5.x/6.x compatible display names (Dan Blanchard via Claude)
Added prefer_superset parameter (default False) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default to ``True`` in the next major version (8.0). (Dan Blanchard via Claude)
Deprecated should_rename_legacy in favor of prefer_superset — a deprecation warning is emitted when used (Dan Blanchard via Claude)

Improvements:

Switched internal canonical encoding names to Python codec names (e.g., "utf-8" instead of "UTF-8"), with compat_names controlling the public output format. See Usage for the full mapping table. (Dan Blanchard via Claude)
Added lookup_encoding() to registry for case-insensitive resolution of arbitrary encoding name input to canonical names (Dan Blanchard via Claude)
Achieved 100% line coverage across all source modules (+31 tests) (Dan Blanchard via Claude)
Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files (Dan Blanchard via Claude)
Pinned test-data cloning to chardet release version tags for reproducible builds (Dan Blanchard via Claude)

7.0.1 (2026-03-04)¶

Fixes:

Fixed false UTF-7 detection of SHA-1 git hashes (Alex Rembish, #324)
Fixed _SINGLE_LANG_MAP missing aliases for single-language encoding lookup (e.g., big5 → big5hkscs) (Dan Blanchard)
Fixed PyPy TypeError in UTF-7 codec handling (Dan Blanchard)

Improvements:

Retrained bigram models — 24 previously failing test cases now pass (Dan Blanchard via Claude)
Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages) (Dan Blanchard via Claude)

7.0.0 (2026-03-02)¶

Ground-up, 0BSD-licensed rewrite of chardet (Dan Blanchard via Claude, #322). Same package name, same public API — drop-in replacement for chardet 5.x/6.x.

Highlights:

0BSD license (previous versions were LGPL)
96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)
41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer
Language detection for every result (90.5% accuracy across 49 languages)
99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)
12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing
Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs
Optional mypyc compilation — 1.49x additional speedup on CPython
Thread-safe detect() and detect_all() with no measurable overhead; scales on free-threaded Python 3.13t+
Negligible import memory (96 B)
Zero runtime dependencies

Breaking changes vs 6.0.0:

detect() and detect_all() now default to encoding_era=EncodingEra.ALL (6.0.0 defaulted to MODERN_WEB)
Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
LanguageFilter is accepted but ignored (deprecation warning emitted)
chunk_size is accepted but ignored (deprecation warning emitted)

6.0.0.post1 (2026-02-22)¶

Fixed __version__ not being set correctly in the package (Dan Blanchard)

6.0.0 (2026-02-22)¶

Features:

Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces Latin1Prober and MacRomanProber heuristics) (Dan Blanchard)
38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh (Dan Blanchard)
EncodingEra filtering via new encoding_era parameter (Dan Blanchard)
max_bytes and chunk_size parameters for detect(), detect_all(), and UniversalDetector (Dan Blanchard)
-e/--encoding-era CLI flag (Dan Blanchard via Claude)
EBCDIC detection (CP037, CP500) (Dan Blanchard)
Direct GB18030 support (replaces redundant GB2312 prober) (Dan Blanchard)
Binary file detection (Dan Blanchard)
Python 3.12, 3.13, and 3.14 support (Hugo van Kemenade, #283)
GitHub Codespaces support (oxygen dioxide, #312)

Breaking changes:

Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
Removed Latin1Prober and MacRomanProber
Removed EUC-TW support
Removed LanguageFilter.NONE
detect() default changed to encoding_era=EncodingEra.MODERN_WEB

Fixes:

Fixed CP949 state machine (nenw*, #268)
Fixed SJIS distribution analysis (second-byte range >= 0x80) (Kadir Can Ozden, #315)
Fixed max_bytes not being passed to UniversalDetector (Kadir Can Ozden, #314)
Fixed UTF-16/32 detection for non-ASCII-heavy text (Dan Blanchard)
Fixed GB18030 char_len_table (Dan Blanchard)
Fixed UTF-8 state machine (Dan Blanchard)
Fixed detect_all() returning inactive probers (Dan Blanchard)
Fixed early cutoff bug (Dan Blanchard)
Updated LGPLv2.1 license text for remote-only FSF address (Ben Beasley, #307)

5.2.0 (2023-08-01)¶

Added support for running the CLI via python -m chardet (Dan Blanchard)

5.1.0 (2022-12-01)¶

Added should_rename_legacy argument to remap legacy encoding names to modern equivalents (Dan Blanchard, #264)
Added MacRoman encoding prober (Elia Robyn Lake)
Added --minimal flag to chardetect CLI (Dan Blanchard, #214)
Added type annotations and mypy CI (Jon Dufresne, #261)
Added support for Python 3.11 (Hugo van Kemenade, #274)
Added ISO-8859-15 capital letter sharp S handling (Simon Waldherr, #222)
Clarified LGPL version in license trove classifier (Ben Beasley, #255)
Removed support for Python 3.6 (Jon Dufresne, #260)

5.0.0 (2022-06-25)¶

Added Johab Korean prober (grizlupo, #172, #207)
Added UTF-16/32 BE/LE probers (Jason Zavaglia, #109, #206)
Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, Turkish (Dan Blanchard)
Improved XML tag filtering (Dan Blanchard, #208)
Made detect_all return child prober confidences (Dan Blanchard, #210)
Added support for Python 3.10 (Hugo van Kemenade, #232)
Slight performance increase (deedy5, #252)
Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)

4.0.0 (2020-12-10)¶

Added detect_all() function returning all candidate encodings (Damien, #111)
Converted single-byte charset probers to nested dicts (performance) (Dan Blanchard, #121)
CharsetGroupProber now short-circuits on definite matches (performance) (Dan Blanchard, #203)
Added language field to detect_all output (Dan Blanchard)
Switched from Travis to GitHub Actions (Dan Blanchard, #204)
Dropped Python 2.6, 3.4, 3.5

3.0.4 (2017-06-08)¶

Fixed packaging issue with pytest_runner (Zac Medico, #119)
Included test.py in source distribution (Zac Medico, #118)
Updated old URLs in README and docs (Qi Fan, #123; Jon Dufresne, #129)

3.0.3 (2017-05-16)¶

Fixed crash when debug logging was enabled (Dan Blanchard, #117)

3.0.2 (2017-04-12)¶

Fixed detect sometimes returning None instead of a result dict (Dan Blanchard, #114)

3.0.1 (2017-04-11)¶

Fixed crash in EUC-TW prober with certain strings (Dan Blanchard)

3.0.0 (2017-04-11)¶

Added Turkish ISO-8859-9 detection (queeup)
Modernized naming conventions (typical_positive_ratio instead of mTypicalPositiveRatio) (Dan Blanchard, #107)
Added language property to probers and results (Dan Blanchard, #108)
Switched from Travis to GitHub Actions (Dan Blanchard)
Fixed CharsetGroupProber.state not being set to FOUND_IT (Dan Blanchard)
Added Hypothesis-based fuzz testing (David R. MacIver, #66)
Don’t indicate byte order for UTF-16/32 with given BOM, for compatibility with decode() (Sebastian Noack, #73)
Stop reading file immediately when file type is known (Jason Zavaglia, #103)

chardet 2.3.0 (2014-10-07)¶

Added CP932 detection (hashy)
Fixed UTF-8 BOM not detected as UTF-8-SIG (atbest, #32)
Switched chardetect to use argparse (Dan Blanchard)

chardet 2.2.1 (2013-12-18)¶

Fixed missing parenthesis in chardetect.py (Owen, #12)

chardet 2.2.0 (2013-12-16)¶

Merged the charade fork back into chardet, unifying Python 2 and Python 3 support under the original package name.

Added CP949 detection (Kyung-hown Chung)
Fixed BOM detection (Jean Boussier)

charade 1.0.3 (2013-01-18)¶

Fixed codecs usage for compatibility (Ian Cordasco)

charade 1.0.2 (2013-01-18)¶

Fixed BOM detection (Jean Boussier)
Improved multibyte sequence handling (Kyung-hown Chung)

charade 1.0.1 (2012-12-03)¶

Version fix (Ian Cordasco)

charade 1.0.0 (2012-12-02)¶

Initial release: Python 3 port of chardet, forked as a separate package (Ian Cordasco)

chardet 2.1.1 (2012-10-01)¶

Bumped version past Mark Pilgrim’s last release
chardetect can now read from stdin (Erik Rose)
Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (Toshio Kuratomi)
Restored Mark Pilgrim’s original docs and COPYING file (Toshio Kuratomi)

chardet 1.1 (2012-07-27)¶

Added chardetect CLI tool (Erik Rose)
Fixed utf8prober crash when character is out of range (David Cramer)
Cleaned up detection logic to fail gracefully (David Cramer)
Fixed feed encoding errors (David Cramer)

chardet 1.0.1 (2008-04-19)¶

Packaging fix, added egg distributions for Python 2.4 and 2.5 (Mark Pilgrim)

chardet 1.0 (2006-12-23)¶

Initial release: Python 2 port of Mozilla’s universal charset detector (Mark Pilgrim)