Changelog¶
Note
Entries marked “via Claude” were developed with Claude Code. Dan directed the design, reviewed all output, and takes responsibility for the result. Unmarked entries by Dan were written without AI assistance.
7.4.0 (2026-03-26)¶
Performance:
Switched to dense zlib-compressed model format (v2): models are now stored as contiguous
memoryviewslices of a single decompressed blob, eliminating per-modelstruct.unpackoverhead. Cold start (import + first detect) dropped from ~75ms to ~13ms with mypyc. (Dan Blanchard via Claude, #354)
Accuracy:
Accuracy improved from 98.6% to 99.3% (2499/2517 files) through a combination of training and scoring improvements:
Eliminated train/test data overlap by content-fingerprinting test suite articles and excluding them from training data (#351)
Added MADLAD-400 and Wikipedia as supplemental training sources to fill gaps left by exclusion filtering (#351)
Improved non-ASCII bigram scoring: high-byte bigrams are now preserved during training (instead of being crushed by global normalization), and weighted by per-bigram IDF so encoding-specific byte patterns contribute proportionally to how discriminative they are (#352)
Added encoding-aware substitution filtering: character substitutions during training now only apply for characters the target encoding cannot represent
Increased training samples from 15K to 25K per language/encoding pair (Dan Blanchard via Claude)
Bug Fixes:
Added dedicated structural analyzers for CP932, CP949, and Big5-HKSCS: these superset encodings previously shared their base encoding’s byte-range analyzer, missing extended ranges unique to each superset (Dan Blanchard via Claude, #353)
7.3.0 (2026-03-24)¶
License:
0BSD license — the project license has been changed from MIT to 0BSD, a maximally permissive license with no attribution requirement. All prior 7.x releases should also be considered 0BSD licensed as of this release. (Dan Blanchard via Claude)
Features:
Added
mime_typefield to detection results — identifies file types for both binary (via magic number matching) and text content. Returned in alldetect(),detect_all(), andUniversalDetectorresults. (Dan Blanchard via Claude, #350)New
pipeline/magic.pymodule detects 40+ binary file formats including images, audio/video, archives, documents, executables, and fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel, OpenDocument) are distinguished by entry filenames. (Dan Blanchard via Claude, #350)
Bug Fixes:
Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in accuracy testing — these are distinct encodings with different byte order, not interchangeable (Dan Blanchard via Claude)
Performance:
Added 4 new modules to mypyc compilation (orchestrator, confusion, magic, ascii), bringing the total to 11 compiled modules (Dan Blanchard via Claude)
Capped statistical scoring at 16 KB — bigram models converge quickly, so large files no longer score the full 200 KB. Worst-case detection time dropped from 62ms to 26ms with no accuracy loss. (Dan Blanchard via Claude)
Replaced
dataclasses.replace()with directDetectionResultconstruction on hot paths, eliminating ~354k function calls per full test suite run (Dan Blanchard via Claude)
Build:
Added riscv64 to the mypyc wheel build matrix — prebuilt wheels are now published for RISC-V Linux alongside existing architectures (Bruno Verachten, #348)
7.2.0 (2026-03-17)¶
Features:
Added
include_encodingsandexclude_encodingsparameters todetect(),detect_all(), andUniversalDetector— restrict or exclude specific encodings from the candidate set, with corresponding-i/--include-encodingsand-x/--exclude-encodingsCLI flags (Dan Blanchard via Claude, #343)Added
no_match_encoding(default"cp1252") andempty_input_encoding(default"utf-8") parameters — control which encoding is returned when no candidate survives the pipeline or the input is empty, with corresponding CLI flags (Dan Blanchard via Claude, #343)Added
-l/--languageflag tochardetectCLI — shows the detected language (ISO 639-1 code and English name) alongside the encoding (Dan Blanchard via Claude, #342)
7.1.0 (2026-03-11)¶
Features:
Added PEP 263 encoding declaration detection —
# -*- coding: ... -*-and# coding=...declarations on lines 1–2 of Python source files are now recognized with confidence 0.95 (Dan Blanchard via Claude, #249)Added
chardet.universaldetectorbackward-compatibility stub so thatfrom chardet.universaldetector import UniversalDetectorworks with a deprecation warning (Dan Blanchard via Claude, #341)
Fixes:
Fixed false UTF-7 detection of ASCII text containing
++or+wordpatterns (Dan Blanchard, #332, #335)Fixed 0.5s startup cost on first
detect()call — model norms are now computed during loading instead of lazily iterating 21M entries (Dan Blanchard via Claude, #333, #336)Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
detect()now returns chardet 5.x-compatible names by default (Dan Blanchard via Claude, #338)Improved ISO-2022-JP family detection — recognizes ESC sequences for ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana) (Dan Blanchard via Claude)
Fixed silent truncation of corrupt model data (
iter_unpackyielded fewer tuples instead of raising) (Dan Blanchard via Claude)Fixed incorrect date in LICENSE (Dan Blanchard)
Performance:
5.5x faster first-detect time (~0.42s → ~0.075s) by computing model norms as a side-product of
load_models()(Dan Blanchard via Claude)~40% faster model parsing via
struct.iter_unpackfor bulk entry extraction (eliminates ~305K individualunpackcalls) (Dan Blanchard via Claude)
New API parameters:
Added
compat_namesparameter (defaultTrue) todetect(),detect_all(), andUniversalDetector— set toFalseto get raw Python codec names instead of chardet 5.x/6.x compatible display names (Dan Blanchard via Claude)Added
prefer_supersetparameter (defaultFalse) — remaps legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252). This will default to ``True`` in the next major version (8.0). (Dan Blanchard via Claude)Deprecated
should_rename_legacyin favor ofprefer_superset— a deprecation warning is emitted when used (Dan Blanchard via Claude)
Improvements:
Switched internal canonical encoding names to Python codec names (e.g.,
"utf-8"instead of"UTF-8"), withcompat_namescontrolling the public output format. See Usage for the full mapping table. (Dan Blanchard via Claude)Added
lookup_encoding()toregistryfor case-insensitive resolution of arbitrary encoding name input to canonical names (Dan Blanchard via Claude)Achieved 100% line coverage across all source modules (+31 tests) (Dan Blanchard via Claude)
Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language accuracy on 2,510 test files (Dan Blanchard via Claude)
Pinned test-data cloning to chardet release version tags for reproducible builds (Dan Blanchard via Claude)
7.0.1 (2026-03-04)¶
Fixes:
Fixed false UTF-7 detection of SHA-1 git hashes (Alex Rembish, #324)
Fixed
_SINGLE_LANG_MAPmissing aliases for single-language encoding lookup (e.g.,big5→big5hkscs) (Dan Blanchard)Fixed PyPy
TypeErrorin UTF-7 codec handling (Dan Blanchard)
Improvements:
Retrained bigram models — 24 previously failing test cases now pass (Dan Blanchard via Claude)
Updated language equivalences for mutual intelligibility (Slovak/Czech, East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages) (Dan Blanchard via Claude)
7.0.0 (2026-03-02)¶
Ground-up, 0BSD-licensed rewrite of chardet (Dan Blanchard via Claude, #322). Same package name, same public API — drop-in replacement for chardet 5.x/6.x.
Highlights:
0BSD license (previous versions were LGPL)
96.8% accuracy on 2,179 test files (+2.3pp vs chardet 6.0.0, +7.7pp vs charset-normalizer)
41x faster than chardet 6.0.0 with mypyc (28x pure Python), 7.5x faster than charset-normalizer
Language detection for every result (90.5% accuracy across 49 languages)
99 encodings across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME)
12-stage detection pipeline — BOM, UTF-16/32 patterns, escape sequences, binary detection, markup charset, ASCII, UTF-8 validation, byte validity, CJK gating, structural probing, statistical scoring, post-processing
Bigram frequency models trained on CulturaX multilingual corpus data for all supported language/encoding pairs
Optional mypyc compilation — 1.49x additional speedup on CPython
Thread-safe
detect()anddetect_all()with no measurable overhead; scales on free-threaded Python 3.13t+Negligible import memory (96 B)
Zero runtime dependencies
Breaking changes vs 6.0.0:
detect()anddetect_all()now default toencoding_era=EncodingEra.ALL(6.0.0 defaulted toMODERN_WEB)Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
LanguageFilteris accepted but ignored (deprecation warning emitted)chunk_sizeis accepted but ignored (deprecation warning emitted)
6.0.0.post1 (2026-02-22)¶
Fixed
__version__not being set correctly in the package (Dan Blanchard)
6.0.0 (2026-02-22)¶
Features:
Unified single-byte charset detection with proper language-specific bigram models for all single-byte encodings (replaces
Latin1ProberandMacRomanProberheuristics) (Dan Blanchard)38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, Welsh (Dan Blanchard)
EncodingErafiltering via newencoding_eraparameter (Dan Blanchard)max_bytesandchunk_sizeparameters fordetect(),detect_all(), andUniversalDetector(Dan Blanchard)-e/--encoding-eraCLI flag (Dan Blanchard via Claude)EBCDIC detection (CP037, CP500) (Dan Blanchard)
Direct GB18030 support (replaces redundant GB2312 prober) (Dan Blanchard)
Binary file detection (Dan Blanchard)
Python 3.12, 3.13, and 3.14 support (Hugo van Kemenade, #283)
GitHub Codespaces support (oxygen dioxide, #312)
Breaking changes:
Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
Removed
Latin1ProberandMacRomanProberRemoved EUC-TW support
Removed
LanguageFilter.NONEdetect()default changed toencoding_era=EncodingEra.MODERN_WEB
Fixes:
Fixed SJIS distribution analysis (second-byte range >= 0x80) (Kadir Can Ozden, #315)
Fixed
max_bytesnot being passed toUniversalDetector(Kadir Can Ozden, #314)Fixed UTF-16/32 detection for non-ASCII-heavy text (Dan Blanchard)
Fixed GB18030
char_len_table(Dan Blanchard)Fixed UTF-8 state machine (Dan Blanchard)
Fixed
detect_all()returning inactive probers (Dan Blanchard)Fixed early cutoff bug (Dan Blanchard)
Updated LGPLv2.1 license text for remote-only FSF address (Ben Beasley, #307)
5.2.0 (2023-08-01)¶
Added support for running the CLI via
python -m chardet(Dan Blanchard)
5.1.0 (2022-12-01)¶
Added
should_rename_legacyargument to remap legacy encoding names to modern equivalents (Dan Blanchard, #264)Added MacRoman encoding prober (Elia Robyn Lake)
Added
--minimalflag tochardetectCLI (Dan Blanchard, #214)Added type annotations and mypy CI (Jon Dufresne, #261)
Added support for Python 3.11 (Hugo van Kemenade, #274)
Added ISO-8859-15 capital letter sharp S handling (Simon Waldherr, #222)
Clarified LGPL version in license trove classifier (Ben Beasley, #255)
Removed support for Python 3.6 (Jon Dufresne, #260)
5.0.0 (2022-06-25)¶
Added Johab Korean prober (Dan Blanchard, #207)
Added UTF-16/32 BE/LE probers (Dan Blanchard, #206)
Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, Turkish (Dan Blanchard)
Improved XML tag filtering (Dan Blanchard, #208)
Made
detect_allreturn child prober confidences (Dan Blanchard, #210)Added support for Python 3.10 (Hugo van Kemenade, #232)
Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)
4.0.0 (2020-12-10)¶
Added
detect_all()function returning all candidate encodings (Damien, #111)Converted single-byte charset probers to nested dicts (performance) (Dan Blanchard, #121)
CharsetGroupProbernow short-circuits on definite matches (performance) (Dan Blanchard, #203)Added
languagefield todetect_alloutput (Dan Blanchard)Switched from Travis to GitHub Actions (Dan Blanchard, #204)
Dropped Python 2.6, 3.4, 3.5
3.0.4 (2017-06-08)¶
Fixed packaging issue with
pytest_runner(Zac Medico, #119)Included
test.pyin source distribution (Zac Medico, #118)Updated old URLs in README and docs (Qi Fan, #123; Jon Dufresne, #129)
3.0.3 (2017-05-16)¶
Fixed crash when debug logging was enabled (Dan Blanchard, #117)
3.0.2 (2017-04-12)¶
Fixed
detectsometimes returningNoneinstead of a result dict (Dan Blanchard, #114)
3.0.1 (2017-04-11)¶
Fixed crash in EUC-TW prober with certain strings (Dan Blanchard)
3.0.0 (2017-04-11)¶
Added Turkish ISO-8859-9 detection (queeup)
Modernized naming conventions (
typical_positive_ratioinstead ofmTypicalPositiveRatio) (Dan Blanchard, #107)Added
languageproperty to probers and results (Dan Blanchard, #108)Switched from Travis to GitHub Actions (Dan Blanchard)
Fixed
CharsetGroupProber.statenot being set toFOUND_IT(Dan Blanchard)Added Hypothesis-based fuzz testing (David R. MacIver, #66)
Don’t indicate byte order for UTF-16/32 with given BOM, for compatibility with
decode()(Sebastian Noack, #73)Stop reading file immediately when file type is known (Jason Zavaglia, #103)
chardet 2.3.0 (2014-10-07)¶
Added CP932 detection (hashy)
Switched
chardetectto useargparse(Dan Blanchard)
chardet 2.2.1 (2013-12-18)¶
chardet 2.2.0 (2013-12-16)¶
Merged the charade fork back into chardet, unifying Python 2 and Python 3 support under the original package name.
Added CP949 detection (Kyung-hown Chung)
Fixed BOM detection (Jean Boussier)
charade 1.0.3 (2013-01-18)¶
Fixed codecs usage for compatibility (Ian Cordasco)
charade 1.0.2 (2013-01-18)¶
Fixed BOM detection (Jean Boussier)
Improved multibyte sequence handling (Kyung-hown Chung)
charade 1.0.1 (2012-12-03)¶
Version fix (Ian Cordasco)
charade 1.0.0 (2012-12-02)¶
Initial release: Python 3 port of chardet, forked as a separate package (Ian Cordasco)
chardet 2.1.1 (2012-10-01)¶
Bumped version past Mark Pilgrim’s last release
chardetectcan now read from stdin (Erik Rose)Fixed BOM byte strings for UCS-4-2143 and UCS-4-3412 (Toshio Kuratomi)
Restored Mark Pilgrim’s original docs and COPYING file (Toshio Kuratomi)
chardet 1.1 (2012-07-27)¶
Added
chardetectCLI tool (Erik Rose)Fixed
utf8probercrash when character is out of range (David Cramer)Cleaned up detection logic to fail gracefully (David Cramer)
Fixed feed encoding errors (David Cramer)
chardet 1.0.1 (2008-04-19)¶
Packaging fix, added egg distributions for Python 2.4 and 2.5 (Mark Pilgrim)
chardet 1.0 (2006-12-23)¶
Initial release: Python 2 port of Mozilla’s universal charset detector (Mark Pilgrim)