Usage¶
Installation¶
pip install chardet
Basic Detection¶
Use chardet.detect() to detect the encoding of a byte string:
import chardet
result = chardet.detect(
"München ist die Hauptstadt Bayerns und eine der"
" schönsten Städte Deutschlands.".encode("windows-1252")
)
print(result)
# {'encoding': 'windows-1252', 'confidence': 0.34, 'language': 'de', 'mime_type': 'text/plain'}
The result is a dictionary with four keys:
"encoding"— the detected encoding name (e.g.,"utf-8","windows-1252"), orNoneif detection failed"confidence"— a float between 0 and 1"language"— the detected language (e.g.,"French"), orNone"mime_type"— the detected MIME type (e.g.,"text/plain","image/png"), orNoneif unknown. For binary files detected via magic number signatures, this identifies the file format.
Multiple Candidates¶
Use chardet.detect_all() to get all candidate encodings ranked by
confidence:
results = chardet.detect_all(data)
for r in results:
print(f"{r['encoding']}: {r['confidence']:.2f}")
By default, results below the minimum confidence threshold (0.20) are
filtered out. Pass ignore_threshold=True to see all candidates.
Streaming Detection¶
For large files or streaming data, use chardet.UniversalDetector:
from chardet import UniversalDetector
detector = UniversalDetector()
with open("somefile.txt", "rb") as f:
for line in f:
detector.feed(line)
if detector.done:
break
detector.close()
print(detector.result)
Call reset() to reuse the detector for
another file.
The constructor accepts the same tuning parameters as detect():
detector = UniversalDetector(
encoding_era=EncodingEra.MODERN_WEB, # restrict candidate encodings
max_bytes=50_000, # stop buffering after 50 KB
)
Encoding Eras¶
By default, chardet considers all supported encodings for maximum
accuracy. Use the encoding_era parameter to restrict the search to a
specific subset:
from chardet import detect, EncodingEra
# Default: all encodings considered
result = detect(data)
# Restrict to modern web encodings only
result = detect(data, encoding_era=EncodingEra.MODERN_WEB)
# Only legacy ISO encodings
result = detect(data, encoding_era=EncodingEra.LEGACY_ISO)
Available eras (can be combined with |):
ALL— All supported encodings (default)MODERN_WEB— UTF-8, Windows codepages, CJK encodingsLEGACY_ISO— ISO-8859 familyLEGACY_MAC— Mac encodingsLEGACY_REGIONAL— Regional codepages (KOI8-T, KZ-1048, etc.)DOS— DOS codepages (CP437, CP850, etc.)MAINFRAME— EBCDIC encodings
Encoding Name Options¶
By default, chardet returns encoding names compatible with chardet 5.x/6.x
(e.g., "utf-8", "ascii", "SHIFT_JIS"). Two parameters control
how encoding names are returned:
compat_names(defaultTrue) — map internal Python codec names to chardet 5.x/6.x compatible display names. Set toFalseto get raw Python codec names (e.g.,"shift_jis_2004"instead of"SHIFT_JIS").prefer_superset(defaultFalse) — remap legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252).
# Default: chardet 5.x compatible names
chardet.detect(data)
# {'encoding': 'ascii', ...}
# Raw Python codec names
chardet.detect(data, compat_names=False)
# {'encoding': 'ascii', ...}
# Superset remapping with compat names
chardet.detect(data, prefer_superset=True)
# {'encoding': 'Windows-1252', ...}
# Superset remapping with raw codec names
chardet.detect(data, prefer_superset=True, compat_names=False)
# {'encoding': 'cp1252', ...}
These parameters apply to detect(), detect_all(),
and UniversalDetector.
The deprecated should_rename_legacy=True parameter is equivalent to
prefer_superset=True and is still accepted with a deprecation warning.
The following table shows every encoding whose name changes depending on
the compat_names and prefer_superset settings. Encodings not listed
here return the same name in all modes.
Internal name |
|
|
|
|
|---|---|---|---|---|
ascii |
|
|
|
|
big5hkscs |
|
|
|
|
cp855 |
|
|
|
|
cp866 |
|
|
|
|
euc_jis_2004 |
|
|
|
|
euc_kr |
|
|
|
|
iso2022_jp_2 |
|
|
|
|
iso8859-1 |
|
|
|
|
iso8859-2 |
|
|
|
|
iso8859-5 |
|
|
|
|
iso8859-6 |
|
|
|
|
iso8859-7 |
|
|
|
|
iso8859-8 |
|
|
|
|
iso8859-9 |
|
|
|
|
ISO-8859-11 |
|
|
|
|
iso8859-13 |
|
|
|
|
kz1048 |
|
|
|
|
mac-cyrillic |
|
|
|
|
mac-greek |
|
|
|
|
mac-iceland |
|
|
|
|
mac-latin2 |
|
|
|
|
mac-roman |
|
|
|
|
mac-turkish |
|
|
|
|
shift_jis_2004 |
|
|
|
|
tis-620 |
|
|
|
|
utf-8 |
|
|
|
|
Encoding Filters¶
Use include_encodings and exclude_encodings to control exactly which
encodings chardet considers:
# Only consider UTF-8 and Windows-1252
result = chardet.detect(data, include_encodings=["utf-8", "windows-1252"])
# Consider everything except EBCDIC
result = chardet.detect(data, exclude_encodings=["cp037", "cp500"])
Encoding names are resolved through Python’s codec system, so aliases work
(e.g., "latin-1" for "iso8859-1"). An empty iterable raises
ValueError — pass None (the default) to disable filtering.
When filtering removes all candidates, chardet returns the
no_match_encoding (default "cp1252") with low confidence. If even
that encoding is excluded by the filters, chardet returns
encoding=None with a warning. Similarly, empty_input_encoding
(default "utf-8") controls the result for empty input:
# Custom fallbacks
result = chardet.detect(
data,
include_encodings=["utf-8", "shift_jis"],
no_match_encoding="utf-8",
empty_input_encoding="shift_jis",
)
These parameters apply to detect(),
detect_all(), and UniversalDetector.
Limiting Bytes¶
By default, chardet examines up to 200,000 bytes. Use max_bytes to
adjust:
# Examine only the first 10 KB
result = chardet.detect(data, max_bytes=10_000)
Smaller values are faster but may reduce accuracy for encodings that require more data to distinguish.
Deprecated Parameters¶
The following parameters are accepted for backward compatibility with chardet 5.x/6.x but have no effect:
chunk_sizeondetect()anddetect_all()— previously controlled how data was chunked for streaming probers. A deprecation warning is emitted if a non-default value is passed.lang_filteronUniversalDetector— previously restricted detection to specific language groups viaLanguageFilter. A deprecation warning is emitted if set to anything other thanALL.
Command-Line Tool¶
chardet includes a chardetect command:
# Detect encoding of files
chardetect somefile.txt anotherfile.csv
# somefile.txt: utf-8 with confidence 0.99
# anotherfile.csv: ascii with confidence 1.0
# Output only the encoding name
chardetect --minimal somefile.txt
# utf-8
# Include detected language
chardetect -l somefile.txt
# somefile.txt: utf-8 en (English) with confidence 0.99
# Minimal output with language
chardetect --minimal -l somefile.txt
# utf-8 en
# Specific encoding era
chardetect -e dos somefile.txt
# somefile.txt: cp850 with confidence 0.10
# Only consider specific encodings
chardetect -i utf-8,windows-1252 somefile.txt
# somefile.txt: utf-8 with confidence 0.99
# Exclude specific encodings
chardetect -x cp037,cp500 somefile.txt
# somefile.txt: utf-8 with confidence 0.99
# Custom fallback when detection is inconclusive
chardetect --no-match-encoding utf-8 somefile.bin
# somefile.bin: utf-8 with confidence 0.10
# Custom encoding for empty input
chardetect --empty-input-encoding shift_jis empty.txt
# empty.txt: SHIFT_JIS with confidence 0.10
# Read from stdin
cat somefile.txt | chardetect
# stdin: utf-8 with confidence 0.99