Usage¶

Installation¶

pip install chardet

Basic Detection¶

Use chardet.detect() to detect the encoding of a byte string:

import chardet

result = chardet.detect(
    "München ist die Hauptstadt Bayerns und eine der"
    " schönsten Städte Deutschlands.".encode("windows-1252")
)
print(result)
# {'encoding': 'windows-1252', 'confidence': 0.34, 'language': 'de', 'mime_type': 'text/plain'}

The result is a dictionary with four keys:

"encoding" — the detected encoding name (e.g., "utf-8", "windows-1252"), or None if detection failed
"confidence" — a float between 0 and 1
"language" — the detected language (e.g., "French"), or None
"mime_type" — the detected MIME type (e.g., "text/plain", "image/png"), or None if unknown. For binary files detected via magic number signatures, this identifies the file format.

Multiple Candidates¶

Use chardet.detect_all() to get all candidate encodings ranked by confidence:

results = chardet.detect_all(data)
for r in results:
    print(f"{r['encoding']}: {r['confidence']:.2f}")

By default, results below the minimum confidence threshold (0.20) are filtered out. Pass ignore_threshold=True to see all candidates.

Streaming Detection¶

For large files or streaming data, use chardet.UniversalDetector:

from chardet import UniversalDetector

detector = UniversalDetector()
with open("somefile.txt", "rb") as f:
    for line in f:
        detector.feed(line)
        if detector.done:
            break
detector.close()
print(detector.result)

Call reset() to reuse the detector for another file.

The constructor accepts the same tuning parameters as detect():

detector = UniversalDetector(
    encoding_era=EncodingEra.MODERN_WEB,  # restrict candidate encodings
    max_bytes=50_000,                      # stop buffering after 50 KB
)

Encoding Eras¶

By default, chardet considers all supported encodings for maximum accuracy. Use the encoding_era parameter to restrict the search to a specific subset:

from chardet import detect, EncodingEra

# Default: all encodings considered
result = detect(data)

# Restrict to modern web encodings only
result = detect(data, encoding_era=EncodingEra.MODERN_WEB)

# Only legacy ISO encodings
result = detect(data, encoding_era=EncodingEra.LEGACY_ISO)

Available eras (can be combined with |):

ALL — All supported encodings (default)
MODERN_WEB — UTF-8, Windows codepages, CJK encodings
LEGACY_ISO — ISO-8859 family
LEGACY_MAC — Mac encodings
LEGACY_REGIONAL — Regional codepages (KOI8-T, KZ-1048, etc.)
DOS — DOS codepages (CP437, CP850, etc.)
MAINFRAME — EBCDIC encodings

Encoding Name Options¶

By default, chardet returns encoding names compatible with chardet 5.x/6.x (e.g., "utf-8", "ascii", "SHIFT_JIS"). Two parameters control how encoding names are returned:

compat_names (default True) — map internal Python codec names to chardet 5.x/6.x compatible display names. Set to False to get raw Python codec names (e.g., "shift_jis_2004" instead of "SHIFT_JIS").
prefer_superset (default False) — remap legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252).

# Default: chardet 5.x compatible names
chardet.detect(data)
# {'encoding': 'ascii', ...}

# Raw Python codec names
chardet.detect(data, compat_names=False)
# {'encoding': 'ascii', ...}

# Superset remapping with compat names
chardet.detect(data, prefer_superset=True)
# {'encoding': 'Windows-1252', ...}

# Superset remapping with raw codec names
chardet.detect(data, prefer_superset=True, compat_names=False)
# {'encoding': 'cp1252', ...}

These parameters apply to detect(), detect_all(), and UniversalDetector.

The deprecated should_rename_legacy=True parameter is equivalent to prefer_superset=True and is still accepted with a deprecation warning.

The following table shows every encoding whose name changes depending on the compat_names and prefer_superset settings. Encodings not listed here return the same name in all modes.

Encoding names by parameter combination¶
Internal name	`compat_names=True` (default)	`compat_names=False`	`prefer_superset=True`	`prefer_superset=True, compat_names=False`
ascii	`ascii`	`ascii`	`Windows-1252`	`cp1252`
big5hkscs	`Big5`	`big5hkscs`	`Big5`	`big5hkscs`
cp855	`IBM855`	`cp855`	`IBM855`	`cp855`
cp866	`IBM866`	`cp866`	`IBM866`	`cp866`
euc_jis_2004	`EUC-JP`	`euc_jis_2004`	`EUC-JP`	`euc_jis_2004`
euc_kr	`EUC-KR`	`euc_kr`	`CP949`	`cp949`
iso2022_jp_2	`ISO-2022-JP`	`iso2022_jp_2`	`ISO-2022-JP`	`iso2022_jp_2`
iso8859-1	`ISO-8859-1`	`iso8859-1`	`Windows-1252`	`cp1252`
iso8859-2	`ISO-8859-2`	`iso8859-2`	`Windows-1250`	`cp1250`
iso8859-5	`ISO-8859-5`	`iso8859-5`	`Windows-1251`	`cp1251`
iso8859-6	`ISO-8859-6`	`iso8859-6`	`Windows-1256`	`cp1256`
iso8859-7	`ISO-8859-7`	`iso8859-7`	`Windows-1253`	`cp1253`
iso8859-8	`ISO-8859-8`	`iso8859-8`	`Windows-1255`	`cp1255`
iso8859-9	`ISO-8859-9`	`iso8859-9`	`Windows-1254`	`cp1254`
ISO-8859-11	`ISO-8859-11`	`ISO-8859-11`	`CP874`	`cp874`
iso8859-13	`ISO-8859-13`	`iso8859-13`	`Windows-1257`	`cp1257`
kz1048	`KZ1048`	`kz1048`	`KZ1048`	`kz1048`
mac-cyrillic	`MacCyrillic`	`mac-cyrillic`	`MacCyrillic`	`mac-cyrillic`
mac-greek	`MacGreek`	`mac-greek`	`MacGreek`	`mac-greek`
mac-iceland	`MacIceland`	`mac-iceland`	`MacIceland`	`mac-iceland`
mac-latin2	`MacLatin2`	`mac-latin2`	`MacLatin2`	`mac-latin2`
mac-roman	`MacRoman`	`mac-roman`	`MacRoman`	`mac-roman`
mac-turkish	`MacTurkish`	`mac-turkish`	`MacTurkish`	`mac-turkish`
shift_jis_2004	`SHIFT_JIS`	`shift_jis_2004`	`SHIFT_JIS`	`shift_jis_2004`
tis-620	`TIS-620`	`tis-620`	`CP874`	`cp874`
utf-8	`utf-8`	`utf-8`	`utf-8`	`utf-8`

Encoding Filters¶

Use include_encodings and exclude_encodings to control exactly which encodings chardet considers:

# Only consider UTF-8 and Windows-1252
result = chardet.detect(data, include_encodings=["utf-8", "windows-1252"])

# Consider everything except EBCDIC
result = chardet.detect(data, exclude_encodings=["cp037", "cp500"])

Encoding names are resolved through Python’s codec system, so aliases work (e.g., "latin-1" for "iso8859-1"). An empty iterable raises ValueError — pass None (the default) to disable filtering.

When filtering removes all candidates, chardet returns the no_match_encoding (default "cp1252") with low confidence. If even that encoding is excluded by the filters, chardet returns encoding=None with a warning. Similarly, empty_input_encoding (default "utf-8") controls the result for empty input:

# Custom fallbacks
result = chardet.detect(
    data,
    include_encodings=["utf-8", "shift_jis"],
    no_match_encoding="utf-8",
    empty_input_encoding="shift_jis",
)

These parameters apply to detect(), detect_all(), and UniversalDetector.

Limiting Bytes¶

By default, chardet examines up to 200,000 bytes. Use max_bytes to adjust:

# Examine only the first 10 KB
result = chardet.detect(data, max_bytes=10_000)

Smaller values are faster but may reduce accuracy for encodings that require more data to distinguish.

Deprecated Parameters¶

The following parameters are accepted for backward compatibility with chardet 5.x/6.x but have no effect:

chunk_size on detect() and detect_all() — previously controlled how data was chunked for streaming probers. A deprecation warning is emitted if a non-default value is passed.
lang_filter on UniversalDetector — previously restricted detection to specific language groups via LanguageFilter. A deprecation warning is emitted if set to anything other than ALL.

Command-Line Tool¶

chardet includes a chardetect command:

# Detect encoding of files
chardetect somefile.txt anotherfile.csv
# somefile.txt: utf-8 with confidence 0.99
# anotherfile.csv: ascii with confidence 1.0

# Output only the encoding name
chardetect --minimal somefile.txt
# utf-8

# Include detected language
chardetect -l somefile.txt
# somefile.txt: utf-8 en (English) with confidence 0.99

# Minimal output with language
chardetect --minimal -l somefile.txt
# utf-8 en

# Specific encoding era
chardetect -e dos somefile.txt
# somefile.txt: cp850 with confidence 0.10

# Only consider specific encodings
chardetect -i utf-8,windows-1252 somefile.txt
# somefile.txt: utf-8 with confidence 0.99

# Exclude specific encodings
chardetect -x cp037,cp500 somefile.txt
# somefile.txt: utf-8 with confidence 0.99

# Custom fallback when detection is inconclusive
chardetect --no-match-encoding utf-8 somefile.bin
# somefile.bin: utf-8 with confidence 0.10

# Custom encoding for empty input
chardetect --empty-input-encoding shift_jis empty.txt
# empty.txt: SHIFT_JIS with confidence 0.10

# Read from stdin
cat somefile.txt | chardetect
# stdin: utf-8 with confidence 0.99