Usage

Installation

pip install chardet

Basic Detection

Use chardet.detect() to detect the encoding of a byte string:

import chardet

result = chardet.detect(
    "München ist die Hauptstadt Bayerns und eine der"
    " schönsten Städte Deutschlands.".encode("windows-1252")
)
print(result)
# {'encoding': 'windows-1252', 'confidence': 0.34, 'language': 'de', 'mime_type': 'text/plain'}

The result is a dictionary with four keys:

  • "encoding" — the detected encoding name (e.g., "utf-8", "windows-1252"), or None if detection failed

  • "confidence" — a float between 0 and 1

  • "language" — the detected language (e.g., "French"), or None

  • "mime_type" — the detected MIME type (e.g., "text/plain", "image/png"), or None if unknown. For binary files detected via magic number signatures, this identifies the file format.

Multiple Candidates

Use chardet.detect_all() to get all candidate encodings ranked by confidence:

results = chardet.detect_all(data)
for r in results:
    print(f"{r['encoding']}: {r['confidence']:.2f}")

By default, results below the minimum confidence threshold (0.20) are filtered out. Pass ignore_threshold=True to see all candidates.

Streaming Detection

For large files or streaming data, use chardet.UniversalDetector:

from chardet import UniversalDetector

detector = UniversalDetector()
with open("somefile.txt", "rb") as f:
    for line in f:
        detector.feed(line)
        if detector.done:
            break
detector.close()
print(detector.result)

Call reset() to reuse the detector for another file.

The constructor accepts the same tuning parameters as detect():

detector = UniversalDetector(
    encoding_era=EncodingEra.MODERN_WEB,  # restrict candidate encodings
    max_bytes=50_000,                      # stop buffering after 50 KB
)

Encoding Eras

By default, chardet considers all supported encodings for maximum accuracy. Use the encoding_era parameter to restrict the search to a specific subset:

from chardet import detect, EncodingEra

# Default: all encodings considered
result = detect(data)

# Restrict to modern web encodings only
result = detect(data, encoding_era=EncodingEra.MODERN_WEB)

# Only legacy ISO encodings
result = detect(data, encoding_era=EncodingEra.LEGACY_ISO)

Available eras (can be combined with |):

Encoding Name Options

By default, chardet returns encoding names compatible with chardet 5.x/6.x (e.g., "utf-8", "ascii", "SHIFT_JIS"). Two parameters control how encoding names are returned:

  • compat_names (default True) — map internal Python codec names to chardet 5.x/6.x compatible display names. Set to False to get raw Python codec names (e.g., "shift_jis_2004" instead of "SHIFT_JIS").

  • prefer_superset (default False) — remap legacy ISO/subset encodings to their modern Windows/CP superset equivalents (e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252).

# Default: chardet 5.x compatible names
chardet.detect(data)
# {'encoding': 'ascii', ...}

# Raw Python codec names
chardet.detect(data, compat_names=False)
# {'encoding': 'ascii', ...}

# Superset remapping with compat names
chardet.detect(data, prefer_superset=True)
# {'encoding': 'Windows-1252', ...}

# Superset remapping with raw codec names
chardet.detect(data, prefer_superset=True, compat_names=False)
# {'encoding': 'cp1252', ...}

These parameters apply to detect(), detect_all(), and UniversalDetector.

The deprecated should_rename_legacy=True parameter is equivalent to prefer_superset=True and is still accepted with a deprecation warning.

The following table shows every encoding whose name changes depending on the compat_names and prefer_superset settings. Encodings not listed here return the same name in all modes.

Encoding names by parameter combination

Internal name

compat_names=True (default)

compat_names=False

prefer_superset=True

prefer_superset=True, compat_names=False

ascii

ascii

ascii

Windows-1252

cp1252

big5hkscs

Big5

big5hkscs

Big5

big5hkscs

cp855

IBM855

cp855

IBM855

cp855

cp866

IBM866

cp866

IBM866

cp866

euc_jis_2004

EUC-JP

euc_jis_2004

EUC-JP

euc_jis_2004

euc_kr

EUC-KR

euc_kr

CP949

cp949

iso2022_jp_2

ISO-2022-JP

iso2022_jp_2

ISO-2022-JP

iso2022_jp_2

iso8859-1

ISO-8859-1

iso8859-1

Windows-1252

cp1252

iso8859-2

ISO-8859-2

iso8859-2

Windows-1250

cp1250

iso8859-5

ISO-8859-5

iso8859-5

Windows-1251

cp1251

iso8859-6

ISO-8859-6

iso8859-6

Windows-1256

cp1256

iso8859-7

ISO-8859-7

iso8859-7

Windows-1253

cp1253

iso8859-8

ISO-8859-8

iso8859-8

Windows-1255

cp1255

iso8859-9

ISO-8859-9

iso8859-9

Windows-1254

cp1254

ISO-8859-11

ISO-8859-11

ISO-8859-11

CP874

cp874

iso8859-13

ISO-8859-13

iso8859-13

Windows-1257

cp1257

kz1048

KZ1048

kz1048

KZ1048

kz1048

mac-cyrillic

MacCyrillic

mac-cyrillic

MacCyrillic

mac-cyrillic

mac-greek

MacGreek

mac-greek

MacGreek

mac-greek

mac-iceland

MacIceland

mac-iceland

MacIceland

mac-iceland

mac-latin2

MacLatin2

mac-latin2

MacLatin2

mac-latin2

mac-roman

MacRoman

mac-roman

MacRoman

mac-roman

mac-turkish

MacTurkish

mac-turkish

MacTurkish

mac-turkish

shift_jis_2004

SHIFT_JIS

shift_jis_2004

SHIFT_JIS

shift_jis_2004

tis-620

TIS-620

tis-620

CP874

cp874

utf-8

utf-8

utf-8

utf-8

utf-8

Encoding Filters

Use include_encodings and exclude_encodings to control exactly which encodings chardet considers:

# Only consider UTF-8 and Windows-1252
result = chardet.detect(data, include_encodings=["utf-8", "windows-1252"])

# Consider everything except EBCDIC
result = chardet.detect(data, exclude_encodings=["cp037", "cp500"])

Encoding names are resolved through Python’s codec system, so aliases work (e.g., "latin-1" for "iso8859-1"). An empty iterable raises ValueError — pass None (the default) to disable filtering.

When filtering removes all candidates, chardet returns the no_match_encoding (default "cp1252") with low confidence. If even that encoding is excluded by the filters, chardet returns encoding=None with a warning. Similarly, empty_input_encoding (default "utf-8") controls the result for empty input:

# Custom fallbacks
result = chardet.detect(
    data,
    include_encodings=["utf-8", "shift_jis"],
    no_match_encoding="utf-8",
    empty_input_encoding="shift_jis",
)

These parameters apply to detect(), detect_all(), and UniversalDetector.

Limiting Bytes

By default, chardet examines up to 200,000 bytes. Use max_bytes to adjust:

# Examine only the first 10 KB
result = chardet.detect(data, max_bytes=10_000)

Smaller values are faster but may reduce accuracy for encodings that require more data to distinguish.

Deprecated Parameters

The following parameters are accepted for backward compatibility with chardet 5.x/6.x but have no effect:

  • chunk_size on detect() and detect_all() — previously controlled how data was chunked for streaming probers. A deprecation warning is emitted if a non-default value is passed.

  • lang_filter on UniversalDetector — previously restricted detection to specific language groups via LanguageFilter. A deprecation warning is emitted if set to anything other than ALL.

Command-Line Tool

chardet includes a chardetect command:

# Detect encoding of files
chardetect somefile.txt anotherfile.csv
# somefile.txt: utf-8 with confidence 0.99
# anotherfile.csv: ascii with confidence 1.0

# Output only the encoding name
chardetect --minimal somefile.txt
# utf-8

# Include detected language
chardetect -l somefile.txt
# somefile.txt: utf-8 en (English) with confidence 0.99

# Minimal output with language
chardetect --minimal -l somefile.txt
# utf-8 en

# Specific encoding era
chardetect -e dos somefile.txt
# somefile.txt: cp850 with confidence 0.10

# Only consider specific encodings
chardetect -i utf-8,windows-1252 somefile.txt
# somefile.txt: utf-8 with confidence 0.99

# Exclude specific encodings
chardetect -x cp037,cp500 somefile.txt
# somefile.txt: utf-8 with confidence 0.99

# Custom fallback when detection is inconclusive
chardetect --no-match-encoding utf-8 somefile.bin
# somefile.bin: utf-8 with confidence 0.10

# Custom encoding for empty input
chardetect --empty-input-encoding shift_jis empty.txt
# empty.txt: SHIFT_JIS with confidence 0.10

# Read from stdin
cat somefile.txt | chardetect
# stdin: utf-8 with confidence 0.99