Usage¶

Installation¶

pip install chardet

Basic Detection¶

Use chardet.detect() to detect the encoding of a byte string:

import chardet

result = chardet.detect(
    "München ist die Hauptstadt Bayerns und eine der"
    " schönsten Städte Deutschlands.".encode("windows-1252")
)
print(result)
# {'encoding': 'windows-1252', 'confidence': 0.34, 'language': 'de'}

The result is a dictionary with three keys:

"encoding" — the detected encoding name (e.g., "utf-8", "windows-1252"), or None if detection failed
"confidence" — a float between 0 and 1
"language" — the detected language (e.g., "French"), or None

Multiple Candidates¶

Use chardet.detect_all() to get all candidate encodings ranked by confidence:

results = chardet.detect_all(data)
for r in results:
    print(f"{r['encoding']}: {r['confidence']:.2f}")

By default, results below the minimum confidence threshold (0.20) are filtered out. Pass ignore_threshold=True to see all candidates.

Streaming Detection¶

For large files or streaming data, use chardet.UniversalDetector:

from chardet import UniversalDetector

detector = UniversalDetector()
with open("somefile.txt", "rb") as f:
    for line in f:
        detector.feed(line)
        if detector.done:
            break
detector.close()
print(detector.result)

Call reset() to reuse the detector for another file.

The constructor accepts the same tuning parameters as detect():

detector = UniversalDetector(
    encoding_era=EncodingEra.MODERN_WEB,  # restrict candidate encodings
    max_bytes=50_000,                      # stop buffering after 50 KB
)

Encoding Eras¶

By default, chardet considers all supported encodings for maximum accuracy. Use the encoding_era parameter to restrict the search to a specific subset:

from chardet import detect, EncodingEra

# Default: all encodings considered
result = detect(data)

# Restrict to modern web encodings only
result = detect(data, encoding_era=EncodingEra.MODERN_WEB)

# Only legacy ISO encodings
result = detect(data, encoding_era=EncodingEra.LEGACY_ISO)

Available eras (can be combined with |):

ALL — All supported encodings (default)
MODERN_WEB — UTF-8, Windows codepages, CJK encodings
LEGACY_ISO — ISO-8859 family
LEGACY_MAC — Mac encodings
LEGACY_REGIONAL — Regional codepages (KOI8-T, KZ-1048, etc.)
DOS — DOS codepages (CP437, CP850, etc.)
MAINFRAME — EBCDIC encodings

Legacy Renaming¶

By default, chardet remaps legacy encoding names to their modern equivalents (e.g., "gb2312" becomes "gb18030"). Set should_rename_legacy=False to get the raw detection name:

# Default: legacy names are remapped
chardet.detect(data)
# {'encoding': 'gb18030', ...}

# Disable renaming to get the original detection name
chardet.detect(data, should_rename_legacy=False)
# {'encoding': 'gb2312', ...}

This applies to detect(), detect_all(), and UniversalDetector.

Limiting Bytes¶

By default, chardet examines up to 200,000 bytes. Use max_bytes to adjust:

# Examine only the first 10 KB
result = chardet.detect(data, max_bytes=10_000)

Smaller values are faster but may reduce accuracy for encodings that require more data to distinguish.

Deprecated Parameters¶

The following parameters are accepted for backward compatibility with chardet 5.x/6.x but have no effect:

chunk_size on detect() and detect_all() — previously controlled how data was chunked for streaming probers. A deprecation warning is emitted if a non-default value is passed.
lang_filter on UniversalDetector — previously restricted detection to specific language groups via LanguageFilter. A deprecation warning is emitted if set to anything other than ALL.

Command-Line Tool¶

chardet includes a chardetect command:

# Detect encoding of files
chardetect somefile.txt anotherfile.csv

# Output only the encoding name
chardetect --minimal somefile.txt

# Specific encoding era
chardetect -e dos somefile.txt

# Read from stdin
cat somefile.txt | chardetect