API Reference

Top-level Functions

chardet.detect(byte_str, should_rename_legacy=False, encoding_era=<EncodingEra.ALL: 63>, chunk_size=65536, max_bytes=200000, *, prefer_superset=False, compat_names=True, include_encodings=None, exclude_encodings=None, no_match_encoding='cp1252', empty_input_encoding='utf-8')

Detect the encoding of the given byte string.

Parameters:
  • byte_str (bytes | bytearray) – The byte sequence to detect encoding for.

  • should_rename_legacy (bool) – Deprecated alias for prefer_superset.

  • encoding_era (EncodingEra) – Restrict candidate encodings to the given era.

  • chunk_size (int) – Deprecated – accepted for backward compatibility but has no effect.

  • max_bytes (int) – Maximum number of bytes to examine from byte_str.

  • prefer_superset (bool) – If True, remap ISO subset encodings to their Windows/CP superset equivalents (e.g., ISO-8859-1 -> Windows-1252).

  • compat_names (bool) – If True (default), return encoding names compatible with chardet 5.x/6.x. If False, return raw Python codec names.

  • include_encodings (Iterable[str] | None) – If given, restrict detection to only these encodings (names or aliases).

  • exclude_encodings (Iterable[str] | None) – If given, remove these encodings from the candidate set.

  • no_match_encoding (str) – Encoding to return when no candidate survives the pipeline. Defaults to "cp1252".

  • empty_input_encoding (str) – Encoding to return for empty input. Defaults to "utf-8".

Returns:

A dictionary with keys "encoding", "confidence", and "language".

Return type:

DetectionDict

chardet.detect_all(byte_str, ignore_threshold=False, should_rename_legacy=False, encoding_era=<EncodingEra.ALL: 63>, chunk_size=65536, max_bytes=200000, *, prefer_superset=False, compat_names=True, include_encodings=None, exclude_encodings=None, no_match_encoding='cp1252', empty_input_encoding='utf-8')

Detect all possible encodings of the given byte string.

When ignore_threshold is False (the default), results with confidence <= MINIMUM_THRESHOLD (0.20) are filtered out. If all results are below the threshold, the full unfiltered list is returned as a fallback so the caller always receives at least one result.

Parameters:
  • byte_str (bytes | bytearray) – The byte sequence to detect encoding for.

  • ignore_threshold (bool) – If True, return all candidate encodings regardless of confidence score.

  • should_rename_legacy (bool) – Deprecated alias for prefer_superset.

  • encoding_era (EncodingEra) – Restrict candidate encodings to the given era.

  • chunk_size (int) – Deprecated – accepted for backward compatibility but has no effect.

  • max_bytes (int) – Maximum number of bytes to examine from byte_str.

  • prefer_superset (bool) – If True, remap ISO subset encodings to their Windows/CP superset equivalents.

  • compat_names (bool) – If True (default), return encoding names compatible with chardet 5.x/6.x. If False, return raw Python codec names.

  • include_encodings (Iterable[str] | None) – If given, restrict detection to only these encodings (names or aliases).

  • exclude_encodings (Iterable[str] | None) – If given, remove these encodings from the candidate set.

  • no_match_encoding (str) – Encoding to return when no candidate survives the pipeline. Defaults to "cp1252".

  • empty_input_encoding (str) – Encoding to return for empty input. Defaults to "utf-8".

Returns:

A list of dictionaries, sorted by descending confidence.

Return type:

list[DetectionDict]

UniversalDetector

class chardet.UniversalDetector(lang_filter=<LanguageFilter.ALL: 31>, should_rename_legacy=False, encoding_era=<EncodingEra.ALL: 63>, max_bytes=200000, *, prefer_superset=False, compat_names=True, include_encodings=None, exclude_encodings=None, no_match_encoding='cp1252', empty_input_encoding='utf-8')

Streaming character encoding detector.

Implements a feed/close pattern for incremental detection of character encoding from byte streams. Compatible with the chardet 6.x API.

All detection is performed by the same pipeline used by chardet.detect() and chardet.detect_all(), ensuring consistent results regardless of which API is used.

Note

This class is not thread-safe. Each thread should create its own UniversalDetector instance.

Parameters:
  • lang_filter (LanguageFilter)

  • should_rename_legacy (bool)

  • encoding_era (EncodingEra)

  • max_bytes (int)

  • prefer_superset (bool)

  • compat_names (bool)

  • include_encodings (Iterable[str] | None)

  • exclude_encodings (Iterable[str] | None)

  • no_match_encoding (str)

  • empty_input_encoding (str)

MINIMUM_THRESHOLD = 0.2
LEGACY_MAP: ClassVar[MappingProxyType] = mappingproxy({'ascii': 'cp1252', 'euc_kr': 'cp949', 'iso8859-1': 'cp1252', 'iso8859-2': 'cp1250', 'iso8859-5': 'cp1251', 'iso8859-6': 'cp1256', 'iso8859-7': 'cp1253', 'iso8859-8': 'cp1255', 'iso8859-9': 'cp1254', 'iso8859-11': 'cp874', 'iso8859-13': 'cp1257', 'tis-620': 'cp874'})
feed(byte_str)

Feed a chunk of bytes to the detector.

Data is accumulated in an internal buffer. Once max_bytes have been buffered, done is set to True and further data is ignored until reset() is called.

Parameters:

byte_str (bytes | bytearray) – The next chunk of bytes to examine.

Raises:

ValueError – If called after close() without a reset().

Return type:

None

close()

Finalize detection and return the best result.

Runs the full detection pipeline on the buffered data.

Returns:

A dictionary with keys "encoding", "confidence", and "language".

Return type:

DetectionDict

reset()

Reset the detector to its initial state for reuse.

Return type:

None

property done: bool

Whether detection is complete and no more data is needed.

property result: DetectionDict

The current best detection result.

Enumerations

class chardet.EncodingEra(*values)

Bit flags representing encoding eras for filtering detection candidates.

MODERN_WEB = 1
LEGACY_ISO = 2
LEGACY_MAC = 4
LEGACY_REGIONAL = 8
DOS = 16
MAINFRAME = 32
ALL = 63
class chardet.LanguageFilter(*values)

Language filter flags for UniversalDetector (chardet 6.x API compat).

Accepted but not used — our pipeline does not filter by language group.

Deprecated since version Retained: only for backward compatibility with chardet 6.x callers. Will be removed in a future major version.

CHINESE_SIMPLIFIED = 1
CHINESE_TRADITIONAL = 2
JAPANESE = 4
KOREAN = 8
NON_CJK = 16
ALL = 31
CHINESE = 3
CJK = 15

Result Types

class chardet.DetectionResult(encoding, confidence, language, mime_type=None)

A single encoding detection result.

Frozen dataclass holding the encoding name, confidence score, and optional language identifier returned by the detection pipeline.

Parameters:
  • encoding (str | None)

  • confidence (float)

  • language (str | None)

  • mime_type (str | None)

encoding: str | None
confidence: float
language: str | None
mime_type: str | None
to_dict()

Convert this result to a plain dict.

Returns:

A dict with 'encoding', 'confidence', 'language', and 'mime_type' keys.

Return type:

DetectionDict

class chardet.DetectionDict

Dictionary representation of a detection result.

Returned by chardet.detect(), chardet.detect_all(), and chardet.UniversalDetector.result.

encoding: str | None
confidence: float
language: str | None
mime_type: str | None

Constants

chardet.DEFAULT_MAX_BYTES: int = 200000

Default maximum number of bytes to examine during detection.

chardet.MINIMUM_THRESHOLD: float = 0.20

Default minimum confidence threshold for filtering results in chardet.detect_all().