API Reference¶
Top-level Functions¶
- chardet.detect(byte_str, should_rename_legacy=False, encoding_era=<EncodingEra.ALL: 63>, chunk_size=65536, max_bytes=200000, *, prefer_superset=False, compat_names=True, include_encodings=None, exclude_encodings=None, no_match_encoding='cp1252', empty_input_encoding='utf-8')¶
Detect the encoding of the given byte string.
- Parameters:
byte_str (bytes | bytearray) – The byte sequence to detect encoding for.
should_rename_legacy (bool) – Deprecated alias for prefer_superset.
encoding_era (EncodingEra) – Restrict candidate encodings to the given era.
chunk_size (int) – Deprecated – accepted for backward compatibility but has no effect.
max_bytes (int) – Maximum number of bytes to examine from byte_str.
prefer_superset (bool) – If
True, remap ISO subset encodings to their Windows/CP superset equivalents (e.g., ISO-8859-1 -> Windows-1252).compat_names (bool) – If
True(default), return encoding names compatible with chardet 5.x/6.x. IfFalse, return raw Python codec names.include_encodings (Iterable[str] | None) – If given, restrict detection to only these encodings (names or aliases).
exclude_encodings (Iterable[str] | None) – If given, remove these encodings from the candidate set.
no_match_encoding (str) – Encoding to return when no candidate survives the pipeline. Defaults to
"cp1252".empty_input_encoding (str) – Encoding to return for empty input. Defaults to
"utf-8".
- Returns:
A dictionary with keys
"encoding","confidence", and"language".- Return type:
- chardet.detect_all(byte_str, ignore_threshold=False, should_rename_legacy=False, encoding_era=<EncodingEra.ALL: 63>, chunk_size=65536, max_bytes=200000, *, prefer_superset=False, compat_names=True, include_encodings=None, exclude_encodings=None, no_match_encoding='cp1252', empty_input_encoding='utf-8')¶
Detect all possible encodings of the given byte string.
When ignore_threshold is False (the default), results with confidence <= MINIMUM_THRESHOLD (0.20) are filtered out. If all results are below the threshold, the full unfiltered list is returned as a fallback so the caller always receives at least one result.
- Parameters:
byte_str (bytes | bytearray) – The byte sequence to detect encoding for.
ignore_threshold (bool) – If
True, return all candidate encodings regardless of confidence score.should_rename_legacy (bool) – Deprecated alias for prefer_superset.
encoding_era (EncodingEra) – Restrict candidate encodings to the given era.
chunk_size (int) – Deprecated – accepted for backward compatibility but has no effect.
max_bytes (int) – Maximum number of bytes to examine from byte_str.
prefer_superset (bool) – If
True, remap ISO subset encodings to their Windows/CP superset equivalents.compat_names (bool) – If
True(default), return encoding names compatible with chardet 5.x/6.x. IfFalse, return raw Python codec names.include_encodings (Iterable[str] | None) – If given, restrict detection to only these encodings (names or aliases).
exclude_encodings (Iterable[str] | None) – If given, remove these encodings from the candidate set.
no_match_encoding (str) – Encoding to return when no candidate survives the pipeline. Defaults to
"cp1252".empty_input_encoding (str) – Encoding to return for empty input. Defaults to
"utf-8".
- Returns:
A list of dictionaries, sorted by descending confidence.
- Return type:
UniversalDetector¶
- class chardet.UniversalDetector(lang_filter=<LanguageFilter.ALL: 31>, should_rename_legacy=False, encoding_era=<EncodingEra.ALL: 63>, max_bytes=200000, *, prefer_superset=False, compat_names=True, include_encodings=None, exclude_encodings=None, no_match_encoding='cp1252', empty_input_encoding='utf-8')¶
Streaming character encoding detector.
Implements a feed/close pattern for incremental detection of character encoding from byte streams. Compatible with the chardet 6.x API.
All detection is performed by the same pipeline used by
chardet.detect()andchardet.detect_all(), ensuring consistent results regardless of which API is used.Note
This class is not thread-safe. Each thread should create its own
UniversalDetectorinstance.- Parameters:
lang_filter (LanguageFilter)
should_rename_legacy (bool)
encoding_era (EncodingEra)
max_bytes (int)
prefer_superset (bool)
compat_names (bool)
include_encodings (Iterable[str] | None)
exclude_encodings (Iterable[str] | None)
no_match_encoding (str)
empty_input_encoding (str)
- MINIMUM_THRESHOLD = 0.2¶
- LEGACY_MAP: ClassVar[MappingProxyType] = mappingproxy({'ascii': 'cp1252', 'euc_kr': 'cp949', 'iso8859-1': 'cp1252', 'iso8859-2': 'cp1250', 'iso8859-5': 'cp1251', 'iso8859-6': 'cp1256', 'iso8859-7': 'cp1253', 'iso8859-8': 'cp1255', 'iso8859-9': 'cp1254', 'iso8859-11': 'cp874', 'iso8859-13': 'cp1257', 'tis-620': 'cp874'})¶
- feed(byte_str)¶
Feed a chunk of bytes to the detector.
Data is accumulated in an internal buffer. Once max_bytes have been buffered,
doneis set toTrueand further data is ignored untilreset()is called.- Parameters:
byte_str (bytes | bytearray) – The next chunk of bytes to examine.
- Raises:
ValueError – If called after
close()without areset().- Return type:
None
- close()¶
Finalize detection and return the best result.
Runs the full detection pipeline on the buffered data.
- Returns:
A dictionary with keys
"encoding","confidence", and"language".- Return type:
- reset()¶
Reset the detector to its initial state for reuse.
- Return type:
None
- property result: DetectionDict¶
The current best detection result.
Enumerations¶
- class chardet.EncodingEra(*values)¶
Bit flags representing encoding eras for filtering detection candidates.
- MODERN_WEB = 1¶
- LEGACY_ISO = 2¶
- LEGACY_MAC = 4¶
- LEGACY_REGIONAL = 8¶
- DOS = 16¶
- MAINFRAME = 32¶
- ALL = 63¶
- class chardet.LanguageFilter(*values)¶
Language filter flags for UniversalDetector (chardet 6.x API compat).
Accepted but not used — our pipeline does not filter by language group.
Deprecated since version Retained: only for backward compatibility with chardet 6.x callers. Will be removed in a future major version.
- CHINESE_SIMPLIFIED = 1¶
- CHINESE_TRADITIONAL = 2¶
- JAPANESE = 4¶
- KOREAN = 8¶
- NON_CJK = 16¶
- ALL = 31¶
- CHINESE = 3¶
- CJK = 15¶
Result Types¶
- class chardet.DetectionResult(encoding, confidence, language, mime_type=None)¶
A single encoding detection result.
Frozen dataclass holding the encoding name, confidence score, and optional language identifier returned by the detection pipeline.
- to_dict()¶
Convert this result to a plain dict.
- Returns:
A dict with
'encoding','confidence','language', and'mime_type'keys.- Return type:
- class chardet.DetectionDict¶
Dictionary representation of a detection result.
Returned by
chardet.detect(),chardet.detect_all(), andchardet.UniversalDetector.result.
Constants¶
- chardet.DEFAULT_MAX_BYTES: int = 200000¶
Default maximum number of bytes to examine during detection.
- chardet.MINIMUM_THRESHOLD: float = 0.20¶
Default minimum confidence threshold for filtering results in
chardet.detect_all().