API Reference¶
Top-level Functions¶
- chardet.detect(byte_str, should_rename_legacy=True, encoding_era=<EncodingEra.ALL: 63>, chunk_size=65536, max_bytes=200000)¶
Detect the encoding of the given byte string.
Parameters match chardet 6.x for backward compatibility. chunk_size is accepted but has no effect.
- Parameters:
byte_str (bytes | bytearray) – The byte sequence to detect encoding for.
should_rename_legacy (bool) – If
True(the default), remap legacy encoding names to their modern equivalents.encoding_era (EncodingEra) – Restrict candidate encodings to the given era.
chunk_size (int) – Deprecated – accepted for backward compatibility but has no effect.
max_bytes (int) – Maximum number of bytes to examine from byte_str.
- Returns:
A dictionary with keys
"encoding","confidence", and"language".- Return type:
- chardet.detect_all(byte_str, ignore_threshold=False, should_rename_legacy=True, encoding_era=<EncodingEra.ALL: 63>, chunk_size=65536, max_bytes=200000)¶
Detect all possible encodings of the given byte string.
Parameters match chardet 6.x for backward compatibility. chunk_size is accepted but has no effect.
When ignore_threshold is False (the default), results with confidence <= MINIMUM_THRESHOLD (0.20) are filtered out. If all results are below the threshold, the full unfiltered list is returned as a fallback so the caller always receives at least one result.
- Parameters:
byte_str (bytes | bytearray) – The byte sequence to detect encoding for.
ignore_threshold (bool) – If
True, return all candidate encodings regardless of confidence score.should_rename_legacy (bool) – If
True(the default), remap legacy encoding names to their modern equivalents.encoding_era (EncodingEra) – Restrict candidate encodings to the given era.
chunk_size (int) – Deprecated – accepted for backward compatibility but has no effect.
max_bytes (int) – Maximum number of bytes to examine from byte_str.
- Returns:
A list of dictionaries, each with keys
"encoding","confidence", and"language", sorted by descending confidence.- Return type:
UniversalDetector¶
- class chardet.UniversalDetector(lang_filter=<LanguageFilter.ALL: 31>, should_rename_legacy=True, encoding_era=<EncodingEra.ALL: 63>, max_bytes=200000)¶
Streaming character encoding detector.
Implements a feed/close pattern for incremental detection of character encoding from byte streams. Compatible with the chardet 6.x API.
All detection is performed by the same pipeline used by
chardet.detect()andchardet.detect_all(), ensuring consistent results regardless of which API is used.Note
This class is not thread-safe. Each thread should create its own
UniversalDetectorinstance.- Parameters:
lang_filter (LanguageFilter)
should_rename_legacy (bool)
encoding_era (EncodingEra)
max_bytes (int)
- MINIMUM_THRESHOLD = 0.2¶
- LEGACY_MAP: ClassVar[MappingProxyType] = mappingproxy({'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'iso-8859-1': 'Windows-1252', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254', 'iso-8859-11': 'CP874', 'iso-8859-13': 'Windows-1257', 'tis-620': 'CP874'})¶
- feed(byte_str)¶
Feed a chunk of bytes to the detector.
Data is accumulated in an internal buffer. Once max_bytes have been buffered,
doneis set toTrueand further data is ignored untilreset()is called.- Parameters:
byte_str (bytes | bytearray) – The next chunk of bytes to examine.
- Raises:
ValueError – If called after
close()without areset().- Return type:
None
- close()¶
Finalize detection and return the best result.
Runs the full detection pipeline on the buffered data.
- Returns:
A dictionary with keys
"encoding","confidence", and"language".- Return type:
- reset()¶
Reset the detector to its initial state for reuse.
- Return type:
None
- property result: DetectionDict¶
The current best detection result.
Enumerations¶
- class chardet.EncodingEra(*values)¶
Bit flags representing encoding eras for filtering detection candidates.
- MODERN_WEB = 1¶
- LEGACY_ISO = 2¶
- LEGACY_MAC = 4¶
- LEGACY_REGIONAL = 8¶
- DOS = 16¶
- MAINFRAME = 32¶
- ALL = 63¶
- class chardet.LanguageFilter(*values)¶
Language filter flags for UniversalDetector (chardet 6.x API compat).
Accepted but not used — our pipeline does not filter by language group.
Deprecated since version Retained: only for backward compatibility with chardet 6.x callers. Will be removed in a future major version.
- CHINESE_SIMPLIFIED = 1¶
- CHINESE_TRADITIONAL = 2¶
- JAPANESE = 4¶
- KOREAN = 8¶
- NON_CJK = 16¶
- ALL = 31¶
- CHINESE = 3¶
- CJK = 15¶
Result Types¶
- class chardet.DetectionResult(encoding, confidence, language)¶
A single encoding detection result.
Frozen dataclass holding the encoding name, confidence score, and optional language identifier returned by the detection pipeline.
- to_dict()¶
Convert this result to a plain dict.
- Returns:
A dict with
'encoding','confidence', and'language'keys.- Return type:
- class chardet.DetectionDict¶
Dictionary representation of a detection result.
Returned by
chardet.detect(),chardet.detect_all(), andchardet.UniversalDetector.result.
Constants¶
- chardet.DEFAULT_MAX_BYTES: int = 200000¶
Default maximum number of bytes to examine during detection.
- chardet.MINIMUM_THRESHOLD: float = 0.20¶
Default minimum confidence threshold for filtering results in
chardet.detect_all().