chardet package
Module contents
- class chardet.EncodingEra(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
FlagThis enum represents different eras of character encodings, used to filter which encodings are considered during detection.
The numeric values also serve as preference tiers for tie-breaking when confidence scores are very close. Lower values = more preferred/modern.
MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, KOI8-R/U, CJK multi-byte (widely used on the web) LEGACY_ISO: ISO-8859-x (legacy but well-known standards) LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.) LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.) DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.) MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
- ALL = 63
- DOS = 16
- LEGACY_ISO = 2
- LEGACY_MAC = 4
- LEGACY_REGIONAL = 8
- MAINFRAME = 32
- MODERN_WEB = 1
- class chardet.UniversalDetector(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, max_bytes: int = 200000)[source]
Bases:
objectThe
UniversalDetectorclass underlies thechardet.detectfunction and coordinates all of the different charset probers.To get a
dictcontaining an encoding and its confidence, you can simply run:u = UniversalDetector() u.feed(some_bytes) u.close() detected = u.result
- ESC_DETECTOR = re.compile(b'(\x1b|~{)')
- HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')
- ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}
- LEGACY_MAP = {'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'iso-8859-1': 'Windows-1252', 'iso-8859-11': 'CP874', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254', 'tis-620': 'CP874'}
- MINIMUM_THRESHOLD = 0.2
- VERY_CLOSE_THRESHOLD = 0.005
- property active_probers: list[CharSetProber]
Get a flat list of all active (not falsey and not in NOT_ME state) nested charset probers.
- property charset_probers: list[CharSetProber]
- close() ResultDict[source]
Stop analyzing the current document and come up with a final prediction.
- Returns:
The
resultattribute, adictwith the keys encoding, confidence, and language.
- feed(byte_str: bytes | bytearray) None[source]
Takes a chunk of a document and feeds it through all of the relevant charset probers.
After calling
feed, you can check the value of thedoneattribute to see if you need to continue feeding theUniversalDetectormore data, or if it has made a prediction (in theresultattribute).Note
You should always call
closewhen you’re done feeding in your document ifdoneis not alreadyTrue.
- property has_win_bytes: bool
Check if Windows-specific bytes were detected by the SBCS prober.
- property input_state: int
- property nested_probers: list[CharSetProber]
Get a flat list of all nested charset probers.
- chardet.detect(byte_str: bytes | bytearray, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, chunk_size: int = 65536, max_bytes: int = 200000) ResultDict[source]
Detect the encoding of the given byte string.
- Parameters:
byte_str (
bytesorbytearray) – The byte sequence to examine.should_rename_legacy (
boolorNone) – Should we rename legacy encodings to their more modern equivalents? If None (default), automatically enabled when encoding_era is MODERN_WEB.encoding_era (
EncodingEra) – Which era of encodings to consider during detection.chunk_size (
int) – Size of chunks to process at a timemax_bytes (
int) – Maximum number of bytes to examine.
- chardet.detect_all(byte_str: bytes | bytearray, ignore_threshold: bool = False, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, chunk_size: int = 65536, max_bytes: int = 200000) list[ResultDict][source]
Detect all the possible encodings of the given byte string.
- Parameters:
byte_str (
bytesorbytearray) – The byte sequence to examine.ignore_threshold (
bool) – Include encodings that are belowUniversalDetector.MINIMUM_THRESHOLDin results.should_rename_legacy (
boolorNone) – Should we rename legacy encodings to their more modern equivalents? If None (default), automatically enabled when encoding_era is MODERN_WEB.encoding_era (
EncodingEra) – Which era of encodings to consider during detection.chunk_size (
int) – Size of chunks to process at a time.max_bytes (
int) – Maximum number of bytes to examine.
Submodules
chardet.enums module
All of the Enums that are used throughout the chardet package.
- author:
Dan Blanchard (dan.blanchard@gmail.com)
- class chardet.enums.CharacterCategory(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
IntEnumThis enum represents the different categories language models for
SingleByteCharsetProberput characters into.Anything less than DIGIT is considered a letter.
- CONTROL = 254
- DIGIT = 251
- LINE_BREAK = 252
- SYMBOL = 253
- UNDEFINED = 255
- class chardet.enums.EncodingEra(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
FlagThis enum represents different eras of character encodings, used to filter which encodings are considered during detection.
The numeric values also serve as preference tiers for tie-breaking when confidence scores are very close. Lower values = more preferred/modern.
MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, KOI8-R/U, CJK multi-byte (widely used on the web) LEGACY_ISO: ISO-8859-x (legacy but well-known standards) LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.) LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.) DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.) MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
- ALL = 63
- DOS = 16
- LEGACY_ISO = 2
- LEGACY_MAC = 4
- LEGACY_REGIONAL = 8
- MAINFRAME = 32
- MODERN_WEB = 1
- class chardet.enums.InputState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
IntEnumThis enum represents the different states a universal detector can be in.
- ESC_ASCII = 1
- HIGH_BYTE = 2
- PURE_ASCII = 0
- class chardet.enums.LanguageFilter(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
FlagThis enum represents the different language filters we can apply to a
UniversalDetector.- ALL = 31
- CHINESE = 3
- CHINESE_SIMPLIFIED = 1
- CHINESE_TRADITIONAL = 2
- CJK = 15
- JAPANESE = 4
- KOREAN = 8
- NON_CJK = 16
- class chardet.enums.MachineState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
IntEnumThis enum represents the different states a state machine can be in.
- ERROR = 1
- ITS_ME = 2
- START = 0
chardet.resultdict module
chardet.universaldetector module
Module containing the UniversalDetector detector class, which is the primary
class a user of chardet should use.
- author:
Mark Pilgrim (initial port to Python)
- author:
Shy Shalom (original C code)
- author:
Dan Blanchard (major refactoring for 3.0)
- author:
Ian Cordasco
- class chardet.universaldetector.UniversalDetector(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, max_bytes: int = 200000)[source]
Bases:
objectThe
UniversalDetectorclass underlies thechardet.detectfunction and coordinates all of the different charset probers.To get a
dictcontaining an encoding and its confidence, you can simply run:u = UniversalDetector() u.feed(some_bytes) u.close() detected = u.result
- ESC_DETECTOR = re.compile(b'(\x1b|~{)')
- HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')
- ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}
- LEGACY_MAP = {'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'iso-8859-1': 'Windows-1252', 'iso-8859-11': 'CP874', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254', 'tis-620': 'CP874'}
- MINIMUM_THRESHOLD = 0.2
- VERY_CLOSE_THRESHOLD = 0.005
- property active_probers: list[CharSetProber]
Get a flat list of all active (not falsey and not in NOT_ME state) nested charset probers.
- property charset_probers: list[CharSetProber]
- close() ResultDict[source]
Stop analyzing the current document and come up with a final prediction.
- Returns:
The
resultattribute, adictwith the keys encoding, confidence, and language.
- feed(byte_str: bytes | bytearray) None[source]
Takes a chunk of a document and feeds it through all of the relevant charset probers.
After calling
feed, you can check the value of thedoneattribute to see if you need to continue feeding theUniversalDetectormore data, or if it has made a prediction (in theresultattribute).Note
You should always call
closewhen you’re done feeding in your document ifdoneis not alreadyTrue.
- property has_win_bytes: bool
Check if Windows-specific bytes were detected by the SBCS prober.
- property input_state: int
- property nested_probers: list[CharSetProber]
Get a flat list of all nested charset probers.
- reset() None[source]
Reset the UniversalDetector and all of its probers back to their initial states. This is called by
__init__, so you only need to call this directly in between analyses of different documents.
- result: ResultDict
chardet.charsetprober module
- class chardet.charsetprober.CharSetProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]
Bases:
object- SHORTCUT_THRESHOLD = 0.95
- property charset_name: str | None
- feed(byte_str: bytes | bytearray) ProbingState[source]
- static filter_international_words(buf: bytes | bytearray) bytearray[source]
Filter out ASCII-only words for non-Latin scripts.
Byte classes: - alphabet: ASCII letters [a-zA-Z] - international: bytes with high bit set [-ÿ] - marker: everything else [^a-zA-Z-ÿ]
The buffer is treated as a sequence of “words” separated by marker bytes. We KEEP only those words that contain at least one high-byte character, i.e. match the pattern: optional ASCII prefix + >=1 high-byte + optional ASCII suffix, plus at most one trailing marker. Pure ASCII words are discarded as noise when the target language model excludes ASCII letters (“English words in other-language pages” — paper §4.7 summary).
Why we retain surrounding ASCII letters instead of stripping them: - Preserves real adjacency for bigram modeling around high-byte letters. - Avoids creating artificial bigrams between non-adjacent high-byte chars.
Trailing marker normalization: a single marker at word end is converted to a space if it is an ASCII punctuation/control, collapsing runs of markers into one delimiter (reduces noise like repeated punctuation or HTML artifacts).
Usage is conditional: callers apply this ONLY when the language model’s
keep_ascii_lettersis False (seeSingleByteCharSetProber.feed). Latin-script languages skip this and instead useremove_xml_tags.This behavior mirrors the original universalchardet / uchardet approach and aligns with the training pipeline which excludes ASCII letters for non-Latin alphabets.
- property language: str | None
- static remove_xml_tags(buf: bytes | bytearray) bytearray[source]
Returns a copy of
bufthat retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used byLatin1Prober.
- property state: ProbingState
chardet.charsetgroupprober module
- class chardet.charsetgroupprober.CharSetGroupProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]
Bases:
CharSetProber- property charset_name: str | None
- feed(byte_str: bytes | bytearray) ProbingState[source]
- property language: str | None
chardet.codingstatemachine module
- class chardet.codingstatemachine.CodingStateMachine(sm: CodingStateMachineDict)[source]
Bases:
objectA state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:
- START state: This is the state to start with, or a legal byte sequence
(i.e. a valid code point) for character has been identified.
- ME state: This indicates that the state machine identified a byte sequence
that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector.
- ERROR state: This indicates the state machine identified an illegal byte
sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on.
- property language: str
chardet.codingstatemachinedict module
- class chardet.codingstatemachinedict.CodingStateMachineDict[source]
Bases:
TypedDict- char_len_table: tuple[int, ...]
- class_factor: int
- class_table: tuple[int, ...]
- language: str
- name: str
- state_table: tuple[int, ...]
Multi-byte encoding probers
chardet.mbcharsetprober module
- class chardet.mbcharsetprober.MultiByteCharSetProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]
Bases:
CharSetProber- feed(byte_str: bytes | bytearray) ProbingState[source]
chardet.mbcsgroupprober module
- class chardet.mbcsgroupprober.MBCSGroupProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]
Bases:
CharSetGroupProber
chardet.mbcssm module
- chardet.mbcssm.BIG5_SM_MODEL: CodingStateMachineDict = {'char_len_table': (0, 1, 1, 2, 0), 'class_factor': 5, 'class_table': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0), 'name': 'Big5', 'state_table': (MachineState.ERROR, MachineState.START, MachineState.START, 3, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ERROR, MachineState.ERROR, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START)}
# Classes 0: Unused 1: 00-40, 5B-60, 7B-7F : Ascii 2: C7-FD 3: C9,FE : User-Defined Area 4: 41-52 5: 53-5A, 61-7A 6: 81-A0 7: A1-AC, B0-C5 8: AD-AF 9: C6
# Byte 1 Ascii: 00-7F : 1 + 4 + 5 State 3: 81-AC, B0-C5 : 6 + 7 State 4: AD-AF : 8 State 5: C6 : 9 State 6: C7-FE : 2 (+ 3)
# Byte 2 State 3: 41-5A, 61-7A, 81-FE : 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 State 4: 41-5A, 61-7A, 81-A0 : 4 + 5 + 6 State 5: 41-52, A1-FE : 2 + 3 + 4 + 7 + 8 + 9 State 6: A1-FE : 2 + 3 + 7 + 8 + 9
chardet.utf8prober module
chardet.utf1632prober module
- class chardet.utf1632prober.UTF1632Prober[source]
Bases:
CharSetProberThis class simply looks for occurrences of zero bytes, and infers whether the file is UTF16 or UTF32 (low-endian or big-endian) For instance, files looking like ( [nonzero] )+ have a good probability to be UTF32BE. Files looking like ( [nonzero] )+ may be guessed to be UTF16BE, and inversely for little-endian varieties.
- EXPECTED_RATIO = 0.94
- MIN_CHARS_FOR_DETECTION = 20
- MIN_RATIO = 0.08
- property charset_name: str
- feed(byte_str: bytes | bytearray) ProbingState[source]
- property language: str
- property state: ProbingState
- validate_utf16_characters(pair: list[int]) None[source]
Validate if the pair of bytes is valid UTF-16.
UTF-16 is valid in the range 0x0000 - 0xFFFF excluding 0xD800 - 0xFFFF with an exception for surrogate pairs, which must be in the range 0xD800-0xDBFF followed by 0xDC00-0xDFFF
chardet.big5prober module
- class chardet.big5prober.Big5Prober[source]
Bases:
MultiByteCharSetProber- property charset_name: str
- property language: str
chardet.gb18030prober module
- class chardet.gb18030prober.GB18030Prober[source]
Bases:
MultiByteCharSetProber- property charset_name: str
- property language: str
chardet.eucjpprober module
chardet.euckrprober module
- class chardet.euckrprober.EUCKRProber[source]
Bases:
MultiByteCharSetProber- property charset_name: str
- property language: str
chardet.cp949prober module
- class chardet.cp949prober.CP949Prober[source]
Bases:
MultiByteCharSetProber- property charset_name: str
- property language: str
chardet.sjisprober module
chardet.johabprober module
- class chardet.johabprober.JOHABProber[source]
Bases:
MultiByteCharSetProber- property charset_name: str
- property language: str
chardet.escprober module
- class chardet.escprober.EscCharSetProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]
Bases:
CharSetProberThis CharSetProber uses a “code scheme” approach for detecting encodings, whereby easily recognizable escape or shift sequences are relied on to identify these encodings.
- property charset_name: str | None
- feed(byte_str: bytes | bytearray) ProbingState[source]
- property language: str | None
chardet.escsm module
Single-byte encoding probers
chardet.sbcharsetprober module
- class chardet.sbcharsetprober.SingleByteCharSetModel(charset_name, language, char_to_order_map, language_model, typical_positive_ratio, keep_ascii_letters, alphabet)[source]
Bases:
NamedTuple- alphabet: str
Alias for field number 6
- char_to_order_map: Mapping[int, CharacterCategory | int]
Alias for field number 2
- charset_name: str
Alias for field number 0
- keep_ascii_letters: bool
Alias for field number 5
- language: str
Alias for field number 1
- language_model: Mapping[int, Mapping[int, SequenceLikelihood | int]]
Alias for field number 3
- typical_positive_ratio: float
Alias for field number 4
- class chardet.sbcharsetprober.SingleByteCharSetProber(model: SingleByteCharSetModel, is_reversed: bool = False, name_prober: CharSetProber | None = None)[source]
Bases:
CharSetProber- NEGATIVE_SHORTCUT_THRESHOLD = 0.05
- POSITIVE_SHORTCUT_THRESHOLD = 0.95
- SB_ENOUGH_REL_THRESHOLD = 1024
- property charset_name: str | None
- feed(byte_str: bytes | bytearray) ProbingState[source]
- property language: str | None
chardet.sbcsgroupprober module
- class chardet.sbcsgroupprober.SBCSGroupProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>)[source]
Bases:
CharSetGroupProber- feed(byte_str: bytes | bytearray) ProbingState[source]
chardet.hebrewprober module
- class chardet.hebrewprober.HebrewProber[source]
Bases:
CharSetProber- FINAL_KAF = 234
- FINAL_MEM = 237
- FINAL_NUN = 239
- FINAL_PE = 243
- FINAL_TSADI = 245
- LOGICAL_HEBREW_NAME = 'WINDOWS-1255'
- MIN_FINAL_CHAR_DISTANCE = 5
- MIN_MODEL_DISTANCE = 0.01
- NORMAL_KAF = 235
- NORMAL_MEM = 238
- NORMAL_NUN = 240
- NORMAL_PE = 244
- NORMAL_TSADI = 246
- SPACE = 32
- VISUAL_HEBREW_NAME = 'ISO-8859-8'
- property charset_name: str
- feed(byte_str: bytes | bytearray) ProbingState[source]
- property language: str
- set_model_probers(logical_prober: SingleByteCharSetProber, visual_prober: SingleByteCharSetProber) None[source]
- property state: ProbingState
Analysis modules
chardet.chardistribution module
- class chardet.chardistribution.Big5DistributionAnalysis[source]
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.CharDistributionAnalysis[source]
Bases:
object- ENOUGH_DATA_THRESHOLD = 1024
- MINIMUM_DATA_THRESHOLD = 3
- SURE_NO = 0.01
- SURE_YES = 0.99
- class chardet.chardistribution.EUCJPDistributionAnalysis[source]
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.EUCKRDistributionAnalysis[source]
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.GB2312DistributionAnalysis[source]
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.JOHABDistributionAnalysis[source]
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.SJISDistributionAnalysis[source]
Bases:
CharDistributionAnalysis
chardet.jpcntx module
- class chardet.jpcntx.EUCJPContextAnalysis[source]
Bases:
JapaneseContextAnalysis
- class chardet.jpcntx.JapaneseContextAnalysis[source]
Bases:
object- DONT_KNOW = -1
- ENOUGH_REL_THRESHOLD = 100
- MAX_REL_THRESHOLD = 1000
- MINIMUM_DATA_THRESHOLD = 4
- NUM_OF_CATEGORY = 6
- class chardet.jpcntx.SJISContextAnalysis[source]
Bases:
JapaneseContextAnalysis- property charset_name: str
Frequency tables
chardet.big5freq module
chardet.euckrfreq module
chardet.gb2312freq module
chardet.jisfreq module
chardet.johabfreq module
Language models
These modules contain bigram frequency models for single-byte encoding
detection. They are generated by create_language_model.py and should
not be edited manually.
CLI module
chardet.cli.chardetect module
Script which takes one or more file paths and reports on their detected encodings
Example:
% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0
If no paths are provided, it takes its input from stdin.
- chardet.cli.chardetect.description_of(lines: ~collections.abc.Iterable[bytes], name: str = 'stdin', minimal: bool = False, should_rename_legacy: bool = False, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>) str | None[source]
Return a string describing the probable encoding of a file or list of strings.
- Parameters:
lines (Iterable of bytes) – The lines to get the encoding of.
name (str) – Name of file or collection of lines
should_rename_legacy (
bool) – Should we rename legacy encodings to their more modern equivalents?encoding_era (
EncodingEra) – Which era of encodings to consider during detection.
- chardet.cli.chardetect.main(argv: list[str] | None = None) None[source]
Handles command line arguments and gets things started.
- Parameters:
argv (list of str) – List of arguments, as if specified on the command-line. If None,
sys.argv[1:]is used instead.
Metadata subpackage
chardet.metadata.charsets module
Metadata about charsets used by our model training code and test file generationcode. Could be used for other things in the future.
- class chardet.metadata.charsets.Charset(name: str, is_multi_byte: bool, encoding_era: EncodingEra, language_filter: LanguageFilter)[source]
Bases:
objectMetadata about charsets useful for training models and generating test files.
- encoding_era: EncodingEra
- is_multi_byte: bool
- language_filter: LanguageFilter
- name: str
chardet.metadata.languages module
Metadata about languages used by our model training code for our SingleByteCharSetProbers. Could be used for other things in the future.
This code was originally based on the language metadata from the uchardet project.
- class chardet.metadata.languages.Language(name: str, iso_code: str, use_ascii: bool, charsets: list[str], alphabet: str, num_training_docs: int | None = None, num_training_chars: int | None = None)[source]
Bases:
objectMetadata about a language useful for training models
- Variables:
name – The human name for the language, in English.
iso_code – 2-letter ISO 639-1 if possible, 3-letter ISO code otherwise, or use another catalog as a last resort.
use_ascii – Whether or not ASCII letters should be included in trained models.
charsets – The charsets we want to support and create data for.
alphabet – The characters in the language’s alphabet. If use_ascii is True, you only need to add those not in the ASCII set.
num_training_docs – Number of documents from CulturaX to use for training. This represents approximately 300M characters of training data. None means the count hasn’t been determined yet.
num_training_chars – Number of characters from CulturaX used for training. The goal is for this to be at least 300M characters, but some languages may not have that much data available. None means the count hasn’t been determined yet.
- alphabet: str
- charsets: list[str]
- iso_code: str
- name: str
- num_training_chars: int | None = None
- num_training_docs: int | None = None
- use_ascii: bool