chardet package

Module contents

class chardet.EncodingEra(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Flag

This enum represents different eras of character encodings, used to filter which encodings are considered during detection.

The numeric values also serve as preference tiers for tie-breaking when confidence scores are very close. Lower values = more preferred/modern.

MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, KOI8-R/U, CJK multi-byte (widely used on the web) LEGACY_ISO: ISO-8859-x (legacy but well-known standards) LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.) LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.) DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.) MAINFRAME: EBCDIC variants (CP037, CP500, etc.)

ALL = 63
DOS = 16
LEGACY_ISO = 2
LEGACY_MAC = 4
LEGACY_REGIONAL = 8
MAINFRAME = 32
MODERN_WEB = 1
class chardet.UniversalDetector(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, max_bytes: int = 200000)[source]

Bases: object

The UniversalDetector class underlies the chardet.detect function and coordinates all of the different charset probers.

To get a dict containing an encoding and its confidence, you can simply run:

u = UniversalDetector()
u.feed(some_bytes)
u.close()
detected = u.result
ESC_DETECTOR = re.compile(b'(\x1b|~{)')
HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')
ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}
LEGACY_MAP = {'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'iso-8859-1': 'Windows-1252', 'iso-8859-11': 'CP874', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254', 'tis-620': 'CP874'}
MINIMUM_THRESHOLD = 0.2
VERY_CLOSE_THRESHOLD = 0.005
property active_probers: list[CharSetProber]

Get a flat list of all active (not falsey and not in NOT_ME state) nested charset probers.

property charset_probers: list[CharSetProber]
close() ResultDict[source]

Stop analyzing the current document and come up with a final prediction.

Returns:

The result attribute, a dict with the keys encoding, confidence, and language.

feed(byte_str: bytes | bytearray) None[source]

Takes a chunk of a document and feeds it through all of the relevant charset probers.

After calling feed, you can check the value of the done attribute to see if you need to continue feeding the UniversalDetector more data, or if it has made a prediction (in the result attribute).

Note

You should always call close when you’re done feeding in your document if done is not already True.

property has_win_bytes: bool

Check if Windows-specific bytes were detected by the SBCS prober.

property input_state: int
property nested_probers: list[CharSetProber]

Get a flat list of all nested charset probers.

reset() None[source]

Reset the UniversalDetector and all of its probers back to their initial states. This is called by __init__, so you only need to call this directly in between analyses of different documents.

chardet.detect(byte_str: bytes | bytearray, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, chunk_size: int = 65536, max_bytes: int = 200000) ResultDict[source]

Detect the encoding of the given byte string.

Parameters:
  • byte_str (bytes or bytearray) – The byte sequence to examine.

  • should_rename_legacy (bool or None) – Should we rename legacy encodings to their more modern equivalents? If None (default), automatically enabled when encoding_era is MODERN_WEB.

  • encoding_era (EncodingEra) – Which era of encodings to consider during detection.

  • chunk_size (int) – Size of chunks to process at a time

  • max_bytes (int) – Maximum number of bytes to examine.

chardet.detect_all(byte_str: bytes | bytearray, ignore_threshold: bool = False, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, chunk_size: int = 65536, max_bytes: int = 200000) list[ResultDict][source]

Detect all the possible encodings of the given byte string.

Parameters:
  • byte_str (bytes or bytearray) – The byte sequence to examine.

  • ignore_threshold (bool) – Include encodings that are below UniversalDetector.MINIMUM_THRESHOLD in results.

  • should_rename_legacy (bool or None) – Should we rename legacy encodings to their more modern equivalents? If None (default), automatically enabled when encoding_era is MODERN_WEB.

  • encoding_era (EncodingEra) – Which era of encodings to consider during detection.

  • chunk_size (int) – Size of chunks to process at a time.

  • max_bytes (int) – Maximum number of bytes to examine.

Submodules

chardet.enums module

All of the Enums that are used throughout the chardet package.

author:

Dan Blanchard (dan.blanchard@gmail.com)

class chardet.enums.CharacterCategory(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the different categories language models for SingleByteCharsetProber put characters into.

Anything less than DIGIT is considered a letter.

CONTROL = 254
DIGIT = 251
LINE_BREAK = 252
SYMBOL = 253
UNDEFINED = 255
class chardet.enums.EncodingEra(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Flag

This enum represents different eras of character encodings, used to filter which encodings are considered during detection.

The numeric values also serve as preference tiers for tie-breaking when confidence scores are very close. Lower values = more preferred/modern.

MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, KOI8-R/U, CJK multi-byte (widely used on the web) LEGACY_ISO: ISO-8859-x (legacy but well-known standards) LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.) LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.) DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.) MAINFRAME: EBCDIC variants (CP037, CP500, etc.)

ALL = 63
DOS = 16
LEGACY_ISO = 2
LEGACY_MAC = 4
LEGACY_REGIONAL = 8
MAINFRAME = 32
MODERN_WEB = 1
class chardet.enums.InputState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the different states a universal detector can be in.

ESC_ASCII = 1
HIGH_BYTE = 2
PURE_ASCII = 0
class chardet.enums.LanguageFilter(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Flag

This enum represents the different language filters we can apply to a UniversalDetector.

ALL = 31
CHINESE = 3
CHINESE_SIMPLIFIED = 1
CHINESE_TRADITIONAL = 2
CJK = 15
JAPANESE = 4
KOREAN = 8
NON_CJK = 16
class chardet.enums.MachineState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the different states a state machine can be in.

ERROR = 1
ITS_ME = 2
START = 0
class chardet.enums.ProbingState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the different states a prober can be in.

DETECTING = 0
FOUND_IT = 1
NOT_ME = 2
class chardet.enums.SequenceLikelihood(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the likelihood of a character following the previous one.

LIKELY = 2
NEGATIVE = 0
POSITIVE = 3
UNLIKELY = 1

chardet.resultdict module

class chardet.resultdict.ResultDict[source]

Bases: TypedDict

confidence: float
encoding: str | None
language: str | None

chardet.universaldetector module

Module containing the UniversalDetector detector class, which is the primary class a user of chardet should use.

author:

Mark Pilgrim (initial port to Python)

author:

Shy Shalom (original C code)

author:

Dan Blanchard (major refactoring for 3.0)

author:

Ian Cordasco

class chardet.universaldetector.UniversalDetector(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, max_bytes: int = 200000)[source]

Bases: object

The UniversalDetector class underlies the chardet.detect function and coordinates all of the different charset probers.

To get a dict containing an encoding and its confidence, you can simply run:

u = UniversalDetector()
u.feed(some_bytes)
u.close()
detected = u.result
ESC_DETECTOR = re.compile(b'(\x1b|~{)')
HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')
ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}
LEGACY_MAP = {'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'iso-8859-1': 'Windows-1252', 'iso-8859-11': 'CP874', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254', 'tis-620': 'CP874'}
MINIMUM_THRESHOLD = 0.2
VERY_CLOSE_THRESHOLD = 0.005
property active_probers: list[CharSetProber]

Get a flat list of all active (not falsey and not in NOT_ME state) nested charset probers.

property charset_probers: list[CharSetProber]
close() ResultDict[source]

Stop analyzing the current document and come up with a final prediction.

Returns:

The result attribute, a dict with the keys encoding, confidence, and language.

feed(byte_str: bytes | bytearray) None[source]

Takes a chunk of a document and feeds it through all of the relevant charset probers.

After calling feed, you can check the value of the done attribute to see if you need to continue feeding the UniversalDetector more data, or if it has made a prediction (in the result attribute).

Note

You should always call close when you’re done feeding in your document if done is not already True.

property has_win_bytes: bool

Check if Windows-specific bytes were detected by the SBCS prober.

property input_state: int
property nested_probers: list[CharSetProber]

Get a flat list of all nested charset probers.

reset() None[source]

Reset the UniversalDetector and all of its probers back to their initial states. This is called by __init__, so you only need to call this directly in between analyses of different documents.

result: ResultDict

chardet.charsetprober module

class chardet.charsetprober.CharSetProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: object

SHORTCUT_THRESHOLD = 0.95
property charset: Charset | None

Return the Charset metadata for this prober’s encoding.

property charset_name: str | None
feed(byte_str: bytes | bytearray) ProbingState[source]
static filter_high_byte_only(buf: bytes | bytearray) bytes[source]
static filter_international_words(buf: bytes | bytearray) bytearray[source]

Filter out ASCII-only words for non-Latin scripts.

Byte classes: - alphabet: ASCII letters [a-zA-Z] - international: bytes with high bit set [€-ÿ] - marker: everything else [^a-zA-Z€-ÿ]

The buffer is treated as a sequence of “words” separated by marker bytes. We KEEP only those words that contain at least one high-byte character, i.e. match the pattern: optional ASCII prefix + >=1 high-byte + optional ASCII suffix, plus at most one trailing marker. Pure ASCII words are discarded as noise when the target language model excludes ASCII letters (“English words in other-language pages” — paper §4.7 summary).

Why we retain surrounding ASCII letters instead of stripping them: - Preserves real adjacency for bigram modeling around high-byte letters. - Avoids creating artificial bigrams between non-adjacent high-byte chars.

Trailing marker normalization: a single marker at word end is converted to a space if it is an ASCII punctuation/control, collapsing runs of markers into one delimiter (reduces noise like repeated punctuation or HTML artifacts).

Usage is conditional: callers apply this ONLY when the language model’s keep_ascii_letters is False (see SingleByteCharSetProber.feed). Latin-script languages skip this and instead use remove_xml_tags.

This behavior mirrors the original universalchardet / uchardet approach and aligns with the training pipeline which excludes ASCII letters for non-Latin alphabets.

get_confidence() float[source]
property language: str | None
static remove_xml_tags(buf: bytes | bytearray) bytearray[source]

Returns a copy of buf that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by Latin1Prober.

reset() None[source]
property state: ProbingState

chardet.charsetgroupprober module

class chardet.charsetgroupprober.CharSetGroupProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: CharSetProber

property charset_name: str | None
feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
property language: str | None
reset() None[source]

chardet.codingstatemachine module

class chardet.codingstatemachine.CodingStateMachine(sm: CodingStateMachineDict)[source]

Bases: object

A state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:

START state: This is the state to start with, or a legal byte sequence

(i.e. a valid code point) for character has been identified.

ME state: This indicates that the state machine identified a byte sequence

that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector.

ERROR state: This indicates the state machine identified an illegal byte

sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on.

get_coding_state_machine() str[source]
get_current_charlen() int[source]
property language: str
next_state(c: int) int[source]
reset() None[source]

chardet.codingstatemachinedict module

class chardet.codingstatemachinedict.CodingStateMachineDict[source]

Bases: TypedDict

char_len_table: tuple[int, ...]
class_factor: int
class_table: tuple[int, ...]
language: str
name: str
state_table: tuple[int, ...]

Multi-byte encoding probers

chardet.mbcharsetprober module

class chardet.mbcharsetprober.MultiByteCharSetProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: CharSetProber

feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
reset() None[source]

chardet.mbcsgroupprober module

class chardet.mbcsgroupprober.MBCSGroupProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: CharSetGroupProber

chardet.mbcssm module

chardet.mbcssm.BIG5_SM_MODEL: CodingStateMachineDict = {'char_len_table': (0, 1, 1, 2, 0), 'class_factor': 5, 'class_table': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0), 'name': 'Big5', 'state_table': (MachineState.ERROR, MachineState.START, MachineState.START, 3, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ERROR, MachineState.ERROR, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START)}

# Classes 0: Unused 1: 00-40, 5B-60, 7B-7F : Ascii 2: C7-FD 3: C9,FE : User-Defined Area 4: 41-52 5: 53-5A, 61-7A 6: 81-A0 7: A1-AC, B0-C5 8: AD-AF 9: C6

# Byte 1 Ascii: 00-7F : 1 + 4 + 5 State 3: 81-AC, B0-C5 : 6 + 7 State 4: AD-AF : 8 State 5: C6 : 9 State 6: C7-FE : 2 (+ 3)

# Byte 2 State 3: 41-5A, 61-7A, 81-FE : 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 State 4: 41-5A, 61-7A, 81-A0 : 4 + 5 + 6 State 5: 41-52, A1-FE : 2 + 3 + 4 + 7 + 8 + 9 State 6: A1-FE : 2 + 3 + 7 + 8 + 9

chardet.utf8prober module

class chardet.utf8prober.UTF8Prober[source]

Bases: CharSetProber

ONE_CHAR_PROB = 0.5
property charset_name: str
feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
property language: str
reset() None[source]

chardet.utf1632prober module

class chardet.utf1632prober.UTF1632Prober[source]

Bases: CharSetProber

This class simply looks for occurrences of zero bytes, and infers whether the file is UTF16 or UTF32 (low-endian or big-endian) For instance, files looking like ( [nonzero] )+ have a good probability to be UTF32BE. Files looking like ( [nonzero] )+ may be guessed to be UTF16BE, and inversely for little-endian varieties.

EXPECTED_RATIO = 0.94
MIN_CHARS_FOR_DETECTION = 20
MIN_RATIO = 0.08
approx_16bit_chars() float[source]
approx_32bit_chars() float[source]
property charset_name: str
feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
is_likely_utf16be() bool[source]
is_likely_utf16le() bool[source]
is_likely_utf32be() bool[source]
is_likely_utf32le() bool[source]
property language: str
reset() None[source]
property state: ProbingState
validate_utf16_characters(pair: list[int]) None[source]

Validate if the pair of bytes is valid UTF-16.

UTF-16 is valid in the range 0x0000 - 0xFFFF excluding 0xD800 - 0xFFFF with an exception for surrogate pairs, which must be in the range 0xD800-0xDBFF followed by 0xDC00-0xDFFF

https://en.wikipedia.org/wiki/UTF-16

validate_utf32_characters(quad: list[int]) None[source]

Validate if the quad of bytes is valid UTF-32.

UTF-32 is valid in the range 0x00000000 - 0x0010FFFF excluding 0x0000D800 - 0x0000DFFF

https://en.wikipedia.org/wiki/UTF-32

chardet.big5prober module

class chardet.big5prober.Big5Prober[source]

Bases: MultiByteCharSetProber

property charset_name: str
property language: str

chardet.gb18030prober module

class chardet.gb18030prober.GB18030Prober[source]

Bases: MultiByteCharSetProber

property charset_name: str
property language: str

chardet.eucjpprober module

class chardet.eucjpprober.EUCJPProber[source]

Bases: MultiByteCharSetProber

property charset_name: str
feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
property language: str
reset() None[source]

chardet.euckrprober module

class chardet.euckrprober.EUCKRProber[source]

Bases: MultiByteCharSetProber

property charset_name: str
property language: str

chardet.cp949prober module

class chardet.cp949prober.CP949Prober[source]

Bases: MultiByteCharSetProber

property charset_name: str
property language: str

chardet.sjisprober module

class chardet.sjisprober.SJISProber[source]

Bases: MultiByteCharSetProber

property charset_name: str
feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
property language: str
reset() None[source]

chardet.johabprober module

class chardet.johabprober.JOHABProber[source]

Bases: MultiByteCharSetProber

property charset_name: str
property language: str

chardet.escprober module

class chardet.escprober.EscCharSetProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: CharSetProber

This CharSetProber uses a “code scheme” approach for detecting encodings, whereby easily recognizable escape or shift sequences are relied on to identify these encodings.

property charset_name: str | None
feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
property language: str | None
reset() None[source]

chardet.escsm module

Single-byte encoding probers

chardet.sbcharsetprober module

class chardet.sbcharsetprober.SingleByteCharSetModel(charset_name, language, char_to_order_map, language_model, typical_positive_ratio, keep_ascii_letters, alphabet)[source]

Bases: NamedTuple

alphabet: str

Alias for field number 6

char_to_order_map: Mapping[int, CharacterCategory | int]

Alias for field number 2

charset_name: str

Alias for field number 0

keep_ascii_letters: bool

Alias for field number 5

language: str

Alias for field number 1

language_model: Mapping[int, Mapping[int, SequenceLikelihood | int]]

Alias for field number 3

typical_positive_ratio: float

Alias for field number 4

class chardet.sbcharsetprober.SingleByteCharSetProber(model: SingleByteCharSetModel, is_reversed: bool = False, name_prober: CharSetProber | None = None)[source]

Bases: CharSetProber

NEGATIVE_SHORTCUT_THRESHOLD = 0.05
POSITIVE_SHORTCUT_THRESHOLD = 0.95
SB_ENOUGH_REL_THRESHOLD = 1024
property charset_name: str | None
feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
property language: str | None
reset() None[source]

chardet.sbcsgroupprober module

class chardet.sbcsgroupprober.SBCSGroupProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>)[source]

Bases: CharSetGroupProber

feed(byte_str: bytes | bytearray) ProbingState[source]
get_confidence() float[source]
reset() None[source]

chardet.hebrewprober module

class chardet.hebrewprober.HebrewProber[source]

Bases: CharSetProber

FINAL_KAF = 234
FINAL_MEM = 237
FINAL_NUN = 239
FINAL_PE = 243
FINAL_TSADI = 245
LOGICAL_HEBREW_NAME = 'WINDOWS-1255'
MIN_FINAL_CHAR_DISTANCE = 5
MIN_MODEL_DISTANCE = 0.01
NORMAL_KAF = 235
NORMAL_MEM = 238
NORMAL_NUN = 240
NORMAL_PE = 244
NORMAL_TSADI = 246
SPACE = 32
VISUAL_HEBREW_NAME = 'ISO-8859-8'
property charset_name: str
feed(byte_str: bytes | bytearray) ProbingState[source]
is_final(c: int) bool[source]
is_non_final(c: int) bool[source]
property language: str
reset() None[source]
set_model_probers(logical_prober: SingleByteCharSetProber, visual_prober: SingleByteCharSetProber) None[source]
property state: ProbingState

Analysis modules

chardet.chardistribution module

class chardet.chardistribution.Big5DistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) int[source]
class chardet.chardistribution.CharDistributionAnalysis[source]

Bases: object

ENOUGH_DATA_THRESHOLD = 1024
MINIMUM_DATA_THRESHOLD = 3
SURE_NO = 0.01
SURE_YES = 0.99
feed(char: bytes | bytearray, char_len: int) None[source]

feed a character with known length

get_confidence() float[source]

return confidence based on existing data

get_order(_: bytes | bytearray) int[source]
got_enough_data() bool[source]
reset() None[source]

reset analyser, clear any state

class chardet.chardistribution.EUCJPDistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) int[source]
class chardet.chardistribution.EUCKRDistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) int[source]
class chardet.chardistribution.GB2312DistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) int[source]
class chardet.chardistribution.JOHABDistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) int[source]
class chardet.chardistribution.SJISDistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) int[source]

chardet.jpcntx module

class chardet.jpcntx.EUCJPContextAnalysis[source]

Bases: JapaneseContextAnalysis

get_order(byte_str: bytes | bytearray) tuple[int, int][source]
class chardet.jpcntx.JapaneseContextAnalysis[source]

Bases: object

DONT_KNOW = -1
ENOUGH_REL_THRESHOLD = 100
MAX_REL_THRESHOLD = 1000
MINIMUM_DATA_THRESHOLD = 4
NUM_OF_CATEGORY = 6
feed(byte_str: bytes | bytearray, num_bytes: int) None[source]
get_confidence() float[source]
get_order(_: bytes | bytearray) tuple[int, int][source]
got_enough_data() bool[source]
reset() None[source]
class chardet.jpcntx.SJISContextAnalysis[source]

Bases: JapaneseContextAnalysis

property charset_name: str
get_order(byte_str: bytes | bytearray) tuple[int, int][source]

Frequency tables

chardet.big5freq module

chardet.euckrfreq module

chardet.gb2312freq module

chardet.jisfreq module

chardet.johabfreq module

Language models

These modules contain bigram frequency models for single-byte encoding detection. They are generated by create_language_model.py and should not be edited manually.

CLI module

chardet.cli.chardetect module

Script which takes one or more file paths and reports on their detected encodings

Example:

% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

If no paths are provided, it takes its input from stdin.

chardet.cli.chardetect.description_of(lines: ~collections.abc.Iterable[bytes], name: str = 'stdin', minimal: bool = False, should_rename_legacy: bool = False, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>) str | None[source]

Return a string describing the probable encoding of a file or list of strings.

Parameters:
  • lines (Iterable of bytes) – The lines to get the encoding of.

  • name (str) – Name of file or collection of lines

  • should_rename_legacy (bool) – Should we rename legacy encodings to their more modern equivalents?

  • encoding_era (EncodingEra) – Which era of encodings to consider during detection.

chardet.cli.chardetect.main(argv: list[str] | None = None) None[source]

Handles command line arguments and gets things started.

Parameters:

argv (list of str) – List of arguments, as if specified on the command-line. If None, sys.argv[1:] is used instead.

Metadata subpackage

chardet.metadata.charsets module

Metadata about charsets used by our model training code and test file generationcode. Could be used for other things in the future.

class chardet.metadata.charsets.Charset(name: str, is_multi_byte: bool, encoding_era: EncodingEra, language_filter: LanguageFilter)[source]

Bases: object

Metadata about charsets useful for training models and generating test files.

encoding_era: EncodingEra
is_multi_byte: bool
language_filter: LanguageFilter
name: str
chardet.metadata.charsets.get_charset(encoding_name: str) Charset[source]

Get the Charset metadata for a given encoding name.

Parameters:

encoding_name – The encoding name to look up

Returns:

The Charset for this encoding, defaults to a MODERN_WEB charset if unknown

chardet.metadata.charsets.is_unicode_encoding(encoding_name: str) bool[source]

Check if an encoding is a Unicode encoding (UTF-8, UTF-16, UTF-32).

Parameters:

encoding_name – The encoding name to check

Returns:

True if the encoding is Unicode, False otherwise

chardet.metadata.languages module

Metadata about languages used by our model training code for our SingleByteCharSetProbers. Could be used for other things in the future.

This code was originally based on the language metadata from the uchardet project.

class chardet.metadata.languages.Language(name: str, iso_code: str, use_ascii: bool, charsets: list[str], alphabet: str, num_training_docs: int | None = None, num_training_chars: int | None = None)[source]

Bases: object

Metadata about a language useful for training models

Variables:
  • name – The human name for the language, in English.

  • iso_code – 2-letter ISO 639-1 if possible, 3-letter ISO code otherwise, or use another catalog as a last resort.

  • use_ascii – Whether or not ASCII letters should be included in trained models.

  • charsets – The charsets we want to support and create data for.

  • alphabet – The characters in the language’s alphabet. If use_ascii is True, you only need to add those not in the ASCII set.

  • num_training_docs – Number of documents from CulturaX to use for training. This represents approximately 300M characters of training data. None means the count hasn’t been determined yet.

  • num_training_chars – Number of characters from CulturaX used for training. The goal is for this to be at least 300M characters, but some languages may not have that much data available. None means the count hasn’t been determined yet.

alphabet: str
charsets: list[str]
iso_code: str
name: str
num_training_chars: int | None = None
num_training_docs: int | None = None
use_ascii: bool