chardet package

Module contents

class chardet.EncodingEra(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Flag

This enum represents different eras of character encodings, used to filter which encodings are considered during detection.

The numeric values also serve as preference tiers for tie-breaking when confidence scores are very close. Lower values = more preferred/modern.

MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, KOI8-R/U, CJK multi-byte (widely used on the web) LEGACY_ISO: ISO-8859-x (legacy but well-known standards) LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.) LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.) DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.) MAINFRAME: EBCDIC variants (CP037, CP500, etc.)

ALL = 63

DOS = 16

LEGACY_ISO = 2

LEGACY_MAC = 4

LEGACY_REGIONAL = 8

MAINFRAME = 32

MODERN_WEB = 1

class chardet.UniversalDetector(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, max_bytes: int = 200000)[source]

Bases: object

The UniversalDetector class underlies the chardet.detect function and coordinates all of the different charset probers.

To get a dict containing an encoding and its confidence, you can simply run:

u = UniversalDetector()
u.feed(some_bytes)
u.close()
detected = u.result

ESC_DETECTOR = re.compile(b'(\x1b|~{)')

HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')

ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}

LEGACY_MAP = {'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'iso-8859-1': 'Windows-1252', 'iso-8859-11': 'CP874', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254', 'tis-620': 'CP874'}

MINIMUM_THRESHOLD = 0.2

VERY_CLOSE_THRESHOLD = 0.005

property active_probers: list[CharSetProber]: Get a flat list of all active (not falsey and not in NOT_ME state) nested charset probers.

property charset_probers: list[CharSetProber]

close() → ResultDict[source]

Stop analyzing the current document and come up with a final prediction.

Returns:: The result attribute, a dict with the keys encoding, confidence, and language.

feed(byte_str: bytes | bytearray) → None[source]

Takes a chunk of a document and feeds it through all of the relevant charset probers.

After calling feed, you can check the value of the done attribute to see if you need to continue feeding the UniversalDetector more data, or if it has made a prediction (in the result attribute).

Note

You should always call close when you’re done feeding in your document if done is not already True.

property has_win_bytes: bool: Check if Windows-specific bytes were detected by the SBCS prober.

property input_state: int

property nested_probers: list[CharSetProber]: Get a flat list of all nested charset probers.

reset() → None[source]: Reset the UniversalDetector and all of its probers back to their initial states. This is called by __init__, so you only need to call this directly in between analyses of different documents.

chardet.detect(byte_str: bytes | bytearray, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, chunk_size: int = 65536, max_bytes: int = 200000) → ResultDict[source]

Detect the encoding of the given byte string.

Parameters:

byte_str (bytes or bytearray) – The byte sequence to examine.
should_rename_legacy (bool or None) – Should we rename legacy encodings to their more modern equivalents? If None (default), automatically enabled when encoding_era is MODERN_WEB.
encoding_era (EncodingEra) – Which era of encodings to consider during detection.
chunk_size (int) – Size of chunks to process at a time
max_bytes (int) – Maximum number of bytes to examine.

chardet.detect_all(byte_str: bytes | bytearray, ignore_threshold: bool = False, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, chunk_size: int = 65536, max_bytes: int = 200000) → list[ResultDict][source]

Detect all the possible encodings of the given byte string.

Parameters:

byte_str (bytes or bytearray) – The byte sequence to examine.
ignore_threshold (bool) – Include encodings that are below UniversalDetector.MINIMUM_THRESHOLD in results.
should_rename_legacy (bool or None) – Should we rename legacy encodings to their more modern equivalents? If None (default), automatically enabled when encoding_era is MODERN_WEB.
encoding_era (EncodingEra) – Which era of encodings to consider during detection.
chunk_size (int) – Size of chunks to process at a time.
max_bytes (int) – Maximum number of bytes to examine.

Submodules

chardet.enums module

All of the Enums that are used throughout the chardet package.

author:: Dan Blanchard (dan.blanchard@gmail.com)

class chardet.enums.CharacterCategory(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the different categories language models for SingleByteCharsetProber put characters into.

Anything less than DIGIT is considered a letter.

CONTROL = 254

DIGIT = 251

LINE_BREAK = 252

SYMBOL = 253

UNDEFINED = 255

class chardet.enums.EncodingEra(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Flag

This enum represents different eras of character encodings, used to filter which encodings are considered during detection.

The numeric values also serve as preference tiers for tie-breaking when confidence scores are very close. Lower values = more preferred/modern.

MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, KOI8-R/U, CJK multi-byte (widely used on the web) LEGACY_ISO: ISO-8859-x (legacy but well-known standards) LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.) LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.) DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.) MAINFRAME: EBCDIC variants (CP037, CP500, etc.)

ALL = 63

DOS = 16

LEGACY_ISO = 2

LEGACY_MAC = 4

LEGACY_REGIONAL = 8

MAINFRAME = 32

MODERN_WEB = 1

class chardet.enums.InputState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the different states a universal detector can be in.

ESC_ASCII = 1

HIGH_BYTE = 2

PURE_ASCII = 0

class chardet.enums.LanguageFilter(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Flag

This enum represents the different language filters we can apply to a UniversalDetector.

ALL = 31

CHINESE = 3

CHINESE_SIMPLIFIED = 1

CHINESE_TRADITIONAL = 2

CJK = 15

JAPANESE = 4

KOREAN = 8

NON_CJK = 16

class chardet.enums.MachineState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the different states a state machine can be in.

ERROR = 1

ITS_ME = 2

START = 0

class chardet.enums.ProbingState(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the different states a prober can be in.

DETECTING = 0

FOUND_IT = 1

NOT_ME = 2

class chardet.enums.SequenceLikelihood(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

This enum represents the likelihood of a character following the previous one.

LIKELY = 2

NEGATIVE = 0

POSITIVE = 3

UNLIKELY = 1

chardet.resultdict module

class chardet.resultdict.ResultDict[source]

Bases: TypedDict

confidence: float

encoding: str | None

language: str | None

chardet.universaldetector module

Module containing the UniversalDetector detector class, which is the primary class a user of chardet should use.

author:: Mark Pilgrim (initial port to Python)
author:: Shy Shalom (original C code)
author:: Dan Blanchard (major refactoring for 3.0)
author:: Ian Cordasco

class chardet.universaldetector.UniversalDetector(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, should_rename_legacy: bool | None = None, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>, max_bytes: int = 200000)[source]

Bases: object

The UniversalDetector class underlies the chardet.detect function and coordinates all of the different charset probers.

To get a dict containing an encoding and its confidence, you can simply run:

u = UniversalDetector()
u.feed(some_bytes)
u.close()
detected = u.result

ESC_DETECTOR = re.compile(b'(\x1b|~{)')

HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')

ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}

LEGACY_MAP = {'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'iso-8859-1': 'Windows-1252', 'iso-8859-11': 'CP874', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254', 'tis-620': 'CP874'}

MINIMUM_THRESHOLD = 0.2

VERY_CLOSE_THRESHOLD = 0.005

property active_probers: list[CharSetProber]: Get a flat list of all active (not falsey and not in NOT_ME state) nested charset probers.

property charset_probers: list[CharSetProber]

close() → ResultDict[source]

Stop analyzing the current document and come up with a final prediction.

Returns:: The result attribute, a dict with the keys encoding, confidence, and language.

feed(byte_str: bytes | bytearray) → None[source]

Takes a chunk of a document and feeds it through all of the relevant charset probers.

After calling feed, you can check the value of the done attribute to see if you need to continue feeding the UniversalDetector more data, or if it has made a prediction (in the result attribute).

Note

You should always call close when you’re done feeding in your document if done is not already True.

property has_win_bytes: bool: Check if Windows-specific bytes were detected by the SBCS prober.

property input_state: int

property nested_probers: list[CharSetProber]: Get a flat list of all nested charset probers.

reset() → None[source]: Reset the UniversalDetector and all of its probers back to their initial states. This is called by __init__, so you only need to call this directly in between analyses of different documents.

result: ResultDict

chardet.charsetprober module

class chardet.charsetprober.CharSetProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: object

SHORTCUT_THRESHOLD = 0.95

property charset: Charset | None: Return the Charset metadata for this prober’s encoding.

property charset_name: str | None

feed(byte_str: bytes | bytearray) → ProbingState[source]

static filter_high_byte_only(buf: bytes | bytearray) → bytes[source]

static filter_international_words(buf: bytes | bytearray) → bytearray[source]

Filter out ASCII-only words for non-Latin scripts.

Byte classes: - alphabet: ASCII letters [a-zA-Z] - international: bytes with high bit set [-ÿ] - marker: everything else [^a-zA-Z-ÿ]

The buffer is treated as a sequence of “words” separated by marker bytes. We KEEP only those words that contain at least one high-byte character, i.e. match the pattern: optional ASCII prefix + >=1 high-byte + optional ASCII suffix, plus at most one trailing marker. Pure ASCII words are discarded as noise when the target language model excludes ASCII letters (“English words in other-language pages” — paper §4.7 summary).

Why we retain surrounding ASCII letters instead of stripping them: - Preserves real adjacency for bigram modeling around high-byte letters. - Avoids creating artificial bigrams between non-adjacent high-byte chars.

Trailing marker normalization: a single marker at word end is converted to a space if it is an ASCII punctuation/control, collapsing runs of markers into one delimiter (reduces noise like repeated punctuation or HTML artifacts).

Usage is conditional: callers apply this ONLY when the language model’s keep_ascii_letters is False (see SingleByteCharSetProber.feed). Latin-script languages skip this and instead use remove_xml_tags.

This behavior mirrors the original universalchardet / uchardet approach and aligns with the training pipeline which excludes ASCII letters for non-Latin alphabets.

get_confidence() → float[source]

property language: str | None

static remove_xml_tags(buf: bytes | bytearray) → bytearray[source]: Returns a copy of buf that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by Latin1Prober.

reset() → None[source]

property state: ProbingState

chardet.charsetgroupprober module

class chardet.charsetgroupprober.CharSetGroupProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: CharSetProber

property charset_name: str | None

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

property language: str | None

reset() → None[source]

chardet.codingstatemachine module

class chardet.codingstatemachine.CodingStateMachine(sm: CodingStateMachineDict)[source]

Bases: object

A state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:

START state: This is the state to start with, or a legal byte sequence: (i.e. a valid code point) for character has been identified.
ME state: This indicates that the state machine identified a byte sequence: that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector.
ERROR state: This indicates the state machine identified an illegal byte: sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on.

get_coding_state_machine() → str[source]

get_current_charlen() → int[source]

property language: str

next_state(c: int) → int[source]

reset() → None[source]

chardet.codingstatemachinedict module

class chardet.codingstatemachinedict.CodingStateMachineDict[source]

Bases: TypedDict

char_len_table: tuple[int, ...]

class_factor: int

class_table: tuple[int, ...]

language: str

name: str

state_table: tuple[int, ...]

Multi-byte encoding probers

chardet.mbcharsetprober module

class chardet.mbcharsetprober.MultiByteCharSetProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: CharSetProber

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

reset() → None[source]

chardet.mbcsgroupprober module

class chardet.mbcsgroupprober.MBCSGroupProber(*, lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]: Bases: CharSetGroupProber

chardet.mbcssm module

chardet.mbcssm.BIG5_SM_MODEL: CodingStateMachineDict = {'char_len_table': (0, 1, 1, 2, 0), 'class_factor': 5, 'class_table': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0), 'name': 'Big5', 'state_table': (MachineState.ERROR, MachineState.START, MachineState.START, 3, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ERROR, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ITS_ME, MachineState.ERROR, MachineState.ERROR, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START, MachineState.START)}

# Classes 0: Unused 1: 00-40, 5B-60, 7B-7F : Ascii 2: C7-FD 3: C9,FE : User-Defined Area 4: 41-52 5: 53-5A, 61-7A 6: 81-A0 7: A1-AC, B0-C5 8: AD-AF 9: C6

# Byte 1 Ascii: 00-7F : 1 + 4 + 5 State 3: 81-AC, B0-C5 : 6 + 7 State 4: AD-AF : 8 State 5: C6 : 9 State 6: C7-FE : 2 (+ 3)

# Byte 2 State 3: 41-5A, 61-7A, 81-FE : 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 State 4: 41-5A, 61-7A, 81-A0 : 4 + 5 + 6 State 5: 41-52, A1-FE : 2 + 3 + 4 + 7 + 8 + 9 State 6: A1-FE : 2 + 3 + 7 + 8 + 9

chardet.utf8prober module

class chardet.utf8prober.UTF8Prober[source]

Bases: CharSetProber

ONE_CHAR_PROB = 0.5

property charset_name: str

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

property language: str

reset() → None[source]

chardet.utf1632prober module

class chardet.utf1632prober.UTF1632Prober[source]

Bases: CharSetProber

This class simply looks for occurrences of zero bytes, and infers whether the file is UTF16 or UTF32 (low-endian or big-endian) For instance, files looking like ( [nonzero] )+ have a good probability to be UTF32BE. Files looking like ( [nonzero] )+ may be guessed to be UTF16BE, and inversely for little-endian varieties.

EXPECTED_RATIO = 0.94

MIN_CHARS_FOR_DETECTION = 20

MIN_RATIO = 0.08

approx_16bit_chars() → float[source]

approx_32bit_chars() → float[source]

property charset_name: str

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

is_likely_utf16be() → bool[source]

is_likely_utf16le() → bool[source]

is_likely_utf32be() → bool[source]

is_likely_utf32le() → bool[source]

property language: str

reset() → None[source]

property state: ProbingState

validate_utf16_characters(pair: list[int]) → None[source]

Validate if the pair of bytes is valid UTF-16.

UTF-16 is valid in the range 0x0000 - 0xFFFF excluding 0xD800 - 0xFFFF with an exception for surrogate pairs, which must be in the range 0xD800-0xDBFF followed by 0xDC00-0xDFFF

https://en.wikipedia.org/wiki/UTF-16

validate_utf32_characters(quad: list[int]) → None[source]

Validate if the quad of bytes is valid UTF-32.

UTF-32 is valid in the range 0x00000000 - 0x0010FFFF excluding 0x0000D800 - 0x0000DFFF

https://en.wikipedia.org/wiki/UTF-32

chardet.big5prober module

class chardet.big5prober.Big5Prober[source]

Bases: MultiByteCharSetProber

property charset_name: str

property language: str

chardet.gb18030prober module

class chardet.gb18030prober.GB18030Prober[source]

Bases: MultiByteCharSetProber

property charset_name: str

property language: str

chardet.eucjpprober module

class chardet.eucjpprober.EUCJPProber[source]

Bases: MultiByteCharSetProber

property charset_name: str

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

property language: str

reset() → None[source]

chardet.euckrprober module

class chardet.euckrprober.EUCKRProber[source]

Bases: MultiByteCharSetProber

property charset_name: str

property language: str

chardet.cp949prober module

class chardet.cp949prober.CP949Prober[source]

Bases: MultiByteCharSetProber

property charset_name: str

property language: str

chardet.sjisprober module

class chardet.sjisprober.SJISProber[source]

Bases: MultiByteCharSetProber

property charset_name: str

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

property language: str

reset() → None[source]

chardet.johabprober module

class chardet.johabprober.JOHABProber[source]

Bases: MultiByteCharSetProber

property charset_name: str

property language: str

chardet.escprober module

class chardet.escprober.EscCharSetProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.ALL: 63>)[source]

Bases: CharSetProber

This CharSetProber uses a “code scheme” approach for detecting encodings, whereby easily recognizable escape or shift sequences are relied on to identify these encodings.

property charset_name: str | None

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

property language: str | None

reset() → None[source]

chardet.escsm module

Single-byte encoding probers

chardet.sbcharsetprober module

class chardet.sbcharsetprober.SingleByteCharSetModel(charset_name, language, char_to_order_map, language_model, typical_positive_ratio, keep_ascii_letters, alphabet)[source]

Bases: NamedTuple

alphabet: str: Alias for field number 6

char_to_order_map: Mapping[int, CharacterCategory | int]: Alias for field number 2

charset_name: str: Alias for field number 0

keep_ascii_letters: bool: Alias for field number 5

language: str: Alias for field number 1

language_model: Mapping[int, Mapping[int, SequenceLikelihood | int]]: Alias for field number 3

typical_positive_ratio: float: Alias for field number 4

class chardet.sbcharsetprober.SingleByteCharSetProber(model: SingleByteCharSetModel, is_reversed: bool = False, name_prober: CharSetProber | None = None)[source]

Bases: CharSetProber

NEGATIVE_SHORTCUT_THRESHOLD = 0.05

POSITIVE_SHORTCUT_THRESHOLD = 0.95

SB_ENOUGH_REL_THRESHOLD = 1024

property charset_name: str | None

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

property language: str | None

reset() → None[source]

chardet.sbcsgroupprober module

class chardet.sbcsgroupprober.SBCSGroupProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>)[source]

Bases: CharSetGroupProber

feed(byte_str: bytes | bytearray) → ProbingState[source]

get_confidence() → float[source]

reset() → None[source]

chardet.hebrewprober module

class chardet.hebrewprober.HebrewProber[source]

Bases: CharSetProber

FINAL_KAF = 234

FINAL_MEM = 237

FINAL_NUN = 239

FINAL_PE = 243

FINAL_TSADI = 245

LOGICAL_HEBREW_NAME = 'WINDOWS-1255'

MIN_FINAL_CHAR_DISTANCE = 5

MIN_MODEL_DISTANCE = 0.01

NORMAL_KAF = 235

NORMAL_MEM = 238

NORMAL_NUN = 240

NORMAL_PE = 244

NORMAL_TSADI = 246

SPACE = 32

VISUAL_HEBREW_NAME = 'ISO-8859-8'

property charset_name: str

feed(byte_str: bytes | bytearray) → ProbingState[source]

is_final(c: int) → bool[source]

is_non_final(c: int) → bool[source]

property language: str

reset() → None[source]

set_model_probers(logical_prober: SingleByteCharSetProber, visual_prober: SingleByteCharSetProber) → None[source]

property state: ProbingState

Analysis modules

chardet.chardistribution module

class chardet.chardistribution.Big5DistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) → int[source]

class chardet.chardistribution.CharDistributionAnalysis[source]

Bases: object

ENOUGH_DATA_THRESHOLD = 1024

MINIMUM_DATA_THRESHOLD = 3

SURE_NO = 0.01

SURE_YES = 0.99

feed(char: bytes | bytearray, char_len: int) → None[source]: feed a character with known length

get_confidence() → float[source]: return confidence based on existing data

get_order(_: bytes | bytearray) → int[source]

got_enough_data() → bool[source]

reset() → None[source]: reset analyser, clear any state

class chardet.chardistribution.EUCJPDistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) → int[source]

class chardet.chardistribution.EUCKRDistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) → int[source]

class chardet.chardistribution.GB2312DistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) → int[source]

class chardet.chardistribution.JOHABDistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) → int[source]

class chardet.chardistribution.SJISDistributionAnalysis[source]

Bases: CharDistributionAnalysis

get_order(byte_str: bytes | bytearray) → int[source]

chardet.jpcntx module

class chardet.jpcntx.EUCJPContextAnalysis[source]

Bases: JapaneseContextAnalysis

get_order(byte_str: bytes | bytearray) → tuple[int, int][source]

class chardet.jpcntx.JapaneseContextAnalysis[source]

Bases: object

DONT_KNOW = -1

ENOUGH_REL_THRESHOLD = 100

MAX_REL_THRESHOLD = 1000

MINIMUM_DATA_THRESHOLD = 4

NUM_OF_CATEGORY = 6

feed(byte_str: bytes | bytearray, num_bytes: int) → None[source]

get_confidence() → float[source]

get_order(_: bytes | bytearray) → tuple[int, int][source]

got_enough_data() → bool[source]

reset() → None[source]

class chardet.jpcntx.SJISContextAnalysis[source]

Bases: JapaneseContextAnalysis

property charset_name: str

get_order(byte_str: bytes | bytearray) → tuple[int, int][source]

Frequency tables

chardet.big5freq module

chardet.euckrfreq module

chardet.gb2312freq module

chardet.jisfreq module

chardet.johabfreq module

Language models

These modules contain bigram frequency models for single-byte encoding detection. They are generated by create_language_model.py and should not be edited manually.

CLI module

chardet.cli.chardetect module

Script which takes one or more file paths and reports on their detected encodings

Example:

% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

If no paths are provided, it takes its input from stdin.

chardet.cli.chardetect.description_of(lines: ~collections.abc.Iterable[bytes], name: str = 'stdin', minimal: bool = False, should_rename_legacy: bool = False, encoding_era: ~chardet.enums.EncodingEra = <EncodingEra.MODERN_WEB: 1>) → str | None[source]

Return a string describing the probable encoding of a file or list of strings.

Parameters:

lines (Iterable of bytes) – The lines to get the encoding of.
name (str) – Name of file or collection of lines
should_rename_legacy (bool) – Should we rename legacy encodings to their more modern equivalents?
encoding_era (EncodingEra) – Which era of encodings to consider during detection.

chardet.cli.chardetect.main(argv: list[str] | None = None) → None[source]

Handles command line arguments and gets things started.

Parameters:: argv (list of str) – List of arguments, as if specified on the command-line. If None, sys.argv[1:] is used instead.

Metadata subpackage

chardet.metadata.charsets module

Metadata about charsets used by our model training code and test file generationcode. Could be used for other things in the future.

class chardet.metadata.charsets.Charset(name: str, is_multi_byte: bool, encoding_era: EncodingEra, language_filter: LanguageFilter)[source]

Bases: object

Metadata about charsets useful for training models and generating test files.

encoding_era: EncodingEra

is_multi_byte: bool

language_filter: LanguageFilter

name: str

chardet.metadata.charsets.get_charset(encoding_name: str) → Charset[source]

Get the Charset metadata for a given encoding name.

Parameters:: encoding_name – The encoding name to look up
Returns:: The Charset for this encoding, defaults to a MODERN_WEB charset if unknown

chardet.metadata.charsets.is_unicode_encoding(encoding_name: str) → bool[source]

Check if an encoding is a Unicode encoding (UTF-8, UTF-16, UTF-32).

Parameters:: encoding_name – The encoding name to check
Returns:: True if the encoding is Unicode, False otherwise

chardet.metadata.languages module

Metadata about languages used by our model training code for our SingleByteCharSetProbers. Could be used for other things in the future.

This code was originally based on the language metadata from the uchardet project.

class chardet.metadata.languages.Language(name: str, iso_code: str, use_ascii: bool, charsets: list[str], alphabet: str, num_training_docs: int | None = None, num_training_chars: int | None = None)[source]

Bases: object

Metadata about a language useful for training models

Variables:

name – The human name for the language, in English.
iso_code – 2-letter ISO 639-1 if possible, 3-letter ISO code otherwise, or use another catalog as a last resort.
use_ascii – Whether or not ASCII letters should be included in trained models.
charsets – The charsets we want to support and create data for.
alphabet – The characters in the language’s alphabet. If use_ascii is True, you only need to add those not in the ASCII set.
num_training_docs – Number of documents from CulturaX to use for training. This represents approximately 300M characters of training data. None means the count hasn’t been determined yet.
num_training_chars – Number of characters from CulturaX used for training. The goal is for this to be at least 300M characters, but some languages may not have that much data available. None means the count hasn’t been determined yet.

alphabet: str

charsets: list[str]

iso_code: str

name: str

num_training_chars: int | None = None

num_training_docs: int | None = None

use_ascii: bool