chardet package¶

static filter_international_words(buf)[source]¶: We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [-ÿ] marker: everything else [^a-zA-Z-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters.

get_confidence()[source]¶

static remove_xml_tags(buf)[source]¶: Returns a copy of buf that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by Latin1Prober.

reset()[source]¶

state¶

chardet.codingstatemachine module¶

class chardet.codingstatemachine.CodingStateMachine(sm)[source]¶

Bases: object

A state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:

START state: This is the state to start with, or a legal byte sequence: (i.e. a valid code point) for character has been identified.
ME state: This indicates that the state machine identified a byte sequence: that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector.
ERROR state: This indicates the state machine identified an illegal byte: sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on.

get_coding_state_machine()[source]¶

get_current_charlen()[source]¶

language¶

next_state(c)[source]¶

reset()[source]¶

chardet.compat module¶

chardet.constants module¶

chardet.cp949prober module¶

class chardet.cp949prober.CP949Prober[source]¶

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name¶

language¶

chardet.escprober module¶

class chardet.escprober.EscCharSetProber(lang_filter=None)[source]¶

Bases: chardet.charsetprober.CharSetProber

This CharSetProber uses a “code scheme” approach for detecting encodings, whereby easily recognizable escape or shift sequences are relied on to identify these encodings.

charset_name¶

feed(byte_str)[source]¶

get_confidence()[source]¶

language¶

reset()[source]¶

chardet.escsm module¶

chardet.eucjpprober module¶

class chardet.eucjpprober.EUCJPProber[source]¶

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name¶

feed(byte_str)[source]¶

get_confidence()[source]¶

language¶

reset()[source]¶

chardet.euckrfreq module¶

chardet.euckrprober module¶

class chardet.euckrprober.EUCKRProber[source]¶

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name¶

language¶

chardet.euctwfreq module¶

chardet.euctwprober module¶

class chardet.euctwprober.EUCTWProber[source]¶

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name¶

language¶

chardet.gb2312freq module¶

chardet.gb2312prober module¶

class chardet.gb2312prober.GB2312Prober[source]¶

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name¶

language¶

chardet.hebrewprober module¶

class chardet.hebrewprober.HebrewProber[source]¶

Bases: chardet.charsetprober.CharSetProber

FINAL_KAF = 234¶

FINAL_MEM = 237¶

FINAL_NUN = 239¶

FINAL_PE = 243¶

FINAL_TSADI = 245¶

LOGICAL_HEBREW_NAME = 'windows-1255'¶

MIN_FINAL_CHAR_DISTANCE = 5¶

MIN_MODEL_DISTANCE = 0.01¶

NORMAL_KAF = 235¶

NORMAL_MEM = 238¶

NORMAL_NUN = 240¶

NORMAL_PE = 244¶

NORMAL_TSADI = 246¶

VISUAL_HEBREW_NAME = 'ISO-8859-8'¶

charset_name¶

feed(byte_str)[source]¶

is_final(c)[source]¶

is_non_final(c)[source]¶

language¶

reset()[source]¶

set_model_probers(logical_prober, visual_prober)[source]¶

state¶

chardet.jisfreq module¶

chardet.jpcntx module¶

class chardet.jpcntx.EUCJPContextAnalysis[source]¶

Bases: chardet.jpcntx.JapaneseContextAnalysis

get_order(byte_str)[source]¶

class chardet.jpcntx.JapaneseContextAnalysis[source]¶

Bases: object

DONT_KNOW = -1¶

ENOUGH_REL_THRESHOLD = 100¶

MAX_REL_THRESHOLD = 1000¶

MINIMUM_DATA_THRESHOLD = 4¶

NUM_OF_CATEGORY = 6¶

feed(byte_str, num_bytes)[source]¶

get_confidence()[source]¶

get_order(_)[source]¶

got_enough_data()[source]¶

reset()[source]¶

class chardet.jpcntx.SJISContextAnalysis[source]¶

Bases: chardet.jpcntx.JapaneseContextAnalysis

charset_name¶

get_order(byte_str)[source]¶

chardet.langbulgarianmodel module¶

chardet.langcyrillicmodel module¶

chardet.langgreekmodel module¶

chardet.langhebrewmodel module¶

chardet.langhungarianmodel module¶

chardet.langthaimodel module¶

chardet.latin1prober module¶

class chardet.latin1prober.Latin1Prober[source]¶

Bases: chardet.charsetprober.CharSetProber

charset_name¶

feed(byte_str)[source]¶

get_confidence()[source]¶

language¶

reset()[source]¶

chardet.mbcharsetprober module¶

class chardet.mbcharsetprober.MultiByteCharSetProber(lang_filter=None)[source]¶

Bases: chardet.charsetprober.CharSetProber

charset_name¶

feed(byte_str)[source]¶

get_confidence()[source]¶

language¶

reset()[source]¶

chardet.mbcsgroupprober module¶

class chardet.mbcsgroupprober.MBCSGroupProber(lang_filter=None)[source]¶: Bases: chardet.charsetgroupprober.CharSetGroupProber

chardet.mbcssm module¶

chardet.sbcharsetprober module¶

class chardet.sbcharsetprober.SingleByteCharSetModel(charset_name, language, char_to_order_map, language_model, typical_positive_ratio, keep_ascii_letters, alphabet)¶

Bases: tuple

alphabet¶: Alias for field number 6

char_to_order_map¶: Alias for field number 2

charset_name¶: Alias for field number 0

keep_ascii_letters¶: Alias for field number 5

language¶: Alias for field number 1

language_model¶: Alias for field number 3

typical_positive_ratio¶: Alias for field number 4

class chardet.sbcharsetprober.SingleByteCharSetProber(model, is_reversed=False, name_prober=None)[source]¶

Bases: chardet.charsetprober.CharSetProber

NEGATIVE_SHORTCUT_THRESHOLD = 0.05¶

POSITIVE_SHORTCUT_THRESHOLD = 0.95¶

SAMPLE_SIZE = 64¶

SB_ENOUGH_REL_THRESHOLD = 1024¶

charset_name¶

feed(byte_str)[source]¶

get_confidence()[source]¶

language¶

reset()[source]¶

chardet.sbcsgroupprober module¶

class chardet.sbcsgroupprober.SBCSGroupProber[source]¶: Bases: chardet.charsetgroupprober.CharSetGroupProber

chardet.sjisprober module¶

class chardet.sjisprober.SJISProber[source]¶

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name¶

feed(byte_str)[source]¶

get_confidence()[source]¶

language¶

reset()[source]¶

chardet.universaldetector module¶

Module containing the UniversalDetector detector class, which is the primary class a user of chardet should use.

author:	Mark Pilgrim (initial port to Python)
author:	Shy Shalom (original C code)
author:	Dan Blanchard (major refactoring for 3.0)
author:	Ian Cordasco

class chardet.universaldetector.UniversalDetector(lang_filter=31)[source]¶

Bases: object

The UniversalDetector class underlies the chardet.detect function and coordinates all of the different charset probers.

To get a dict containing an encoding and its confidence, you can simply run:

u = UniversalDetector()
u.feed(some_bytes)
u.close()
detected = u.result

ESC_DETECTOR = re.compile(b'(\x1b|~{)')¶

HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')¶

ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}¶

MINIMUM_THRESHOLD = 0.2¶

WIN_BYTE_DETECTOR = re.compile(b'[\x80-\x9f]')¶

charset_probers¶

close()[source]¶

Stop analyzing the current document and come up with a final prediction.

Returns:	The `result` attribute, a `dict` with the keys encoding, confidence, and language.

feed(byte_str)[source]¶

Takes a chunk of a document and feeds it through all of the relevant charset probers.

After calling feed, you can check the value of the done attribute to see if you need to continue feeding the UniversalDetector more data, or if it has made a prediction (in the result attribute).

Note

You should always call close when you’re done feeding in your document if done is not already True.

has_win_bytes¶

input_state¶

reset()[source]¶: Reset the UniversalDetector and all of its probers back to their initial states. This is called by __init__, so you only need to call this directly in between analyses of different documents.

chardet.utf8prober module¶

class chardet.utf8prober.UTF8Prober[source]¶

Bases: chardet.charsetprober.CharSetProber

ONE_CHAR_PROB = 0.5¶

charset_name¶

feed(byte_str)[source]¶

get_confidence()[source]¶

language¶

reset()[source]¶

Module contents¶

class chardet.UniversalDetector(lang_filter=31)[source]¶

Bases: object

The UniversalDetector class underlies the chardet.detect function and coordinates all of the different charset probers.

To get a dict containing an encoding and its confidence, you can simply run:

u = UniversalDetector()
u.feed(some_bytes)
u.close()
detected = u.result

ESC_DETECTOR = re.compile(b'(\x1b|~{)')¶

HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')¶

ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}¶

MINIMUM_THRESHOLD = 0.2¶

WIN_BYTE_DETECTOR = re.compile(b'[\x80-\x9f]')¶

charset_probers¶

close()[source]¶

Stop analyzing the current document and come up with a final prediction.

Returns:	The `result` attribute, a `dict` with the keys encoding, confidence, and language.

feed(byte_str)[source]¶

Takes a chunk of a document and feeds it through all of the relevant charset probers.

After calling feed, you can check the value of the done attribute to see if you need to continue feeding the UniversalDetector more data, or if it has made a prediction (in the result attribute).

Note

You should always call close when you’re done feeding in your document if done is not already True.

has_win_bytes¶

input_state¶

reset()[source]¶: Reset the UniversalDetector and all of its probers back to their initial states. This is called by __init__, so you only need to call this directly in between analyses of different documents.

chardet.detect(byte_str)[source]¶

Detect the encoding of the given byte string.

Parameters:	byte_str (`bytes` or `bytearray`) – The byte sequence to examine.

chardet.detect_all(byte_str, ignore_threshold=False)[source]¶

Detect all the possible encodings of the given byte string.

Parameters:	byte_str (`bytes` or `bytearray`) – The byte sequence to examine. ignore_threshold (`bool`) – Include encodings that are below `UniversalDetector.MINIMUM_THRESHOLD` in results.