chardet package

Submodules

chardet.big5freq module

chardet.big5prober module

class chardet.big5prober.Big5Prober

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name

chardet.chardetect module

chardet.chardistribution module

class chardet.chardistribution.Big5DistributionAnalysis

Bases: chardet.chardistribution.CharDistributionAnalysis

get_order(byte_str)
class chardet.chardistribution.CharDistributionAnalysis

Bases: object

ENOUGH_DATA_THRESHOLD = 1024
MINIMUM_DATA_THRESHOLD = 3
SURE_NO = 0.01
SURE_YES = 0.99
feed(char, char_len)

feed a character with known length

get_confidence()

return confidence based on existing data

get_order(byte_str)
got_enough_data()
reset()

reset analyser, clear any state

class chardet.chardistribution.EUCJPDistributionAnalysis

Bases: chardet.chardistribution.CharDistributionAnalysis

get_order(byte_str)
class chardet.chardistribution.EUCKRDistributionAnalysis

Bases: chardet.chardistribution.CharDistributionAnalysis

get_order(byte_str)
class chardet.chardistribution.EUCTWDistributionAnalysis

Bases: chardet.chardistribution.CharDistributionAnalysis

get_order(byte_str)
class chardet.chardistribution.GB2312DistributionAnalysis

Bases: chardet.chardistribution.CharDistributionAnalysis

get_order(byte_str)
class chardet.chardistribution.SJISDistributionAnalysis

Bases: chardet.chardistribution.CharDistributionAnalysis

get_order(byte_str)

chardet.charsetgroupprober module

class chardet.charsetgroupprober.CharSetGroupProber(lang_filter=None)

Bases: chardet.charsetprober.CharSetProber

charset_name
feed(byte_str)
get_confidence()
reset()

chardet.charsetprober module

class chardet.charsetprober.CharSetProber(lang_filter=None)

Bases: object

SHORTCUT_THRESHOLD = 0.95
charset_name
feed(buf)
static filter_high_byte_only(buf)
static filter_international_words(buf)

We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [€-ÿ] marker: everything else [^a-zA-Z€-ÿ]

The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character.

This filter applies to all scripts which do not use English characters.

static filter_with_english_letters(buf)

Returns a copy of buf that retains only the sequences of English alphabet and high byte characters that are not between <> characters. Also retains English alphabet and high byte characters immediately before occurrences of >.

This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by Latin1Prober.

get_confidence()
reset()
state

chardet.codingstatemachine module

class chardet.codingstatemachine.CodingStateMachine(sm)

Bases: object

A state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:

START state: This is the state to start with, or a legal byte sequence
(i.e. a valid code point) for character has been identified.
ME state: This indicates that the state machine identified a byte sequence
that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector.
ERROR state: This indicates the state machine identified an illegal byte
sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on.
get_coding_state_machine()
get_current_charlen()
next_state(c)
reset()

chardet.compat module

chardet.compat.wrap_ord(a)

chardet.constants module

chardet.cp949prober module

class chardet.cp949prober.CP949Prober

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name

chardet.escprober module

class chardet.escprober.EscCharSetProber(lang_filter=None)

Bases: chardet.charsetprober.CharSetProber

This CharSetProber uses a “code scheme” approach for detecting encodings, whereby easily recognizable escape or shift sequences are relied on to identify these encodings.

charset_name
feed(byte_str)
get_confidence()
reset()

chardet.escsm module

chardet.eucjpprober module

class chardet.eucjpprober.EUCJPProber

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name
feed(byte_str)
get_confidence()
reset()

chardet.euckrfreq module

chardet.euckrprober module

class chardet.euckrprober.EUCKRProber

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name

chardet.euctwfreq module

chardet.euctwprober module

class chardet.euctwprober.EUCTWProber

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name

chardet.gb2312freq module

chardet.gb2312prober module

class chardet.gb2312prober.GB2312Prober

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name

chardet.hebrewprober module

class chardet.hebrewprober.HebrewProber

Bases: chardet.charsetprober.CharSetProber

FINAL_KAF = 234
FINAL_MEM = 237
FINAL_NUN = 239
FINAL_PE = 243
FINAL_TSADI = 245
LOGICAL_HEBREW_NAME = 'windows-1255'
MIN_FINAL_CHAR_DISTANCE = 5
MIN_MODEL_DISTANCE = 0.01
NORMAL_KAF = 235
NORMAL_MEM = 238
NORMAL_NUN = 240
NORMAL_PE = 244
NORMAL_TSADI = 246
VISUAL_HEBREW_NAME = 'ISO-8859-8'
charset_name
feed(byte_str)
is_final(c)
is_non_final(c)
reset()
set_model_probers(logicalProber, visualProber)
state

chardet.jisfreq module

chardet.jpcntx module

class chardet.jpcntx.EUCJPContextAnalysis

Bases: chardet.jpcntx.JapaneseContextAnalysis

get_order(byte_str)
class chardet.jpcntx.JapaneseContextAnalysis

Bases: object

DONT_KNOW = -1
ENOUGH_REL_THRESHOLD = 100
MAX_REL_THRESHOLD = 1000
MINIMUM_DATA_THRESHOLD = 4
NUM_OF_CATEGORY = 6
feed(byte_str, num_bytes)
get_confidence()
get_order(byte_str)
got_enough_data()
reset()
class chardet.jpcntx.SJISContextAnalysis

Bases: chardet.jpcntx.JapaneseContextAnalysis

charset_name
get_order(byte_str)

chardet.langbulgarianmodel module

chardet.langcyrillicmodel module

chardet.langgreekmodel module

chardet.langhebrewmodel module

chardet.langhungarianmodel module

chardet.langthaimodel module

chardet.latin1prober module

class chardet.latin1prober.Latin1Prober

Bases: chardet.charsetprober.CharSetProber

charset_name
feed(byte_str)
get_confidence()
reset()

chardet.mbcharsetprober module

class chardet.mbcharsetprober.MultiByteCharSetProber(lang_filter=None)

Bases: chardet.charsetprober.CharSetProber

charset_name
feed(byte_str)
get_confidence()
reset()

chardet.mbcsgroupprober module

class chardet.mbcsgroupprober.MBCSGroupProber(lang_filter=None)

Bases: chardet.charsetgroupprober.CharSetGroupProber

chardet.mbcssm module

chardet.sbcharsetprober module

class chardet.sbcharsetprober.SingleByteCharSetProber(model, reversed=False, name_prober=None)

Bases: chardet.charsetprober.CharSetProber

NEGATIVE_SHORTCUT_THRESHOLD = 0.05
NUMBER_OF_SEQ_CAT = 4
POSITIVE_CAT = 3
POSITIVE_SHORTCUT_THRESHOLD = 0.95
SAMPLE_SIZE = 64
SB_ENOUGH_REL_THRESHOLD = 1024
SYMBOL_CAT_ORDER = 250
charset_name
feed(byte_str)
get_confidence()
reset()

chardet.sbcsgroupprober module

class chardet.sbcsgroupprober.SBCSGroupProber

Bases: chardet.charsetgroupprober.CharSetGroupProber

chardet.sjisprober module

class chardet.sjisprober.SJISProber

Bases: chardet.mbcharsetprober.MultiByteCharSetProber

charset_name
feed(byte_str)
get_confidence()
reset()

chardet.universaldetector module

Module containing the UniversalDetector detector class, which is the primary class a user of chardet should use.

author:Mark Pilgrim (initial port to Python)
author:Shy Shalom (original C code)
author:Dan Blanchard (major refactoring for 3.0)
author:Ian Cordasco
class chardet.universaldetector.UniversalDetector(lang_filter=31)

Bases: object

The UniversalDetector class underlies the chardet.detect function and coordinates all of the different charset probers.

To get a dict containing an encoding and its confidence, you can simply run:

u = UniversalDetector()
u.feed(some_bytes)
u.close()
detected = u.result
ESC_DETECTOR = re.compile(b'(\x1b|~{)')
HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')
MINIMUM_THRESHOLD = 0.2
close()

Stop analyzing the current document and come up with a final prediction.

Returns:The result attribute if a prediction was made, otherwise None.
feed(byte_str)

Takes a chunk of a document and feeds it through all of the relevant charset probers.

After calling feed, you can check the value of the done attribute to see if you need to continue feeding the UniversalDetector more data, or if it has made a prediction (in the result attribute).

Note

You should always call close when you’re done feeding in your document if done is not already True.

reset()

Reset the UniversalDetector and all of its probers back to their initial states. This is called by __init__, so you only need to call this directly in between analyses of different documents.

chardet.utf8prober module

class chardet.utf8prober.UTF8Prober

Bases: chardet.charsetprober.CharSetProber

ONE_CHAR_PROB = 0.5
charset_name
feed(byte_str)
get_confidence()
reset()

Module contents

chardet.detect(byte_str)