chardet package¶
Submodules¶
chardet.big5freq module¶
chardet.big5prober module¶
-
class
chardet.big5prober.
Big5Prober
[source]¶ Bases:
chardet.mbcharsetprober.MultiByteCharSetProber
-
charset_name
¶
-
language
¶
-
chardet.chardetect module¶
chardet.chardistribution module¶
-
class
chardet.chardistribution.
CharDistributionAnalysis
[source]¶ Bases:
object
-
ENOUGH_DATA_THRESHOLD
= 1024¶
-
MINIMUM_DATA_THRESHOLD
= 3¶
-
SURE_NO
= 0.01¶
-
SURE_YES
= 0.99¶
-
chardet.charsetgroupprober module¶
chardet.charsetprober module¶
-
class
chardet.charsetprober.
CharSetProber
(lang_filter=None)[source]¶ Bases:
object
-
SHORTCUT_THRESHOLD
= 0.95¶
-
charset_name
¶
-
static
filter_international_words
(buf)[source]¶ We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [-ÿ] marker: everything else [^a-zA-Z-ÿ]
The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character.
This filter applies to all scripts which do not use English characters.
-
static
filter_with_english_letters
(buf)[source]¶ Returns a copy of
buf
that retains only the sequences of English alphabet and high byte characters that are not between <> characters. Also retains English alphabet and high byte characters immediately before occurrences of >.This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by
Latin1Prober
.
-
state
¶
-
chardet.codingstatemachine module¶
-
class
chardet.codingstatemachine.
CodingStateMachine
(sm)[source]¶ Bases:
object
A state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:
- START state: This is the state to start with, or a legal byte sequence
- (i.e. a valid code point) for character has been identified.
- ME state: This indicates that the state machine identified a byte sequence
- that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector.
- ERROR state: This indicates the state machine identified an illegal byte
- sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on.
-
language
¶
chardet.compat module¶
chardet.constants module¶
chardet.cp949prober module¶
-
class
chardet.cp949prober.
CP949Prober
[source]¶ Bases:
chardet.mbcharsetprober.MultiByteCharSetProber
-
charset_name
¶
-
language
¶
-
chardet.escprober module¶
-
class
chardet.escprober.
EscCharSetProber
(lang_filter=None)[source]¶ Bases:
chardet.charsetprober.CharSetProber
This CharSetProber uses a “code scheme” approach for detecting encodings, whereby easily recognizable escape or shift sequences are relied on to identify these encodings.
-
charset_name
¶
-
language
¶
-
chardet.escsm module¶
chardet.eucjpprober module¶
chardet.euckrfreq module¶
chardet.euckrprober module¶
-
class
chardet.euckrprober.
EUCKRProber
[source]¶ Bases:
chardet.mbcharsetprober.MultiByteCharSetProber
-
charset_name
¶
-
language
¶
-
chardet.euctwfreq module¶
chardet.euctwprober module¶
-
class
chardet.euctwprober.
EUCTWProber
[source]¶ Bases:
chardet.mbcharsetprober.MultiByteCharSetProber
-
charset_name
¶
-
language
¶
-
chardet.gb2312freq module¶
chardet.gb2312prober module¶
-
class
chardet.gb2312prober.
GB2312Prober
[source]¶ Bases:
chardet.mbcharsetprober.MultiByteCharSetProber
-
charset_name
¶
-
language
¶
-
chardet.hebrewprober module¶
-
class
chardet.hebrewprober.
HebrewProber
[source]¶ Bases:
chardet.charsetprober.CharSetProber
-
FINAL_KAF
= 234¶
-
FINAL_MEM
= 237¶
-
FINAL_NUN
= 239¶
-
FINAL_PE
= 243¶
-
FINAL_TSADI
= 245¶
-
LOGICAL_HEBREW_NAME
= 'windows-1255'¶
-
MIN_FINAL_CHAR_DISTANCE
= 5¶
-
MIN_MODEL_DISTANCE
= 0.01¶
-
NORMAL_KAF
= 235¶
-
NORMAL_MEM
= 238¶
-
NORMAL_NUN
= 240¶
-
NORMAL_PE
= 244¶
-
NORMAL_TSADI
= 246¶
-
VISUAL_HEBREW_NAME
= 'ISO-8859-8'¶
-
charset_name
¶
-
language
¶
-
state
¶
-
chardet.jisfreq module¶
chardet.jpcntx module¶
chardet.langbulgarianmodel module¶
chardet.langcyrillicmodel module¶
chardet.langgreekmodel module¶
chardet.langhebrewmodel module¶
chardet.langhungarianmodel module¶
chardet.langthaimodel module¶
chardet.latin1prober module¶
chardet.mbcharsetprober module¶
chardet.mbcsgroupprober module¶
chardet.mbcssm module¶
chardet.sbcharsetprober module¶
chardet.sjisprober module¶
chardet.universaldetector module¶
Module containing the UniversalDetector detector class, which is the primary
class a user of chardet
should use.
author: | Mark Pilgrim (initial port to Python) |
---|---|
author: | Shy Shalom (original C code) |
author: | Dan Blanchard (major refactoring for 3.0) |
author: | Ian Cordasco |
-
class
chardet.universaldetector.
UniversalDetector
(lang_filter=31)[source]¶ Bases:
object
The
UniversalDetector
class underlies thechardet.detect
function and coordinates all of the different charset probers.To get a
dict
containing an encoding and its confidence, you can simply run:u = UniversalDetector() u.feed(some_bytes) u.close() detected = u.result
-
ESC_DETECTOR
= re.compile(b'(\x1b|~{)')¶
-
HIGH_BYTE_DETECTOR
= re.compile(b'[\x80-\xff]')¶
-
ISO_WIN_MAP
= {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}¶
-
MINIMUM_THRESHOLD
= 0.2¶
-
WIN_BYTE_DETECTOR
= re.compile(b'[\x80-\x9f]')¶
-
close
()[source]¶ Stop analyzing the current document and come up with a final prediction.
Returns: The result
attribute, adict
with the keys encoding, confidence, and language.
-
feed
(byte_str)[source]¶ Takes a chunk of a document and feeds it through all of the relevant charset probers.
After calling
feed
, you can check the value of thedone
attribute to see if you need to continue feeding theUniversalDetector
more data, or if it has made a prediction (in theresult
attribute).Note
You should always call
close
when you’re done feeding in your document ifdone
is not alreadyTrue
.
-