Usage
Basic usage
The easiest way to use chardet is with the detect function.
Example: Using the detect function
The detect function takes a byte string and returns a dictionary
containing the auto-detected character encoding, a confidence level
from 0 to 1, and the detected language.
>>> import chardet
>>> chardet.detect('Strauß und Müller über Änderungen'.encode('windows-1252'))
{'encoding': 'WINDOWS-1252', 'confidence': 0.6316251912431836, 'language': 'German'}
The result dictionary always contains three keys:
encoding: the detected encoding name (orNoneif detection failed)confidence: a float from0to1language: the detected language (or''if not applicable)
Controlling how much data to process
By default, detect() reads up to 200 KB of input in 64 KB chunks.
You can tune this with the max_bytes and chunk_size parameters:
import chardet
# Process at most 50 KB, feeding 8 KB at a time internally
result = chardet.detect(data, max_bytes=50_000, chunk_size=8192)
These parameters also apply to detect_all.
Filtering by encoding era
By default, detect() only considers modern web encodings (UTF-8,
Windows-125x, CJK multi-byte, etc.). If you’re working with legacy data,
you can expand the search using the encoding_era parameter:
from chardet import detect
from chardet.enums import EncodingEra
# Default: only modern web encodings
result = detect(data)
# Include all encoding eras
result = detect(data, encoding_era=EncodingEra.ALL)
# Only consider DOS-era encodings
result = detect(data, encoding_era=EncodingEra.DOS)
# Combine specific eras
result = detect(data, encoding_era=EncodingEra.MODERN_WEB | EncodingEra.LEGACY_ISO)
See Supported encodings for which encodings belong to each era.
Getting all candidates with detect_all
If you want to see all candidate encodings rather than just the best
guess, use detect_all:
>>> import chardet
>>> chardet.detect_all('Strauß und Müller über Änderungen'.encode('windows-1252'))[:3]
[{'encoding': 'WINDOWS-1252', 'confidence': 0.6316251912431836, 'language': 'German'},
{'encoding': 'WINDOWS-1250', 'confidence': 0.5220501710295528, 'language': 'Czech'},
{'encoding': 'WINDOWS-1257', 'confidence': 0.5197657012389119, 'language': 'Estonian'}]
Results are sorted by confidence (highest first). detect_all accepts
the same encoding_era, should_rename_legacy, max_bytes, and
chunk_size parameters as detect.
Advanced usage: incremental detection
In most cases, the max_bytes and chunk_size parameters on
detect() and detect_all() are sufficient for controlling how much
data is processed. However, if you need to feed data from a custom source
(such as a network stream or a decompressor), you can use
UniversalDetector directly.
Create a UniversalDetector object, then call its feed method
repeatedly with each block of data. If the detector reaches a minimum
threshold of confidence, it will set detector.done to True.
Once you’ve exhausted the source data, call detector.close() to
finalize detection. The result is then available in detector.result.
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
with open('mystery-file.txt', 'rb') as f:
for line in f:
detector.feed(line)
if detector.done:
break
detector.close()
print(detector.result)
UniversalDetector also accepts encoding_era and max_bytes
parameters:
from chardet.enums import EncodingEra
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector(encoding_era=EncodingEra.ALL)
detector.feed(data)
detector.close()
print(detector.result)
If you want to detect the encoding of multiple texts (such as separate
files), you can re-use a single UniversalDetector object. Call
detector.reset() at the start of each file, feed as many times as
you like, then close() and check detector.result.
Example: Detecting encodings of multiple files
import glob
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
for filename in glob.glob('*.xml'):
detector.reset()
with open(filename, 'rb') as f:
for line in f:
detector.feed(line)
if detector.done:
break
detector.close()
print(f'{filename}: {detector.result}')
Command-line tool
chardet includes a chardetect command-line tool:
$ chardetect somefile.txt someotherfile.txt
somefile.txt: Windows-1252 with confidence 0.73
someotherfile.txt: ascii with confidence 1.0
To consider all encoding eras (not just modern web encodings):
$ chardetect -e ALL somefile.txt
Other options:
$ chardetect --help
usage: chardetect [-h] [--minimal] [-l] [-e ENCODING_ERA] [--version]
[input ...]
Takes one or more file paths and reports their detected encodings
positional arguments:
input File whose encoding we would like to determine.
(default: stdin)
options:
-h, --help show this help message and exit
--minimal Print only the encoding to standard output
-l, --legacy Rename legacy encodings to more modern ones.
-e ENCODING_ERA, --encoding-era ENCODING_ERA
Which era of encodings to consider (default:
MODERN_WEB). Choices: MODERN_WEB, LEGACY_ISO,
LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME, ALL
--version show program's version number and exit