Usage
=====

Basic usage
-----------

The easiest way to use chardet is with the ``detect`` function.


Example: Using the ``detect`` function
--------------------------------------

The ``detect`` function takes a byte string and returns a dictionary
containing the auto-detected character encoding, a confidence level
from ``0`` to ``1``, and the detected language.

.. code:: python

    >>> import chardet
    >>> chardet.detect('Strauß und Müller über Änderungen'.encode('windows-1252'))
    {'encoding': 'WINDOWS-1252', 'confidence': 0.6316251912431836, 'language': 'German'}

The result dictionary always contains three keys:

- ``encoding``: the detected encoding name (or ``None`` if detection failed)
- ``confidence``: a float from ``0`` to ``1``
- ``language``: the detected language (or ``''`` if not applicable)


Controlling how much data to process
-------------------------------------

By default, ``detect()`` reads up to 200 KB of input in 64 KB chunks.
You can tune this with the ``max_bytes`` and ``chunk_size`` parameters:

.. code:: python

    import chardet

    # Process at most 50 KB, feeding 8 KB at a time internally
    result = chardet.detect(data, max_bytes=50_000, chunk_size=8192)

These parameters also apply to ``detect_all``.


Filtering by encoding era
--------------------------

By default, ``detect()`` only considers modern web encodings (UTF-8,
Windows-125x, CJK multi-byte, etc.). If you're working with legacy data,
you can expand the search using the ``encoding_era`` parameter:

.. code:: python

    from chardet import detect
    from chardet.enums import EncodingEra

    # Default: only modern web encodings
    result = detect(data)

    # Include all encoding eras
    result = detect(data, encoding_era=EncodingEra.ALL)

    # Only consider DOS-era encodings
    result = detect(data, encoding_era=EncodingEra.DOS)

    # Combine specific eras
    result = detect(data, encoding_era=EncodingEra.MODERN_WEB | EncodingEra.LEGACY_ISO)

See :doc:`supported-encodings` for which encodings belong to each era.


Getting all candidates with ``detect_all``
------------------------------------------

If you want to see all candidate encodings rather than just the best
guess, use ``detect_all``:

.. code:: python

    >>> import chardet
    >>> chardet.detect_all('Strauß und Müller über Änderungen'.encode('windows-1252'))[:3]
    [{'encoding': 'WINDOWS-1252', 'confidence': 0.6316251912431836, 'language': 'German'},
     {'encoding': 'WINDOWS-1250', 'confidence': 0.5220501710295528, 'language': 'Czech'},
     {'encoding': 'WINDOWS-1257', 'confidence': 0.5197657012389119, 'language': 'Estonian'}]

Results are sorted by confidence (highest first). ``detect_all`` accepts
the same ``encoding_era``, ``should_rename_legacy``, ``max_bytes``, and
``chunk_size`` parameters as ``detect``.


Advanced usage: incremental detection
--------------------------------------

In most cases, the ``max_bytes`` and ``chunk_size`` parameters on
``detect()`` and ``detect_all()`` are sufficient for controlling how much
data is processed. However, if you need to feed data from a custom source
(such as a network stream or a decompressor), you can use
``UniversalDetector`` directly.

Create a ``UniversalDetector`` object, then call its ``feed`` method
repeatedly with each block of data. If the detector reaches a minimum
threshold of confidence, it will set ``detector.done`` to ``True``.

Once you've exhausted the source data, call ``detector.close()`` to
finalize detection. The result is then available in ``detector.result``.

.. code:: python

    from chardet.universaldetector import UniversalDetector

    detector = UniversalDetector()
    with open('mystery-file.txt', 'rb') as f:
        for line in f:
            detector.feed(line)
            if detector.done:
                break
    detector.close()
    print(detector.result)

``UniversalDetector`` also accepts ``encoding_era`` and ``max_bytes``
parameters:

.. code:: python

    from chardet.enums import EncodingEra
    from chardet.universaldetector import UniversalDetector

    detector = UniversalDetector(encoding_era=EncodingEra.ALL)
    detector.feed(data)
    detector.close()
    print(detector.result)

If you want to detect the encoding of multiple texts (such as separate
files), you can re-use a single ``UniversalDetector`` object. Call
``detector.reset()`` at the start of each file, ``feed`` as many times as
you like, then ``close()`` and check ``detector.result``.

Example: Detecting encodings of multiple files
----------------------------------------------

.. code:: python

    import glob
    from chardet.universaldetector import UniversalDetector

    detector = UniversalDetector()
    for filename in glob.glob('*.xml'):
        detector.reset()
        with open(filename, 'rb') as f:
            for line in f:
                detector.feed(line)
                if detector.done:
                    break
        detector.close()
        print(f'{filename}: {detector.result}')


Command-line tool
-----------------

chardet includes a ``chardetect`` command-line tool:

.. code:: bash

    $ chardetect somefile.txt someotherfile.txt
    somefile.txt: Windows-1252 with confidence 0.73
    someotherfile.txt: ascii with confidence 1.0

To consider all encoding eras (not just modern web encodings):

.. code:: bash

    $ chardetect -e ALL somefile.txt

Other options:

.. code:: text

    $ chardetect --help
    usage: chardetect [-h] [--minimal] [-l] [-e ENCODING_ERA] [--version]
                      [input ...]

    Takes one or more file paths and reports their detected encodings

    positional arguments:
      input                 File whose encoding we would like to determine.
                            (default: stdin)

    options:
      -h, --help            show this help message and exit
      --minimal             Print only the encoding to standard output
      -l, --legacy          Rename legacy encodings to more modern ones.
      -e ENCODING_ERA, --encoding-era ENCODING_ERA
                            Which era of encodings to consider (default:
                            MODERN_WEB). Choices: MODERN_WEB, LEGACY_ISO,
                            LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME, ALL
      --version             show program's version number and exit