Usage
=====

Installation
------------

.. code-block:: bash

   pip install chardet

Basic Detection
---------------

Use :func:`chardet.detect` to detect the encoding of a byte string:

.. code-block:: python

   import chardet

   result = chardet.detect(
       "München ist die Hauptstadt Bayerns und eine der"
       " schönsten Städte Deutschlands.".encode("windows-1252")
   )
   print(result)
   # {'encoding': 'windows-1252', 'confidence': 0.34, 'language': 'de', 'mime_type': 'text/plain'}

The result is a dictionary with four keys:

- ``"encoding"`` — the detected encoding name (e.g., ``"utf-8"``,
  ``"windows-1252"``), or ``None`` if detection failed
- ``"confidence"`` — a float between 0 and 1
- ``"language"`` — the detected language (e.g., ``"French"``), or ``None``
- ``"mime_type"`` — the detected MIME type (e.g., ``"text/plain"``,
  ``"image/png"``), or ``None`` if unknown. For binary files detected via
  magic number signatures, this identifies the file format.

Multiple Candidates
~~~~~~~~~~~~~~~~~~~

Use :func:`chardet.detect_all` to get all candidate encodings ranked by
confidence:

.. code-block:: python

   results = chardet.detect_all(data)
   for r in results:
       print(f"{r['encoding']}: {r['confidence']:.2f}")

By default, results below the minimum confidence threshold (0.20) are
filtered out. Pass ``ignore_threshold=True`` to see all candidates.

Streaming Detection
-------------------

For large files or streaming data, use :class:`chardet.UniversalDetector`:

.. code-block:: python

   from chardet import UniversalDetector

   detector = UniversalDetector()
   with open("somefile.txt", "rb") as f:
       for line in f:
           detector.feed(line)
           if detector.done:
               break
   detector.close()
   print(detector.result)

Call :meth:`~chardet.UniversalDetector.reset` to reuse the detector for
another file.

The constructor accepts the same tuning parameters as :func:`~chardet.detect`:

.. code-block:: python

   detector = UniversalDetector(
       encoding_era=EncodingEra.MODERN_WEB,  # restrict candidate encodings
       max_bytes=50_000,                      # stop buffering after 50 KB
   )

Encoding Eras
-------------

By default, chardet considers all supported encodings for maximum
accuracy. Use the ``encoding_era`` parameter to restrict the search to a
specific subset:

.. code-block:: python

   from chardet import detect, EncodingEra

   # Default: all encodings considered
   result = detect(data)

   # Restrict to modern web encodings only
   result = detect(data, encoding_era=EncodingEra.MODERN_WEB)

   # Only legacy ISO encodings
   result = detect(data, encoding_era=EncodingEra.LEGACY_ISO)

Available eras (can be combined with ``|``):

- :attr:`~chardet.EncodingEra.ALL` — All supported encodings (default)
- :attr:`~chardet.EncodingEra.MODERN_WEB` — UTF-8, Windows codepages,
  CJK encodings
- :attr:`~chardet.EncodingEra.LEGACY_ISO` — ISO-8859 family
- :attr:`~chardet.EncodingEra.LEGACY_MAC` — Mac encodings
- :attr:`~chardet.EncodingEra.LEGACY_REGIONAL` — Regional codepages
  (KOI8-T, KZ-1048, etc.)
- :attr:`~chardet.EncodingEra.DOS` — DOS codepages (CP437, CP850, etc.)
- :attr:`~chardet.EncodingEra.MAINFRAME` — EBCDIC encodings

Encoding Name Options
---------------------

By default, chardet returns encoding names compatible with chardet 5.x/6.x
(e.g., ``"utf-8"``, ``"ascii"``, ``"SHIFT_JIS"``).  Two parameters control
how encoding names are returned:

- ``compat_names`` (default ``True``) — map internal Python codec names to
  chardet 5.x/6.x compatible display names.  Set to ``False`` to get raw
  Python codec names (e.g., ``"shift_jis_2004"`` instead of ``"SHIFT_JIS"``).
- ``prefer_superset`` (default ``False``) — remap legacy ISO/subset encodings
  to their modern Windows/CP superset equivalents (e.g., ASCII →
  Windows-1252, ISO-8859-1 → Windows-1252).

.. code-block:: python

   # Default: chardet 5.x compatible names
   chardet.detect(data)
   # {'encoding': 'ascii', ...}

   # Raw Python codec names
   chardet.detect(data, compat_names=False)
   # {'encoding': 'ascii', ...}

   # Superset remapping with compat names
   chardet.detect(data, prefer_superset=True)
   # {'encoding': 'Windows-1252', ...}

   # Superset remapping with raw codec names
   chardet.detect(data, prefer_superset=True, compat_names=False)
   # {'encoding': 'cp1252', ...}

These parameters apply to :func:`~chardet.detect`, :func:`~chardet.detect_all`,
and :class:`~chardet.UniversalDetector`.

The deprecated ``should_rename_legacy=True`` parameter is equivalent to
``prefer_superset=True`` and is still accepted with a deprecation warning.

The following table shows every encoding whose name changes depending on
the ``compat_names`` and ``prefer_superset`` settings.  Encodings not listed
here return the same name in all modes.

.. list-table:: Encoding names by parameter combination
   :header-rows: 1
   :widths: 20 20 20 20 20

   * - Internal name
     - ``compat_names=True`` (default)
     - ``compat_names=False``
     - ``prefer_superset=True``
     - ``prefer_superset=True, compat_names=False``
   * - ascii
     - ``ascii``
     - ``ascii``
     - ``Windows-1252``
     - ``cp1252``
   * - big5hkscs
     - ``Big5``
     - ``big5hkscs``
     - ``Big5``
     - ``big5hkscs``
   * - cp855
     - ``IBM855``
     - ``cp855``
     - ``IBM855``
     - ``cp855``
   * - cp866
     - ``IBM866``
     - ``cp866``
     - ``IBM866``
     - ``cp866``
   * - euc_jis_2004
     - ``EUC-JP``
     - ``euc_jis_2004``
     - ``EUC-JP``
     - ``euc_jis_2004``
   * - euc_kr
     - ``EUC-KR``
     - ``euc_kr``
     - ``CP949``
     - ``cp949``
   * - iso2022_jp_2
     - ``ISO-2022-JP``
     - ``iso2022_jp_2``
     - ``ISO-2022-JP``
     - ``iso2022_jp_2``
   * - iso8859-1
     - ``ISO-8859-1``
     - ``iso8859-1``
     - ``Windows-1252``
     - ``cp1252``
   * - iso8859-2
     - ``ISO-8859-2``
     - ``iso8859-2``
     - ``Windows-1250``
     - ``cp1250``
   * - iso8859-5
     - ``ISO-8859-5``
     - ``iso8859-5``
     - ``Windows-1251``
     - ``cp1251``
   * - iso8859-6
     - ``ISO-8859-6``
     - ``iso8859-6``
     - ``Windows-1256``
     - ``cp1256``
   * - iso8859-7
     - ``ISO-8859-7``
     - ``iso8859-7``
     - ``Windows-1253``
     - ``cp1253``
   * - iso8859-8
     - ``ISO-8859-8``
     - ``iso8859-8``
     - ``Windows-1255``
     - ``cp1255``
   * - iso8859-9
     - ``ISO-8859-9``
     - ``iso8859-9``
     - ``Windows-1254``
     - ``cp1254``
   * - ISO-8859-11
     - ``ISO-8859-11``
     - ``ISO-8859-11``
     - ``CP874``
     - ``cp874``
   * - iso8859-13
     - ``ISO-8859-13``
     - ``iso8859-13``
     - ``Windows-1257``
     - ``cp1257``
   * - kz1048
     - ``KZ1048``
     - ``kz1048``
     - ``KZ1048``
     - ``kz1048``
   * - mac-cyrillic
     - ``MacCyrillic``
     - ``mac-cyrillic``
     - ``MacCyrillic``
     - ``mac-cyrillic``
   * - mac-greek
     - ``MacGreek``
     - ``mac-greek``
     - ``MacGreek``
     - ``mac-greek``
   * - mac-iceland
     - ``MacIceland``
     - ``mac-iceland``
     - ``MacIceland``
     - ``mac-iceland``
   * - mac-latin2
     - ``MacLatin2``
     - ``mac-latin2``
     - ``MacLatin2``
     - ``mac-latin2``
   * - mac-roman
     - ``MacRoman``
     - ``mac-roman``
     - ``MacRoman``
     - ``mac-roman``
   * - mac-turkish
     - ``MacTurkish``
     - ``mac-turkish``
     - ``MacTurkish``
     - ``mac-turkish``
   * - shift_jis_2004
     - ``SHIFT_JIS``
     - ``shift_jis_2004``
     - ``SHIFT_JIS``
     - ``shift_jis_2004``
   * - tis-620
     - ``TIS-620``
     - ``tis-620``
     - ``CP874``
     - ``cp874``
   * - utf-8
     - ``utf-8``
     - ``utf-8``
     - ``utf-8``
     - ``utf-8``

Encoding Filters
----------------

Use ``include_encodings`` and ``exclude_encodings`` to control exactly which
encodings chardet considers:

.. code-block:: python

   # Only consider UTF-8 and Windows-1252
   result = chardet.detect(data, include_encodings=["utf-8", "windows-1252"])

   # Consider everything except EBCDIC
   result = chardet.detect(data, exclude_encodings=["cp037", "cp500"])

Encoding names are resolved through Python's codec system, so aliases work
(e.g., ``"latin-1"`` for ``"iso8859-1"``).  An empty iterable raises
:class:`ValueError` — pass ``None`` (the default) to disable filtering.

When filtering removes all candidates, chardet returns the
``no_match_encoding`` (default ``"cp1252"``) with low confidence.  If even
that encoding is excluded by the filters, chardet returns
``encoding=None`` with a warning.  Similarly, ``empty_input_encoding``
(default ``"utf-8"``) controls the result for empty input:

.. code-block:: python

   # Custom fallbacks
   result = chardet.detect(
       data,
       include_encodings=["utf-8", "shift_jis"],
       no_match_encoding="utf-8",
       empty_input_encoding="shift_jis",
   )

These parameters apply to :func:`~chardet.detect`,
:func:`~chardet.detect_all`, and :class:`~chardet.UniversalDetector`.

Limiting Bytes
--------------

By default, chardet examines up to 200,000 bytes. Use ``max_bytes`` to
adjust:

.. code-block:: python

   # Examine only the first 10 KB
   result = chardet.detect(data, max_bytes=10_000)

Smaller values are faster but may reduce accuracy for encodings that
require more data to distinguish.

Deprecated Parameters
---------------------

The following parameters are accepted for backward compatibility with
chardet 5.x/6.x but have no effect:

- ``chunk_size`` on :func:`~chardet.detect` and
  :func:`~chardet.detect_all` — previously controlled how data was
  chunked for streaming probers. A deprecation warning is emitted if a
  non-default value is passed.
- ``lang_filter`` on :class:`~chardet.UniversalDetector` — previously
  restricted detection to specific language groups via
  :class:`~chardet.LanguageFilter`. A deprecation warning is emitted if
  set to anything other than :attr:`~chardet.LanguageFilter.ALL`.

Command-Line Tool
-----------------

chardet includes a ``chardetect`` command:

.. code-block:: bash

   # Detect encoding of files
   chardetect somefile.txt anotherfile.csv
   # somefile.txt: utf-8 with confidence 0.99
   # anotherfile.csv: ascii with confidence 1.0

   # Output only the encoding name
   chardetect --minimal somefile.txt
   # utf-8

   # Include detected language
   chardetect -l somefile.txt
   # somefile.txt: utf-8 en (English) with confidence 0.99

   # Minimal output with language
   chardetect --minimal -l somefile.txt
   # utf-8 en

   # Specific encoding era
   chardetect -e dos somefile.txt
   # somefile.txt: cp850 with confidence 0.10

   # Only consider specific encodings
   chardetect -i utf-8,windows-1252 somefile.txt
   # somefile.txt: utf-8 with confidence 0.99

   # Exclude specific encodings
   chardetect -x cp037,cp500 somefile.txt
   # somefile.txt: utf-8 with confidence 0.99

   # Custom fallback when detection is inconclusive
   chardetect --no-match-encoding utf-8 somefile.bin
   # somefile.bin: utf-8 with confidence 0.10

   # Custom encoding for empty input
   chardetect --empty-input-encoding shift_jis empty.txt
   # empty.txt: SHIFT_JIS with confidence 0.10

   # Read from stdin
   cat somefile.txt | chardetect
   # stdin: utf-8 with confidence 0.99