Contributing
============

Development Setup
-----------------

chardet uses `uv <https://docs.astral.sh/uv/>`_ for dependency management:

.. code-block:: bash

   git clone https://github.com/chardet/chardet.git
   cd chardet
   uv sync                    # install dependencies
   prek install               # set up pre-commit hooks (ruff lint+format, etc.)

Running Tests
-------------

Tests use pytest. Test data is auto-cloned from the
`chardet/test-data <https://github.com/chardet/test-data>`_ repo on
first run (cached in ``tests/data/``, gitignored).

.. code-block:: bash

   uv run python -m pytest                              # run all tests
   uv run python -m pytest tests/test_api.py            # single file
   uv run python -m pytest tests/test_api.py::test_detect_empty  # single test
   uv run python -m pytest -x                           # stop on first failure

Accuracy tests are dynamically parametrized from the test data via
``conftest.py``.

Linting and Formatting
----------------------

chardet uses `Ruff <https://docs.astral.sh/ruff/>`_ with
``select = ["ALL"]`` and targeted ignores (see ``pyproject.toml``):

.. code-block:: bash

   uv run ruff check .        # lint
   uv run ruff check --fix .  # lint with auto-fix
   uv run ruff format .       # format

Pre-commit hooks run ruff automatically on each commit.

Training Models
---------------

Bigram frequency models are trained from the
`CulturaX <https://huggingface.co/datasets/uonlp/CulturaX>`_ multilingual
corpus (via Hugging Face) plus HTML data (separate from the evaluation
test suite):

.. code-block:: bash

   uv run python scripts/train.py

Training data is cached in ``data/`` (gitignored). Models are saved to
``src/chardet/models/models.bin``.

Benchmarks and Diagnostics
--------------------------

.. code-block:: bash

   uv run python scripts/benchmark_time.py     # latency benchmarks
   uv run python scripts/benchmark_memory.py   # memory usage benchmarks
   uv run python scripts/diagnose_accuracy.py  # detailed accuracy diagnostics
   uv run python scripts/compare_detectors.py  # compare against other detectors

Building Documentation
----------------------

.. code-block:: bash

   uv sync --group docs                          # install Sphinx, Furo, etc.
   uv run sphinx-build docs docs/_build          # build HTML docs
   uv run sphinx-build -W docs docs/_build       # build with warnings as errors

Docs are published to `ReadTheDocs <https://chardet.readthedocs.io>`_
on tag push.

Architecture Overview
---------------------

All detection flows through ``run_pipeline()`` in
``src/chardet/pipeline/orchestrator.py``, which runs stages in order —
each stage either returns a definitive result or passes to the next:

1. **BOM** (``bom.py``) — byte order mark
2. **UTF-16/32 patterns** (``utf1632.py``) — null-byte patterns
3. **Escape sequences** (``escape.py``) — ISO-2022-JP/KR, HZ-GB-2312
4. **Magic numbers** (``magic.py``) — binary file type identification
5. **Binary detection** (``binary.py``) — null bytes / control chars
6. **Markup charset** (``markup.py``) — ``<meta charset>`` / ``<?xml encoding>``
7. **ASCII** (``ascii.py``) — pure 7-bit check
8. **UTF-8** (``utf8.py``) — structural multi-byte validation
9. **Byte validity** (``validity.py``) — eliminate invalid encodings
10. **CJK gating** (in orchestrator) — eliminate spurious CJK candidates
11. **Structural probing** (``structural.py``) — multi-byte encoding fit
12. **Statistical scoring** (``statistical.py``) — bigram frequency models
13. **Post-processing** (orchestrator) — confusion groups, niche demotion

Key types:

- ``DetectionResult`` — frozen dataclass: ``encoding``, ``confidence``,
  ``language``, ``mime_type``
- ``EncodingInfo`` (``registry.py``) — frozen dataclass: ``name``,
  ``aliases``, ``era``, ``is_multibyte``, ``languages``
- ``EncodingEra`` (``enums.py``) — IntFlag for filtering candidates
- ``BigramProfile`` (``models/__init__.py``) — pre-computed bigram
  frequencies

Model format: binary file ``src/chardet/models/models.bin`` — sparse
bigram tables loaded via ``struct.unpack``. Each model is a 65,536-byte
lookup table indexed by ``(b1 << 8) | b2``.

Optional mypyc Compilation
--------------------------

Hot-path modules can be compiled to C extensions with
`mypyc <https://mypyc.readthedocs.io>`_:

.. code-block:: bash

   HATCH_BUILD_HOOK_ENABLE_MYPYC=true uv build

Compiled modules: ``models/__init__.py``, ``pipeline/structural.py``,
``pipeline/validity.py``, ``pipeline/statistical.py``,
``pipeline/utf1632.py``, ``pipeline/utf8.py``, ``pipeline/escape.py``,
``pipeline/orchestrator.py``, ``pipeline/confusion.py``,
``pipeline/magic.py``, ``pipeline/ascii.py``.

These modules cannot use ``from __future__ import annotations``
(``FA100`` is ignored for them in ruff config).

Versioning
----------

Version is derived from git tags via ``hatch-vcs``. The tag is the
single source of truth — no hardcoded version strings. The generated
``src/chardet/_version.py`` is gitignored and should never be committed.

Conventions
-----------

- ``from __future__ import annotations`` in all source files (except
  mypyc-compiled modules)
- Frozen dataclasses with ``slots=True`` for data types
- Ruff with ``select = ["ALL"]`` and targeted ignores
- Training data (CulturaX corpus + HTML) is never the same as
  evaluation data (chardet test suite)