Contributing
============
Development Setup
-----------------
chardet uses `uv `_ for dependency management:
.. code-block:: bash
git clone https://github.com/chardet/chardet.git
cd chardet
uv sync # install dependencies
prek install # set up pre-commit hooks (ruff lint+format, etc.)
Running Tests
-------------
Tests use pytest. Test data is auto-cloned from the
`chardet/test-data `_ repo on
first run (cached in ``tests/data/``, gitignored).
.. code-block:: bash
uv run python -m pytest # run all tests
uv run python -m pytest tests/test_api.py # single file
uv run python -m pytest tests/test_api.py::test_detect_empty # single test
uv run python -m pytest -x # stop on first failure
Accuracy tests are dynamically parametrized from the test data via
``conftest.py``.
Linting and Formatting
----------------------
chardet uses `Ruff `_ with
``select = ["ALL"]`` and targeted ignores (see ``pyproject.toml``):
.. code-block:: bash
uv run ruff check . # lint
uv run ruff check --fix . # lint with auto-fix
uv run ruff format . # format
Pre-commit hooks run ruff automatically on each commit.
Training Models
---------------
Bigram frequency models are trained from the
`CulturaX `_ multilingual
corpus (via Hugging Face) plus HTML data (separate from the evaluation
test suite):
.. code-block:: bash
uv run python scripts/train.py
Training data is cached in ``data/`` (gitignored). Models are saved to
``src/chardet/models/models.bin``.
Benchmarks and Diagnostics
--------------------------
.. code-block:: bash
uv run python scripts/benchmark_time.py # latency benchmarks
uv run python scripts/benchmark_memory.py # memory usage benchmarks
uv run python scripts/diagnose_accuracy.py # detailed accuracy diagnostics
uv run python scripts/compare_detectors.py # compare against other detectors
Building Documentation
----------------------
.. code-block:: bash
uv sync --group docs # install Sphinx, Furo, etc.
uv run sphinx-build docs docs/_build # build HTML docs
uv run sphinx-build -W docs docs/_build # build with warnings as errors
Docs are published to `ReadTheDocs `_
on tag push.
Architecture Overview
---------------------
All detection flows through ``run_pipeline()`` in
``src/chardet/pipeline/orchestrator.py``, which runs stages in order —
each stage either returns a definitive result or passes to the next:
1. **BOM** (``bom.py``) — byte order mark
2. **UTF-16/32 patterns** (``utf1632.py``) — null-byte patterns
3. **Escape sequences** (``escape.py``) — ISO-2022-JP/KR, HZ-GB-2312
4. **Binary detection** (``binary.py``) — null bytes / control chars
5. **Markup charset** (``markup.py``) — ```` / ````
6. **ASCII** (``ascii.py``) — pure 7-bit check
7. **UTF-8** (``utf8.py``) — structural multi-byte validation
8. **Byte validity** (``validity.py``) — eliminate invalid encodings
9. **CJK gating** (in orchestrator) — eliminate spurious CJK candidates
10. **Structural probing** (``structural.py``) — multi-byte encoding fit
11. **Statistical scoring** (``statistical.py``) — bigram frequency models
12. **Post-processing** (orchestrator) — confusion groups, niche demotion
Key types:
- ``DetectionResult`` — frozen dataclass: ``encoding``, ``confidence``,
``language``
- ``EncodingInfo`` (``registry.py``) — frozen dataclass: ``name``,
``aliases``, ``era``, ``is_multibyte``, ``python_codec``
- ``EncodingEra`` (``enums.py``) — IntFlag for filtering candidates
- ``BigramProfile`` (``models/__init__.py``) — pre-computed bigram
frequencies
Model format: binary file ``src/chardet/models/models.bin`` — sparse
bigram tables loaded via ``struct.unpack``. Each model is a 65,536-byte
lookup table indexed by ``(b1 << 8) | b2``.
Optional mypyc Compilation
--------------------------
Hot-path modules can be compiled to C extensions with
`mypyc `_:
.. code-block:: bash
HATCH_BUILD_HOOK_ENABLE_MYPYC=true uv build
Compiled modules: ``models/__init__.py``, ``pipeline/structural.py``,
``pipeline/validity.py``, ``pipeline/statistical.py``,
``pipeline/utf1632.py``, ``pipeline/utf8.py``, ``pipeline/escape.py``.
These modules cannot use ``from __future__ import annotations``
(``FA100`` is ignored for them in ruff config).
Versioning
----------
Version is derived from git tags via ``hatch-vcs``. The tag is the
single source of truth — no hardcoded version strings. The generated
``src/chardet/_version.py`` is gitignored and should never be committed.
Conventions
-----------
- ``from __future__ import annotations`` in all source files (except
mypyc-compiled modules)
- Frozen dataclasses with ``slots=True`` for data types
- Ruff with ``select = ["ALL"]`` and targeted ignores
- Training data (CulturaX corpus + HTML) is never the same as
evaluation data (chardet test suite)