Contributing ============ Development Setup ----------------- chardet uses `uv `_ for dependency management: .. code-block:: bash git clone https://github.com/chardet/chardet.git cd chardet uv sync # install dependencies prek install # set up pre-commit hooks (ruff lint+format, etc.) Running Tests ------------- Tests use pytest. Test data is auto-cloned from the `chardet/test-data `_ repo on first run (cached in ``tests/data/``, gitignored). .. code-block:: bash uv run python -m pytest # run all tests uv run python -m pytest tests/test_api.py # single file uv run python -m pytest tests/test_api.py::test_detect_empty # single test uv run python -m pytest -x # stop on first failure Accuracy tests are dynamically parametrized from the test data via ``conftest.py``. Linting and Formatting ---------------------- chardet uses `Ruff `_ with ``select = ["ALL"]`` and targeted ignores (see ``pyproject.toml``): .. code-block:: bash uv run ruff check . # lint uv run ruff check --fix . # lint with auto-fix uv run ruff format . # format Pre-commit hooks run ruff automatically on each commit. Training Models --------------- Bigram frequency models are trained from the `CulturaX `_ multilingual corpus (via Hugging Face) plus HTML data (separate from the evaluation test suite): .. code-block:: bash uv run python scripts/train.py Training data is cached in ``data/`` (gitignored). Models are saved to ``src/chardet/models/models.bin``. Benchmarks and Diagnostics -------------------------- .. code-block:: bash uv run python scripts/benchmark_time.py # latency benchmarks uv run python scripts/benchmark_memory.py # memory usage benchmarks uv run python scripts/diagnose_accuracy.py # detailed accuracy diagnostics uv run python scripts/compare_detectors.py # compare against other detectors Building Documentation ---------------------- .. code-block:: bash uv sync --group docs # install Sphinx, Furo, etc. uv run sphinx-build docs docs/_build # build HTML docs uv run sphinx-build -W docs docs/_build # build with warnings as errors Docs are published to `ReadTheDocs `_ on tag push. Architecture Overview --------------------- All detection flows through ``run_pipeline()`` in ``src/chardet/pipeline/orchestrator.py``, which runs stages in order — each stage either returns a definitive result or passes to the next: 1. **BOM** (``bom.py``) — byte order mark 2. **UTF-16/32 patterns** (``utf1632.py``) — null-byte patterns 3. **Escape sequences** (``escape.py``) — ISO-2022-JP/KR, HZ-GB-2312 4. **Magic numbers** (``magic.py``) — binary file type identification 5. **Binary detection** (``binary.py``) — null bytes / control chars 6. **Markup charset** (``markup.py``) — ```` / ```` 7. **ASCII** (``ascii.py``) — pure 7-bit check 8. **UTF-8** (``utf8.py``) — structural multi-byte validation 9. **Byte validity** (``validity.py``) — eliminate invalid encodings 10. **CJK gating** (in orchestrator) — eliminate spurious CJK candidates 11. **Structural probing** (``structural.py``) — multi-byte encoding fit 12. **Statistical scoring** (``statistical.py``) — bigram frequency models 13. **Post-processing** (orchestrator) — confusion groups, niche demotion Key types: - ``DetectionResult`` — frozen dataclass: ``encoding``, ``confidence``, ``language``, ``mime_type`` - ``EncodingInfo`` (``registry.py``) — frozen dataclass: ``name``, ``aliases``, ``era``, ``is_multibyte``, ``languages`` - ``EncodingEra`` (``enums.py``) — IntFlag for filtering candidates - ``BigramProfile`` (``models/__init__.py``) — pre-computed bigram frequencies Model format: binary file ``src/chardet/models/models.bin`` — sparse bigram tables loaded via ``struct.unpack``. Each model is a 65,536-byte lookup table indexed by ``(b1 << 8) | b2``. Optional mypyc Compilation -------------------------- Hot-path modules can be compiled to C extensions with `mypyc `_: .. code-block:: bash HATCH_BUILD_HOOK_ENABLE_MYPYC=true uv build Compiled modules: ``models/__init__.py``, ``pipeline/structural.py``, ``pipeline/validity.py``, ``pipeline/statistical.py``, ``pipeline/utf1632.py``, ``pipeline/utf8.py``, ``pipeline/escape.py``, ``pipeline/orchestrator.py``, ``pipeline/confusion.py``, ``pipeline/magic.py``, ``pipeline/ascii.py``. These modules cannot use ``from __future__ import annotations`` (``FA100`` is ignored for them in ruff config). Versioning ---------- Version is derived from git tags via ``hatch-vcs``. The tag is the single source of truth — no hardcoded version strings. The generated ``src/chardet/_version.py`` is gitignored and should never be committed. Conventions ----------- - ``from __future__ import annotations`` in all source files (except mypyc-compiled modules) - Frozen dataclasses with ``slots=True`` for data types - Ruff with ``select = ["ALL"]`` and targeted ignores - Training data (CulturaX corpus + HTML) is never the same as evaluation data (chardet test suite)