Contributing¶

Development Setup¶

chardet uses uv for dependency management:

git clone https://github.com/chardet/chardet.git
cd chardet
uv sync                    # install dependencies
prek install               # set up pre-commit hooks (ruff lint+format, etc.)

Running Tests¶

Tests use pytest. Test data is auto-cloned from the chardet/test-data repo on first run (cached in tests/data/, gitignored).

uv run python -m pytest                              # run all tests
uv run python -m pytest tests/test_api.py            # single file
uv run python -m pytest tests/test_api.py::test_detect_empty  # single test
uv run python -m pytest -x                           # stop on first failure

Accuracy tests are dynamically parametrized from the test data via conftest.py.

Linting and Formatting¶

chardet uses Ruff with select = ["ALL"] and targeted ignores (see pyproject.toml):

uv run ruff check .        # lint
uv run ruff check --fix .  # lint with auto-fix
uv run ruff format .       # format

Pre-commit hooks run ruff automatically on each commit.

Training Models¶

Bigram frequency models are trained from the CulturaX multilingual corpus (via Hugging Face) plus HTML data (separate from the evaluation test suite):

uv run python scripts/train.py

Training data is cached in data/ (gitignored). Models are saved to src/chardet/models/models.bin.

Benchmarks and Diagnostics¶

uv run python scripts/benchmark_time.py     # latency benchmarks
uv run python scripts/benchmark_memory.py   # memory usage benchmarks
uv run python scripts/diagnose_accuracy.py  # detailed accuracy diagnostics
uv run python scripts/compare_detectors.py  # compare against other detectors

Building Documentation¶

uv sync --group docs                          # install Sphinx, Furo, etc.
uv run sphinx-build docs docs/_build          # build HTML docs
uv run sphinx-build -W docs docs/_build       # build with warnings as errors

Docs are published to ReadTheDocs on tag push.

Architecture Overview¶

All detection flows through run_pipeline() in src/chardet/pipeline/orchestrator.py, which runs stages in order — each stage either returns a definitive result or passes to the next:

BOM (bom.py) — byte order mark
UTF-16/32 patterns (utf1632.py) — null-byte patterns
Escape sequences (escape.py) — ISO-2022-JP/KR, HZ-GB-2312
Magic numbers (magic.py) — binary file type identification
Binary detection (binary.py) — null bytes / control chars
Markup charset (markup.py) — <meta charset> / <?xml encoding>
ASCII (ascii.py) — pure 7-bit check
UTF-8 (utf8.py) — structural multi-byte validation
Byte validity (validity.py) — eliminate invalid encodings
CJK gating (in orchestrator) — eliminate spurious CJK candidates
Structural probing (structural.py) — multi-byte encoding fit
Statistical scoring (statistical.py) — bigram frequency models
Post-processing (orchestrator) — confusion groups, niche demotion

Key types:

DetectionResult — frozen dataclass: encoding, confidence, language, mime_type
EncodingInfo (registry.py) — frozen dataclass: name, aliases, era, is_multibyte, languages
EncodingEra (enums.py) — IntFlag for filtering candidates
BigramProfile (models/__init__.py) — pre-computed bigram frequencies

Model format: binary file src/chardet/models/models.bin — sparse bigram tables loaded via struct.unpack. Each model is a 65,536-byte lookup table indexed by (b1 << 8) | b2.

Optional mypyc Compilation¶

Hot-path modules can be compiled to C extensions with mypyc:

HATCH_BUILD_HOOK_ENABLE_MYPYC=true uv build

Compiled modules: models/__init__.py, pipeline/structural.py, pipeline/validity.py, pipeline/statistical.py, pipeline/utf1632.py, pipeline/utf8.py, pipeline/escape.py, pipeline/orchestrator.py, pipeline/confusion.py, pipeline/magic.py, pipeline/ascii.py.

These modules cannot use from __future__ import annotations (FA100 is ignored for them in ruff config).

Versioning¶

Version is derived from git tags via hatch-vcs. The tag is the single source of truth — no hardcoded version strings. The generated src/chardet/_version.py is gitignored and should never be committed.

Conventions¶

from __future__ import annotations in all source files (except mypyc-compiled modules)
Frozen dataclasses with slots=True for data types
Ruff with select = ["ALL"] and targeted ignores
Training data (CulturaX corpus + HTML) is never the same as evaluation data (chardet test suite)