Contributing¶
Development Setup¶
chardet uses uv for dependency management:
git clone https://github.com/chardet/chardet.git
cd chardet
uv sync # install dependencies
prek install # set up pre-commit hooks (ruff lint+format, etc.)
Running Tests¶
Tests use pytest. Test data is auto-cloned from the
chardet/test-data repo on
first run (cached in tests/data/, gitignored).
uv run python -m pytest # run all tests
uv run python -m pytest tests/test_api.py # single file
uv run python -m pytest tests/test_api.py::test_detect_empty # single test
uv run python -m pytest -x # stop on first failure
Accuracy tests are dynamically parametrized from the test data via
conftest.py.
Linting and Formatting¶
chardet uses Ruff with
select = ["ALL"] and targeted ignores (see pyproject.toml):
uv run ruff check . # lint
uv run ruff check --fix . # lint with auto-fix
uv run ruff format . # format
Pre-commit hooks run ruff automatically on each commit.
Training Models¶
Bigram frequency models are trained from the CulturaX multilingual corpus (via Hugging Face) plus HTML data (separate from the evaluation test suite):
uv run python scripts/train.py
Training data is cached in data/ (gitignored). Models are saved to
src/chardet/models/models.bin.
Benchmarks and Diagnostics¶
uv run python scripts/benchmark_time.py # latency benchmarks
uv run python scripts/benchmark_memory.py # memory usage benchmarks
uv run python scripts/diagnose_accuracy.py # detailed accuracy diagnostics
uv run python scripts/compare_detectors.py # compare against other detectors
Building Documentation¶
uv sync --group docs # install Sphinx, Furo, etc.
uv run sphinx-build docs docs/_build # build HTML docs
uv run sphinx-build -W docs docs/_build # build with warnings as errors
Docs are published to ReadTheDocs on tag push.
Architecture Overview¶
All detection flows through run_pipeline() in
src/chardet/pipeline/orchestrator.py, which runs stages in order —
each stage either returns a definitive result or passes to the next:
BOM (
bom.py) — byte order markUTF-16/32 patterns (
utf1632.py) — null-byte patternsEscape sequences (
escape.py) — ISO-2022-JP/KR, HZ-GB-2312Magic numbers (
magic.py) — binary file type identificationBinary detection (
binary.py) — null bytes / control charsMarkup charset (
markup.py) —<meta charset>/<?xml encoding>ASCII (
ascii.py) — pure 7-bit checkUTF-8 (
utf8.py) — structural multi-byte validationByte validity (
validity.py) — eliminate invalid encodingsCJK gating (in orchestrator) — eliminate spurious CJK candidates
Structural probing (
structural.py) — multi-byte encoding fitStatistical scoring (
statistical.py) — bigram frequency modelsPost-processing (orchestrator) — confusion groups, niche demotion
Key types:
DetectionResult— frozen dataclass:encoding,confidence,language,mime_typeEncodingInfo(registry.py) — frozen dataclass:name,aliases,era,is_multibyte,languagesEncodingEra(enums.py) — IntFlag for filtering candidatesBigramProfile(models/__init__.py) — pre-computed bigram frequencies
Model format: binary file src/chardet/models/models.bin — sparse
bigram tables loaded via struct.unpack. Each model is a 65,536-byte
lookup table indexed by (b1 << 8) | b2.
Optional mypyc Compilation¶
Hot-path modules can be compiled to C extensions with mypyc:
HATCH_BUILD_HOOK_ENABLE_MYPYC=true uv build
Compiled modules: models/__init__.py, pipeline/structural.py,
pipeline/validity.py, pipeline/statistical.py,
pipeline/utf1632.py, pipeline/utf8.py, pipeline/escape.py,
pipeline/orchestrator.py, pipeline/confusion.py,
pipeline/magic.py, pipeline/ascii.py.
These modules cannot use from __future__ import annotations
(FA100 is ignored for them in ruff config).
Versioning¶
Version is derived from git tags via hatch-vcs. The tag is the
single source of truth — no hardcoded version strings. The generated
src/chardet/_version.py is gitignored and should never be committed.
Conventions¶
from __future__ import annotationsin all source files (except mypyc-compiled modules)Frozen dataclasses with
slots=Truefor data typesRuff with
select = ["ALL"]and targeted ignoresTraining data (CulturaX corpus + HTML) is never the same as evaluation data (chardet test suite)