corpa
The ripgrep of text analysis.
N-grams, readability, entropy, language detection, and BPE tokenization — all in milliseconds, from your terminal or Python.
Features
Fast
Parallel processing via rayon. Memory-mapped I/O. Analyzes multi-gigabyte corpora in under a second.
Composable
Unix-friendly design with structured output. JSON, CSV, and table formats pipe seamlessly with jq, awk, and standard tooling.
Comprehensive
Nine analysis commands covering vocabulary statistics, n-gram frequencies, readability indices, Shannon entropy, Zipf's law, and more.
Multi-platform
Available as a native CLI binary and a Python package via PyO3. npm/WASM module for browser and Node.js coming soon.
Streaming
Process unbounded stdin streams with incremental chunk-based output. Emit cumulative results after each configurable chunk.
Quick Start
All commands accept file paths, directories, or stdin. Output format is controlled with --format.
$ corpa stats prose.txt corpa · prose.txt ┌─────────────────────┬────────────┐ │ Metric ┆ Value │ ╞═════════════════════╫════════════╥ │ Tokens (words) ┆ 175 │ │ Types (unique) ┆ 95 │ │ Characters ┆ 805 │ │ Sentences ┆ 6 │ │ Type-Token Ratio ┆ 0.5429 │ │ Hapax Legomena ┆ 70 (73.7%) │ │ Avg Sentence Length ┆ 29.2 words │ └─────────────────────┴────────────┘ $ corpa lang document.txt --format json [ {"metric": "Language", "value": "English"}, {"metric": "Script", "value": "Latin"}, {"metric": "Confidence", "value": "0.99"} ]
import corpa # Analyze text statistics result = corpa.stats("corpus.txt") print(result["tokens"]) # 175 print(result["type_token_ratio"]) # 0.5429 # Or pass text directly corpa.stats(text="The quick brown fox jumps over the lazy dog.") # {'tokens': 9, 'types': 8, 'sentences': 1, ...} # Language detection lang = corpa.lang(text="Bonjour le monde") print(lang["language"]) # "Français" print(lang["confidence"]) # 0.99 # N-gram analysis ngrams = corpa.ngrams("corpus.txt", n=2, top=10) # [{'ngram': 'of the', 'frequency': 4521, 'relative_pct': 2.09}, ...]
Commands
| Command | Description |
|---|---|
| stats | Token, type, and sentence counts. Type-token ratio, hapax legomena, average sentence length. |
| ngrams | N-gram frequency analysis with configurable N, top-K, minimum frequency, case folding, and stopword filtering. |
| tokens | Whitespace, sentence, and character tokenization. BPE token counts for GPT-3, GPT-4, and GPT-4o. |
| readability | Flesch-Kincaid Grade, Flesch Reading Ease, Coleman-Liau Index, Gunning Fog Index, SMOG Index. |
| entropy | Unigram, bigram, and trigram Shannon entropy. Entropy rate and vocabulary redundancy. |
| perplexity | N-gram language model perplexity with Laplace smoothing and Stupid Backoff. |
| lang | Language and script detection with confidence scoring. |
| zipf | Zipf's law rank-frequency distribution with exponent fitting and terminal sparkline plotting. |
| completions | Shell completion generation for bash, zsh, and fish. |
Performance
Benchmarks on a 1 GB English text corpus (Apple M-series, 8 cores).
Benchmarked against equivalent pure Python implementations.
Installation
Rust / Cargo
Install the native binary with full parallel processing and BPE tokenization support.
cargo install corpa
pip / PyO3
All commands as native Python functions. Accepts file paths or text strings directly.
pip install corpa
npm / WASM
Rust compiled to WebAssembly for browser and Node.js environments.
Coming soon