Text Analysis CLI

corpa

The ripgrep of text analysis.

N-grams, readability, entropy, language detection, and BPE tokenization — all in milliseconds, from your terminal or Python.

$ cargo install corpa Copied!

v0.4.11 PyPI GitHub MIT OR Apache-2.0

Features

Fast

Parallel processing via rayon. Memory-mapped I/O. Analyzes multi-gigabyte corpora in under a second.

Composable

Unix-friendly design with structured output. JSON, CSV, and table formats pipe seamlessly with jq, awk, and standard tooling.

Comprehensive

Nine analysis commands covering vocabulary statistics, n-gram frequencies, readability indices, Shannon entropy, Zipf's law, and more.

Multi-platform

Available as a native CLI binary and a Python package via PyO3. npm/WASM module for browser and Node.js coming soon.

Streaming

Process unbounded stdin streams with incremental chunk-based output. Emit cumulative results after each configurable chunk.

Quick Start

All commands accept file paths, directories, or stdin. Output format is controlled with --format.

Terminal

$ corpa stats prose.txt

  corpa · prose.txt
┌─────────────────────┬────────────┐
│ Metric              ┆      Value │
╞═════════════════════╫════════════╥
│ Tokens (words)      ┆        175 │
│ Types (unique)      ┆         95 │
│ Characters          ┆        805 │
│ Sentences           ┆          6 │
│ Type-Token Ratio    ┆   0.5429 │
│ Hapax Legomena      ┆ 70 (73.7%) │
│ Avg Sentence Length ┆ 29.2 words │
└─────────────────────┴────────────┘

$ corpa lang document.txt --format json
[
  {"metric": "Language", "value": "English"},
  {"metric": "Script", "value": "Latin"},
  {"metric": "Confidence", "value": "0.99"}
]

example.py

import corpa

# Analyze text statistics
result = corpa.stats("corpus.txt")
print(result["tokens"])            # 175
print(result["type_token_ratio"])  # 0.5429

# Or pass text directly
corpa.stats(text="The quick brown fox jumps over the lazy dog.")
# {'tokens': 9, 'types': 8, 'sentences': 1, ...}

# Language detection
lang = corpa.lang(text="Bonjour le monde")
print(lang["language"])  # "Français"
print(lang["confidence"]) # 0.99

# N-gram analysis
ngrams = corpa.ngrams("corpus.txt", n=2, top=10)
# [{'ngram': 'of the', 'frequency': 4521, 'relative_pct': 2.09}, ...]
            

Commands

Command	Description
stats	Token, type, and sentence counts. Type-token ratio, hapax legomena, average sentence length.
ngrams	N-gram frequency analysis with configurable N, top-K, minimum frequency, case folding, and stopword filtering.
tokens	Whitespace, sentence, and character tokenization. BPE token counts for GPT-3, GPT-4, and GPT-4o.
readability	Flesch-Kincaid Grade, Flesch Reading Ease, Coleman-Liau Index, Gunning Fog Index, SMOG Index.
entropy	Unigram, bigram, and trigram Shannon entropy. Entropy rate and vocabulary redundancy.
perplexity	N-gram language model perplexity with Laplace smoothing and Stupid Backoff.
lang	Language and script detection with confidence scoring.
zipf	Zipf's law rank-frequency distribution with exponent fitting and terminal sparkline plotting.
completions	Shell completion generation for bash, zsh, and fish.

Performance

Benchmarks on a 1 GB English text corpus (Apple M-series, 8 cores).

Word count

corpa

1.9s

Python

11.5s

6x faster

Bigram frequency

corpa

3.4s

Python

53.9s

16x faster

Readability

corpa

5.4s

Python

107.9s

20x faster

Benchmarked against equivalent pure Python implementations.

Installation

CLI

Rust / Cargo

Install the native binary with full parallel processing and BPE tokenization support.

cargo install corpa

Python

pip / PyO3

All commands as native Python functions. Accepts file paths or text strings directly.

pip install corpa

JavaScript

npm / WASM

Rust compiled to WebAssembly for browser and Node.js environments.

Coming soon