Text Analysis CLI

corpa

The ripgrep of text analysis.

N-grams, readability, entropy, language detection, and BPE tokenization — all in milliseconds, from your terminal or Python.

$ cargo install corpa Copied!
01

Features

Fast

Parallel processing via rayon. Memory-mapped I/O. Analyzes multi-gigabyte corpora in under a second.

Composable

Unix-friendly design with structured output. JSON, CSV, and table formats pipe seamlessly with jq, awk, and standard tooling.

Comprehensive

Nine analysis commands covering vocabulary statistics, n-gram frequencies, readability indices, Shannon entropy, Zipf's law, and more.

Multi-platform

Available as a native CLI binary and a Python package via PyO3. npm/WASM module for browser and Node.js coming soon.

Streaming

Process unbounded stdin streams with incremental chunk-based output. Emit cumulative results after each configurable chunk.

02

Quick Start

All commands accept file paths, directories, or stdin. Output format is controlled with --format.

Terminal
$ corpa stats prose.txt

  corpa · prose.txt
┌─────────────────────┬────────────┐
 Metric                    Value 
╞═════════════════════╫════════════╥
 Tokens (words)              175 
 Types (unique)               95 
 Characters                  805 
 Sentences                     6 
 Type-Token Ratio       0.5429 
 Hapax Legomena       70 (73.7%) 
 Avg Sentence Length  29.2 words 
└─────────────────────┴────────────┘

$ corpa lang document.txt --format json
[
  {"metric": "Language", "value": "English"},
  {"metric": "Script", "value": "Latin"},
  {"metric": "Confidence", "value": "0.99"}
]
example.py
import corpa

# Analyze text statistics
result = corpa.stats("corpus.txt")
print(result["tokens"])            # 175
print(result["type_token_ratio"])  # 0.5429

# Or pass text directly
corpa.stats(text="The quick brown fox jumps over the lazy dog.")
# {'tokens': 9, 'types': 8, 'sentences': 1, ...}

# Language detection
lang = corpa.lang(text="Bonjour le monde")
print(lang["language"])  # "Français"
print(lang["confidence"]) # 0.99

# N-gram analysis
ngrams = corpa.ngrams("corpus.txt", n=2, top=10)
# [{'ngram': 'of the', 'frequency': 4521, 'relative_pct': 2.09}, ...]
03

Commands

Command Description
stats Token, type, and sentence counts. Type-token ratio, hapax legomena, average sentence length.
ngrams N-gram frequency analysis with configurable N, top-K, minimum frequency, case folding, and stopword filtering.
tokens Whitespace, sentence, and character tokenization. BPE token counts for GPT-3, GPT-4, and GPT-4o.
readability Flesch-Kincaid Grade, Flesch Reading Ease, Coleman-Liau Index, Gunning Fog Index, SMOG Index.
entropy Unigram, bigram, and trigram Shannon entropy. Entropy rate and vocabulary redundancy.
perplexity N-gram language model perplexity with Laplace smoothing and Stupid Backoff.
lang Language and script detection with confidence scoring.
zipf Zipf's law rank-frequency distribution with exponent fitting and terminal sparkline plotting.
completions Shell completion generation for bash, zsh, and fish.
04

Performance

Benchmarks on a 1 GB English text corpus (Apple M-series, 8 cores).

Word count
corpa
1.9s
Python
11.5s
6x faster
Bigram frequency
corpa
3.4s
Python
53.9s
16x faster
Readability
corpa
5.4s
Python
107.9s
20x faster

Benchmarked against equivalent pure Python implementations.

05

Installation

CLI

Rust / Cargo

Install the native binary with full parallel processing and BPE tokenization support.

cargo install corpa
Python

pip / PyO3

All commands as native Python functions. Accepts file paths or text strings directly.

pip install corpa
JavaScript

npm / WASM

Rust compiled to WebAssembly for browser and Node.js environments.

Coming soon