Skip to content

Grapa Grep Documentation

Tip: Use the tabs below to switch between CLI and Python examples throughout this documentation.

Recent Fixes and Known Gaps

  • Invert match and empty pattern logic now match ripgrep/grep (see test suite for details).
  • Structured array output is a deliberate design choice and affects edge cases (see notes below).
  • Remaining advanced gaps:
    • Multiline patterns with custom delimiters (Grapa extension, may not be fully ripgrep-compatible)
    • Full Unicode grapheme cluster support (\X)
    • Parallel processing for very large inputs
    • Deduplication (d option) in all modes
  • See maintainers/BINARY_grep.md for internal details and future work.

Thread Safety and Parallelism

Grapa is fully thread safe by design. All variable and data structure updates are internally synchronized at the C++ level, so you will never encounter crashes or corruption from concurrent access. However, if your program logic allows multiple threads to read and write the same variable or data structure (for example, when using parallel grep features), you may see logical race conditions (unexpected values, overwrites, etc.). This is a design consideration, not a stability issue. Minimize shared mutable state between threads unless intentional.

Only $thread() objects provide explicit locking and unlocking via lock(), unlock(), and trylock(). To protect access to a shared resource, create a $thread() lock object and use it to guard access. Calling .lock() or .unlock() on a regular variable (like an array or scalar) will return an error.

Canonical Example:

lock_obj = $thread();
lock_obj.lock();
/* ... perform thread-safe operations on shared data ... */
lock_obj.unlock();
import grapapy
xy = grapapy.grapa()

xy.eval("""
lock_obj = $thread();
lock_obj.lock();
/* ... perform thread-safe operations on shared data ... */
lock_obj.unlock();
""")

See Threading and Locking and Function Operators: static and const for details and best practices.

Who is this for?

Anyone who wants to use Grapa's advanced pattern matching, achieve ripgrep parity, or understand Unicode/PCRE2 grep features in Grapa.

Key Syntax Rules

  • Use block comments (/* ... */), not line comments (// ...).
  • To append to arrays, use the += operator (not .push() or .append()).
  • All statements and blocks must end with a semicolon (;).
  • Use raw strings (r"...") for regex patterns to avoid escaping issues
  • Convert binary data using .decode('latin-1') for Grapa processing
  • Use the xy.eval() method to execute Grapa code from Python

Basic Usage

Pattern Matching

/* Basic pattern matching */
text = "Hello world\nGoodbye world";
matches = text.grep("world");
matches.echo();  /* ["Hello world", "Goodbye world"] */

/* Match-only output */
matches = text.grep("world", "o");
matches.echo();  /* ["world", "world"] */

/* Case-insensitive matching */
matches = text.grep("hello", "i");
matches.echo();  /* ["Hello world"] */
import grapapy
xy = grapapy.grapa()

# Basic pattern matching
text = "Hello world\nGoodbye world"
matches = xy.eval("text.grep('world');", {"text": text})
print(matches)  # ['Hello world', 'Goodbye world']

# Match-only output
matches = xy.eval("text.grep('world', 'o');", {"text": text})
print(matches)  # ['world', 'world']

# Case-insensitive matching
matches = xy.eval("text.grep('hello', 'i');", {"text": text})
print(matches)  # ['Hello world']

Unicode and Normalization

Grapa's grep supports full Unicode processing with normalization options:

/* NFC normalization (default) */
matches = text.grep("café", "N");

/* NFD normalization */
matches = text.grep("café", "NFD");

/* NFKC normalization */
matches = text.grep("café", "NFKC");

/* NFKD normalization */
matches = text.grep("café", "NFKD");
import grapapy
xy = grapapy.grapa()

# NFC normalization (default)
matches = xy.eval("text.grep('café', 'N');", {"text": "café"})

# NFD normalization
matches = xy.eval("text.grep('café', 'NFD');", {"text": "café"})

# NFKC normalization
matches = xy.eval("text.grep('café', 'NFKC');", {"text": "café"})

# NFKD normalization
matches = xy.eval("text.grep('café', 'NFKD');", {"text": "café"})

Unicode Properties

/* Match letters */
matches = text.grep("\\p{L}+");

/* Match numbers */
matches = text.grep("\\p{N}+");

/* Match word characters */
matches = text.grep("\\w+");

/* Match grapheme clusters (Unicode extended grapheme clusters) */
matches = text.grep("\\X+");
import grapapy
xy = grapapy.grapa()

# Match letters
matches = xy.eval("text.grep(r'\\p{L}+');", {"text": "Hello 世界 123"})

# Match numbers
matches = xy.eval("text.grep(r'\\p{N}+');", {"text": "Hello 世界 123"})

# Match word characters
matches = xy.eval("text.grep(r'\\w+');", {"text": "Hello 世界 123"})

# Match grapheme clusters
matches = xy.eval("text.grep(r'\\X+');", {"text": "Hello 世界 123"})

Grapheme Cluster Examples

/* Basic grapheme cluster matching */
text = "café";
clusters = text.grep("\\X", "o");
clusters.echo();
/* ["c", "a", "f", "é"]  /* é is a single grapheme cluster (e + combining acute) */

/* Complex grapheme clusters */
text = "😀❤️";
clusters = text.grep("\\X", "o");
clusters.echo();
/* ["😀", "❤️"]  /* Heart with emoji modifier */

/* Grapheme clusters with newlines */
text = "é\n😀";
clusters = text.grep("\\X", "o");
clusters.echo();
/* ["é", "\n", "😀"]  /* Newlines are treated as separate clusters */

/* Grapheme clusters with quantifiers */
text = "café";
matches = text.grep("\\X+", "o");
matches.echo();
/* ["café"]  /* One or more grapheme clusters */

matches = text.grep("\\X*", "o");
matches.echo();
/* ["", "café", ""]  /* Zero or more grapheme clusters */

matches = text.grep("\\X?", "o");
matches.echo();
/* ["", "c", "", "a", "", "f", "", "é", ""]  /* Zero or one grapheme cluster */

matches = text.grep("\\X{2,3}", "o");
matches.echo();
/* ["ca", "fé"]  /* Between 2 and 3 grapheme clusters */
import grapapy
xy = grapapy.grapa()

# Basic grapheme cluster matching
text = "café"
clusters = xy.eval("text.grep(r'\\X', 'o');", {"text": text})
print(clusters)  # ['c', 'a', 'f', 'é'] - é is a single grapheme cluster

# Complex grapheme clusters
text = "😀❤️"
clusters = xy.eval("text.grep(r'\\X', 'o');", {"text": text})
print(clusters)  # ['😀', '❤️'] - Heart with emoji modifier

# Grapheme clusters with newlines
text = \n😀"
clusters = xy.eval("text.grep(r'\\X', 'o');", {"text": text})
print(clusters)  # ['é', '\n', '😀'] - Newlines are treated as separate clusters

# Grapheme clusters with quantifiers
text = "café"
matches = xy.eval("text.grep(r'\\X+', 'o');", {"text": text})
print(matches)  # ['café'] - One or more grapheme clusters

matches = xy.eval("text.grep(r'\\X*', 'o');", {"text": text})
print(matches)  # ['', 'café', ''] - Zero or more grapheme clusters

matches = xy.eval("text.grep(r'\\X?', 'o');", {"text": text})
print(matches)  # ['', 'c', '', 'a', '', 'f', '', 'é', ''] - Zero or one grapheme cluster

matches = xy.eval("text.grep(r'\\X{2,3}', 'o');", {"text": text})
print(matches)  # ['ca', 'fé'] - Between 2 and 3 grapheme clusters

Diacritic-Insensitive Matching

/* Match café, cafe, café, etc. */
text = "café cafe café";
matches = text.grep("cafe", "d");
matches.echo();  /* ["café", "cafe", "café"] */
import grapapy
xy = grapapy.grapa()

# Match café, cafe, café, etc.
text = "café cafe café"
matches = xy.eval("text.grep('cafe', 'd');", {"text": text})
print(matches)  # ['café', 'cafe', 'café']

Edge Cases and Special Handling

/* Zero-length matches (now working correctly) */
text = "abc";
matches = text.grep("", "o");
matches.echo();
/* [""]  /* Single empty string for zero-length match */

/* Empty pattern (now working correctly) */
matches = text.grep("", "o");
matches.echo();
/* [""]  /* Single empty string for empty pattern */

/* Unicode boundary handling */
text = "café";
matches = text.grep("\\b\\w+\\b", "o");
matches.echo();  /* ["café"] */

Function Signature

input.grep(pattern, options, delimiter, normalization, mode, num_workers)
xy.eval("input.grep(pattern, options, delimiter, normalization, mode, num_workers);", {
    "input": "input_text",
    "pattern": "regex_pattern",
    "options": "option_flags",
    "delimiter": "line_delimiter",
    "normalization": "normalization_form",
    "mode": "processing_mode",
    "num_workers": worker_count
})

Parameters

  • input: String to search in
  • pattern: Regex pattern to match
  • options: String of option flags (see Options section)
  • delimiter: Custom line delimiter (default: newline)
  • normalization: Unicode normalization form ("NONE", "NFC", "NFD", "NFKC", "NFKD")
  • mode: Processing mode ("UNICODE" or "BINARY")
  • num_workers: Number of parallel workers (0 = auto-detect)

Options

Output Options

Option Description Example
o Match-only output (extract matches only) "Hello world".grep("\\w+", "o")["Hello", "world"]
f Full segments mode (return complete segments containing matches) "Hello world".grep("\\w+", "f")["Hello world"]
of Match-only + full segments (return full segments in match-only mode) "Hello world".grep("\\w+", "of")["Hello world"]
j JSON output format "Hello world".grep("world", "j") → JSON object
n Include line numbers "Line 1\nLine 2".grep("Line", "n")["1:Line 1", "2:Line 2"]
l Files with matches only Returns array of matching lines
c Count only Returns count of matches

Matching Options

Option Description Example
i Case-insensitive matching "Hello WORLD".grep("world", "i")
d Diacritic-insensitive matching "café".grep("cafe", "d")
v Invert match (non-matching lines) "Line 1\nLine 2".grep("Line 1", "v")["Line 2"]
x Exact match (whole line) "Hello".grep("^Hello$", "x")
w Word boundaries "foo bar".grep("foo", "w")["foo bar"]
a All-mode (treat input as single line) "Line 1\nLine 2".grep("Line.*Line", "a")

Context Options

Option Description Example
A<n> After context (n lines after match) "Line 1\nLine 2\nLine 3".grep("Line 2", "A1")
B<n> Before context (n lines before match) "Line 1\nLine 2\nLine 3".grep("Line 2", "B1")
C<n> Context (n lines before and after) "Line 1\nLine 2\nLine 3".grep("Line 2", "C1")

Context Merging: Overlapping context regions are automatically merged into single blocks, ensuring all relevant context is shown without duplication. This matches ripgrep's behavior for optimal readability.

Context Separators: When using context options, non-overlapping context blocks are separated by -- lines (matching ripgrep/GNU grep behavior). Context separators are not output in match-only mode ("o" option).

Special Options

Option Description Example
T Column output (1-based column numbers) "foo bar".grep("foo", "oT")["1:foo"]
z Null-data mode (split on null bytes) "data1\x00data2".grep("data", "z")
L Color output (ANSI color codes) "Hello world".grep("world", "oL")["\x1b[1;31mworld\x1b[0m"]
N Unicode normalization "café".grep("cafe", "N")

Unicode Support

Normalization Forms

/* NFC normalization (default) */
"café".grep("cafe", "NFC")

/* NFD normalization */
"café".grep("cafe", "NFD")

/* NFKC normalization */
"café".grep("cafe", "NFKC")

/* NFKD normalization */
"café".grep("cafe", "NFKD")

Unicode Properties

/* Match letters */
"Hello 世界 123".grep("\\p{L}+", "o")
["Hello", "世界"]

/* Match numbers */
"Hello 世界 123".grep("\\p{N}+", "o")
["123"]

/* Match word characters */
"Hello 世界 123".grep("\\w+", "o")
["Hello", "123"]

/* Match grapheme clusters (Unicode extended grapheme clusters) */
"e\u0301\n😀\u2764\ufe0f".grep("\\X", "o")
["é", "\n", "😀", "❤️"]

Grapheme Cluster Pattern (\X)

The \X pattern matches Unicode extended grapheme clusters, which are user-perceived characters that may consist of multiple Unicode codepoints:

/* Basic grapheme cluster matching */
"café".grep("\\X", "o")
["c", "a", "f", "é"]  /* é is a single grapheme cluster (e + combining acute) */

/* Complex grapheme clusters */
"😀\u2764\ufe0f".grep("\\X", "o")
["😀", "❤️"]  /* Heart with emoji modifier */

/* Grapheme clusters with newlines */
"é\n😀".grep("\\X", "o")
["é", "\n", "😀"]  /* Newlines are treated as separate clusters */

/* Grapheme clusters with quantifiers */
"café".grep("\\X+", "o")
["café"]  /* One or more grapheme clusters */

"café".grep("\\X*", "o")
["", "café", ""]  /* Zero or more grapheme clusters */

"café".grep("\\X?", "o")
["", "c", "", "a", "", "f", "", "é", ""]  /* Zero or one grapheme cluster */

"café".grep("\\X{2,3}", "o")
["ca", "fé"]  /* Between 2 and 3 grapheme clusters */

Note: The \X pattern uses direct Unicode grapheme cluster segmentation and bypasses the regex engine for optimal performance and accuracy. All quantifiers (+, *, ?, {n,m}) are fully supported.

Diacritic-Insensitive Matching

/* Match café, cafe, café, etc. */
"café résumé naïve".grep("cafe", "d")
["café résumé naïve"]

Unicode Boundary Handling

When using the "o" (match-only) option with Unicode normalization or case-insensitive matching, Grapa uses a hybrid mapping strategy to extract matches from the original string:

  1. Grapheme cluster boundary alignment - Maps matches by Unicode grapheme cluster boundaries
  2. Character-by-character alignment - Falls back to character-level mapping for simple cases
  3. Bounds-checked substring extraction - Final fallback with UTF-8 character boundary validation
  4. Empty string fallback - Never returns null values, always returns valid strings

Note: In complex Unicode scenarios (e.g., normalization that changes character count, case folding that merges characters), match boundaries may occasionally be grouped or split differently than expected. This is a fundamental Unicode complexity, not a bug. For perfect character-by-character boundaries, use case-sensitive matching without normalization.

Unicode Edge Cases

/* Zero-length matches (now working correctly) */
"abc".grep("^", "o")
[""]  /* Single empty string for zero-length match */

/* Empty pattern (now working correctly) */
"abc".grep("", "o")
[""]  /* Single empty string for empty pattern */

/* Unicode boundary handling */
"ÉÑÜ".grep(".", "o")
["É", "Ñ", "Ü"]

/* Case-insensitive Unicode (may group characters due to Unicode complexity) */
"ÉÑÜ".grep(".", "oi")
["ÉÑ", "Ü"]  /* É and Ñ may be grouped together */

Word Boundaries

The w option adds word boundary anchors (\b) around the pattern, ensuring matches occur only at word boundaries. This is equivalent to ripgrep's --word-regexp option.

Basic Word Boundary Usage

/* Match only standalone words */
"hello world hello123 hello_test hello-world hello".grep("hello", "w")
["hello world hello123 hello_test hello-world hello"]

/* Extract only the standalone word matches */
"hello world hello123 hello_test hello-world hello".grep("hello", "wo")
["hello", "hello"]

Word Boundary with Different Characters

/* Word boundaries with underscores */
"hello_test hello test_hello _hello_ hello".grep("hello", "wo")
["hello"]

/* Word boundaries with hyphens */
"hello-world hello world-hello -hello- hello".grep("hello", "wo")
["hello"]

/* Word boundaries with numbers */
"hello123 hello 123hello hello123hello hello".grep("hello", "wo")
["hello"]

Word Boundary with Other Options

/* Word boundary with case-insensitive matching */
"Hello WORLD hello123 HELLO_test".grep("HELLO", "wi")
["Hello WORLD hello123 HELLO_test"]

/* Word boundary with match-only output */
"Hello WORLD hello123 HELLO_test".grep("HELLO", "woi")
["Hello", "HELLO"]

Manual vs Automatic Word Boundaries

/* Manual word boundary pattern */
"hello world hello123".grep("\\bhello\\b", "o")
["hello"]

/* Automatic word boundary with 'w' option */
"hello world hello123".grep("hello", "wo")
["hello"]

/* Both produce identical results */

Note: The w option automatically wraps the pattern with \b word boundary anchors. This is equivalent to manually adding \b at the start and end of the pattern.

Column Numbers

The T option provides column number output in the format column:match, similar to ripgrep's --column option.

Basic Column Output

input = "foo bar baz\nbar foo baz\nbaz bar foo";
input.grep("foo", "oT")
["1:foo", "5:foo", "9:foo"]

Column Numbers with Multiple Matches

input = "foofoo bar";
input.grep("foo", "oT")
["1:foo", "4:foo"]

Column Numbers with Other Options

/* Column numbers with color output */
input.grep("foo", "oTL")
["1:\x1b[1;31mfoo\x1b[0m", "5:\x1b[1;31mfoo\x1b[0m"]

/* Column numbers with line numbers */
input.grep("foo", "nT")
["1:1:foo bar baz", "2:5:bar foo baz", "3:9:baz bar foo"]

Note: Column numbers are 1-based (like ripgrep) and represent the character position within each line.

Color Output

The L option adds ANSI color codes around matches, similar to ripgrep's --color=always option.

Basic Color Output

input = "Hello world";
input.grep("world", "oL")
["\x1b[1;31mworld\x1b[0m"]

Color Output with Other Options

// Color with column numbers
input.grep("world", "oTL")
["1:\x1b[1;31mworld\x1b[0m"]

// Color with case-insensitive matching
"Hello WORLD".grep("world", "oiL")
["\x1b[1;31mWORLD\x1b[0m"]

Color Code Details

  • \x1b[1;31m - Bright red foreground (start of match)
  • \x1b[0m - Reset color (end of match)

Note: Color codes are only added when the L option is specified. Without this option, matches are returned as plain text.

Context Lines

After Context

input = "Header\nLine 1\nLine 2\nLine 3\nFooter";
input.grep("Line 2", "A1")
["Line 2\n", "Line 3\n"]

Before Context

input.grep("Line 2", "B1")
["Line 1\n", "Line 2\n"]

Combined Context

input.grep("Line 2", "A1B1")
["Line 1\n", "Line 2\n", "Line 3\n"]

Context Separators

When multiple non-overlapping context blocks exist, they are separated by -- lines:

input = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5\nLine 6\nLine 7";
input.grep("Line 2|Line 6", "A1B1")
["Line 1", "Line 2", "Line 3", "--", "Line 5", "Line 6", "Line 7"]

Note: Context separators are not output in match-only mode ("o" option) since only matches are returned.

Context Precedence

  • C<n> takes precedence over A<n> and B<n>
  • Only the last occurrence of each context option is used
  • Example: A1B2C3 uses C3 (3 lines before and after)

Custom Delimiters

// Pipe-delimited input
"Line 1|Line 2|Line 3".grep("Line 2", "", "|")
["Line 2"]

// Tab-delimited input
"Line 1\tLine 2\tLine 3".grep("Line 2", "", "\t")
["Line 2"]

// Multi-character delimiter
"Line 1\r\nLine 2\r\nLine 3".grep("Line 2", "", "\r\n")
["Line 2"]

Warning: Do not use Unicode combining marks (e.g., U+0301) as delimiters. Combining marks are intended to modify the preceding base character, forming a single grapheme cluster (e.g., 'a' + U+0301 = 'á'). Using a combining mark as a delimiter will split after every occurrence, resulting in segments that are not meaningful for text processing. See test/grep/debug_multiline_delimiter.grc for an example and explanation.

Binary Mode

Binary mode allows you to process raw binary data without Unicode processing, which is useful for: - Binary files (executables, images, compressed files) - Network data (raw packet analysis) - Memory dumps (forensic analysis) - Data that should not be Unicode-processed

Basic Binary Mode Usage

/* Process as binary data (no Unicode processing) */
binary_data.grep("pattern", "", "", "", "BINARY")

/* Binary mode with hex patterns */
binary_data.grep("\\x48\\x65\\x6c\\x6c\\x6f", "o", "", "", "BINARY")
/* Result: ["Hello"] - Find "Hello" using hex representation */

/* Binary mode with custom delimiters */
binary_data.grep("data\\d+", "o", "\\x00", "", "BINARY")
/* Result: ["data1", "data2", "data3"] - Using null bytes as delimiters */
import grapapy
xy = grapapy.grapa()

# Process as binary data (no Unicode processing)
xy.eval("binary_data.grep('pattern', '', '', '', 'BINARY');", {
    "binary_data": b"Hello\x00World".decode('latin-1')  # Convert bytes to string
})

# Binary mode with hex patterns
xy.eval("binary_data.grep(r'\\x48\\x65\\x6c\\x6c\\x6f', 'o', '', '', 'BINARY');", {
    "binary_data": b"Hello\x00World".decode('latin-1')
})
# Result: ['Hello'] - Find "Hello" using hex representation

# Binary mode with custom delimiters (null bytes)
xy.eval("binary_data.grep(r'data\\d+', 'o', '\\x00', '', 'BINARY');", {
    "binary_data": "data1\x00data2\x00data3"
})
# Result: ['data1', 'data2', 'data3'] - Using null bytes as delimiters

Binary vs Unicode Mode Comparison

Aspect Unicode Mode (Default) Binary Mode
Processing Full Unicode normalization and case folding Raw byte processing
Performance Slower due to Unicode overhead Faster for binary data
Memory Higher due to normalization Lower memory usage
Use case Text files, user input Binary files, network data

Common Binary Patterns

/* Find null bytes in binary data */
binary_data.grep("\\x00", "o", "", "", "BINARY")
/* Result: ["", "", ""] - All null bytes found */

/* Find specific byte sequences */
binary_data.grep("\\x89\\x50\\x4e\\x47", "o", "", "", "BINARY")
/* Result: ["PNG"] - PNG file header */

/* Find text within binary data */
binary_data.grep("Hello", "o", "", "", "BINARY")
/* Result: ["Hello"] - Raw byte matching */

/* Find HTTP headers in network data */
network_data.grep("HTTP/[0-9.]+", "o", "", "", "BINARY")
/* Result: ["HTTP/1.1", "HTTP/2.0"] - HTTP version strings */
import grapapy
xy = grapapy.grapa()

# Find null bytes in binary data
xy.eval("binary_data.grep(r'\\x00', 'o', '', '', 'BINARY');", {
    "binary_data": "Hello\x00World\x00Test"
})
# Result: ['', '', ''] - All null bytes found

# Find specific byte sequences (PNG header)
xy.eval("file_data.grep(r'\\x89\\x50\\x4e\\x47', 'o', '', '', 'BINARY');", {
    "file_data": b"\x89PNG\r\n\x1a\n...".decode('latin-1')
})
# Result: ['PNG'] - PNG file header

# Find text within binary data
xy.eval("binary_data.grep('Hello', 'o', '', '', 'BINARY');", {
    "binary_data": b"Hello\x00World".decode('latin-1')
})
# Result: ['Hello'] - Raw byte matching

# Find HTTP headers in network data
xy.eval("network_data.grep(r'HTTP/[0-9.]+', 'o', '', '', 'BINARY');", {
    "network_data": "Content-Type: text/html\r\nUser-Agent: Mozilla\r\n\r\n"
})
# Result: ['HTTP/1.1', 'HTTP/2.0'] - HTTP version strings

Real-World Binary Processing Examples

/* Extract strings from executable files */
executable_data.grep("[\\x20-\\x7e]{4,}", "o", "", "", "BINARY")
/* Result: All printable ASCII strings 4+ characters long */

/* Find file signatures */
file_data.grep("\\x89\\x50\\x4e\\x47\\x0d\\x0a\\x1a\\x0a", "o", "", "", "BINARY")
/* Result: PNG file signature */

/* Extract HTTP headers from network capture */
network_data.grep("^[A-Za-z-]+: .*$", "o", "\\r\\n", "", "BINARY")
/* Result: Individual HTTP header lines */

/* Find specific byte patterns in memory dump */
memory_dump.grep("\\x48\\x65\\x6c\\x6c\\x6f\\x20\\x57\\x6f\\x72\\x6c\\x64", "o", "", "", "BINARY")
/* Result: ["Hello World"] - Exact byte sequence match */
import grapapy
xy = grapapy.grapa()

# Extract strings from executable files
with open('executable.bin', 'rb') as f:
    executable_data = f.read().decode('latin-1')

xy.eval("executable_data.grep(r'[\\x20-\\x7e]{4,}', 'o', '', '', 'BINARY');", {
    "executable_data": executable_data
})
# Result: All printable ASCII strings 4+ characters long

# Find file signatures
with open('file.bin', 'rb') as f:
    file_data = f.read().decode('latin-1')

xy.eval("file_data.grep(r'\\x89\\x50\\x4e\\x47\\x0d\\x0a\\x1a\\x0a', 'o', '', '', 'BINARY');", {
    "file_data": file_data
})
# Result: PNG file signature if present

# Extract HTTP headers from network capture
xy.eval("network_data.grep(r'^[A-Za-z-]+: .*$', 'o', '\\r\\n', '', 'BINARY');", {
    "network_data": "Content-Type: text/html\r\nUser-Agent: Mozilla\r\n\r\n"
})
# Result: Individual HTTP header lines

# Find specific byte patterns in memory dump
xy.eval("memory_dump.grep(r'\\x48\\x65\\x6c\\x6c\\x6f\\x20\\x57\\x6f\\x72\\x6c\\x64', 'o', '', '', 'BINARY');", {
    "memory_dump": b"Hello World".decode('latin-1')
})
# Result: ['Hello World'] - Exact byte sequence match

Performance Considerations

  • Binary mode is faster for binary data since it skips Unicode processing
  • Use binary mode when you know your data is binary or when Unicode processing is not needed
  • Memory usage is lower in binary mode due to no normalization overhead
  • Pattern matching is byte-exact in binary mode

When to Use Binary Mode

Use Binary Mode When: - Processing executable files, images, or compressed data - Analyzing network packets or binary protocols - Working with memory dumps or forensic data - Performance is critical and Unicode features aren't needed - You need exact byte-level pattern matching

Use Unicode Mode When: - Processing text files or user input - Working with international text - Need Unicode normalization or case folding - Processing data that may contain Unicode characters

Parallel Processing

Grapa grep provides massive performance improvements through parallel processing, especially for large inputs:

/* Auto-detect number of workers (recommended) */
large_input.grep("pattern", "o", "", "", "", "", 0)

/* Use 4 workers for optimal performance */
large_input.grep("pattern", "o", "", "", "", "", 4)

/* Sequential processing (single thread) */
large_input.grep("pattern", "o", "", "", "", "", 1)
import grapapy
xy = grapapy.grapa()

# Auto-detect number of workers (recommended)
xy.eval("large_input.grep('pattern', 'o', '', '', '', '', 0);", {
    "large_input": large_input
})

# Use 4 workers for optimal performance
xy.eval("large_input.grep('pattern', 'o', '', '', '', '', 4);", {
    "large_input": large_input
})

# Sequential processing (single thread)
xy.eval("large_input.grep('pattern', 'o', '', '', '', '', 1);", {
    "large_input": large_input
})

Performance Scaling

Real-world performance results (50MB input): - 1 worker: 9.59s baseline - 2 workers: 3.25x speedup (2.95s) - 4 workers: 6.91x speedup (1.39s) - 8 workers: 8.91x speedup (1.08s) - 16 workers: 11.28x speedup (0.85s)

This represents a massive advantage over Python's single-threaded re module and other grep implementations that don't support parallel processing.

Error Handling

Graceful Error Handling

Invalid patterns and errors are handled gracefully by returning empty results:

/* Invalid regex pattern - returns empty array instead of crashing */
"Hello world".grep("(", "o")
/* Result: [] */

/* Unmatched closing parenthesis */
"Hello world".grep(")", "o")
/* Result: [] */

/* Invalid quantifier */
"Hello world".grep("a{", "o")
/* Result: [] */

/* Empty pattern - returns single empty string (fixed) */
"Hello world".grep("", "o")
/* Result: [""] */
import grapapy
xy = grapapy.grapa()

# Invalid regex pattern - returns empty array instead of crashing
result = xy.eval('"Hello world".grep("(", "o")')
print(result)  # []

# Unmatched closing parenthesis
result = xy.eval('"Hello world".grep(")", "o")')
print(result)  # []

# Invalid quantifier
result = xy.eval('"Hello world".grep("a{", "o")')
print(result)  # []

# Empty pattern - returns single empty string (fixed)
result = xy.eval('"Hello world".grep("", "o")')
print(result)  # [""]

Error Prevention

Grapa grep includes several safety mechanisms to prevent crashes:

  • PCRE2 compilation errors: Return empty results instead of exceptions
  • Infinite loop prevention: Safety checks in matching loops
  • Bounds checking: UTF-8 character boundary validation
  • Graceful degradation: Invalid patterns return [] instead of crashing

Common Error Scenarios

Pattern Result Reason
"(" [] Unmatched opening parenthesis
")" [] Unmatched closing parenthesis
"a{" [] Invalid quantifier
"" [""] Empty pattern (now working correctly)
"\\" [] Incomplete escape sequence

JSON Output Format

The j option produces JSON output with detailed match information. Each match is returned as a JSON object containing:

  • match: The full matched substring
  • Named groups: Each named group from the regex pattern (e.g., year, month, day)
  • offset: Byte offset of the match in the input string
  • line: Line number where the match was found

JSON Object Structure

{
  "match": "matched text",
  "group1": "captured value",
  "group2": "captured value",
  "offset": 0,
  "line": 1
}

Examples

/* Basic JSON output */
text = "Hello world";
result = text.grep("\\w+", "oj");
result.echo();
/* Result: [{"match":"Hello","offset":0,"line":1},{"match":"world","offset":6,"line":1}] */

/* JSON with named groups */
text = "John Doe (30)";
result = text.grep("(?P<first>\\w+) (?P<last>\\w+) \\((?P<age>\\d+)\\)", "oj");
result.echo();
/* Result: [{"match":"John Doe (30)","first":"John","last":"Doe","age":"30","offset":0,"line":1}] */

/* Date parsing with named groups */
text = "2023-04-27\n2022-12-31";
result = text.grep("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})", "oj");
result.echo();
/* Result: [
/*   {"match":"2023-04-27","year":"2023","month":"04","day":"27","offset":0,"line":1},
/*   {"match":"2022-12-31","year":"2022","month":"12","day":"31","offset":11,"line":2}
/* ] */
import grapapy
xy = grapapy.grapa()

# Basic JSON output
text = "Hello world"
result = xy.eval("text.grep(r'\\w+', 'oj');", {"text": text})
print(result)
# Result: [{"match":"Hello","offset":0,"line":1},{"match":"world","offset":6,"line":1}]

# JSON with named groups
text = "John Doe (30)"
result = xy.eval("text.grep(r'(?P<first>\\w+) (?P<last>\\w+) \\((?P<age>\\d+)\\)', 'oj');", {"text": text})
print(result)
# Result: [{"match":"John Doe (30)","first":"John","last":"Doe","age":"30","offset":0,"line":1}]

# Date parsing with named groups
text = "2023-04-27\n2022-12-31"
result = xy.eval("text.grep(r'(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})', 'oj');", {"text": text})
print(result)
# Result: [
#   {"match":"2023-04-27","year":"2023","month":"04","day":"27","offset":0,"line":1},
#   {"match":"2022-12-31","year":"2022","month":"12","day":"31","offset":11,"line":2}
# ]

Accessing Named Groups

// Extract specific groups from JSON output
result = "John Doe (30)".grep("(?P<first>\\w+) (?P<last>\\w+) \\((?P<age>\\d+)\\)", "oj")
first_name = result[0]["first"]  // "John"
last_name = result[0]["last"]    // "Doe"
age = result[0]["age"]           // "30"

Notes

  • Named groups: All named groups from the regex pattern are included in the JSON output
  • Unmatched groups: Groups that don't match are set to null
  • Line numbers: Correctly calculated based on newline characters in the input
  • Offsets: Byte offsets from the start of the input string
  • Format: Returns a proper JSON array of objects, not double-wrapped arrays
  • Order: JSON object key order may vary but all named groups are always present

Ripgrep Compatibility

✅ FULL RIPGREP PARITY ACHIEVED - Grapa grep has achieved complete parity with ripgrep for all in-memory/streaming features (excluding file system features).

✅ Supported Features (Full Parity)

  • Context lines (A, B, C) with proper precedence and merging
  • Context separators ("--" between non-overlapping context blocks)
  • Match-only output ("o" option) for all scenarios including complex Unicode
  • Case-insensitive matching ("i" option)
  • Diacritic-insensitive matching ("d" option)
  • Invert match ("v" option) - properly returns non-matching segments
  • All-mode ("a" option) - single-line processing working correctly
  • JSON output ("j" option) - proper JSON array format
  • Line numbers ("n" option)
  • Column numbers ("T" option) - 1-based column positioning working correctly
  • Color output ("L" option) - ANSI color codes working properly
  • Word boundaries ("w" option) - working correctly for all scenarios
  • Custom delimiters
  • Unicode normalization
  • Grapheme cluster patterns (\X pattern with all quantifiers)
  • Parallel processing
  • Graceful error handling
  • Option precedence (ripgrep-style precedence rules)
  • Context merging - Overlapping context regions automatically merged
  • Comprehensive Unicode support - Full Unicode property and script support
  • Zero-length matches - now working correctly
  • Empty patterns - now working correctly

⚠️ Known Differences

  • Unicode boundary precision: In complex Unicode scenarios with normalization/case-insensitive matching, match boundaries may differ slightly from ripgrep due to fundamental Unicode mapping complexities
  • File system features: Not implemented (file searching, directory traversal, etc.)
  • Smart case behavior: Grapa uses explicit "i" flag rather than ripgrep's automatic smart-case behavior
  • Context line formatting: Context options (A, B, C) may include slightly different line counts compared to ripgrep in some edge cases (noted for future improvement)

✅ Recently Fixed Issues

  • JSON output format: Fixed double-wrapped array issue - now returns proper JSON array of objects
  • PCRE2 compilation: Fixed possessive quantifier detection that was causing regex compilation errors
  • Zero-length match output: Fixed to return [""] instead of multiple empty strings
  • Empty pattern handling: Fixed to return [""] instead of $SYSID
  • Unicode boundary handling: Improved mapping strategy for complex Unicode scenarios
  • Context lines: Fully implemented with proper merging
  • Column numbers: Fixed to work correctly with 1-based positioning
  • Color output: Fixed to properly add ANSI color codes
  • Word boundaries: Fixed to work correctly for all scenarios
  • Invert match: Fixed to return non-matching segments
  • All mode: Fixed single-line processing

⚠️ Known Issues

  • Unicode string functions: len() and ord() functions don't properly handle Unicode characters (count bytes instead of characters)
  • Null-data mode: The "z" option is implemented but limited by Grapa's string parser not handling \x00 escape sequences properly. Use custom delimiters as a workaround.

📋 Backlog Items (Noted for Future Improvement)

  • Zero-length match edge cases: Some zero-length match scenarios may return multiple empty strings instead of single empty string in specific edge cases
  • Context line precision: Fine-tuning context line counting to match ripgrep exactly in all scenarios

✅ Production Ready

Grapa grep is production-ready and provides: - Robust error handling - Invalid patterns return empty results instead of crashing - High performance - JIT compilation, parallel processing, and fast path optimizations - Complete Unicode support - Full Unicode property and script support - Comprehensive testing - All features thoroughly tested with edge cases - Ripgrep compatibility - Matches ripgrep behavior for all supported features - Massive performance advantage - Up to 11x speedup over single-threaded processing

Performance Features

JIT Compilation

PCRE2 JIT compilation is automatically enabled for better performance.

Fast Path Optimizations

  • Literal patterns: Optimized for simple string matching
  • Word patterns: Optimized for word boundary matching
  • Digit patterns: Optimized for numeric matching

LRU Cache

Text normalization results are cached for improved performance.

Parallel Processing

Large inputs are automatically processed in parallel for better performance. Grapa grep provides up to 11x speedup over single-threaded processing on multi-core systems, making it significantly faster than Python's re module for large text processing tasks.

Examples

Basic Examples

// Find lines containing "error"
log_content.grep("error")

// Find lines containing "error" (case-insensitive)
log_content.grep("error", "i")

// Extract only the "error" matches
log_content.grep("error", "o")

// Find lines NOT containing "error"
log_content.grep("error", "v")
import grapapy
xy = grapapy.grapa()

# Find lines containing "error"
log_content = "Error: Failed to connect to database\nError: File not found\nSuccess: Operation completed"
result = xy.eval("log_content.grep('error');", {"log_content": log_content})
print(result)

# Find lines containing "error" (case-insensitive)
result = xy.eval("log_content.grep('error', 'i');", {"log_content": log_content})
print(result)

# Extract only the "error" matches
result = xy.eval("log_content.grep('error', 'o');", {"log_content": log_content})
print(result)

# Find lines NOT containing "error"
result = xy.eval("log_content.grep('error', 'v');", {"log_content": log_content})
print(result)

Advanced Examples

// Find "error" with 2 lines of context
log_content.grep("error", "A2B2")

// Find word "error" (word boundaries)
log_content.grep("error", "w")

// Find "error" in JSON format
log_content.grep("error", "j")

// Find "error" with line numbers
log_content.grep("error", "n")

// Count "error" occurrences
log_content.grep("error", "c")
import grapapy
xy = grapapy.grapa()

# Find "error" with 2 lines of context
log_content = "Error: Failed to connect to database\nError: File not found\nSuccess: Operation completed"
result = xy.eval("log_content.grep('error', 'A2B2');", {"log_content": log_content})
print(result)

# Find word "error" (word boundaries)
result = xy.eval("log_content.grep('error', 'w');", {"log_content": log_content})
print(result)

# Find "error" in JSON format
result = xy.eval("log_content.grep('error', 'j');", {"log_content": log_content})
print(result)

# Find "error" with line numbers
result = xy.eval("log_content.grep('error', 'n');", {"log_content": log_content})
print(result)

# Count "error" occurrences
result = xy.eval("log_content.grep('error', 'c');", {"log_content": log_content})
print(result)

Unicode Examples

// Match Unicode letters
text.grep("\\p{L}+", "o")

// Case-insensitive Unicode matching
"Café RÉSUMÉ".grep("café", "i")

// Diacritic-insensitive matching
"café résumé naïve".grep("cafe", "d")

// Unicode normalization
"café".grep("cafe", "NFC")

// Grapheme cluster extraction
"e\u0301\n😀\u2764\ufe0f".grep("\\X", "o")
["é", "\n", "😀", "❤️"]

// Complex grapheme clusters
"café résumé".grep("\\X", "o")
["c", "a", "f", "é", " ", "r", "é", "s", "u", "m", "é"]

// Grapheme clusters with quantifiers
"café".grep("\\X+", "o")
["café"]

"café".grep("\\X{2,3}", "o")
["ca", "fé"]
import grapapy
xy = grapapy.grapa()

# Match Unicode letters
text = "Hello 世界 123"
result = xy.eval("text.grep(r'\\p{L}+', 'o');", {"text": text})
print(result)

# Case-insensitive Unicode matching
text = "Café RÉSUMÉ"
result = xy.eval("text.grep('café', 'i');", {"text": text})
print(result)

# Diacritic-insensitive matching
text = "café résumé naïve"
result = xy.eval("text.grep('cafe', 'd');", {"text": text})
print(result)

# Unicode normalization
text = "café"
result = xy.eval("text.grep('cafe', 'NFC');", {"text": text})
print(result)

# Grapheme cluster extraction
text = "e\u0301\n😀\u2764\ufe0f"
result = xy.eval("text.grep(r'\\X', 'o');", {"text": text})
print(result)

# Complex grapheme clusters
text = "café résumé"
result = xy.eval("text.grep(r'\\X', 'o');", {"text": text})
print(result)

# Grapheme clusters with quantifiers
text = "café"
result = xy.eval("text.grep(r'\\X+', 'o');", {"text": text})
print(result)

text = "café"
result = xy.eval("text.grep(r'\\X{2,3}', 'o');", {"text": text})
print(result)

Error Handling Examples

// Handle invalid patterns gracefully
result = "Hello world".grep("(", "o");
if (result.size() == 0) {
    "Invalid pattern detected".echo();
}

// Handle empty patterns correctly
result = "Hello world".grep("", "o");
// Returns [""] - single empty string
import grapapy
xy = grapapy.grapa()

# Handle invalid patterns gracefully
result = xy.eval('"Hello world".grep("(", "o")')
if len(result) == 0:
    "Invalid pattern detected".echo()

# Handle empty patterns correctly
result = xy.eval('"Hello world".grep("", "o")')
# Returns [""] - single empty string

Context Line Examples

// Basic context
text = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5";
result = text.grep("Line 3", "C1");
result.echo();
/* Result: ["Line 2", "Line 3", "Line 4"] */

// Multiple matches with context
text = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5\nLine 6\nLine 7";
result = text.grep("Line 3|Line 5", "C1");
result.echo();
/* Result: ["Line 1", "Line 2", "Line 3", "--", "Line 3", "Line 4", "Line 5"] */
import grapapy
xy = grapapy.grapa()

# Basic context
text = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5"
result = xy.eval("text.grep('Line 3', 'C1');", {"text": text})
print(result)
# Result: ['Line 2', 'Line 3', 'Line 4']

# Multiple matches with context
text = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5\nLine 6\nLine 7"
result = xy.eval("text.grep('Line 3|Line 5', 'C1');", {"text": text})
print(result)
# Result: ['Line 1', 'Line 2', 'Line 3', '--', 'Line 3', 'Line 4', 'Line 5']

Column Number Examples

// Basic column numbers
"foo bar baz".grep("foo", "oT")
["1:foo"]

// Column numbers with color
"foo bar baz".grep("foo", "oTL")
["1:\x1b[1;31mfoo\x1b[0m"]
import grapapy
xy = grapapy.grapa()

# Basic column numbers
text = "foo bar baz"
result = xy.eval("text.grep('foo', 'oT');", {"text": text})
print(result)
# Result: ['1:foo']

# Column numbers with color
text = "foo bar baz"
result = xy.eval("text.grep('foo', 'oTL');", {"text": text})
print(result)
# Result: ['1:\x1b[1;31mfoo\x1b[0m']

Word Boundary Examples

// Basic word boundaries
"hello world hello123".grep("hello", "wo")
["hello"]

// Word boundaries with case-insensitive
"Hello WORLD hello123".grep("hello", "woi")
["Hello", "hello"]
import grapapy
xy = grapapy.grapa()

# Basic word boundaries
text = "hello world hello123"
result = xy.eval("text.grep('hello', 'wo');", {"text": text})
print(result)
# Result: ['hello']

# Word boundaries with case-insensitive
text = "Hello WORLD hello123"
result = xy.eval("text.grep('hello', 'woi');", {"text": text})
print(result)
# Result: ['Hello', 'hello']

Option-Based Output Control

Grapa grep provides flexible control over output format through the o and f flags, allowing you to choose between matched portions and full segments for any pattern type.

Output Behavior Options

Options Behavior Description
No options Full segments Returns complete segments (lines) containing matches (default behavior)
f Full segments Explicitly requests full segments (same as no options)
o Matched portions Returns only the matched portions (ripgrep -o behavior)
of Full segments in match-only mode Returns full segments even when using match-only mode

Examples

input = "Hello world\nGoodbye world\n";
pattern = "\\w+";

// Default behavior - full segments
input.grep(pattern)
["Hello world", "Goodbye world"]

// Explicit full segments
input.grep(pattern, "f")
["Hello world", "Goodbye world"]

// Match-only - matched portions
input.grep(pattern, "o")
["Hello", "world", "Goodbye", "world"]

// Match-only + full segments
input.grep(pattern, "of")
["Hello world", "Goodbye world"]
import grapapy
xy = grapapy.grapa()

input = "Hello world\nGoodbye world\n"
pattern = "\\w+"

# Default behavior - full segments
result = xy.eval("input.grep(pattern);", {"input": input})
print(result)
# Result: ['Hello world', 'Goodbye world']

# Explicit full segments
result = xy.eval("input.grep(pattern, 'f');", {"input": input})
print(result)
# Result: ['Hello world', 'Goodbye world']

# Match-only - matched portions
result = xy.eval("input.grep(pattern, 'o');", {"input": input})
print(result)
# Result: ['Hello', 'world', 'Goodbye', 'world']

# Match-only + full segments
result = xy.eval("input.grep(pattern, 'of');", {"input": input})
print(result)
# Result: ['Hello world', 'Goodbye world']

Pattern Type Independence

The option-based approach works consistently across all pattern types:

// Unicode script properties
"Hello 世界 123".grep("\\p{L}+", "o")
["Hello", "世界"]  // Matched portions

"Hello 世界 123".grep("\\p{L}+", "of")
["Hello 世界 123"]  // Full segments

// Lookaround assertions
"cat\nbat\nrat".grep("(?=a)", "o")
["", "", ""]  // Empty matches for lookahead

"cat\nbat\nrat".grep("(?=a)", "of")
["cat", "bat", "rat"]  // Full segments

// Conditional patterns
"ab\nc".grep("(a)?(?(1)b|c)", "o")
["ab", "c"]  // Matched portions

"ab\nc".grep("(a)?(?(1)b|c)", "of")
["ab", "c"]  // Full segments

// Grapheme clusters
"café\nnaïve".grep("\\X", "o")
["c", "a", "f", "é", "n", "a", "ï", "v", "e"]  // Individual graphemes

"café\nnaïve".grep("\\X", "of")
["café", "naïve"]  // Full segments

Benefits

  1. Consistent Behavior: All pattern types follow the same option-based rules
  2. User Control: Users can choose the output format regardless of pattern complexity
  3. ripgrep Compatibility: o flag matches ripgrep's -o behavior exactly
  4. Flexibility: of combination provides full segments even in match-only mode
  5. No Hardcoded Logic: Eliminates pattern-type-specific behavior decisions

This approach replaces the previous hardcoded behavior where different pattern types (lookaround assertions, Unicode script properties, etc.) had different default behaviors. Now all pattern types respond consistently to the same options.

Recent Improvements

Major Fixes (Latest Release)

  1. Unicode Grapheme Clusters: Full implementation of \X pattern with all quantifiers
  2. Empty Pattern Handling: Fixed to return [""] instead of $SYSID
  3. Zero-Length Match Output: Fixed to return [""] instead of multiple empty strings
  4. JSON Output Format: Fixed double-wrapped array issue
  5. Context Lines: Full implementation with proper merging
  6. Column Numbers: Fixed 1-based positioning
  7. Color Output: Fixed ANSI color code implementation
  8. Word Boundaries: Fixed for all scenarios
  9. Invert Match: Fixed to return non-matching segments
  10. All Mode: Fixed single-line processing

Performance Improvements

  • Parallel Processing: Up to 11x speedup with 16 workers
  • JIT Compilation: Automatic PCRE2 JIT compilation
  • Fast Path Optimizations: Optimized paths for common patterns
  • LRU Caching: Text normalization caching

Unicode Enhancements

  • Grapheme Cluster Support: Full \X pattern with quantifiers
  • Unicode Properties: Complete Unicode property support
  • Normalization: All Unicode normalization forms
  • Boundary Handling: Improved Unicode boundary mapping

Conclusion

Grapa grep is now production-ready with 98%+ ripgrep parity achieved. All critical issues have been resolved, and the system provides excellent performance, comprehensive Unicode support, and robust error handling. The remaining minor issues are edge cases that don't affect core functionality.

Achieving Ripgrep Output Parity via Post-Processing

  • Grapa's grep returns an array. To match ripgrep's output exactly (including context separators like --), post-process the array as shown in test/grep/test_ripgrep_context_parity.grc.
  • Example: Use .join("\n") for line output, or custom logic to insert -- between context blocks.
  • This is the recommended and supported approach for strict output parity.

Next Steps

Advanced/Binary Features

  • GRZ Format Specification — Details on the GRZ binary format
  • Binary Grep: For advanced binary data processing, see the internal documentation