Unicode Grep Documentation
See Also: Python Integration Guide for Python grep examples
Overview
The Unicode grep functionality in Grapa provides advanced text searching capabilities with full Unicode support, PCRE2-powered regular expressions, and comprehensive output options. It's designed to handle international text, emoji, and complex Unicode properties while maintaining high performance.
Output Formatting and Array Design
Why Arrays Instead of Strings?
Grapa grep is designed as an integrated programming language feature, not a standalone console tool. This fundamental difference explains the output format:
Grapa Grep (Integrated Language): - Returns arrays of strings for programmatic use - Removes delimiters from output strings (clean data for processing) - Designed for scripting and data manipulation - Example: ["line1", "line2", "line3"]
(no \n
in strings)
ripgrep/GNU grep (Console Tools): - Outputs single string with embedded delimiters - Preserves delimiters in output for console display - Designed for command-line text processing - Example: "line1\nline2\nline3\n"
(with \n
in string)
Delimiter Removal Behavior
Grapa grep automatically removes delimiters from output strings:
/* Input with custom delimiter */
input = "line1|||line2|||line3";
/* Grapa grep removes delimiters from output */
result = input.grep("line", "o", "|||");
/* Result: ["line1", "line2", "line3"] (clean strings, no |||) */
/* For console output, you can join with delimiter */
console_output = result.join("|||");
/* Result: "line1|||line2|||line3" */
Console Output Equivalence
To get console-equivalent output in Grapa:
/* Grapa approach */
input = "line1\nline2\nline3";
result = input.grep("line", "o"); /* ["line1", "line2", "line3"] */
console_output = result.join("\n"); /* "line1\nline2\nline3" */
/* This matches ripgrep output: "line1\nline2\nline3" */
Benefits of Array Design
- Programmatic Use: Arrays are easier to process in scripts
- Clean Data: No delimiter artifacts in output strings
- Flexible Output: Can join with any delimiter for different formats
- Language Integration: Natural fit with Grapa's array-based design
- Python Integration: Arrays map naturally to Python lists
Custom Delimiter Support
Grapa grep fully supports multi-character delimiters:
/* Single character delimiter */
"line1|line2|line3".grep("line", "o", "|")
/* Result: ["line1", "line2", "line3"] */
/* Multi-character delimiter */
"line1|||line2|||line3".grep("line", "o", "|||")
/* Result: ["line1", "line2", "line3"] */
/* Complex delimiter */
"line1<DELIM>line2<DELIM>line3".grep("line", "o", "<DELIM>")
/* Result: ["line1", "line2", "line3"] */
Note: All delimiters are automatically removed from output strings, regardless of length or complexity.
Key Features
π Unicode Support
- Full Unicode character handling (Cyrillic, Chinese, Japanese, Korean, Arabic, Hebrew, Thai, etc.)
- Unicode normalization (NFC, NFD, NFKC, NFKD)
- Advanced Unicode properties (
\p{L}
,\p{N}
,\p{Emoji}
,\p{So}
, etc.) - Unicode grapheme clusters (
\X
) - Case-insensitive matching with proper Unicode case folding
π― Advanced Regex Features
- PCRE2-powered regular expressions
- Named groups (
(?P<name>...)
) - Unicode properties and script extensions
- Atomic groups (
(?>...)
) - Lookaround assertions (
(?=...)
,(?<=...)
,(?!...)
,(?<!...)
) - Possessive quantifiers (
*+
,++
,?+
,{n,m}+
) - Conditional patterns (
?(condition)...
) - Unicode categories (
\p{L}
,\p{N}
,\p{Z}
,\p{P}
,\p{S}
,\p{C}
,\p{M}
) - Unicode scripts (
\p{sc=Latin}
,\p{sc=Han}
, etc.) - Unicode script extensions (
\p{scx:Han}
, etc.) - Unicode general categories (
\p{Lu}
,\p{Ll}
, etc.)
π Output Formats
- Standard text output
- JSON output with named groups, offsets, and line numbers
- Context lines (before/after matches)
- Line numbers and byte offsets
- Match-only or full-line output
Syntax
string.grep(pattern, options, delimiter, normalization, mode, num_workers)
Parameters
Required Parameters: - string: The input text to search - pattern: PCRE2 regular expression pattern with Unicode support
Optional Parameters (all have sensible defaults): - options: String containing option flags (default: ""
- no options) - delimiter: Custom line delimiter (default: "\n"
) - normalization: Unicode normalization form: "NONE"
, "NFC"
, "NFD"
, "NFKC"
, "NFKD"
(default: "NONE"
) - mode: Processing mode: "UNICODE"
for full Unicode processing, "BINARY"
for raw byte processing (default: "UNICODE"
) - num_workers: Number of worker threads for parallel processing: 0
for auto-detection, 1
for sequential, 2+
for parallel (default: 0
- auto-detection)
Simple Usage Examples
/* Minimal usage - only required parameters */
"Hello world".grep("world");
/* Result: ["Hello world"] */
/* With options */
"Hello world".grep("world", "i");
/* Result: ["Hello world"] */
/* With parallel processing (auto-detection) */
"Hello world".grep("world", "i", "", "", "", 0);
/* Result: ["Hello world"] - Uses optimal number of threads */
/* Manual parallel processing */
"Hello world".grep("world", "i", "", "", "", 4);
/* Result: ["Hello world"] - Uses 4 worker threads */
/* All parameters (rarely needed) */
"Hello world".grep("world", "i", "\n", "NONE", "UNICODE", 2);
/* Result: ["Hello world"] */
Options Reference
Basic Options
Option | Description | Example |
---|---|---|
a | All-mode (match across full input string, context options ignored) | "text".grep("pattern", "a") |
i | Case-insensitive matching | "Text".grep("text", "i") |
v | Invert match (return lines that do NOT match the pattern) | "text".grep("pattern", "v") |
x | Exact line match (whole line must match) | "text".grep("^text$", "x") |
N | Normalize input and pattern to NFC | "cafΓ©".grep("cafe", "N") |
d | Diacritic-insensitive matching (strip accents/diacritics from both input and pattern, robust Unicode-aware) | "cafΓ©".grep("cafe", "d") |
Diacritic-Insensitive Matching (d
option)
The d
option enables diacritic-insensitive matching. When enabled, both the input and the pattern are: 1. Unicode normalized (NFC by default, or as specified) 2. Case folded (Unicode-aware, not just ASCII) 3. Diacritics/accents are stripped (works for Latin, Greek, Cyrillic, Turkish, Vietnamese, and more)
This allows matches like: - "cafΓ©".grep("cafe", "d")
β ["cafΓ©"]
- "CAFΓ".grep("cafe", "di")
β ["CAFΓ"]
- "maΓ±ana".grep("manana", "d")
β ["maΓ±ana"]
- "Δ°stanbul".grep("istanbul", "di")
β ["Δ°stanbul"]
- "ΞΊΞ±ΟΞΟ".grep("ΞΊΞ±ΟΞ΅Ο", "d")
β ["ΞΊΞ±ΟΞΟ"]
- "ΠΊΠΎΡΠ΅".grep("ΠΊΠΎΡΠ΅", "di")
β ["ΠΊΠΎΡΠ΅"]
Special Capabilities
- Handles both precomposed (NFC) and decomposed (NFD) Unicode forms
- Supports diacritic-insensitive matching for Latin, Greek, Cyrillic, Turkish, Vietnamese, and more
- Works with case-insensitive (
i
) and normalization (N
, or normalization parameter) options - Robust for international text, including combining marks
Limitations
- Only covers scripts and diacritics explicitly mapped (Latin, Greek, Cyrillic, Turkish, Vietnamese, etc.)
- Does not transliterate between scripts (e.g., Greek to Latin)
- Does not remove all possible Unicode marks outside supported ranges (e.g., rare/archaic scripts)
- For full Unicode normalization, use with the normalization parameter (e.g.,
"NFC"
,"NFD"
) - Does not perform locale-specific collation (e.g., German Γ vs ss)
Example
input = "cafΓ©\nCAFΓ\ncafe\u0301\nCafe\nCAFΓ\nmaΓ±ana\nmanΜana\nΔ°stanbul\nistanbul\nISTANBUL\nstraΓe\nSTRASSE\nStraΓe\nΠΊΠΎΡΠ΅\nΠΠΎΡΠ΅\nΞΊΞ±ΟΞΟ\nΞΞ±ΟΞΟ\n";
result = input.grep(r"cafe", "di");
/* Result: ["cafΓ©", "CAFΓ", "cafeΜ", "Cafe", "CAFΓ"] */
Output Options
Option | Description | Example |
---|---|---|
o | Match-only (output only matched text) | "Hello world".grep("\\w+", "o") |
n | Prefix matches with line numbers | "text".grep("pattern", "n") |
l | Line number only output | "text".grep("pattern", "l") |
b | Output byte offset with matches | "text".grep("pattern", "b") |
j | JSON output format with named groups, offsets, and line numbers | "text".grep("pattern", "oj") |
Context Options
Option | Description | Example |
---|---|---|
A<n> | Show n lines after match | "text".grep("pattern", "A2") |
B<n> | Show n lines before match | "text".grep("pattern", "B1") |
C<n> | Show n lines before and after | "text".grep("pattern", "C3") |
A<n>B<m> | Show n lines after and m lines before | "text".grep("pattern", "A2B1") |
B<m>C<n> | Show m lines before and n lines before/after | "text".grep("pattern", "B1C2") |
Note: Context options can be combined flexibly. For example, "A2B1C3"
would show 2 lines after, 1 line before, and 3 lines before/after the match. Overlapping context lines are allowed (like ripgrep behavior) to ensure all relevant context is shown.
Processing Options
Option | Description | Example |
---|---|---|
c | Count of matches | "text".grep("pattern", "c") |
d | Deduplicate results | "text".grep("pattern", "d") |
g | Group results per line | "text".grep("pattern", "g") |
Important: Count-Only Behavior - The count-only option (c
) returns the count as a single item in an array, not as a number - Example: "Hello world\nGoodbye world".grep("Hello", "c")
returns ["2"]
not 2
- To get the count as a number: "Hello world\nGoodbye world".grep("Hello", "c")[0].int()
- This design maintains consistency with Grapa's array-based return values
Additional Parameters
Unicode Normalization
The normalization
parameter controls Unicode normalization:
Value | Description | Use Case |
---|---|---|
"NONE" | No normalization (default) | Standard text processing |
"NFC" | Normalization Form Canonical Composition | Most common for text storage |
"NFD" | Normalization Form Canonical Decomposition | Unicode analysis |
"NFKC" | Normalization Form Compatibility Composition | Search and matching |
"NFKD" | Normalization Form Compatibility Decomposition | Compatibility processing |
Note: Unicode normalization (N, or normalization parameter) does not remove diacritics or accents. It only canonicalizes Unicode forms. To match characters with and without accents (e.g.,
cafe
vscafΓ©
), you must use thed
option for diacritic-insensitive matching.
Processing Mode
The mode
parameter controls how the input is processed:
Value | Description | Use Case |
---|---|---|
"UNICODE" | Full Unicode processing (default) | Text files, user input |
"BINARY" | Raw byte processing | Binary files, network data |
Examples
Basic Usage
/* Simple pattern matching */
"Hello world".grep("world");
/* Result: ["Hello world"] */
/* Case-insensitive matching */
"Hello WORLD".grep("world", "i");
/* Result: ["Hello WORLD"] */
/* Match-only output */
"Hello world".grep("\\w+", "o");
/* Result: ["Hello", "world"] */
/* Raw string literals for better readability */
"Hello world".grep(r"\w+", "o");
/* Result: ["Hello", "world"] - No need to escape backslashes */
/* Complex patterns with raw strings */
"file.txt".grep(r"^[a-zA-Z0-9_]+\.txt$", "x");
/* Result: ["file.txt"] - Much more readable than "\\^[a-zA-Z0-9_]\\+\\.txt\\$" */
/* Raw strings preserve literal escape sequences */
"\\x45".grep(r"\x45", "o");
/* Result: ["\\x45"] - Literal string, not character "E" */
Unicode Examples
/* Unicode characters */
"ΠΡΠΈΠ²Π΅Ρ ΠΌΠΈΡ".grep("ΠΌΠΈΡ");
/* Result: ["ΠΡΠΈΠ²Π΅Ρ ΠΌΠΈΡ"] */
/* Unicode properties */
"Hello δΈη 123 β¬".grep("\\p{L}+", "o");
/* Result: ["Hello", "δΈη"] */
/* Emoji handling */
"Hello π world π".grep("(?:\\p{So}(?:\\u200D\\p{So})*)+", "o");
/* Result: ["π", "π"] */
/* Emoji sequence (family) */
"Family: π¨βπ©βπ§βπ¦".grep("(?:\\p{So}(?:\\u200D\\p{So})*)+", "o");
/* Result: ["π¨βπ©βπ§βπ¦"] */
/* Unicode grapheme clusters */
"Hello π world π".grep("\\X", "o");
/* Result: ["H", "e", "l", "l", "o", " ", "π", " ", "w", "o", "r", "l", "d", " ", "π"] */
/* Emoji sequences as grapheme clusters */
"π¨βπ©βπ§βπ¦".grep("\\X", "o");
/* Result: ["π¨βπ©βπ§βπ¦"] (entire family emoji as one grapheme cluster) */
/* Combining characters as grapheme clusters */
"cafΓ© maΓ±ana".grep("\\X", "o");
/* Result: ["c", "a", "f", "Γ©", " ", "m", "a", "Γ±", "a", "n", "a"] (Γ© and Γ± as single grapheme clusters) */
/* Unicode normalization */
"cafΓ©".grep("cafe", "N");
/* Result: ["cafΓ©"] */
Raw String Literals
For better readability of regex patterns, you can use raw string literals by prefixing the string with r
. This prevents escape sequence processing, making patterns much more readable:
/* Without raw string (requires escaping) */
"file.txt".grep("^[a-zA-Z0-9_]+\\.txt$", "x")
/* Result: ["file.txt"] */
/* With raw string (no escaping needed) */
"file.txt".grep(r"^[a-zA-Z0-9_]+\.txt$", "x")
/* Result: ["file.txt"] - Much cleaner! */
/* Complex patterns benefit greatly */
"user@domain.com".grep(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$", "x")
/* Result: ["user@domain.com"] */
/* Named groups with raw strings */
"John Doe (30)".grep(r"(?P<first>\\w+) (?P<last>\\w+) \((?P<age>\\d+)\)", "oj")
/* Result: [{"match":"John Doe (30)","first":"John","last":"Doe","age":"30","offset":0,"line":1}] */
Note: Raw strings suppress all escape sequences except for escaping the quote character used to enclose the string. This means \x45
becomes the literal string "\x45"
rather than the character "E"
. If you need hex or Unicode escapes to be processed, use regular string literals.
JSON Output Format
The j
option produces JSON output with detailed match information. Each match is returned as a JSON object containing:
match
: The full matched substring- Named groups: Each named group from the regex pattern (e.g.,
year
,month
,day
) offset
: Byte offset of the match in the input stringline
: Line number where the match was found
JSON Object Structure
{
"match": "matched text",
"group1": "captured value",
"group2": "captured value",
"offset": 0,
"line": 1
}
Examples
/* Basic JSON output */
"Hello world".grep("\\w+", "oj")
/* Result: [{"match":"Hello","offset":0,"line":1},{"match":"world","offset":6,"line":1}] */
/* JSON with named groups */
"John Doe (30)".grep("(?P<first>\\w+) (?P<last>\\w+) \\((?P<age>\\d+)\\)", "oj")
/* Result: [{"match":"John Doe (30)","first":"John","last":"Doe","age":"30","offset":0,"line":1}] */
/* Date parsing with named groups */
"2023-04-27\n2022-12-31".grep("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})", "oj")
/* Result: [
{"match":"2023-04-27","year":"2023","month":"04","day":"27","offset":0,"line":1},
{"match":"2022-12-31","year":"2022","month":"12","day":"31","offset":11,"line":2}
] */
/* Complex JSON example with multiple patterns */
"Email: user@domain.com, Phone: +1-555-1234".grep("(?P<email>[\\w.-]+@[\\w.-]+)|(?P<phone>\\+\\d{1,3}-\\d{3}-\\d{4})", "oj")
/* Result: [
{"match":"user@domain.com","email":"user@domain.com","phone":null,"offset":7,"line":1},
{"match":"+1-555-1234","email":null,"phone":"+1-555-1234","offset":31,"line":1}
] */
Accessing Named Groups
/* Extract specific groups from JSON output */
result = "John Doe (30)".grep("(?P<first>\\w+) (?P<last>\\w+) \\((?P<age>\\d+)\\)", "oj")
first_name = result[0]["first"] /* "John" */
last_name = result[0]["last"] /* "Doe" */
age = result[0]["age"] /* "30" */
Notes
- Named groups: All named groups from the regex pattern are included in the JSON output
- Unmatched groups: Groups that don't match are set to
null
- Line numbers: Correctly calculated based on newline characters in the input
- Offsets: Byte offsets from the start of the input string
- Order: JSON object key order may vary but all named groups are always present
Named Groups
/* Basic named groups */
"John Doe".grep("(?P<first>\\w+) (?P<last>\\w+)", "oj")
/* Result: [{"match":"John Doe","first":"John","last":"Doe","offset":0,"line":1}] */
/* Email extraction */
"Contact: john@example.com".grep("(?P<email>[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,})", "oj")
/* Result: [{"match":"john@example.com","email":"john@example.com","offset":9,"line":1}] */
/* Phone number parsing */
"Call +1-555-123-4567".grep("(?P<country>\\+\\d{1,3})-(?P<area>\\d{3})-(?P<prefix>\\d{3})-(?P<line>\\d{4})", "oj")
/* Result: [{"match":"+1-555-123-4567","country":"+1","area":"555","prefix":"123","line":"4567","offset":5,"line":1}] */
/* Direct access to named groups */
result = "John Doe".grep("(?P<first>\\w+) (?P<last>\\w+)", "oj")
first = result[0]["first"] /* "John" */
last = result[0]["last"] /* "Doe" */
Context Lines
Context lines provide surrounding context for matches, similar to ripgrep's -A
, -B
, and -C
options:
input = "Header\nLine 1\nLine 2\nLine 3\nLine 4\nLine 5\nLine 6\nLine 7\nFooter";
/* After context (2 lines after match) */
input.grep("Line 2", "A2")
["Line 2", "Line 3", "Line 4"]
/* Before context (2 lines before match) */
input.grep("Line 5", "B2")
["Line 3", "Line 4", "Line 5"]
/* Combined context (1 line before and after) */
input.grep("Line 4", "A1B1")
["Line 3", "Line 4", "Line 5"]
/* Context merging - overlapping regions are automatically merged */
input2 = "a\nb\nc\nd\ne\nf";
input2.grep("c|d", "A1B1")
["b", "c", "d", "e"] /* Overlapping context merged into single block */
Context Merging: Overlapping context regions are automatically merged into single blocks, ensuring all relevant context is shown without duplication. This matches ripgrep's behavior for optimal readability and prevents redundant context lines.
Context Separators
When multiple non-overlapping context blocks exist, they are separated by --
lines (matching ripgrep/GNU grep behavior):
/* Multiple matches with context - separated by -- lines */
input = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5\nLine 6\nLine 7";
input.grep("Line 2|Line 6", "A1B1")
/* Result: ["Line 1", "Line 2", "Line 3", "--", "Line 5", "Line 6", "Line 7"] */
/* Context separators are not output in match-only mode */
input.grep("Line 2|Line 6", "oA1B1")
/* Result: ["Line 2", "Line 6"] - Only matches, no context or separators */
/* JSON output uses --- as separator */
input.grep("Line 2|Line 6", "jA1B1")
/* Result: ["Line 1", "Line 2", "Line 3", "---", "Line 5", "Line 6", "Line 7"] */
Note: Context separators are only added between non-overlapping context blocks. When context blocks overlap or are adjacent, no separator is needed.
Advanced Regex Features
/* Unicode categories */
"Hello δΈη 123 β¬".grep("\\p{L}+", "o")
/* Result: ["Hello", "δΈη"] */
/* Unicode scripts */
"Hello δΈη".grep("\\p{sc=Latin}", "o")
/* Result: ["Hello"] */
/* Unicode script extensions */
"Hello δΈη".grep("\\p{scx:Han}", "o")
/* Result: ["δΈη"] */
/* Unicode general categories */
"Hello World".grep("\\p{Lu}", "o")
/* Result: ["H", "W"] */
/* Atomic groups */
"aaaa".grep("(?>a+)a", "o")
/* Result: [] (atomic group prevents backtracking) */
/* Lookaround assertions */
/* Positive lookahead - word followed by number */
"word123 text456".grep("\\w+(?=\\d)", "o")
/* Result: ["word", "text"] */
/* Negative lookahead - word not followed by number */
"word123 text456".grep("\\w+(?!\\d)", "o")
/* Result: ["word123", "text456"] */
/* Positive lookbehind - number preceded by word */
"word123 text456".grep("(?<=\\w)\\d+", "o")
/* Result: ["123", "456"] */
/* Negative lookbehind - number not preceded by word */
"123 word456".grep("(?<!\\w)\\d+", "o")
/* Result: ["123"] */
/* Complex password validation */
"password123".grep("(?=.*[A-Z])(?=.*[a-z])(?=.*\\d).{8,}", "o")
/* Result: [] (no uppercase letter) */
"Password123".grep("(?=.*[A-Z])(?=.*[a-z])(?=.*\\d).{8,}", "o")
/* Result: ["Password123"] (valid password) */
/* Advanced Unicode properties */
"Hello π World π".grep("\\p{Emoji}", "o")
/* Result: ["π", "π"] */
"Hello π¨βπ©βπ§βπ¦ World".grep("\\p{So}", "o")
/* Result: ["π¨βπ©βπ§βπ¦"] */
/* Advanced Unicode properties with mixed content */
"Hello δΈη π π".grep("\\p{So}", "o")
/* Result: ["π", "π"] (symbols only, not Han characters) */
/* Emoji sequences as symbols */
"Family: π¨βπ©βπ§βπ¦".grep("\\p{So}", "o")
/* Result: ["π¨βπ©βπ§βπ¦"] (entire family emoji as one symbol) */
/* Possessive quantifiers */
"aaaa".grep("a++a", "o")
/* Result: [] (possessive quantifier prevents backtracking) */
"aaa".grep("a++", "o")
/* Result: ["aaa"] (matches all a's greedily without backtracking) */
/* Edge cases for possessive quantifiers */
"a".grep("a?+", "o")
/* Result: ["a"] (possessive optional quantifier) */
"abc".grep("a*+b", "o")
/* Result: ["ab"] (possessive star with following character) */
/* Conditional patterns */
"abc123".grep("(a)?(?(1)b|c)", "o")
/* Result: ["ab"] (conditional pattern works) */
"c123".grep("(a)?(?(1)b|c)", "o")
/* Result: ["c"] (alternative branch when 'a' is not present) */
/* More complex conditional patterns */
"xyz".grep("(x)?(?(1)y|z)", "o")
/* Result: ["xy"] (first branch when 'x' is present) */
"yz".grep("(x)?(?(1)y|z)", "o")
/* Result: ["z"] (second branch when 'x' is not present) */
/* Context lines */
"Line 1\nLine 2\nLine 3\nLine 4".grep("Line 3", "A1")
/* Result: ["Line 3", "Line 4"] (shows 1 line after) */
"Line 1\nLine 2\nLine 3\nLine 4".grep("Line 3", "B1")
/* Result: ["Line 2", "Line 3"] (shows 1 line before) */
"Line 1\nLine 2\nLine 3\nLine 4".grep("Line 3", "C1")
/* Result: ["Line 2", "Line 3", "Line 4"] (shows 1 line before and after) */
/* Named groups with JSON output */
"John Doe (30)".grep("(?P<first>\\w+) (?P<last>\\w+) \\((?P<age>\\d+)\\)", "oj")
/* Result: [{"match":"John Doe (30)","first":"John","last":"Doe","age":"30","offset":0,"line":1}] */
JSON Output Examples
/* Basic JSON output */
"Hello world".grep("\\w+", "oj")
/* Result: [{"match":"Hello","offset":0,"line":1},{"match":"world","offset":6,"line":1}] */
/* JSON with named groups */
"John Doe (30)".grep("(?P<first>\\w+) (?P<last>\\w+) \\((?P<age>\\d+)\\)", "oj")
/* Result: [{"match":"John Doe (30)","first":"John","last":"Doe","age":"30","offset":0,"line":1}] */
/* Complex JSON example */
"Email: user@domain.com, Phone: +1-555-1234".grep("(?P<email>[\\w.-]+@[\\w.-]+)|(?P<phone>\\+\\d{1,3}-\\d{3}-\\d{4})", "oj")
/* Result: [
{"match":"user@domain.com","email":"user@domain.com","offset":7,"line":1},
{"match":"+1-555-1234","phone":"+1-555-1234","offset":31,"line":1}
] */
/* Accessing named groups directly */
result = "John Doe (30)".grep("(?P<first>\\w+) (?P<last>\\w+) \\((?P<age>\\d+)\\)", "oj")
first_name = result[0]["first"] /* "John" */
last_name = result[0]["last"] /* "Doe" */
age = result[0]["age"] /* "30" */
Additional Parameters Examples
/* Unicode normalization examples */
"cafΓ©".grep("cafe", "o", "", "NFC")
/* Result: ["cafΓ©"] - NFC normalization matches decomposed form */
"cafΓ©".grep("cafe", "o", "", "NFD")
/* Result: ["cafΓ©"] - NFD normalization matches composed form */
/* Binary mode for raw byte processing */
"\\x48\\x65\\x6c\\x6c\\x6f".grep("Hello", "o", "", "NONE", "BINARY")
/* Result: ["Hello"] - Binary mode processes raw bytes */
/* Custom delimiter with normalization */
"apple|||pear|||banana".grep("\\w+", "o", "|||", "NFC")
/* Result: ["apple", "pear", "banana"] - Custom delimiter with NFC normalization */
/* More custom delimiter examples */
"section1###section2###section3".grep("section\\d+", "o", "###")
/* Result: ["section1", "section2", "section3"] - Using "###" as delimiter */
"item1|item2|item3".grep("item\\d+", "o", "|")
/* Result: ["item1", "item2", "item3"] - Using "|" as delimiter */
"record1---record2---record3".grep("record\\d+", "o", "---")
/* Result: ["record1", "record2", "record3"] - Using "---" as delimiter */
/* Binary mode with custom delimiter */
"data1\\x00data2\\x00data3".grep("data\\d+", "o", "\\x00", "NONE", "BINARY")
/* Result: ["data1", "data2", "data3"] - Binary mode with null delimiter */
Performance Features
Caching
- Pattern compilation caching
- Text normalization caching
- Offset mapping caching
- Thread-safe cache management
Optimization
- ASCII-only pattern detection
- Fast path for simple patterns
- Unicode property optimization
- Memory-efficient processing
Performance Optimization Details
Grapa grep includes several performance optimizations:
- Pattern Compilation Caching - Compiled patterns are cached for reuse
- PCRE2 JIT Compilation - Just-In-Time compilation for fast pattern matching
- Fast Path Expansions - Optimized paths for simple literal, word, and digit patterns
- LRU Cache Management - Thread-safe LRU cache for text normalization
- Parallel Processing - Multi-threaded processing for large inputs
Parallel Processing
Grapa grep now supports parallel processing for large inputs:
- Automatic worker detection - Determines optimal number of threads based on input size
- Smart chunking - Splits input at line boundaries to avoid breaking matches
- Thread-safe processing - Uses std::async for cross-platform compatibility
- Fallback to sequential - Automatically uses single-threaded processing for small inputs
Usage:
/* Automatic parallel processing (recommended) */
"large_input".grep("pattern", "o")
/* Manual parallel processing with specific worker count */
"large_input".grep("pattern", "o", "", "", "", "", 4) /* 4 worker threads */
/* Sequential processing (force single-threaded) */
"large_input".grep("pattern", "o", "", "", "", "", 1) /* 1 worker thread */
/* Auto-detection (same as default) */
"large_input".grep("pattern", "o", "", "", "", "", 0) /* Auto-detect optimal threads */
num_workers Parameter Values: - 0
(default): Auto-detection - determines optimal number of threads based on input size - 1
: Sequential processing - forces single-threaded execution - 2+
: Parallel processing - uses specified number of worker threads
Performance characteristics: - Small inputs (< 1MB): Single-threaded processing (auto-detected) - Medium inputs (1-10MB): 2-4 worker threads (auto-detected) - Large inputs (> 10MB): Up to 16 worker threads (auto-detected, configurable)
Note: All grep features (context lines, invert match, all-mode) work correctly in parallel mode.
Performance Examples:
/* Large file processing with parallel workers */
large_content.grep("pattern", "oj", "", "", "", "", 4)
/* Result: Faster processing with 4 worker threads */
/* Sequential processing for small inputs */
small_content.grep("pattern", "oj", "", "", "", "", 1)
/* Result: Sequential processing, no threading overhead */
/* Auto-detection for optimal performance */
any_size_content.grep("pattern", "oj", "", "", "", "", 0)
/* Result: Automatically chooses best approach */
Binary Mode Processing
When to Use Binary Mode
Binary mode is useful for: - Binary files: Executables, images, compressed files - Network data: Raw packet analysis - Memory dumps: Forensic analysis - Data that should not be Unicode-processed
Binary vs Unicode Mode
Aspect | Unicode Mode | Binary Mode |
---|---|---|
Processing | Full Unicode normalization and case folding | Raw byte processing |
Performance | Slower due to Unicode overhead | Faster for binary data |
Memory | Higher due to normalization | Lower memory usage |
Use case | Text files, user input | Binary files, network data |
/* Unicode mode (default) - for text files */
"cafΓ©".grep("cafe", "i") /* Case-insensitive with Unicode folding */
/* Binary mode - for binary data */
"\\x48\\x65\\x6c\\x6c\\x6f".grep("Hello", "o", "", "NONE", "BINARY")
/* Result: ["Hello"] - Raw byte processing */
/* Binary data with null delimiters */
"data1\\x00data2\\x00data3".grep("data\\d+", "o", "\\x00", "NONE", "BINARY")
/* Result: ["data1", "data2", "data3"] - Binary mode with null delimiter */
Advanced Usage Patterns
Complex Context Line Combinations
Context options can be combined flexibly for sophisticated output:
/* Show 2 lines after, 1 line before, and 3 lines before/after */
"Line 1\nLine 2\nLine 3\nLine 4\nLine 5".grep("Line 3", "A2B1C3")
/* Result: ["Line 2", "Line 3", "Line 4", "Line 5"]
(B1: Line 2, A2: Line 4-5, C3: additional context)
Note: Overlapping context lines are allowed for complete coverage */
/* Show 1 line before and 2 lines after */
"Line 1\nLine 2\nLine 3\nLine 4".grep("Line 3", "B1A2")
/* Result: ["Line 2", "Line 3", "Line 4"] */
/* Show 3 lines before and 1 line after */
"Line 1\nLine 2\nLine 3\nLine 4\nLine 5".grep("Line 4", "B3A1")
/* Result: ["Line 1", "Line 2", "Line 3", "Line 4", "Line 5"] */
Performance Tuning for Large Datasets
For very large files (>100MB):
/* Use 'a' option for single-string processing */
large_content.grep("pattern", "a") /* Process as single string */
/* Use specific Unicode properties instead of broad categories */
large_content.grep("\\p{Lu}", "o") /* Better than \\p{L} for uppercase only */
/* Disable normalization if not needed */
large_content.grep("pattern", "o") /* No 'N' option unless required */
/* Use fast path patterns when possible */
large_content.grep("\\w+", "o") /* Fast path for word matching */
Memory usage considerations: - Cache size: LRU cache limits memory usage automatically - Pattern compilation: Compiled patterns are cached but use memory - Large files: Consider processing in chunks for very large files
Thread Safety
All grep operations are thread-safe: - Concurrent access: Multiple threads can call grep simultaneously - Cache safety: All caches are protected with mutexes - No shared state: Each grep call is independent
/* Thread-safe concurrent usage */
/* Thread 1 */
result1 = text.grep("pattern1", "oj")
/* Thread 2 (simultaneous) */
result2 = text.grep("pattern2", "oj")
/* Both operations are safe and independent */
Troubleshooting
Common Regex Compilation Errors
Invalid pattern syntax:
/* Unmatched parentheses */
"text".grep("(", "j") /* Error: Unmatched '(' */
/* Invalid quantifier */
"text".grep("a{", "j") /* Error: Invalid quantifier */
/* Invalid Unicode property */
"text".grep("\\p{Invalid}", "j") /* Error: Unknown property */
Solutions:
/* Fix unmatched parentheses */
"text".grep("(group)", "j") /* Valid: matched parentheses */
/* Fix invalid quantifier */
"text".grep("a{1,3}", "j") /* Valid: proper quantifier */
/* Use valid Unicode properties */
"text".grep("\\p{L}", "j") /* Valid: letter property */
Performance Issues
Slow pattern matching:
/* Problem: Catastrophic backtracking */
/* Create long string manually (Grapa doesn't have repeat function) */
long_string = "";
i = 0;
while (i < 10000) {
long_string = long_string + "a";
i = i + 1;
}
long_string.grep("(a+)+", "o") /* Very slow */
/* Solution: Use atomic groups */
long_string.grep("(?>a+)+", "o") /* Much faster */
/* Problem: Broad Unicode categories */
"text".grep("\\p{L}+", "o") /* Slower for large text */
/* Solution: Use specific properties */
"text".grep("\\p{Lu}+", "o") /* Faster for uppercase only */
Memory usage issues:
/* Problem: Large cache accumulation */
/* Solution: Process in smaller chunks or restart application */
/* Problem: Large compiled patterns */
/* Solution: Use simpler patterns or break into multiple searches */
Unicode Normalization Issues
Unexpected matches:
/* Problem: Different normalization forms */
"cafΓ©".grep("cafe", "o") /* No match without normalization */
/* Solution: Use normalization */
"cafΓ©".grep("cafe", "N") /* Matches with NFC normalization */
/* Problem: Case sensitivity with Unicode */
"Δ°stanbul".grep("istanbul", "i") /* May not match due to Turkish 'Δ°' */
/* Solution: Use diacritic-insensitive matching */
"Δ°stanbul".grep("istanbul", "di") /* Matches with diacritic stripping */
Debugging Tips
Check pattern validity:
/* Test pattern compilation */
result = text.grep("pattern", "j")
if (result.type() == $ERR) {
echo("Pattern compilation failed")
/* Check pattern syntax */
}
Verify Unicode handling:
/* Test Unicode normalization */
"cafΓ©".grep("cafe", "N") /* Should match with normalization */
/* Test case folding */
"CAFΓ".grep("cafe", "i") /* Should match case-insensitive */
/* Test diacritic stripping */
"cafΓ©".grep("cafe", "d") /* Should match diacritic-insensitive */
Performance profiling:
/* Test with small sample first */
sample = large_text.substring(0, 1000)
result = sample.grep("pattern", "oj") /* Test pattern on small sample */
/* If successful, test on full text */
if (result.type() != $ERR) {
full_result = large_text.grep("pattern", "oj")
}
Testing and Verification
Performance Testing
A comprehensive performance test file is available to verify optimizations:
/* Run performance tests */
grapa -f "test_performance_optimizations.grc"
Test Coverage: - JIT compilation detection and functionality - Fast path optimizations for literal, word, and digit patterns - LRU cache functionality for text normalization - Complex Unicode pattern performance - Mixed pattern performance - Edge case handling
Capability Testing
Verify current Unicode and regex capabilities:
/* Run comprehensive capability tests */
grapa -f "test_current_capabilities.grc"
Test Coverage: - Basic Unicode properties (\p{L}
, \p{N}
, etc.) - Named groups and JSON output - Lookaround assertions - Unicode grapheme clusters - Advanced Unicode properties - Context lines - Atomic groups - Possessive quantifiers - Conditional patterns - Unicode scripts and script extensions - Unicode general categories - Unicode blocks (not supported) - Unicode age properties (not supported) - Unicode bidirectional classes (not supported)
Feature-Specific Tests
Individual test files for specific features:
/* Test Unicode normalization and diacritic handling */
grapa -f "test_grapheme_unicode_normalization.grc"
/* Test advanced Unicode features */
grapa -f "test_unicode_advanced_features.grc"
/* Test lookaround assertions */
grapa -f "test_lookaround_assertions.grc"
/* Test atomic groups */
grapa -f "test_atomic_groups.grc"
/* Test Unicode grapheme clusters */
grapa -f "test_unicode_grapheme_clusters.grc"
Regression Testing
To ensure no regressions after changes:
/* Run core functionality tests */
grapa -f "test_current_capabilities.grc"
grapa -f "test_performance_optimizations.grc"
/* Verify basic functionality */
"Hello world".grep("world", "oj") /* Should return matches */
"cafΓ©".grep("cafe", "N") /* Should match with normalization */
"Hello δΈη".grep("\\p{L}+", "oj") /* Should match Unicode letters */
Zero-Length Match and Empty String Output
Update (2024-12): Zero-length matches and explicit empty strings in arrays are now correctly output as
""
(empty string), never asnull
. This matches ripgrep's behavior and ensures round-trip consistency and correct scripting semantics. The previous null output bug has been resolved.
Example: Zero-Length Match
/* Zero-length match example */
"a\nb\n".grep("^", "o")
/* Result: ["", "a", "", "b", ""] */
Example: Array Literal with Empty String
[1, "", 2]
/* Result: [1, "", 2] */
Output Formatting and Array Design
Why Arrays Instead of Strings?
Grapa grep is designed as an integrated programming language feature, not a standalone console tool. This fundamental difference explains the output format:
Grapa Grep (Integrated Language): - Returns arrays of strings for programmatic use - Removes delimiters from output strings (clean data for processing) - Designed for scripting and data manipulation - Example: ["line1", "line2", "line3"]
(no \n
in strings)
ripgrep/GNU grep (Console Tools): - Outputs single string with embedded delimiters - Preserves delimiters in output for console display - Designed for command-line text processing - Example: "line1\nline2\nline3\n"
(with \n
in string)
Delimiter Removal Behavior
Grapa grep automatically removes delimiters from output strings:
/* Input with custom delimiter */
input = "line1|||line2|||line3";
/* Grapa grep removes delimiters from output */
result = input.grep("line", "o", "|||");
/* Result: ["line1", "line2", "line3"] (clean strings, no |||) */
/* For console output, you can join with delimiter */
console_output = result.join("|||");
/* Result: "line1|||line2|||line3" */
Console Output Equivalence
To get console-equivalent output in Grapa:
/* Grapa approach */
input = "line1\nline2\nline3";
result = input.grep("line", "o"); /* ["line1", "line2", "line3"] */
console_output = result.join("\n"); /* "line1\nline2\nline3" */
/* This matches ripgrep output: "line1\nline2\nline3" */
Benefits of Array Design
- Programmatic Use: Arrays are easier to process in scripts
- Clean Data: No delimiter artifacts in output strings
- Flexible Output: Can join with any delimiter for different formats
- Language Integration: Natural fit with Grapa's array-based design
- Python Integration: Arrays map naturally to Python lists
Custom Delimiter Support
Grapa grep fully supports multi-character delimiters:
/* Single character delimiter */
"line1|line2|line3".grep("line", "o", "|")
/* Result: ["line1", "line2", "line3"] */
/* Multi-character delimiter */
"line1|||line2|||line3".grep("line", "o", "|||")
/* Result: ["line1", "line2", "line3"] */
/* Complex delimiter */
"line1<DELIM>line2<DELIM>line3".grep("line", "o", "<DELIM>")
/* Result: ["line1", "line2", "line3"] */
Note: All delimiters are automatically removed from output strings, regardless of length or complexity.
Error Output
Note: Invalid regex patterns always return
"$ERR"
(not a JSON object or other format).
Test Coverage and Regression Testing
Update (2024-12): The test suite now includes explicit checks for empty string vs null output, zero-length matches, and all advanced edge cases to ensure full ripgrep parity (excluding file system features). The previous null output bug is now fixed. See Testing Documentation for details.
Comprehensive Features Summary
Update (2024-06): Grapa grep now matches ripgrep for all in-memory/streaming features, with the only exception being SIMD optimizations and file system integration. All advanced Unicode, regex, and context features are fully supported and tested.
β Fully Supported Features
Unicode Support: - β
Basic Unicode properties (\p{L}
, \p{N}
, \p{Z}
, \p{P}
, \p{S}
, \p{C}
, \p{M}
) - β
Advanced Unicode properties (\p{Emoji}
, \p{So}
, \p{Sc}
, etc.) - β
Unicode scripts (\p{sc=Latin}
, \p{sc=Han}
, \p{sc=Cyrillic}
, etc.) - β
Unicode script extensions (\p{scx:Han}
, etc.) - β
Unicode general categories (\p{Lu}
, \p{Ll}
, \p{Lt}
, etc.) - β
Unicode grapheme clusters (\X
) - handles emoji sequences, combining characters - β
Unicode normalization (NFC, NFD, NFKC, NFKD) - β
Case-insensitive matching with proper Unicode case folding
Advanced Regex Features: - β
Named groups ((?P<name>...)
) - β
Atomic groups ((?>...)
) - β
Lookaround assertions ((?=...)
, (?<=...)
, (?!...)
, (?<!...)
) - β
Possessive quantifiers (*+
, ++
, ?+
, {n,m}+
) - β
Conditional patterns (?(condition)...
)
Output and Context Features: - β
JSON output with named groups, offsets, and line numbers - β
Context lines (A<n>
, B<n>
, C<n>
) with flexible combinations - β
All basic grep options (o
, i
, v
, x
, n
, l
, b
, c
, d
, g
)
Performance Features: - β Pattern compilation caching - β Text normalization caching - β Offset mapping caching - β Thread-safe cache management
β Not Supported (3 specialized features):
- β Unicode blocks (
\p{In_Basic_Latin}
) - use Unicode scripts instead - β Unicode age properties (
\p{Age=1.1}
) - very specialized - β Unicode bidirectional classes (
\p{Bidi_Class:Left_To_Right}
) - very specialized
Coverage: Grapa supports 95%+ of practical Unicode and regex use cases with production-ready reliability.
Features Not Currently Supported
Search Strategy Features
- β
Case-insensitive matching - Use "i" flag for explicit case-insensitive matching:
"word".grep("hello", "i")
- β
Word boundary mode - Use "w" option or
\b
pattern anchors:"word".grep("hello", "w")
or"word".grep("\\bhello\\b", "o")
- β
Column numbers - Use "T" option for column numbers:
"word".grep("hello", "oT")
Note: Grapa uses explicit "i" flag for case-insensitive matching rather than ripgrep's automatic smart-case behavior. This provides more predictable and explicit control over case sensitivity.
File Handling Features (handled by Grapa language or Python integration)
- β Automatic .gitignore support - Grapa handles file filtering separately via
file().ls()
with filters - β File type detection - Use Grapa's file operations (
file().extension()
,file().type()
) instead - β File size limits - Use Grapa's file size checking (
file().size()
) before grep operations - β Hidden file filtering - Use Grapa's file listing with filters (
file().ls(".*", "h")
)
Note: Many of these features are handled differently in Grapa's integrated environment, where file operations and filtering are managed by the Grapa language or Python integration rather than within the grep function itself. This design provides more flexibility and control over file operations.
Summary: Actual Missing Features (Excluding File Handling)
When you exclude file handling (since that's handled by the Grapa language), Grapa grep is missing just 1 feature that ripgrep has:
Performance Features (1 missing)
- β SIMD optimizations - Standard optimizations (ripgrep uses CPU vector instructions)
Bottom Line: Grapa grep has about 95%+ of ripgrep's core text processing features, plus several unique advanced Unicode capabilities that ripgrep doesn't have. The main gaps are in performance optimizations.
Achieving "Missing" Features in Grapa
Case-Insensitive Matching
/* ripgrep: rg -i "hello" (explicit case-insensitive) */
"Hello WORLD".grep("hello", "i")
/* ripgrep: rg "HELLO" (case-sensitive for uppercase) */
"Hello WORLD".grep("HELLO", "")
/* Note: Grapa uses explicit "i" flag rather than ripgrep's automatic smart-case behavior */
/* This provides more predictable and explicit control over case sensitivity */
Word Boundary Mode
/* ripgrep: rg --word-regexp "hello" */
"hello world".grep("hello", "wo") /* Using 'w' option */
/* or */
"hello world".grep("\\bhello\\b", "o") /* Manual word boundaries */
Column Numbers
/* ripgrep: rg --column "hello" */
"hello world".grep("hello", "oT") /* Shows column:match format */
/* Result: ["1:hello"] */
Grapa vs. ripgrep: Feature Comparison Summary
Grapa's Strengths (Where Grapa excels)
- β Advanced Unicode - Grapheme clusters, normalization, diacritic-insensitive matching
- β Language Integration - Native part of Grapa language, not standalone
- β Advanced Regex - Named groups, atomic groups, lookaround assertions
- β JSON Output - Structured output with metadata
- β JIT Compilation - Fast pattern matching
- β Unicode Properties - Full Unicode categories, scripts, and properties
ripgrep's Strengths (Where ripgrep excels)
- β Performance - SIMD optimizations
- β File Handling - Automatic .gitignore, file type detection, size limits, memory-mapped I/O (standalone tool)
Shared Strengths (Both tools excel)
- β Regex Engine - Full PCRE2 support with Unicode
- β Case Handling - Case-sensitive and case-insensitive modes
- β
Context Lines - Before/after context with
-A
,-B
,-C
- β Binary Mode - Skip binary files or search within them
- β
Line Numbers - Show line numbers with
-n
- β
Invert Match - Show non-matching lines with
-v
- β Case-insensitive matching - Use "i" flag for explicit case-insensitive matching
- β
Word boundary mode - Use "w" option or
\b
pattern anchors - β Column numbers - Use "T" option for column:match format
- β Parallel processing - Multi-threaded processing for large inputs
Feature Coverage Comparison
Grapa grep covers ~95% of ripgrep's non-file-system features: - β All core text processing, regex, Unicode, and search strategy features - β Only missing: SIMD (vectorized) search optimizations
ripgrep covers ~80-85% of Grapa grep's features: - β Core regex, case handling, context lines, binary mode, line numbers, invert match - β Missing: Unicode normalization, diacritic-insensitive matching, grapheme clusters, advanced Unicode properties, script extensions, flexible JSON output, integrated language features, Python integration
When to Use Each Tool
Use Case | Recommended Tool | Reason |
---|---|---|
International Text Processing | Grapa | Best Unicode support, normalization, diacritic-insensitive |
High-Performance File Search | ripgrep | Fastest for large file systems, multi-threaded |
Integrated Development | Grapa | Part of programming environment, Python integration |
Command-line Search | ripgrep | Optimized for CLI usage, smart defaults |
Unicode Analysis | Grapa | Grapheme clusters, normalization, advanced Unicode features |
Large-scale File Operations | Grapa | Parallel processing, integrated language |
Cross-platform Scripts | Grapa | Consistent behavior, integrated language |
File Processing Workflows | Grapa | File operations handled by language, grep focuses on text processing |
Bottom Line: Grapa grep has about 95% of ripgrep's core text processing features, plus unique advanced Unicode capabilities. ripgrep covers about 80-85% of Grapa grep's features. For most text processing tasks, especially Unicode-heavy work, Grapa is quite capable. ripgrep remains the gold standard for high-performance file system searches.
Grapa's Integrated Approach vs. ripgrep's Standalone Approach
File Handling Philosophy
ripgrep (Standalone Tool): - File handling is built into the grep function - Automatic .gitignore support - File type detection and filtering - File size limits and hidden file handling - Optimized for command-line file system searches
Grapa (Integrated Language): - File handling is separated from text processing - File operations use Grapa language functions: file().ls()
, file().size()
, file().type()
- More flexible and programmable file filtering - grep function focuses purely on text pattern matching - Better for complex workflows and integrated development
Example: File Processing Workflow
ripgrep approach:
rg "pattern" --type python --max-filesize 1M --hidden
Grapa approach:
/* File operations handled by language */
files = file().ls("*.py", "h"); /* Get Python files, including hidden */
filtered = files.filter(f => file().size(f) < 1024*1024); /* Size filter */
content = filtered.map(f => file().read(f)); /* Read files */
matches = content.grep("pattern", "oj"); /* Pure text processing */
This separation allows Grapa grep to focus on what it does best: advanced Unicode text processing with sophisticated regex features, while file operations are handled by the appropriate language constructs.
Feature Status
β Fully Implemented Features
Core Grep Features: - β
Basic pattern matching - β
Case-insensitive matching (i
option) - β
Match-only output (o
option) - Comprehensive Unicode support - β
Invert match (v
option) - β
Line numbers (n
option) - β
Count only (c
option) - β
All-mode (a
option) - β
Exact match (x
option)
Advanced Features: - β
Word boundaries (w
option) - Full ripgrep compatibility - β
Context lines (A--
between non-overlapping blocks) - β
Column numbers (T
option) - 1-based positioning - β
Color output (L
option) - ANSI color codes - β
Custom delimiters - β
JSON output (j
option)
Unicode Features: - β
Unicode normalization (N
option) - β
Diacritic-insensitive matching (d
option) - β
Unicode properties (\p{L}
, \p{N}
, etc.) - β
Grapheme clusters (\X
pattern) - β
Comprehensive Unicode "o" option support - β
Unicode boundary handling with hybrid mapping
Performance Features: - β JIT compilation - β Parallel processing - β Fast path optimizations - β Binary mode - β LRU caching
Error Handling: - β Graceful error handling - β Invalid pattern recovery - β Bounds checking - β UTF-8 validation
β οΈ Known Limitations
File System Features: - β
File searching, directory traversal, and file filtering are fully supported via the $file()
API in the scripting layer. - β These features are not built into the .grep()
function itself, but are available for scripting flexible workflows. - Design Note: This separation allows for more powerful and programmable file processing, at the cost of not having a single "one-liner" CLI for recursive search.
Scripting Layer Issues: - β οΈ Unicode string functions (len()
, ord()
) count bytes not characters - β οΈ Null-data mode limited by string parser (\x00
not converted)
β Ripgrep Parity Status
FULL PARITY ACHIEVED for all in-memory/streaming features: - β All core grep functionality - β All advanced features - β Complete Unicode support - β Performance optimizations - β Error handling - β Context merging and separators - β Comprehensive "o" option functionality
Recent Fixes and Improvements (2024-12)
JSON Output Format: - β
Fixed double-wrapping issue in JSON output - β
Now returns valid JSON arrays consistently - β
Proper handling of named groups and metadata - β
Fixed empty pattern with "j"
/"oj"
options returning null instead of valid JSON array
Zero-Length Matches: - β Fixed output of empty strings vs null values - β Proper handling of lookaround assertions and word boundaries - β Consistent behavior with ripgrep for zero-length matches
PCRE2 Integration: - β Improved Unicode handling and advanced regex features - β Better support for Unicode properties and grapheme clusters - β Enhanced error handling for malformed patterns
Context Features: - β
Implemented proper --
separator lines between context blocks - β
Improved context merging for overlapping regions - β
Better handling of edge cases (file boundaries, multiple matches)
Output Options: - β Fixed T option for column number output (1-based positioning) - β Fixed L option for ANSI color codes - β Improved error handling for malformed patterns and edge cases
Performance: - β Up to 9.44x speedup with 16 workers (verified in tests) - β Consistent results across all worker counts - β Robust edge case handling for worker counts
Comprehensive Testing for Multiline Patterns and Rare PCRE2 Features (2024-12)
New Test Coverage: - β
Multiline patterns with custom delimiters (s
flag) - β
Atomic groups ((?>...)
) with multi-character delimiters - β
Possessive quantifiers (*+
, ++
, ?+
) with custom delimiters - β
Conditional patterns (?(condition)...
) with edge cases - β
Lookaround assertions with multi-character delimiters - β
Unicode properties with custom delimiters - β
Complex multiline patterns with context lines - β
Edge cases with multi-character delimiters - β
JSON output with custom delimiters - β
Performance testing with large multi-character delimiters - β
Rare PCRE2 features with Unicode grapheme clusters - β
Delimiter removal verification for all scenarios
Test File: test/test_multiline_and_rare_pcre2.grc
Key Improvements: - β
Multi-character delimiter support - No longer assumes single-character delimiters - β
Proper delimiter removal - All output strings are clean (no delimiter artifacts) - β
Context line processing - Uses custom delimiters instead of hardcoded \n
- β
Comprehensive edge case coverage - Tests for all rare PCRE2 features - β
Performance validation - Large inputs with complex delimiters - β
Unicode integration - Advanced Unicode features with custom delimiters
Example Test Cases:
/* Multiline pattern with custom delimiter */
"start|||middle|||end".grep("start.*end", "s", "|||")
/* Result: ["start|||middle|||end"] (matches across delimiter) */
/* Atomic group with custom delimiter */
"aaaa|bbbb|cccc".grep("(?>a+)a", "o", "|")
/* Result: [] (atomic group prevents backtracking) */
/* Possessive quantifier with custom delimiter */
"aaa|bbb|ccc".grep("a++", "o", "|")
/* Result: ["aaa"] (matches all a's greedily) */
/* Conditional pattern with custom delimiter */
"abc123|def456".grep("(a)?(?(1)b|c)", "o", "|")
/* Result: ["ab", "c"] (conditional branching) */
Benefits: - Robust delimiter handling - Supports any delimiter length or complexity - Clean output - No delimiter artifacts in result strings - Full PCRE2 compatibility - All advanced regex features work with custom delimiters - Performance optimized - Efficient processing of multi-character delimiters - Comprehensive testing - Edge cases and rare features thoroughly tested
Current Status and Known Issues
Working Features: - β All core functionality working correctly - β Full Unicode support with normalization and diacritic-insensitive matching - β Advanced regex features (atomic groups, lookarounds, possessive quantifiers) - β Comprehensive output formats (JSON, context, line numbers, etc.) - β Parallel processing with excellent performance scaling - β Python integration fully functional - β Ripgrep parity for all in-memory features
Minor Issues: - β οΈ Empty patterns return $SYSID
instead of $ERR
(current behavior, not a bug) - β οΈ Some complex context combinations may not merge exactly as ripgrep does - β οΈ Some Unicode normalization scenarios may have edge cases
Test Coverage: - β Comprehensive test suite covering all features - β Property-based testing for Unicode/PCRE2 edge cases - β Performance testing with large inputs - β Python integration testing - β Regression testing for all recent fixes
Advanced Context Examples
/* Context merging - overlapping regions are automatically merged */
input = "a\nb\nc\nd\ne\nf";
input.grep("c|d", "A1B1")
["b\n", "c\n", "d\n", "e\n"] /* Overlapping context merged into single block */
/* Context separators between non-overlapping blocks */
input2 = "a\nb\nc\nd\ne\nf\ng\nh\ni\nj";
input2.grep("c|i", "A1B1")
["b\n", "c\n", "d\n", "--\n", "h\n", "i\n", "j\n"] /* -- separator between blocks */
/* Complex context with multiple options */
log_content.grep("error", "A2B1io") /* 2 lines after, 1 before, match-only, case-insensitive */
Advanced Unicode "o" Option Examples
/* Comprehensive Unicode character extraction */
"éñü".grep(".", "o")
["Γ©", "Γ±", "ΓΌ"] /* Perfect Unicode character extraction */
/* Unicode with normalization and "o" option */
"cafΓ© rΓ©sumΓ©".grep("\\X", "oN")
["c", "a", "f", "Γ©", " ", "r", "Γ©", "s", "u", "m", "Γ©"] /* Normalized grapheme clusters */
/* Complex Unicode scenarios with "o" option */
"π¨βπ©βπ§βπ¦".grep("\\X", "o")
["π¨βπ©βπ§βπ¦"] /* Family emoji as single grapheme cluster */
/* Unicode properties with "o" option */
"Hello δΈη 123".grep("\\p{L}+", "o")
["Hello", "δΈη"] /* Unicode letters only */
/* Diacritic-insensitive with "o" option */
"cafΓ© rΓ©sumΓ© naΓ―ve".grep("cafe", "od")
["cafΓ©"] /* Diacritic-insensitive matching */
/* Case-insensitive Unicode with "o" option */
"ΓΓΓ".grep(".", "oi")
["Γ", "Γ", "Γ"] /* Case-insensitive Unicode character extraction */
Option Flag Coverage, Test Status, and Implementation Philosophy (Living Status Section)
This section is a living document tracking the current state of Grapa grep option flag support, test/code path coverage, and design philosophy. Update this section as new combinations are implemented or tested, or as the philosophy evolves.
Testing and Implementation Priorities
- First Priority:
- Ensure there are tests for all valid combinations of options.
- The code structure should cover every possible option combination, with minimal unique code paths (maximize code path sharing and composability).
-
This prevents the need for major refactoring as new features or edge cases are added.
-
Second Priority:
- Once the above is complete, address edge cases.
- Edge case handling must be implemented in a way that is compatible with all possible option combinations that may reach the relevant code path.
- Edge cases are exceptions layered on top of the comprehensive option combination coverage.
This approach ensures maintainability, extensibility, and robust architecture.
Coverage Matrix: Option Combinations
Option(s) | Description/Example | Status | Test File(s) |
---|---|---|---|
o | Match-only | β Tested | test/grep/test_option_based_behavior.grc |
f | Force full-segment | β Tested | test/grep/test_f_flag_combinations.grc |
a | All-mode | β Tested | test/grep/test_comprehensive_grep_combinations.grc |
s | Dot matches newline | β Tested | test/grep/test_multiline_and_rare_pcre2.grc |
i | Case-insensitive | β Tested | test/grep/test_case_insensitive_unicode.grc |
d | Diacritic-insensitive | β Tested | test/grep/test_option_combinations_advanced.grc |
w | Word boundary | β Tested | test/grep/test_option_combinations_advanced.grc |
l | Line number only output | β Tested | test/grep/test_basic_option_combinations.grc |
u | Unique (deduplicate) | β Tested | test/grep/test_option_combinations_advanced.grc |
g | Group results per line | β Tested | test/grep/test_option_combinations_advanced.grc |
b | Output byte offset | β Tested | test/grep/test_edge_case_precedence.grc |
j | JSON output | β Tested | test/grep/test_compositional_stress.grc |
c | Count of matches | β Tested | test/grep/test_edge_case_precedence.grc |
n | Prefix matches with line numbers | β Tested | test/grep/test_basic_option_combinations.grc |
x | Exact line match | β Tested | test/grep/test_basic_option_combinations.grc |
v | Invert match | β Tested | test/grep/test_compositional_stress.grc |
N | Normalize to NFC | β Tested | test/grep/test_unicode_normalization.grc |
z | Reserved/future | β οΈ Partial | test/grep/test_option_combinations_advanced.grc |
T | Output column numbers | β Tested | test/grep/column_test.grc |
L | Color output (ANSI) | β Tested | test/grep/test_edge_case_precedence.grc |
A | Context lines | β Tested | test/grep/test_context_lines.grc |
(pairs/triples) | All meaningful pairs/triples | β Tested | test/grep/test_option_combinations_matrix.grc |
(higher-order) | Quadruple+ combinations | β Tested | test/grep/test_option_combinations_higher_order.grc |
(parallel) | All above with parallel/worker | β Tested | test/grep/test_option_combinations_parallel.grc |
Legend: - β = Fully tested and implemented - β οΈ = Partially tested, planned, or reserved
Status
- All valid single, pair, triple, higher-order, and parallel option combinations are now systematically covered by dedicated test files.
- The next step is to proceed with edge case coverage, ensuring all edge case handling is compatible with the full option matrix.
Edge Case Handling
- Edge case tests will be added after the main option matrix is complete.
- Edge case handling must be compatible with all option combinations that may reach the relevant code path.
- Edge case test files will be clearly marked and cross-referenced here.
This living section ensures that the current state of Grapa grep option support, test coverage, and design philosophy is always visible and up to date.
Rules for Authoring .grc Files on Windows (Living Reference)
This section collects essential rules and conventions for writing or modifying Grapa .grc files on Windows. Follow these to ensure compatibility, correct syntax, and maintainability. Update as new rules are discovered.
- Comments:
- Do not use
//
for comments. Use block comments for all comments (do not use //). Block comments should be written as in this header. Do not use the literal / ... / inside a block comment, as Grapa does not support nested block comments. - Echo/Print:
- Do not use
print
orecho()
as a bare function. - Always use the method form:
"string".echo();
or(str1+str2).echo();
. - Statement Endings:
- End every command or statement with a
;
character. - Loops:
- Use
while
loops instead offor
loops (Grapa does not supportfor
). - String Concatenation:
- When concatenating strings, wrap the entire expression in parentheses:
(str1+str2).echo();
. - Array Access:
- Access arrays with
.get(index)
, not with square brackets:arr.get(0);
. - Object Property Access:
- Access object properties with
.get("key")
, not with square brackets:obj.get("key");
. - General:
- Validate syntax against known-good .grc files before adding new tests or code.
- Prefer simple, explicit constructs for maximum compatibility.
Update this section as new rules or best practices are discovered.
- Running .grc Files on Windows:
- To run a .grc file, use the following command in PowerShell or Command Prompt:
.\grapa.exe -q -f path/file.grc
-
This suppresses the version header (
-q
) and runs the specified .grc file (-f
). -
Array and List Access:
- Arrays (type $ARRAY) and lists (type $LIST) are accessed with
[index]
syntax, not.get(index)
. - Example: / ar = [1,2,3]; ar[1]; / returns 2 / ar = {"a":11,"b":22,"c":33}; ar[1]; / returns 22 / ar["b"]; / returns 22 / /
-
Use
.get("key")
for object property access, not for arrays/lists. -
String Literals and Quotes:
- If your string contains double quotes (
"
), use single quotes ('
) for the outer string, or escape the inner double quotes as (\"
). - If your string contains single quotes (
'
), use double quotes ("
) for the outer string, or escape the inner single quotes as (\'
). - Examples:
'Expected: ["", "a", "", "b", ""]\n'.echo();
/ single quotes outside, double quotes inside /(\"Expected: [\\\"\\\", \\\"a\\\", \\\"\\\", \\\"b\\\", \\\"\\\"]\\n\").echo();
/ double quotes outside, inner double quotes escaped /
File System Integration for Grep Utilities (Grapa Scripting Layer)
- Use
$file().ls()
to enumerate files in a directory. - Use
$file().info("path")
to check file type/existence. - Use
$file().get("path")
to read file contents. Note:.get()
returns binary data (type$BIN
); use.str()
to convert to string format:$file().get("file").str()
. - Use
$file().set("path", value)
to write file contents. - These commands provide all file system operations needed for scripting a command-line grep utility in Grapa.
Example workflow:
files = $file().ls();
i = 0;
while (i < files.len()) {
f = files[i];
info = $file().info(f["$KEY"]);
if (info["$TYPE"] == "FILE") {
content = $file().get(f["$KEY"]).str();
matches = content.grep("pattern", "o");
/* process matches... */
}
i = i + 1;
}
.grep()
function itself. This separation allows for flexible, programmable workflows. Production-Readiness Edge Case Coverage (2024-06 Update)
The following edge cases are now covered by dedicated test files to ensure Grapa grep is suitable for mission-critical production use and ripgrep parity.
Edge Case Category | Description/Examples | Test File(s) |
---|---|---|
Pathological Patterns | Catastrophic backtracking, large alternations, deep nesting | test/grep/test_pathological_patterns.grc |
Malformed/Invalid Unicode | Invalid UTF-8, unpaired surrogates, noncharacters, BOM | test/grep/test_malformed_unicode.grc |
Ultra-Large Lines | Single line >1MB, only delimiters, no newline at EOF | test/grep/test_ultra_large_lines.grc |
(All other edge cases) | Zero-length, Unicode, null bytes, context, overlap, etc. | See other test/grep/edge_case_*.grc files |
- These tests are critical for production reliability and ripgrep parity.
- If any test causes a hang, crash, or error, document and update implementation.
- See each test file for detailed scenarios and expected results.