Vector Performance Optimization Guide

Overview

This guide provides comprehensive performance optimization recommendations for Grapa's vector operations, based on systematic benchmarking and analysis. It covers performance characteristics, optimization strategies, and best practices for different use cases.

Performance Characteristics

Algorithm Complexities

Operation	Complexity	Performance Impact
Matrix multiplication	O(n³)	High for large matrices
Determinant calculation	O(n³)	High for large matrices
Eigenvalue calculation	O(n³) with iterations	Very high for large matrices
Covariance calculation	O(n²m)	Moderate for large matrices
Basic operations (sum, mean)	O(n²)	Low for all sizes
Transpose	O(n²)	Very low for all sizes

Performance Benchmarks

Matrix Size	Creation (ms)	Multiplication (ms)	Determinant (ms)	Memory (KB)
10x10	<1	<1	<1	0.8
50x50	<1	<1	<1	20
100x100	<1	<1	<1	80
200x200	<1	~200	<1	320
500x500	<1	~1,200	<1	2,000
10,000x1	<1	~530	N/A	80

Note: Performance scales linearly with data size (~0.053ms per sample for linear regression operations).

Optimization Strategies

1. Matrix Size Optimization

Real-time Applications

/* For real-time applications, keep matrices small */
// ✅ Good - Fast response
small_matrix = [[1, 2], [3, 4]].vector();
result = small_matrix.dot(small_matrix);  // < 1ms

// ⚠️ Moderate - Good performance for most use cases
large_matrix = create_large_matrix(100);  // ~200ms for 100x100
result = large_matrix.dot(large_matrix);

Recommendations: - Keep matrices < 50x50 for sub-second response - Use < 200x200 for interactive applications (good performance) - Use < 500x500 for batch processing (reasonable performance) - Consider breaking very large problems (> 1000x1000) into smaller blocks

Batch Processing

/* For batch processing, monitor memory usage */
// ✅ Good - Manageable memory
medium_matrix = create_matrix(200);  // 320KB memory
result = medium_matrix.cov();        // Good performance

// ⚠️ Monitor - Large memory usage
large_matrix = create_matrix(500);   // 2MB memory
result = large_matrix.cov();         // Monitor performance

2. Algorithm Selection

Matrix Multiplication

/* Use appropriate algorithms for your use case */:
// For small matrices - standard multiplication is fine
small_result = mat_a.dot(mat_b);  // Fast for < 100x100

// For large matrices - consider alternatives
if (mat_a.shape().getfield(0) > 100) {
    // Consider breaking into smaller blocks
    result = block_multiply(mat_a, mat_b);
} else {
    result = mat_a.dot(mat_b);
}

Statistical Functions

/* Choose efficient statistical operations */
// ✅ Fast - Good for all sizes
sum_result = data.sum();
mean_result = data.mean();

// ⚠️ Moderate - Good for < 200x200
cov_result = data.cov();

// ❌ Slow - Use only for < 50x50
eigen_result = data.eigh();

3. Memory Management

Pre-allocation

/* Pre-allocate matrices when possible */
// ✅ Good - Reuse allocated memory
matrix = create_matrix(50);
for (i = 0; i < 1000; i++) {
    result = matrix.dot(matrix);  // Reuse same matrix
}

// ❌ Poor - Repeated allocation
for (i = 0; i < 1000; i++) {
    matrix = create_matrix(50);   // Allocate each time
    result = matrix.dot(matrix);
}

Memory Monitoring

/* Monitor memory usage for large operations */
large_matrix = create_matrix(200);
estimated_memory = large_matrix.shape().reduce(op(acc, dim){acc * dim}, 1) * 8 / 1024;
("Estimated memory usage: " + estimated_memory + " KB").echo();

4. Data Type Optimization

Choose Appropriate Types

/* Use INT for integer data, FLOAT for decimal */
// ✅ Good - Use INT for integer data
integer_data = [1, 2, 3, 4, 5].vector();

// ✅ Good - Use FLOAT for decimal data
decimal_data = [1.5, 2.7, 3.2, 4.1, 5.9].vector();

// ⚠️ Consider - Precision vs performance trade-off
high_precision = [1.123456789, 2.987654321].vector();

5. Precision Optimization

System-Level Precision Control

For performance-critical applications, you can significantly improve vector operation speed by reducing floating-point precision:

/* Set system precision for performance optimization */
// ✅ High performance - 32-bit precision (~2x faster)
32.setfloat(0);
result = large_matrix.dot(large_matrix);  // Much faster

// ✅ Balanced - 64-bit precision (good speed/accuracy)
64.setfloat(0);
result = large_matrix.dot(large_matrix);  // Good performance

// ✅ High accuracy - 128-bit precision (default)
128.setfloat(0);
result = large_matrix.dot(large_matrix);  // Maximum accuracy

Performance Impact: - 32-bit precision: ~2x faster than 128-bit precision - 64-bit precision: ~1.1x faster than 128-bit precision - 128-bit precision: Maximum accuracy (default)

Fixed-Point vs Floating-Point Accuracy

Grapa automatically switches between floating-point and fixed-point representations depending on the mathematical function, with internal optimizations in GrapaFloat.cpp that choose the best representation for each operation. You can set a system preference:

/* Set system preference for floating-point representation */
32.setfloat(0);  // Sets system preference, but Grapa optimizes internally
result = matrix.dot(matrix);  // Grapa chooses optimal representation per operation

/* Set system preference for fixed-point representation */
32.setfix(0);    // Sets system preference, but Grapa optimizes internally
result = matrix.dot(matrix);  // Grapa chooses optimal representation per operation

System Behavior: - Automatic optimization: Grapa's internal functions automatically choose the best representation for each mathematical operation - System preference: setfloat() and setfix() set a system preference, but Grapa overrides this when it knows better - Same bit precision: Performance is similar, accuracy is optimized per operation - Default recommendation: Use setfloat() as the default unless you have specific requirements:

When the Choice Matters: - Financial calculations: May benefit from setfix() preference for decimal precision - Scientific calculations: May benefit from setfloat() preference for large dynamic ranges - Most applications: The system default choice is sufficient due to internal optimizations:

Precision Performance Example

/* Linear regression with different precision settings */
n_samples = 10000;

// 32-bit precision - Fast (Grapa optimizes representation internally)
32.setfloat(0);
start_time = $TIME().utc();
result_32bit = perform_linear_regression(n_samples);
time_32bit = start_time.ms();  // ~277ms

// 128-bit precision - Maximum accuracy (Grapa optimizes representation internally)
128.setfloat(0);
start_time = $TIME().utc();
result_128bit = perform_linear_regression(n_samples);
time_128bit = start_time.ms();  // ~529ms

// 32-bit is ~1.9x faster with minimal accuracy loss

Use Case Optimization

Real-time Applications

Requirements: Sub-second response time Recommended Matrix Size: < 50x50

/* Real-time optimization strategies */
// 1. Use small matrices
small_matrix = [[1, 2], [3, 4]].vector();

// 2. Pre-compute when possible
precomputed_result = expensive_operation(small_matrix);

// 3. Use fast operations
fast_result = small_matrix.sum();  // O(n²) - very fast

// 4. Avoid expensive operations
// ❌ Avoid in real-time
eigen_result = small_matrix.eigh();  // O(n³) with iterations

Interactive Applications

Requirements: < 5 second response time Recommended Matrix Size: < 200x200

/* Interactive optimization strategies */
// 1. Use moderate matrix sizes
medium_matrix = create_matrix(50);

// 2. Provide progress feedback
("Computing...").echo();
result = medium_matrix.dot(medium_matrix);
("Complete!").echo();

// 3. Use appropriate operations
cov_result = medium_matrix.cov();  // Good performance

Batch Processing

Requirements: Efficient processing of large datasets Recommended Matrix Size: < 1000x1000

/* Batch processing optimization strategies */
// 1. Monitor memory usage
large_matrix = create_matrix(200);
memory_usage = estimate_memory(large_matrix);

// 2. Use memory-efficient operations
sum_result = large_matrix.sum();  // Memory efficient

// 3. Consider breaking large problems
if (large_matrix.shape().getfield(0) > 200) {
    result = process_in_blocks(large_matrix);
} else {
    result = process_directly(large_matrix);
}

Data Science Applications

Requirements: Accurate results with reasonable performance Recommended Matrix Size: < 500x500

/* Data science optimization strategies */
// 1. Use appropriate statistical functions
data = load_dataset();
cov_matrix = data.cov();  // Good for data analysis

// 2. Consider data characteristics
if (is_sparse(data)) {
    result = sparse_operations(data);
} else {
    result = dense_operations(data);
}

// 3. Use efficient algorithms
// For correlation analysis
correlation = data.cov();  // More efficient than manual calculation

Edge Case Performance

Empty and Small Matrices

/* Edge cases perform excellently */
empty_vec = [].vector();
empty_sum = empty_vec.sum();  // Returns {"error":-1} for empty vectors

small_mat = [[1]].vector();
small_det = small_mat.det();  // 0ms - very fast

Special Matrix Types

/* Special matrices are optimized */
identity = [[1, 0], [0, 1]].vector();
id_det = identity.det();  // 0ms - very fast

sparse = [[1, 0, 0], [0, 1, 0], [0, 0, 1]].vector();
sparse_det = sparse.det();  // 0ms - very fast

Extreme Values

/* Extreme values handled efficiently */
large_nums = [[1e15, 2e15], [3e15, 4e15]].vector();
large_det = large_nums.det();  // Handled correctly

small_nums = [[1e-15, 2e-15], [3e-15, 4e-15]].vector();
small_det = small_nums.det();  // Handled correctly

Performance Monitoring

Timing Operations

/* Monitor operation performance */
start_time = $TIME().utc();
result = matrix.dot(matrix);
end_time = $TIME().utc();
operation_time = (($TIME().utc() - start_time) / 1000000).int();
("Operation took: " + operation_time + "ms").echo();

Memory Estimation

/* Estimate memory usage */
estimate_memory = op(matrix) {
    elements = matrix.shape().reduce(op(acc, dim){acc * dim}, 1);
    bytes = elements * 8;  // 8 bytes per element
    kb = bytes / 1024;
    kb;
};

memory_usage = estimate_memory(my_matrix);
("Estimated memory: " + memory_usage + " KB").echo();

Best Practices Summary

Do's

✅ Use matrices < 50x50 for real-time applications
✅ Use matrices < 200x200 for interactive applications (good performance)
✅ Use matrices < 1000x1000 for batch processing
✅ Pre-allocate matrices when possible
✅ Use appropriate data types (INT vs FLOAT)
✅ Monitor memory usage for large matrices
✅ Use fast operations (sum, mean) for large datasets
✅ Consider breaking very large problems (> 1000x1000) into smaller blocks
✅ Use sequential loops for large datasets with simple operations
✅ Use parallel .map()/.filter() for smaller datasets with complex operations
✅ Prefer .reduce() for large datasets when possible (more efficient)
✅ Use 32.setfloat(0) for machine learning applications requiring maximum speed
✅ Use 64.setfloat(0) for balanced speed/accuracy in most applications
✅ Use 128.setfloat(0) for applications requiring maximum precision

Don'ts

❌ Don't use very large matrices (> 1000x1000) for real-time applications
❌ Don't repeatedly allocate large matrices
❌ Don't use eigenvalue calculations for large matrices (> 100x100)
❌ Don't ignore memory usage for large datasets
❌ Don't use expensive operations when fast alternatives exist
❌ Don't use .map()/.filter() on very large datasets with simple operations
❌ Don't ignore the copy overhead of parallel operations on large datasets

Performance Checklist

Before using vector operations, consider:

Matrix Size: Is it appropriate for your use case?
Operation Type: Are you using the most efficient operation?
Memory Usage: Do you have sufficient memory?
Data Types: Are you using appropriate data types?
Pre-allocation: Can you reuse allocated matrices?
Precision Settings: Can you use lower precision for better performance?
Monitoring: Are you tracking performance and memory usage?

Real-World Performance Validation

Linear Regression Example Results

Recent testing with a real-world linear regression implementation demonstrates Grapa's vector performance:

Dataset Size	Training Time (128-bit)	Training Time (32-bit)	Performance
100 samples	5.642ms	~3ms (estimated)	~0.056ms per sample
10,000 samples	529.003ms	276.634ms	~0.053ms per sample

Key Insights: - Linear scaling: Performance scales predictably with data size - Real-world ready: 10,000 samples processed in under 1 second - Machine learning capable: Suitable for practical ML applications - Consistent performance: ~0.053ms per sample across different dataset sizes - Precision optimization: 32-bit precision provides ~1.9x speedup with minimal accuracy loss

Conclusion

Grapa's vector operations provide excellent performance for real-world use cases, including machine learning applications. By following these optimization guidelines, you can achieve:

Sub-second response for real-time applications
Efficient processing for batch operations (10K+ samples in <1 second)
Optimal memory usage for large datasets
Robust error handling for edge cases
Machine learning ready performance for practical applications

The key is choosing the right matrix size and operations for your specific use case, while monitoring performance and memory usage appropriately. Grapa's vector implementation is sufficiently fast for real-world machine learning and data science applications.