Table of Contents
- Block Design & Architecture
- Pipeline Organization
- Performance Optimization
- Testing & Validation
- Error Handling & Reliability
- Resource Management
- Collaboration & Version Control
- Monitoring & Observability
- Security & Governance
- Deployment & CI/CD
Block Design & Architecture
Single Responsibility Principle
Each block should have a single, well-defined purpose. This makes blocks easier to understand, test, and maintain. ✅ Good:Modularity & Reusability
Design blocks to be reusable across pipelines and projects. Use clear, descriptive names and avoid hardcoded values. ✅ Good:Block Naming Conventions
Use clear, descriptive names that indicate the block’s purpose:- Data loaders:
load_<source>_<entity>(e.g.,load_salesforce_contacts) - Transformers:
transform_<action>_<entity>(e.g.,transform_clean_customer_data) - Data exporters:
export_<entity>_to_<destination>(e.g.,export_customers_to_warehouse)
Organizing Blocks in Subdirectories
Organize blocks in subdirectories within their respective block type folders to improve project structure and maintainability. ✅ Benefits:- Better organization for large projects
- Easier to find related blocks
- Clearer project structure
- Supports team collaboration
-
Create subdirectories by category:
- By source type:
api/,database/,file/,cloud/ - By domain:
sales/,marketing/,finance/ - By function:
validation/,cleaning/,aggregation/
- By source type:
-
Using the UI:
- When creating a block, name it with the subdirectory:
subfolder_name/block_name - Example:
api/load_stripecreatesdata_loaders/api/load_stripe.py - The subdirectory is automatically created if it doesn’t exist
- When creating a block, name it with the subdirectory:
-
Example structure:
-
Best practices:
- Use consistent subdirectory naming across block types
- Keep subdirectory depth to 1-2 levels maximum
- Group related blocks together logically
- Document your organization structure for your team
Block Size Guidelines
Keep blocks focused and reasonably sized:- Ideal: 50-200 lines of code per block
- Maximum: 500 lines (consider splitting if larger)
- Minimum: Enough code to accomplish one clear task
Use Block Templates
Create reusable block templates for common patterns:- Standard ETL patterns
- Data validation patterns
- Error handling templates
- Logging templates
Pipeline Organization
Choosing the Right Pipeline Type
Mage supports three main pipeline types, each optimized for different use cases. Understanding when to use each type is crucial for building efficient and maintainable data pipelines.Data Integration Pipelines
Use data integration pipelines when:- You need to sync data between systems (external to internal, or internal to external)
- You’re working with pre-built connectors for SaaS tools, databases, or data warehouses
- You want low-code/no-code data synchronization
- You need incremental sync capabilities with automatic state tracking
- You’re doing reverse ETL (syncing data from your warehouse to external systems)
- Uses Singer spec-compliant connectors
- Handles schema migrations automatically
- Supports incremental and full sync modes
- Minimal code required - mostly configuration
- Can be combined with batch pipelines for custom transformations
- Syncing Salesforce contacts to your data warehouse
- Loading Stripe transaction data into BigQuery
- Exporting customer segments from Snowflake to HubSpot
- Syncing product data from PostgreSQL to MongoDB
Batch Pipelines
Use batch pipelines when:- You need scheduled data processing (hourly, daily, weekly)
- You require custom transformations using Python, SQL, or R
- You’re building ETL/ELT workflows with complex business logic
- You need to process large volumes of historical data
- You want to combine multiple data sources and apply custom logic
- You’re building analytics pipelines for reporting and dashboards
- Runs on a schedule, completes, and exits
- Full flexibility with custom code blocks
- Can combine data integration blocks with custom transformers
- Supports backfilling historical data
- Ideal for data warehousing and analytics workloads
- Daily revenue reporting with complex aggregations
- Monthly financial reconciliation
- Customer segmentation analysis
- Data quality checks and validation
- Feature engineering for machine learning
Streaming Pipelines
Use streaming pipelines when:- You need real-time data processing (sub-second to second latency)
- You’re processing continuous data streams (Kafka, Kinesis, etc.)
- You require immediate insights or alerts
- You’re building real-time analytics or monitoring systems
- You need event-driven architectures
- Low latency is critical (milliseconds to seconds)
- Long-running processes that continuously monitor sources
- Processes data as it arrives (not in batches)
- Event-driven architecture
- Lower latency than batch processing
- Supports stateful processing with window aggregations
- Real-time fraud detection
- Live dashboard updates
- IoT sensor data processing
- Real-time recommendation systems
- Event-driven data transformations
- Real-time data quality monitoring
Decision Matrix
| Requirement | Data Integration | Batch | Streaming |
|---|---|---|---|
| Latency | Minutes to hours | Minutes to hours | Milliseconds to seconds |
| Code Complexity | Low (mostly config) | High (custom code) | Medium (transformers) |
| Use Case | Data sync | Scheduled processing | Real-time processing |
| Data Volume | Small to large | Large | Continuous streams |
| Custom Logic | Limited | Full flexibility | Transformers only |
| State Management | Automatic | Manual | Automatic (optional) |
- Use data integration pipelines to sync data into your warehouse
- Use batch pipelines to transform and aggregate the synced data
- Use streaming pipelines for real-time processing of critical events
Pipeline Structure
Organize pipelines logically:- Data Integration Pipelines: Extract and load data from sources
- Transformation Pipelines: Transform and clean data
- Analytics Pipelines: Generate reports and metrics
- Orchestration Pipelines: Coordinate multiple pipelines
Pipeline Naming
Use consistent naming conventions:- Format:
<domain>_<purpose>_<frequency>(e.g.,sales_daily_revenue_report) - Examples:
marketing_campaign_attribution_hourlyfinance_monthly_reconciliationproduct_user_behavior_realtime
Pipeline Tagging
Use tags to organize and group related pipelines for easier discovery and management. ✅ Best Practices for Tagging:-
Use consistent tag names: Establish a tagging convention across your team
- Domain tags:
sales,marketing,finance,product - Type tags:
etl,analytics,reporting,integration - Environment tags:
production,staging,development - Priority tags:
critical,high,low
- Domain tags:
-
Apply multiple tags: Pipelines can have multiple tags for flexible grouping
-
Group by tags: Use the pipeline list view to group and filter by tags
- Right-click a pipeline → “Add/Remove tags”
- Select or type tags (separated by Enter)
- Filter and group pipelines by tag on the Pipeline page
-
Tag naming conventions:
- Use lowercase with underscores:
data_quality,customer_analytics - Keep tags short and descriptive
- Avoid special characters and spaces
- Use lowercase with underscores:
- Quickly find related pipelines
- Filter pipelines by domain, type, or environment
- Organize pipelines for different teams or projects
- Group pipelines for monitoring and alerting
Limit Pipeline Complexity
✅ Recommended:- Block count: 10-30 blocks per pipeline
- Depth: Maximum 5-7 levels of dependencies
- Branches: Limit parallel branches to 5-10
- Split into multiple pipelines
- Use orchestration pipelines to coordinate
- Extract common logic into reusable blocks
Dependency Management
- Minimize dependencies: Only create dependencies where data actually flows
- Avoid circular dependencies: Design acyclic graphs
- Use clear data contracts: Document expected input/output schemas
Performance Optimization
Efficient Data Processing
Batch Processing:Leverage Spark for Large Datasets
For datasets larger than memory or requiring distributed processing:Optimize API Calls
Batch API Requests:Memory Management
Stream Large Files:Pipeline Performance
Choose the Right Executor:-
local_pythonexecutor: Use for faster execution when you have sufficient local resources. Blocks run in the same process or separate processes on the same machine, which reduces overhead and improves performance for smaller to medium-sized pipelines. -
k8sexecutor: Use for scalability when you need more resources or want to distribute block execution across multiple pods. Ideal for large-scale pipelines, resource-intensive transformations, or when you need to scale horizontally.
-
run_pipeline_in_one_process: true: Speeds up pipeline execution by running all blocks in a single process. This reduces process creation overhead and improves performance for pipelines with many small blocks or when blocks need to share memory efficiently. -
run_pipeline_in_one_process: false: Runs blocks in separate processes (forlocal_pythonexecutor) or separate pods (fork8sexecutor). Use this when:- Blocks need isolation (e.g., different resource requirements)
- You want to leverage parallel execution across multiple processes/pods
- Memory isolation is important to prevent one block from affecting others
- Use dynamic blocks for parallel processing
- Leverage conditional execution for optional steps
- Cache intermediate results when appropriate
- Global concurrency: Limit the maximum number of concurrent block runs across all pipelines
- Pipeline-level concurrency: Control concurrent block runs and pipeline runs per trigger
- Per-block concurrency: Set specific concurrency limits for individual blocks
Testing & Validation
Write Tests in Blocks
Mage’s built-in testing framework makes it easy to write tests alongside your code:Data Quality Checks
Schema Validation:Test-Driven Development
Write tests before or alongside implementation:- Define expected behavior in test functions
- Implement the logic to pass tests
- Run tests automatically on each block execution
- Refactor while keeping tests green
Error Handling & Reliability
Graceful Error Handling
Handle Expected Errors:Retry Logic
Implement Retry for Transient Failures: You can implement retry logic in your code for handling transient failures, or use Mage’s built-in automatic retry feature for block runs. Code-level retry:Idempotency
Design blocks to be idempotent (safe to run multiple times):Checkpointing
For long-running pipelines, implement checkpointing:Resource Management
Compute Resources
Right-Size Your Resources:- Use appropriate compute clusters for workload size
- Monitor CPU and memory usage in Cluster Manager
- Scale up for large transformations, scale down for simple tasks
- Enable Spark for datasets > 10GB
- Configure Spark resources appropriately
- Monitor Spark job performance
Connection Management
Reuse Connections:Memory Optimization
Process Data in Chunks:Collaboration & Version Control
Use Workspaces Effectively
Separate Environments:- Development: For experimentation and testing
- Staging: For pre-production validation
- Production: For live data pipelines
- Use consistent naming across workspaces
- Document workspace purposes
- Limit production workspace access
Version Control Integration
Commit Frequently:- Commit after completing each block or feature
- Use descriptive commit messages
- Tag releases and important milestones
- Use feature branches for new development
- Keep main/master branch stable
- Use pull requests for code review
Code Review
Review Checklist:- Code follows team standards
- Tests are included and passing
- Error handling is appropriate
- Documentation is clear
- Performance considerations addressed
- Security concerns reviewed
Documentation
Document Your Blocks:- Document pipeline purpose and business logic
- Include data flow diagrams
- Note dependencies and requirements
- Document expected run times and resource needs
Monitoring & Observability
Logging
Use Structured Logging:Monitoring Pipeline Health
Set Up Alerts:- Pipeline failure alerts
- Performance degradation alerts
- Data quality threshold alerts
- Resource usage alerts
- Pipeline run duration
- Block execution times
- Data volume processed
- Error rates
- Resource utilization
Data Quality Monitoring
Track Data Quality Metrics:Security & Governance
Authentication
User Authentication is Enabled by Default:- Mage Pro: User authentication is always enabled by default (
REQUIRE_USER_AUTHENTICATION=True) - Mage OSS (0.9.78+): User authentication is enabled by default
- Mage OSS (0.8.4-0.9.77): Can be enabled by setting
REQUIRE_USER_AUTHENTICATION=1
- Load Balancer/Reverse Proxy: Configure IP whitelisting at the load balancer or reverse proxy level (e.g., Nginx, AWS ALB, Azure Application Gateway)
- Firewall Rules: Use network-level firewall rules to restrict access
- VPN/Private Network: Deploy Mage Pro in a private network accessible only via VPN
- Whitelist only necessary IP ranges (office networks, VPN endpoints)
- Regularly review and update the whitelist
- Use CIDR notation for network ranges
- Document all whitelisted IPs and their purposes
- Consider using a VPN for remote access instead of public IPs
Secrets Management
Use Mage Secrets:- Store API keys, passwords, and tokens in Mage’s secrets management
- Never hardcode credentials in blocks
- Rotate secrets regularly
Access Control
Follow Principle of Least Privilege:- Grant minimum necessary permissions
- Use workspace-level access controls
- Review and audit access regularly
Data Privacy
Handle Sensitive Data Appropriately:- Mask PII in logs and outputs
- Encrypt sensitive data at rest and in transit
- Comply with data protection regulations (GDPR, CCPA, etc.)
Code Security
Security Best Practices:- Avoid executing user input directly
- Validate and sanitize all inputs
- Use parameterized queries for databases
- Keep dependencies updated
Deployment & CI/CD
Use CI/CD Deployments (Mage Pro)
Deployment Workflow:- Develop in development workspace
- Test in staging workspace
- Deploy to production using CI/CD
- Automate deployments through CI/CD
- Use deployment pipelines for consistency
- Test deployments in staging first
- Roll back quickly if issues occur
Environment Management
Separate Configurations:Pipeline Scheduling
Schedule Appropriately:- Match schedule to data freshness requirements
- Consider source system load
- Account for pipeline execution time
- Use appropriate time zones
- Set up retry policies using automatic retry for block runs
- Configure alerting for failures
- Implement dead letter queues for failed runs
Backfilling
Backfill Strategy:- Use Mage’s built-in backfilling for historical data
- Process in chunks to avoid resource exhaustion
- Monitor backfill progress
- Validate backfilled data
Summary
Following these best practices will help you: ✅ Build reliable pipelines with proper error handling and testing✅ Optimize performance through efficient data processing and resource management
✅ Maintain code quality with modular, reusable blocks
✅ Collaborate effectively using version control and workspaces
✅ Monitor and debug with comprehensive logging and observability
✅ Deploy safely using CI/CD and environment management Remember: Start with these practices and adapt them to your team’s specific needs and constraints. The goal is to build maintainable, scalable, and reliable data pipelines.