Pipelines#
Pipelines are the core organizational unit in chartbook. A pipeline represents a complete analytics workflow that produces charts and dataframes.
What is a Pipeline?#
A pipeline is:
A self-contained analytics project
A collection of related charts and dataframes
A reproducible workflow with documentation
A unit that can be published and shared
Pipeline Structure#
A typical pipeline directory structure:
my-pipeline/
├── chartbook.toml # Configuration file
├── README.md # Pipeline documentation
├── dodo.py # Task automation (optional)
├── _data/ # Data files
│ ├── raw/ # Raw input data
│ └── processed/ # Processed data
├── _output/ # Generated outputs
│ ├── *.html # Interactive charts
│ └── *.ipynb # Notebooks
├── src/ # Source code
│ ├── data_processing.py
│ └── create_charts.py
├── docs_src/ # Documentation sources
│ ├── charts/ # Chart documentation
│ └── dataframes/ # Dataframe documentation
└── excel/ # Excel files (optional)
Creating a Pipeline#
Step 1: Initialize Configuration#
Create a chartbook.toml file:
[config]
type = "pipeline"
chartbook_format_version = "0.0.6"
[site]
title = "My Analytics Pipeline"
author = "Data Team"
copyright = "2025"
[pipeline]
id = "ANALYTICS"
pipeline_name = "Business Analytics Pipeline"
pipeline_description = "Monthly business metrics and KPIs"
lead_pipeline_developer = "Jane Doe"
contributors = ["Jane Doe", "John Smith"]
git_repo_URL = "https://repository.yourcompany.org/scm/chart/repos/analytics"
README_file_path = "./README.md"
Step 2: Organize Your Data#
Store data in the _data directory:
import pandas as pd
from pathlib import Path
# Create data directory
data_dir = Path("_data")
data_dir.mkdir(exist_ok=True)
# Save processed data
df = process_raw_data()
df.to_parquet(data_dir / "metrics.parquet")
Step 3: Create Charts#
Generate charts in the _output directory:
import chartbook
from pathlib import Path
# Create output directory
output_dir = Path("_output")
output_dir.mkdir(exist_ok=True)
# Generate chart
chartbook.plotting.line(
df,
x="date",
y=["revenue", "costs"],
title="Revenue vs Costs",
).save(chart_id="revenue_costs", output_dir=output_dir)
Step 4: Document Everything#
Create documentation in docs_src:
# Revenue vs Costs Chart
This chart displays the relationship between revenue and costs over time.
## Key Insights
- Revenue growth outpaces cost growth
- Seasonal patterns evident in Q4
## Methodology
- Data sourced from financial system
- Monthly aggregation applied
Pipeline Configuration#
Required Fields#
Every pipeline must define:
[pipeline]
id = "UNIQUE_ID" # Unique identifier (uppercase)
pipeline_name = "Full Name" # Human-readable name
pipeline_description = "..." # Detailed description
lead_pipeline_developer = "Name"
Optional Fields#
Additional metadata:
[pipeline]
contributors = ["Name1", "Name2"]
software_modules_command = "module load python/3.11"
runs_on_grid_or_windows_or_other = "Windows/Linux"
git_repo_URL = "https://..."
README_file_path = "./README.md"
Best Practices#
1. Consistent Naming#
Use consistent naming conventions:
Pipeline ID:
UPPERCASE_WITH_UNDERSCORESFile names:
lowercase_with_underscoresChart IDs:
descriptive_chart_name
2. Version Control#
Track all pipeline files in git:
git init
git add chartbook.toml src/ docs_src/
git commit -m "Initial pipeline setup"
3. Data Management#
Store raw data separately from processed data
Use Parquet format for efficiency
Document data sources and transformations
Include data validation checks
4. Reproducibility#
Make your pipeline reproducible:
# Set random seeds
import numpy as np
np.random.seed(42)
# Document package versions
# requirements.txt
pandas==2.0.0
numpy==1.24.0
chartbook==0.0.6
5. Documentation#
Document at multiple levels:
README.md: Overall pipeline documentationChart docs: Individual chart explanations
Code comments: Implementation details
Dataframe docs: Data source information
Pipeline Automation#
Use dodo.py for task automation:
# dodo.py
from doit import task_params
@task_params([{"name": "year", "default": 2024}])
def task_process_data(year):
"""Process raw data for specified year."""
return {
'actions': [f'python src/process_data.py --year {year}'],
'file_dep': ['src/process_data.py'],
'targets': [f'_data/processed_{year}.parquet'],
}
def task_create_charts():
"""Generate all charts."""
return {
'actions': ['python src/create_all_charts.py'],
'file_dep': ['_data/processed_2024.parquet'],
'targets': ['_output/revenue_costs.html'],
}
Run tasks:
doit
Publishing Pipelines#
Prepare your pipeline for publication:
Clean up temporary files:
rm -rf __pycache__ .ipynb_checkpoints
Validate configuration:
chartbook build --project-dir .
Publish to staging:
chartbook publish --publish-dir ./staging
Review and publish to production:
chartbook publish
Troubleshooting#
Common Issues#
Missing dependencies: Ensure all required packages are installed
Path errors: Use relative paths in configuration
Data not found: Check file paths and extensions
Chart generation fails: Verify data types and column names
Debugging Tips#
Use
--keep-build-dirsto inspect intermediate filesCheck logs in
_docs/_build/directoryValidate TOML syntax with online validators
Test charts individually before full generation
Next Steps#
Learn about Charts to create visualizations
Understand Dataframes for data management