Git Integration for Users¶
Biotope makes metadata version control simple by using Git under the hood. If you know Git, you already know how to use biotope's version control!
Quick Start¶
Think of biotope as Git for your scientific metadata. The workflow is familiar:
# 1. Initialize your project
biotope init
# 2. Add your data files
biotope add data/raw/experiment.csv
# 3. Create metadata (like staging changes)
biotope annotate interactive --staged
# Or complete incomplete annotations
biotope annotate interactive --incomplete
# 4. Commit your metadata
biotope commit -m "Add RNA-seq dataset with quality metrics"
# 5. Share with others
biotope push
How It Works¶
Biotope stores your metadata in a .biotope/
folder that Git tracks automatically. Your data files stay in the data/
folder, which is excluded from Git tracking via .gitignore
. Biotope keeps track of your data files through metadata and checksums.
your-project/
├── .biotope/ # Your metadata (tracked by Git)
│ └── datasets/ # Metadata files with file references
├── data/ # Your data files (excluded from Git)
│ ├── raw/
│ └── processed/
├── .gitignore # Excludes data/ from Git tracking
└── .git/ # Git repository
Why Data Files Aren't in Git¶
Data files are intentionally excluded from Git tracking because:
- Size: Scientific data files are often large and would bloat the repository
- Metadata Tracking: Biotope tracks files through metadata in
.biotope/datasets/
- Data Integrity: SHA256 checksums ensure files haven't been corrupted
- Collaboration: Teams can share metadata without sharing large data files
- Flexibility: Different team members can have different data file locations
Commands You'll Use¶
biotope init
¶
Sets up your project and optionally initializes Git. Now includes project-level metadata collection for annotation pre-filling.
biotope init
# Follow the prompts to configure your project
# - Project name
# - Git integration
# - Knowledge graph (optional)
# - Project metadata (optional, for annotation pre-fill)
biotope config
¶
Manage project configuration and metadata settings.
# Set project-level metadata for annotation pre-fill
biotope config set-project-metadata
# View current project metadata
biotope config show-project-metadata
# Configure validation requirements
biotope config show-validation
biotope add
¶
Adds data files to your project and prepares them for metadata creation.
biotope mv
¶
Move tracked data files and update their metadata automatically.
biotope mv data/raw/experiment.csv data/processed/experiment.csv
biotope mv data/raw/old_name.csv data/raw/new_name.csv
biotope mv data/raw/ --recursive data/archive/ # Move entire directory
biotope mv data/raw/file.csv data/processed/ --force # Overwrite if exists
The mv
command:
- Moves data files to new locations
- Updates all metadata files to reflect the new paths
- Recalculates checksums for moved files
- Moves metadata files to mirror the new data file structure
- Stages changes for commit automatically
biotope status
¶
Shows what metadata changes are ready to commit.
biotope commit
¶
Saves your metadata changes (just like git commit
).
biotope commit -m "Add experiment dataset"
biotope commit -m "Update metadata" --author "Your Name <email@example.com>"
biotope commit -m "Fix typo" --amend # Fix last commit
biotope log
¶
Shows your metadata history.
biotope log # Full history
biotope log --oneline # One line per commit
biotope log -n 5 # Last 5 commits
biotope log --since "2024-01-01" # Commits since date
biotope push
/ biotope pull
¶
Share metadata with your team.
biotope push # Share your changes
biotope pull # Get latest changes from team
biotope pull --rebase # Pull with rebase
biotope check-data
¶
Verify your data files haven't been corrupted.
biotope check-data # Check all files
biotope check-data -f data/raw/experiment.csv # Check specific file
Your Git Knowledge Applies¶
Since biotope uses Git, all your Git skills work:
# Branching
git checkout -b new-experiment
biotope add data/raw/new-data.csv
biotope mv data/raw/new-data.csv data/processed/new-data.csv
biotope commit -m "Add new experiment"
git checkout main
git merge new-experiment
# Viewing changes
git diff .biotope/ # See metadata changes
git log -- .biotope/ # View metadata history
# Collaboration
git remote add origin https://github.com/team/project.git
biotope push
Understanding .gitignore¶
Biotope automatically creates a .gitignore
file that excludes the data/
directory from Git tracking. This means:
What's Excluded¶
data/
- All data files and subdirectoriesdownloads/
- Downloaded filestmp/
- Temporary files- Common development files (Python cache, IDE files, etc.)
What's Tracked¶
.biotope/
- All metadata and configurationconfig/
- User configuration filesschemas/
- Knowledge graph schema definitionsoutputs/
- Generated outputs (if small enough)
Benefits¶
- Clean Git Status:
git status
won't show data files as untracked - Focused Commits: Only metadata changes appear in Git history
- Small Repositories: Git repositories stay small and fast
- Team Collaboration: Share metadata without sharing large data files
Working with Data Files¶
Even though data files aren't in Git, biotope still tracks them:
# Add a data file (creates metadata, doesn't add to Git)
biotope add data/raw/experiment.csv
# Check what's tracked (shows metadata, not data files)
biotope status
# Verify data integrity
biotope check-data
# See all tracked files
git ls-files .biotope/
Common Workflows¶
Setting Up a New Project¶
# 1. Initialize project with project metadata
biotope init
# Enter project name, enable Git, set project metadata
# 2. Add your data files
biotope add data/raw/experiment.csv
# 3. Create metadata (pre-filled with project metadata)
biotope annotate interactive --staged
# 4. Commit and share
biotope commit -m "Add experiment dataset"
biotope push
Adding New Data¶
# 1. Add your data files
biotope add data/raw/new-experiment.csv
# 2. Create metadata (with project metadata pre-fill)
biotope annotate interactive --staged
# 3. Commit and share
biotope commit -m "Add new experiment: 24 samples, 3 conditions"
biotope push
Moving and Reorganizing Data¶
# 1. Move files to new locations
biotope mv data/raw/experiment.csv data/processed/experiment.csv
# 2. Move entire directories
biotope mv data/raw/experiment_1/ --recursive data/processed/experiment_1/
# 3. Rename files
biotope mv data/raw/old_name.csv data/raw/new_name.csv
# 4. Commit the reorganization
biotope commit -m "Reorganize data: move experiments to processed directory"
biotope push
Updating Existing Metadata¶
# 1. Check what needs updating
biotope status
# 2. Edit metadata files or re-annotate
biotope annotate interactive -f data/raw/experiment.csv
# 3. Commit changes
biotope commit -m "Update experiment description and add QC metrics"
Completing Incomplete Annotations¶
# 1. Check which files need annotation
biotope status
# 2. Complete annotations for all incomplete tracked files
biotope annotate interactive --incomplete
# 3. Commit the completed annotations
biotope commit -m "Complete metadata for all tracked datasets"
Managing Project Metadata¶
# 1. Set or update project metadata
biotope config set-project-metadata
# Enter: description, URL, creator, license, citation
# 2. View current project metadata
biotope config show-project-metadata
# 3. Use in annotation (automatically pre-filled)
biotope annotate interactive --staged
Working with Your Team¶
# 1. Get latest changes
biotope pull
# 2. Make your changes
biotope add data/raw/my-experiment.csv
biotope annotate interactive --staged
# 3. Share your work
biotope commit -m "Add my experiment dataset"
biotope push
Project Metadata Benefits¶
Setting up project-level metadata provides several benefits:
- Faster Annotation: Forms are pre-filled with project information
- Consistency: All datasets use the same project metadata
- Team Coordination: Everyone uses consistent project details
- Reduced Errors: Less manual entry means fewer typos
Example Project Metadata¶
project_metadata:
description: "Comprehensive protein structure analysis dataset"
url: "https://github.com/team/protein-project"
creator:
name: "Dr. Jane Smith"
email: "jane.smith@university.edu"
license: "MIT"
citation: "Smith, J. et al. (2024). Protein Structure Dataset. Nature Data."
Best Practices¶
Project Setup¶
- Set project metadata during initialization or early in the project
- Use consistent project metadata across all team members
- Update project metadata when project details change
Commit Messages¶
Write clear, descriptive commit messages:
# Good
biotope commit -m "Add RNA-seq dataset with quality metrics"
# Better
biotope commit -m "Add RNA-seq dataset: 24 samples, 3 conditions, QC passed"
Data Organization¶
Keep your data organized:
data/
├── raw/
│ ├── experiment_1/
│ │ ├── samples.csv
│ │ └── measurements.csv
│ └── experiment_2/
└── processed/
└── combined_results.csv
Regular Checks¶
- Run
biotope check-data
regularly to ensure data integrity - Use
biotope status
before committing to see what's changing - Keep metadata and data in sync
- Review project metadata periodically with
biotope config show-project-metadata
Troubleshooting¶
"Not in a Git repository"¶
"No changes to commit"¶
"Remote 'origin' not found"¶
Data integrity issues¶
Data files showing as untracked in Git¶
If you see data files in git status
as untracked:
# Check if .gitignore exists and includes data/
cat .gitignore
# If .gitignore is missing, create it:
echo "data/" >> .gitignore
echo "downloads/" >> .gitignore
echo "tmp/" >> .gitignore
# Or re-run biotope init to create a proper .gitignore
Want to track some data files in Git¶
If you need to track specific data files in Git (e.g., small configuration files):
# Force add specific files (overrides .gitignore)
git add -f data/config/small_config.csv
# Or modify .gitignore to be more specific
# Instead of "data/", use:
# data/*.csv
# data/*.txt
# !data/config/
Moving files with biotope mv¶
The biotope mv
command automatically handles metadata updates:
# Move a file and update its metadata
biotope mv data/raw/experiment.csv data/processed/experiment.csv
# Move a directory with all its tracked files
biotope mv data/raw/experiment_1/ --recursive data/processed/experiment_1/
# Force overwrite if destination exists
biotope mv data/raw/file.csv data/processed/file.csv --force
Note: Always use biotope mv
instead of the system mv
command for tracked files to ensure metadata stays in sync.
Moving data files¶
When you move data files, use the biotope mv
command to automatically update metadata:
# Move the file and update metadata automatically
biotope mv data/raw/old_location.csv data/raw/new_location.csv
# Commit the metadata change
biotope commit -m "Move data file to new location"
Or move entire directories:
# Move entire directory with all tracked files
biotope mv data/raw/experiment_1/ --recursive data/processed/experiment_1/
# Commit the changes
biotope commit -m "Move experiment_1 to processed directory"
What's Different from Git?¶
- Focus: Biotope focuses on metadata, not code
- Validation: Metadata is automatically validated before commits
- Checksums: Data integrity is tracked automatically
- Croissant ML: Metadata follows scientific standards
- File Operations:
biotope mv
automatically updates metadata when moving files
What's the Same as Git?¶
- Commands: Same workflow (add, mv, commit, push, pull)
- Options: Same flags and options work
- Collaboration: Same branching, merging, and remote workflows
- History: Same log, diff, and status functionality
That's it! Your Git knowledge transfers directly to biotope. The only difference is that you're versioning scientific metadata instead of code.
Annotation Validation and Status Reporting¶
Biotope now supports project-specific annotation requirements, allowing administrators to define what fields must be present in dataset metadata for it to be considered "annotated". This helps ensure data quality and consistency across your project.
How Annotation Status Works¶
- The
biotope status
command now shows, for each tracked and staged dataset, whether it is considered annotated (✅) or not (⚠️), based on the current project requirements. - The summary section reports how many datasets are annotated and how many are not.
What is "Annotated"?¶
A dataset is considered annotated if its metadata file (in .biotope/datasets/
) contains all required fields, and those fields meet the validation rules set by your project admin. By default, required fields include name
, description
, creator
, dateCreated
, and distribution
, but this can be customized.
Example: Status Output¶
$ biotope status
**Biotope Project Status**
Project: my-biotope
Location: /path/to/project
Git Repository: ✅
**Changes to be committed:**
Status File Annotated
A .biotope/datasets/mydata.jsonld ✅
**Tracked Datasets:**
Dataset Annotated Status
mydata ✅ Complete
rawdata ⚠️ Incomplete (2 issues)
**Summary:**
Staged: 1 file(s) (1 annotated, 0 unannotated)
Tracked datasets: 2 (1 annotated, 1 unannotated)
Customizing Annotation Requirements¶
Admins can configure what fields are required and how they are validated using the biotope config
command group.
Show Current Requirements¶
$ biotope config show-validation
**Annotation Validation Configuration**
Enabled: ✅
**Required Fields:**
Field Type Validation Rules
name string min_length: 1
description string min_length: 10
creator object required_keys: name
dateCreated string format: date
distribution array min_length: 1
Add a Required Field¶
Remove a Required Field¶
Enable/Disable Validation¶
Remote Validation Configuration¶
For institutional clusters or multi-site collaborations, you can use remote validation configurations to enforce consistent policies across all projects.
Set Remote Validation URL¶
# Set a remote validation configuration
$ biotope config set-remote-validation --url https://cluster.example.com/validation.yaml
# With custom cache duration (in seconds)
$ biotope config set-remote-validation --url https://cluster.example.com/validation.yaml --cache-duration 7200
# Disable fallback to local config if remote fails
$ biotope config set-remote-validation --url https://cluster.example.com/validation.yaml --no-fallback
Show Remote Validation Status¶
This shows: - Remote URL and configuration - Cache status and age - Effective configuration (remote + local merged)
Clear Validation Cache¶
Remove Remote Validation¶
Example Remote Configuration¶
# https://cluster.example.com/validation.yaml
annotation_validation:
enabled: true
minimum_required_fields:
- name
- description
- creator
- dateCreated
- distribution
- license
field_validation:
name:
type: string
min_length: 1
description:
type: string
min_length: 10
creator:
type: object
required_keys: [name]
license:
type: string
min_length: 5
How Remote Validation Works¶
- Caching: Remote configurations are cached locally for performance
- Merging: Local configurations can extend or override remote requirements
- Fallback: If remote is unavailable, falls back to local configuration
- Updates: Cache is refreshed based on configurable duration
Use Cases¶
- Institutional Clusters: Enforce consistent metadata standards
- Multi-site Collaborations: Share validation requirements
- Compliance: Ensure datasets meet regulatory requirements
- Quality Assurance: Maintain high metadata quality standards
See also: Admin documentation for advanced configuration and developer details.