Skip to content

Git Integration for Users

Biotope makes metadata version control simple by using Git under the hood. If you know Git, you already know how to use biotope's version control!

Quick Start

Think of biotope as Git for your scientific metadata. The workflow is familiar:

# 1. Initialize your project
biotope init

# 2. Add your data files
biotope add data/raw/experiment.csv

# 3. Create metadata (like staging changes)
biotope annotate interactive --staged

# Or complete incomplete annotations
biotope annotate interactive --incomplete

# 4. Commit your metadata
biotope commit -m "Add RNA-seq dataset with quality metrics"

# 5. Share with others
biotope push

How It Works

Biotope stores your metadata in a .biotope/ folder that Git tracks automatically. Your data files stay in the data/ folder, which is excluded from Git tracking via .gitignore. Biotope keeps track of your data files through metadata and checksums.

your-project/
├── .biotope/              # Your metadata (tracked by Git)
│   └── datasets/          # Metadata files with file references
├── data/                  # Your data files (excluded from Git)
│   ├── raw/
│   └── processed/
├── .gitignore             # Excludes data/ from Git tracking
└── .git/                  # Git repository

Why Data Files Aren't in Git

Data files are intentionally excluded from Git tracking because:

  • Size: Scientific data files are often large and would bloat the repository
  • Metadata Tracking: Biotope tracks files through metadata in .biotope/datasets/
  • Data Integrity: SHA256 checksums ensure files haven't been corrupted
  • Collaboration: Teams can share metadata without sharing large data files
  • Flexibility: Different team members can have different data file locations

Commands You'll Use

biotope init

Sets up your project and optionally initializes Git. Now includes project-level metadata collection for annotation pre-filling.

biotope init
# Follow the prompts to configure your project
# - Project name
# - Git integration
# - Knowledge graph (optional)
# - Project metadata (optional, for annotation pre-fill)

biotope config

Manage project configuration and metadata settings.

# Set project-level metadata for annotation pre-fill
biotope config set-project-metadata

# View current project metadata
biotope config show-project-metadata

# Configure validation requirements
biotope config show-validation

biotope add

Adds data files to your project and prepares them for metadata creation.

biotope add data/raw/experiment.csv
biotope add data/raw/ --recursive  # Add entire directory

biotope mv

Move tracked data files and update their metadata automatically.

biotope mv data/raw/experiment.csv data/processed/experiment.csv
biotope mv data/raw/old_name.csv data/raw/new_name.csv
biotope mv data/raw/ --recursive data/archive/  # Move entire directory
biotope mv data/raw/file.csv data/processed/ --force  # Overwrite if exists

The mv command: - Moves data files to new locations - Updates all metadata files to reflect the new paths - Recalculates checksums for moved files - Moves metadata files to mirror the new data file structure - Stages changes for commit automatically

biotope status

Shows what metadata changes are ready to commit.

biotope status                    # See all changes
biotope status --biotope-only     # See only metadata changes

biotope commit

Saves your metadata changes (just like git commit).

biotope commit -m "Add experiment dataset"
biotope commit -m "Update metadata" --author "Your Name <email@example.com>"
biotope commit -m "Fix typo" --amend  # Fix last commit

biotope log

Shows your metadata history.

biotope log                       # Full history
biotope log --oneline             # One line per commit
biotope log -n 5                  # Last 5 commits
biotope log --since "2024-01-01"  # Commits since date

biotope push / biotope pull

Share metadata with your team.

biotope push                      # Share your changes
biotope pull                      # Get latest changes from team
biotope pull --rebase             # Pull with rebase

biotope check-data

Verify your data files haven't been corrupted.

biotope check-data                # Check all files
biotope check-data -f data/raw/experiment.csv  # Check specific file

Your Git Knowledge Applies

Since biotope uses Git, all your Git skills work:

# Branching
git checkout -b new-experiment
biotope add data/raw/new-data.csv
biotope mv data/raw/new-data.csv data/processed/new-data.csv
biotope commit -m "Add new experiment"
git checkout main
git merge new-experiment

# Viewing changes
git diff .biotope/               # See metadata changes
git log -- .biotope/             # View metadata history

# Collaboration
git remote add origin https://github.com/team/project.git
biotope push

Understanding .gitignore

Biotope automatically creates a .gitignore file that excludes the data/ directory from Git tracking. This means:

What's Excluded

  • data/ - All data files and subdirectories
  • downloads/ - Downloaded files
  • tmp/ - Temporary files
  • Common development files (Python cache, IDE files, etc.)

What's Tracked

  • .biotope/ - All metadata and configuration
  • config/ - User configuration files
  • schemas/ - Knowledge graph schema definitions
  • outputs/ - Generated outputs (if small enough)

Benefits

  • Clean Git Status: git status won't show data files as untracked
  • Focused Commits: Only metadata changes appear in Git history
  • Small Repositories: Git repositories stay small and fast
  • Team Collaboration: Share metadata without sharing large data files

Working with Data Files

Even though data files aren't in Git, biotope still tracks them:

# Add a data file (creates metadata, doesn't add to Git)
biotope add data/raw/experiment.csv

# Check what's tracked (shows metadata, not data files)
biotope status

# Verify data integrity
biotope check-data

# See all tracked files
git ls-files .biotope/

Common Workflows

Setting Up a New Project

# 1. Initialize project with project metadata
biotope init
# Enter project name, enable Git, set project metadata

# 2. Add your data files
biotope add data/raw/experiment.csv

# 3. Create metadata (pre-filled with project metadata)
biotope annotate interactive --staged

# 4. Commit and share
biotope commit -m "Add experiment dataset"
biotope push

Adding New Data

# 1. Add your data files
biotope add data/raw/new-experiment.csv

# 2. Create metadata (with project metadata pre-fill)
biotope annotate interactive --staged

# 3. Commit and share
biotope commit -m "Add new experiment: 24 samples, 3 conditions"
biotope push

Moving and Reorganizing Data

# 1. Move files to new locations
biotope mv data/raw/experiment.csv data/processed/experiment.csv

# 2. Move entire directories
biotope mv data/raw/experiment_1/ --recursive data/processed/experiment_1/

# 3. Rename files
biotope mv data/raw/old_name.csv data/raw/new_name.csv

# 4. Commit the reorganization
biotope commit -m "Reorganize data: move experiments to processed directory"
biotope push

Updating Existing Metadata

# 1. Check what needs updating
biotope status

# 2. Edit metadata files or re-annotate
biotope annotate interactive -f data/raw/experiment.csv

# 3. Commit changes
biotope commit -m "Update experiment description and add QC metrics"

Completing Incomplete Annotations

# 1. Check which files need annotation
biotope status

# 2. Complete annotations for all incomplete tracked files
biotope annotate interactive --incomplete

# 3. Commit the completed annotations
biotope commit -m "Complete metadata for all tracked datasets"

Managing Project Metadata

# 1. Set or update project metadata
biotope config set-project-metadata
# Enter: description, URL, creator, license, citation

# 2. View current project metadata
biotope config show-project-metadata

# 3. Use in annotation (automatically pre-filled)
biotope annotate interactive --staged

Working with Your Team

# 1. Get latest changes
biotope pull

# 2. Make your changes
biotope add data/raw/my-experiment.csv
biotope annotate interactive --staged

# 3. Share your work
biotope commit -m "Add my experiment dataset"
biotope push

Project Metadata Benefits

Setting up project-level metadata provides several benefits:

  1. Faster Annotation: Forms are pre-filled with project information
  2. Consistency: All datasets use the same project metadata
  3. Team Coordination: Everyone uses consistent project details
  4. Reduced Errors: Less manual entry means fewer typos

Example Project Metadata

project_metadata:
  description: "Comprehensive protein structure analysis dataset"
  url: "https://github.com/team/protein-project"
  creator:
    name: "Dr. Jane Smith"
    email: "jane.smith@university.edu"
  license: "MIT"
  citation: "Smith, J. et al. (2024). Protein Structure Dataset. Nature Data."

Best Practices

Project Setup

  • Set project metadata during initialization or early in the project
  • Use consistent project metadata across all team members
  • Update project metadata when project details change

Commit Messages

Write clear, descriptive commit messages:

# Good
biotope commit -m "Add RNA-seq dataset with quality metrics"

# Better
biotope commit -m "Add RNA-seq dataset: 24 samples, 3 conditions, QC passed"

Data Organization

Keep your data organized:

data/
├── raw/
│   ├── experiment_1/
│   │   ├── samples.csv
│   │   └── measurements.csv
│   └── experiment_2/
└── processed/
    └── combined_results.csv

Regular Checks

  • Run biotope check-data regularly to ensure data integrity
  • Use biotope status before committing to see what's changing
  • Keep metadata and data in sync
  • Review project metadata periodically with biotope config show-project-metadata

Troubleshooting

"Not in a Git repository"

# Initialize Git
git init
# Or run biotope init which offers Git initialization

"No changes to commit"

# Check if files are staged
biotope status
# Stage changes if needed
git add .biotope/

"Remote 'origin' not found"

# Add remote repository
git remote add origin https://github.com/username/repo.git

Data integrity issues

# Check for corrupted files
biotope check-data
# Re-download or regenerate corrupted files

Data files showing as untracked in Git

If you see data files in git status as untracked:

# Check if .gitignore exists and includes data/
cat .gitignore

# If .gitignore is missing, create it:
echo "data/" >> .gitignore
echo "downloads/" >> .gitignore
echo "tmp/" >> .gitignore

# Or re-run biotope init to create a proper .gitignore

Want to track some data files in Git

If you need to track specific data files in Git (e.g., small configuration files):

# Force add specific files (overrides .gitignore)
git add -f data/config/small_config.csv

# Or modify .gitignore to be more specific
# Instead of "data/", use:
# data/*.csv
# data/*.txt
# !data/config/

Moving files with biotope mv

The biotope mv command automatically handles metadata updates:

# Move a file and update its metadata
biotope mv data/raw/experiment.csv data/processed/experiment.csv

# Move a directory with all its tracked files
biotope mv data/raw/experiment_1/ --recursive data/processed/experiment_1/

# Force overwrite if destination exists
biotope mv data/raw/file.csv data/processed/file.csv --force

Note: Always use biotope mv instead of the system mv command for tracked files to ensure metadata stays in sync.

Moving data files

When you move data files, use the biotope mv command to automatically update metadata:

# Move the file and update metadata automatically
biotope mv data/raw/old_location.csv data/raw/new_location.csv

# Commit the metadata change
biotope commit -m "Move data file to new location"

Or move entire directories:

# Move entire directory with all tracked files
biotope mv data/raw/experiment_1/ --recursive data/processed/experiment_1/

# Commit the changes
biotope commit -m "Move experiment_1 to processed directory"

What's Different from Git?

  • Focus: Biotope focuses on metadata, not code
  • Validation: Metadata is automatically validated before commits
  • Checksums: Data integrity is tracked automatically
  • Croissant ML: Metadata follows scientific standards
  • File Operations: biotope mv automatically updates metadata when moving files

What's the Same as Git?

  • Commands: Same workflow (add, mv, commit, push, pull)
  • Options: Same flags and options work
  • Collaboration: Same branching, merging, and remote workflows
  • History: Same log, diff, and status functionality

That's it! Your Git knowledge transfers directly to biotope. The only difference is that you're versioning scientific metadata instead of code.

Annotation Validation and Status Reporting

Biotope now supports project-specific annotation requirements, allowing administrators to define what fields must be present in dataset metadata for it to be considered "annotated". This helps ensure data quality and consistency across your project.

How Annotation Status Works

  • The biotope status command now shows, for each tracked and staged dataset, whether it is considered annotated (✅) or not (⚠️), based on the current project requirements.
  • The summary section reports how many datasets are annotated and how many are not.

What is "Annotated"?

A dataset is considered annotated if its metadata file (in .biotope/datasets/) contains all required fields, and those fields meet the validation rules set by your project admin. By default, required fields include name, description, creator, dateCreated, and distribution, but this can be customized.

Example: Status Output

$ biotope status

**Biotope Project Status**
Project: my-biotope
Location: /path/to/project
Git Repository: ✅

**Changes to be committed:**
Status  File                              Annotated
A       .biotope/datasets/mydata.jsonld   ✅

**Tracked Datasets:**
Dataset         Annotated   Status
mydata          ✅          Complete
rawdata         ⚠️          Incomplete (2 issues)

**Summary:**
  Staged: 1 file(s) (1 annotated, 0 unannotated)
  Tracked datasets: 2 (1 annotated, 1 unannotated)

Customizing Annotation Requirements

Admins can configure what fields are required and how they are validated using the biotope config command group.

Show Current Requirements

$ biotope config show-validation

**Annotation Validation Configuration**
Enabled: ✅

**Required Fields:**
Field        Type      Validation Rules
name         string    min_length: 1
description  string    min_length: 10
creator      object    required_keys: name
dateCreated  string    format: date
distribution array     min_length: 1

Add a Required Field

$ biotope config set-validation --field license --type string --min-length 3

Remove a Required Field

$ biotope config remove-validation --field license

Enable/Disable Validation

$ biotope config toggle-validation --enabled
$ biotope config toggle-validation --disabled

Remote Validation Configuration

For institutional clusters or multi-site collaborations, you can use remote validation configurations to enforce consistent policies across all projects.

Set Remote Validation URL
# Set a remote validation configuration
$ biotope config set-remote-validation --url https://cluster.example.com/validation.yaml

# With custom cache duration (in seconds)
$ biotope config set-remote-validation --url https://cluster.example.com/validation.yaml --cache-duration 7200

# Disable fallback to local config if remote fails
$ biotope config set-remote-validation --url https://cluster.example.com/validation.yaml --no-fallback
Show Remote Validation Status
$ biotope config show-remote-validation

This shows: - Remote URL and configuration - Cache status and age - Effective configuration (remote + local merged)

Clear Validation Cache
$ biotope config clear-validation-cache
Remove Remote Validation
$ biotope config remove-remote-validation
Example Remote Configuration
# https://cluster.example.com/validation.yaml
annotation_validation:
  enabled: true
  minimum_required_fields:
    - name
    - description
    - creator
    - dateCreated
    - distribution
    - license
  field_validation:
    name:
      type: string
      min_length: 1
    description:
      type: string
      min_length: 10
    creator:
      type: object
      required_keys: [name]
    license:
      type: string
      min_length: 5
How Remote Validation Works
  1. Caching: Remote configurations are cached locally for performance
  2. Merging: Local configurations can extend or override remote requirements
  3. Fallback: If remote is unavailable, falls back to local configuration
  4. Updates: Cache is refreshed based on configurable duration
Use Cases
  • Institutional Clusters: Enforce consistent metadata standards
  • Multi-site Collaborations: Share validation requirements
  • Compliance: Ensure datasets meet regulatory requirements
  • Quality Assurance: Maintain high metadata quality standards

See also: Admin documentation for advanced configuration and developer details.