Biotope Init¶

Draft stage

Biotope is in draft stage. Functionality may be missing or incomplete. The API is subject to change.

Overview¶

The biotope init command initializes a new biotope project with interactive configuration. It sets up the necessary directory structure and configuration files for metadata management.

Features¶

Interactive Configuration¶

The init process guides you through several configuration options:

Project Name: Set a name for your biotope project
Git Integration: Choose whether to initialize Git version control
Knowledge Graph: Optionally install a knowledge graph for enhanced data management
Output Format: Select output format (only shown if knowledge graph is enabled)
Project Metadata: Collect project-level metadata for annotation pre-filling

Project-Level Metadata Collection¶

During initialization, you can optionally collect project-level metadata that will be used to pre-fill annotation fields:

Description: Brief description of the project and its purpose
URL: Project homepage, repository, or documentation URL
Creator: Name and contact information of the project maintainer
License: Data usage license (e.g., MIT, CC-BY, etc.)
Citation: How to cite the project or dataset

This metadata is stored in .biotope/config/biotope.yaml and automatically loaded when using biotope annotate interactive.

Conditional Output Format Selection¶

The output format selection is only presented if you choose to install a knowledge graph, as it's only relevant for knowledge graph functionality.

Usage¶

biotope init [OPTIONS]

Options¶

--dir, -d: Directory to initialize biotope project in (default: current directory)

Example¶

# Initialize in current directory
biotope init

# Initialize in specific directory
biotope init --dir /path/to/project

Configuration File Structure¶

The initialization creates a .biotope/config/biotope.yaml file with the following structure:

version: "1.0"
croissant_schema_version: "1.0"
default_metadata_template: "scientific"
data_storage:
  type: "local"
  path: "data"
checksum_algorithm: "sha256"
auto_stage: true
commit_message_template: "Update metadata: {description}"

# Project information (consolidated from internal metadata)
project_info:
  name: "my-project"
  created_at: "2024-01-01T00:00:00Z"
  biotope_version: "0.1.0"
  last_modified: "2024-01-01T00:00:00Z"
  builds: []
  knowledge_sources: []

# Project-level metadata for annotation pre-fill
project_metadata:
  description: "Project description"
  url: "https://example.com/project"
  creator:
    name: "John Doe"
    email: "john@example.com"
  license: "MIT"
  citation: "Doe, J. (2024). Project Title. Journal Name."

# Validation configuration
annotation_validation:
  enabled: true
  minimum_required_fields:
    - "name"
    - "description"
    - "creator"
    - "dateCreated"
    - "distribution"
  field_validation:
    name:
      type: "string"
      min_length: 1
    description:
      type: "string"
      min_length: 10
    creator:
      type: "object"
      required_keys: ["name"]
    dateCreated:
      type: "string"
      format: "date"
    distribution:
      type: "array"
      min_length: 1

Directory Structure¶

The init command creates the following directory structure:

project-root/
├── .biotope/
│   ├── config/
│   │   └── biotope.yaml          # Consolidated configuration (Git-like)
│   ├── datasets/                 # Croissant ML metadata files
│   ├── workflows/                # Bioinformatics workflow definitions
│   └── logs/                     # Command execution logs
├── config/
│   └── biotope.yaml              # User-facing configuration
├── data/
│   ├── raw/
│   └── processed/
├── schemas/
└── outputs/

Note: The configuration follows a Git-like approach where .biotope/config/biotope.yaml contains all biotope-specific configuration, similar to how Git uses .git/config for its configuration.

Managing Project Metadata¶

After initialization, you can manage project metadata using the biotope config command:

# Set project metadata
biotope config set-project-metadata

# Show current project metadata
biotope config show-project-metadata

Initialize command implementation.

`create_project_structure(directory, config, metadata, project_metadata=None)` ¶

Create the project directory structure and configuration files.

Parameters:

Name	Type	Description	Default
`directory`	`Path`	Project directory path	required
`config`	`dict`	User-facing configuration dictionary	required
`metadata`	`dict`	Internal metadata dictionary (now consolidated into biotope config)	required
`project_metadata`	`dict`	Project-level metadata for pre-filling annotations	`None`

Source code in biotope/commands/init.py

def create_project_structure(
    directory: Path, config: dict, metadata: dict, project_metadata: dict = None
) -> None:
    """
    Create the project directory structure and configuration files.

    Args:
        directory: Project directory path
        config: User-facing configuration dictionary
        metadata: Internal metadata dictionary (now consolidated into biotope config)
        project_metadata: Project-level metadata for pre-filling annotations

    """
    # Create directory structure - git-on-top layout
    dirs = [
        ".biotope",
        ".biotope/config",  # Configuration for biotope project
        ".biotope/datasets",  # Stores Croissant ML JSON-LD files
        ".biotope/workflows",  # Bioinformatics workflow definitions
        ".biotope/logs",  # Command execution logs
        "config",
        "data",
        "data/raw",
        "data/processed",
        "schemas",
        "outputs",
    ]

    for d in dirs:
        (directory / d).mkdir(parents=True, exist_ok=True)

    # Create user-facing config file
    (directory / "config" / "biotope.yaml").write_text(
        yaml.dump(config, default_flow_style=False),
    )

    # Create consolidated biotope config (Git-like approach)
    biotope_config = {
        "version": "1.0",
        "croissant_schema_version": "1.0",
        "default_metadata_template": "scientific",
        "data_storage": {"type": "local", "path": "data"},
        "checksum_algorithm": "sha256",
        "auto_stage": True,
        "commit_message_template": "Update metadata: {description}",
        "annotation_validation": {
            "enabled": True,
            "minimum_required_fields": [
                "name",
                "description",
                "creator",
                "dateCreated",
                "distribution",
            ],
            "field_validation": {
                "name": {"type": "string", "min_length": 1},
                "description": {"type": "string", "min_length": 10},
                "creator": {"type": "object", "required_keys": ["name"]},
                "dateCreated": {"type": "string", "format": "date"},
                "distribution": {"type": "array", "min_length": 1},
            },
        },
        # Consolidate internal metadata into config (Git-like approach)
        "project_info": {
            "name": metadata.get("project_name"),
            "created_at": metadata.get("created_at"),
            "biotope_version": metadata.get("biotope_version"),
            "last_modified": metadata.get("last_modified"),
            "builds": metadata.get("builds", []),
            "knowledge_sources": metadata.get("knowledge_sources", []),
        },
    }

    # Add project metadata if provided
    if project_metadata:
        biotope_config["project_metadata"] = project_metadata

    (directory / ".biotope" / "config" / "biotope.yaml").write_text(
        yaml.dump(biotope_config, default_flow_style=False),
    )

    # Create .gitignore file to exclude data files and other common files
    gitignore_content = """# Biotope data files (not tracked in Git)
# Data files are tracked through metadata in .biotope/datasets/
/data/
/downloads/
/tmp/

# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Virtual environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# IDEs
.vscode/
.idea/
*.swp
*.swo
*~

# OS
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# Jupyter
.ipynb_checkpoints
*/.ipynb_checkpoints/*

# Logs
*.log
logs/

# Temporary files
*.tmp
*.temp
"""
    (directory / ".gitignore").write_text(gitignore_content)

    # Note: No custom refs needed - Git handles all version control

    # Create README
    readme_content = f"""# {config["project"]["name"]}

A BioCypher knowledge graph project managed with biotope.

## Project Structure

- `config/`: User configuration files
- `data/`: Data files (not tracked in Git)
  - `raw/`: Raw input data
  - `processed/`: Processed data
- `schemas/`: Knowledge schema definitions
- `outputs/`: Generated knowledge graphs
- `.biotope/`: Biotope project management (Git-tracked)
  - `datasets/`: Croissant ML metadata files
  - `workflows/`: Bioinformatics workflow definitions
  - `config/`: Biotope configuration (Git-like approach)
  - `logs/`: Command execution history

## Git Integration

This project uses Git for metadata version control. The `.biotope/` directory is tracked by Git, allowing you to:
- Version control your metadata changes
- Collaborate with others on metadata
- Use standard Git tools and workflows

**Note**: Data files in the `data/` directory are intentionally excluded from Git tracking via `.gitignore`. This is because:
- Data files are often large and would bloat the repository
- Data files are tracked through metadata in `.biotope/datasets/`
- Checksums ensure data integrity without storing the actual files

## Getting Started

1. Add data files: `biotope add <data_file>`
2. Create metadata: `biotope annotate interactive --staged`
3. Check status: `biotope status`
4. Commit changes: `biotope commit -m "Add new dataset"`
5. View history: `biotope log`
6. Push/pull: `biotope push` / `biotope pull`

## Standard Git Commands

You can also use standard Git commands:
- `git status` - See all project changes
- `git log -- .biotope/` - View metadata history
- `git diff .biotope/` - See metadata changes
"""
    (directory / "README.md").write_text(readme_content)

`init(dir)` ¶

Initialize a new biotope with interactive configuration in the specified directory.