Skip to content

Biotope Add

Draft stage

Biotope is in draft stage. Functionality may be missing or incomplete. The API is subject to change.

The biotope add command adds data files to your biotope project and prepares them for metadata creation. It calculates checksums for data integrity and creates basic Croissant ML metadata files.

Command Signature

biotope add [OPTIONS] [PATHS]...

Arguments

  • PATHS: One or more file or directory paths to add. Can be absolute or relative paths.

Options

  • --recursive, -r: Add directories recursively (default: False)
  • --force, -f: Force add even if file already tracked (default: False)

Examples

Add a single file

biotope add data/raw/experiment.csv

Add multiple files

biotope add data/raw/experiment1.csv data/raw/experiment2.csv

Add directory recursively

biotope add data/raw/ --recursive

Force add already tracked file

biotope add data/raw/experiment.csv --force

Add files with absolute paths

biotope add /absolute/path/to/experiment.csv

What It Does

  1. Validates Environment: Checks that you're in a biotope project and Git repository
  2. Calculates Checksums: Computes SHA256 checksums for data integrity
  3. Creates Metadata: Generates basic Croissant ML metadata files in .biotope/datasets/
  4. Stages Changes: Automatically stages metadata changes in Git
  5. Reports Results: Shows which files were added and which were skipped

Output

The command provides detailed feedback:

📁 Added data/raw/experiment.csv (SHA256: e471e5fc...)

✅ Added 1 file(s) to biotope project:
  + data/raw/experiment.csv

💡 Next steps:
  1. Run 'biotope status' to see staged files
  2. Run 'biotope annotate interactive --staged' to create metadata
  3. Run 'biotope commit -m "message"' to save changes

💡 For incomplete annotations:
  1. Run 'biotope status' to see which files need annotation
  2. Run 'biotope annotate interactive --incomplete' to complete them

Metadata Structure

Creates JSON-LD files in .biotope/datasets/ with this structure:

{
  "@context": {"@vocab": "https://schema.org/"},
  "@type": "Dataset",
  "name": "experiment",
  "description": "Dataset for experiment.csv",
  "distribution": [
    {
      "@type": "sc:FileObject",
      "@id": "file_e471e5fc",
      "name": "experiment.csv",
      "contentUrl": "data/raw/experiment.csv",
      "sha256": "e471e5fc1234567890abcdef...",
      "contentSize": 1024,
      "dateCreated": "2024-01-15T10:30:00Z"
    }
  ]
}

Error Handling

Common Errors

  • "Not in a biotope project": Run biotope init first
  • "Not in a Git repository": Initialize Git with git init
  • "File already tracked": Use --force to override
  • "Path does not exist": Check the file path

Error Messages

❌ Not in a biotope project. Run 'biotope init' first.
❌ Not in a Git repository. Initialize Git first with 'git init'.
⚠️  File 'data/raw/experiment.csv' already tracked (use --force to override)
⚠️  Skipping directory 'data/raw/' (use --recursive to add contents)

Integration

With Other Commands

  • biotope status: See what files are staged
  • biotope annotate: Create detailed metadata
  • biotope commit: Save metadata changes
  • biotope check-data: Verify data integrity

Workflow Integration

# 1. Add files
biotope add data/raw/experiment.csv

# 2. Create metadata
biotope annotate interactive --staged

# 3. Commit changes
biotope commit -m "Add experiment dataset"

Technical Details

File Tracking

Files are tracked by their relative path from the biotope project root. The command handles both absolute and relative paths correctly.

Checksum Calculation

Uses SHA256 algorithm for data integrity verification:

def calculate_file_checksum(file_path: Path) -> str:
    """Calculate SHA256 checksum of a file."""
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            sha256_hash.update(chunk)
    return sha256_hash.hexdigest()

Git Integration

Automatically stages metadata changes:

def _stage_git_changes(biotope_root: Path) -> None:
    """Stage .biotope/ changes in Git."""
    subprocess.run(["git", "add", ".biotope/"], cwd=biotope_root, check=True)

Best Practices

  1. Use Relative Paths: Prefer relative paths for better portability
  2. Organize Data: Keep data files in structured directories
  3. Check Status: Use biotope status to verify what was added
  4. Review Metadata: Always review generated metadata before committing

Limitations

  • Only supports local files (not URLs)
  • Requires Git repository
  • Metadata is basic and should be enhanced with biotope annotate
  • No support for symbolic links

Add command implementation for tracking data files and metadata.

add(paths, recursive, force)

Add data files to biotope project and stage for metadata creation.

This command calculates checksums for data files and prepares them for metadata annotation. Files are tracked in the .biotope/datasets/ directory with their checksums embedded in Croissant ML metadata.

Parameters:

Name Type Description Default
paths tuple[Path, ...]

Files or directories to add

required
recursive bool

Add directories recursively

required
force bool

Force add even if already tracked

required
Source code in biotope/commands/add.py
@click.command()
@click.argument("paths", nargs=-1, type=click.Path(exists=True, path_type=Path))
@click.option(
    "--recursive",
    "-r",
    is_flag=True,
    help="Add directories recursively",
)
@click.option(
    "--force",
    "-f",
    is_flag=True,
    help="Force add even if file already tracked",
)
def add(paths: tuple[Path, ...], recursive: bool, force: bool) -> None:
    """
    Add data files to biotope project and stage for metadata creation.

    This command calculates checksums for data files and prepares them for metadata
    annotation. Files are tracked in the .biotope/datasets/ directory with their
    checksums embedded in Croissant ML metadata.

    Args:
        paths: Files or directories to add
        recursive: Add directories recursively
        force: Force add even if already tracked
    """
    if not paths:
        click.echo("❌ No paths specified. Use 'biotope add <file_or_directory>'")
        raise click.Abort

    # Find biotope project root
    biotope_root = find_biotope_root()
    if not biotope_root:
        click.echo("❌ Not in a biotope project. Run 'biotope init' first.")
        raise click.Abort

    # Check if we're in a Git repository
    if not is_git_repo(biotope_root):
        click.echo("❌ Not in a Git repository. Initialize Git first with 'git init'.")
        raise click.Abort

    datasets_dir = biotope_root / ".biotope" / "datasets"
    datasets_dir.mkdir(parents=True, exist_ok=True)

    added_files = []
    skipped_files = []

    for path in paths:
        if path.is_file():
            result = _add_file(path, biotope_root, datasets_dir, force)
            if result:
                added_files.append(path)
            else:
                skipped_files.append(path)
        elif path.is_dir() and recursive:
            for file_path in path.rglob("*"):
                if file_path.is_file():
                    result = _add_file(file_path, biotope_root, datasets_dir, force)
                    if result:
                        added_files.append(file_path)
                    else:
                        skipped_files.append(file_path)
        elif path.is_dir():
            click.echo(
                f"⚠️  Skipping directory '{path}' (use --recursive to add contents)"
            )
            skipped_files.append(path)

    # Stage changes in Git
    if added_files:
        stage_git_changes(biotope_root)

    # Report results
    if added_files:
        click.echo(f"\n✅ Added {len(added_files)} file(s) to biotope project:")
        for file_path in added_files:
            click.echo(f"  + {file_path}")

    if skipped_files:
        click.echo(f"\n⚠️  Skipped {len(skipped_files)} file(s):")
        for file_path in skipped_files:
            click.echo(f"  - {file_path}")

    if added_files:
        click.echo(f"\n💡 Next steps:")
        click.echo(f"  1. Run 'biotope status' to see staged files")
        click.echo(
            f"  2. Run 'biotope annotate interactive --staged' to create metadata"
        )
        click.echo(f"  3. Run 'biotope commit -m \"message\"' to save changes")