FoundryVTT/.claude/skills/pdf-processor/SKILL.md

---
name: pdf-processor
description: Extract text, tables, and metadata from PDF files, fill PDF forms, and merge/split PDFs. Use when user mentions PDFs, documents, forms, or needs to extract content from PDF files.
allowed-tools: Read, Bash(python *:*), Bash(pip *:*), Write
version: 1.0.0
---

# PDF Processor Skill

Process PDF files: extract text/tables, read metadata, fill forms, merge/split documents.

## Capabilities

### 1. Text Extraction
Extract text content from PDF files for analysis or conversion.

### 2. Table Extraction
Extract tables from PDFs and convert to CSV, JSON, or markdown.

### 3. Metadata Reading
Read PDF metadata (author, creation date, page count, etc.).

### 4. Form Filling
Fill interactive PDF forms programmatically.

### 5. Document Manipulation
- Merge multiple PDFs
- Split PDFs into separate pages
- Extract specific pages

## Trigger Words

Use this skill when user mentions:
- PDF files, documents
- "extract from PDF", "read PDF", "parse PDF"
- "PDF form", "fill form"
- "merge PDFs", "split PDF", "combine PDFs"
- "PDF to text", "PDF to CSV"

## Dependencies

This skill uses Python's `PyPDF2` and `pdfplumber` libraries:

```bash
pip install PyPDF2 pdfplumber
```

## Usage Examples

### Example 1: Extract Text
```
User: "Extract text from report.pdf"
Assistant: [Uses this skill to extract and display text]
```

### Example 2: Extract Tables
```
User: "Get the data table from financial-report.pdf"
Assistant: [Extracts tables and converts to markdown/CSV]
```

### Example 3: Read Metadata
```
User: "What's in this PDF? Show me the metadata"
Assistant: [Displays author, page count, creation date, etc.]
```

## Instructions

When this skill is invoked:

### Step 1: Verify Dependencies
Check if required Python libraries are installed:
```bash
python -c "import PyPDF2, pdfplumber" 2>/dev/null || echo "Need to install"
```

If not installed, ask user permission to install:
```bash
pip install PyPDF2 pdfplumber
```

### Step 2: Determine Task Type

Ask clarifying questions if ambiguous:
- "Would you like to extract text, tables, or metadata?"
- "Do you need all pages or specific pages?"
- "What output format do you prefer?"

### Step 3: Execute Based on Task

#### For Text Extraction:

```python
import PyPDF2

def extract_text(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n\n"
    return text

# Usage
text = extract_text("path/to/file.pdf")
print(text)
```

#### For Table Extraction:

```python
import pdfplumber

def extract_tables(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if page_tables:
                tables.extend(page_tables)
    return tables

# Usage
tables = extract_tables("path/to/file.pdf")
# Convert to markdown or CSV as needed
```

#### For Metadata:

```python
import PyPDF2

def get_metadata(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        info = reader.metadata
        return {
            'Author': info.get('/Author', 'Unknown'),
            'Title': info.get('/Title', 'Unknown'),
            'Subject': info.get('/Subject', 'Unknown'),
            'Creator': info.get('/Creator', 'Unknown'),
            'Producer': info.get('/Producer', 'Unknown'),
            'CreationDate': info.get('/CreationDate', 'Unknown'),
            'ModDate': info.get('/ModDate', 'Unknown'),
            'Pages': len(reader.pages)
        }

# Usage
metadata = get_metadata("path/to/file.pdf")
for key, value in metadata.items():
    print(f"{key}: {value}")
```

#### For Merging PDFs:

```python
import PyPDF2

def merge_pdfs(pdf_list, output_path):
    merger = PyPDF2.PdfMerger()
    for pdf in pdf_list:
        merger.append(pdf)
    merger.write(output_path)
    merger.close()

# Usage
merge_pdfs(["file1.pdf", "file2.pdf"], "merged.pdf")
```

#### For Splitting PDFs:

```python
import PyPDF2

def split_pdf(pdf_path, output_dir):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for i, page in enumerate(reader.pages):
            writer = PyPDF2.PdfWriter()
            writer.add_page(page)
            output_file = f"{output_dir}/page_{i+1}.pdf"
            with open(output_file, 'wb') as output:
                writer.write(output)

# Usage
split_pdf("document.pdf", "output/")
```

### Step 4: Present Results

- For text: Display extracted content or save to file
- For tables: Format as markdown table or save as CSV
- For metadata: Display in readable format
- For operations: Confirm success and output location

### Step 5: Offer Next Steps

Suggest related actions:
- "Would you like me to save this to a file?"
- "Should I analyze this content?"
- "Need to extract data from other PDFs?"

## Error Handling

### Common Errors

1. **File not found**
   - Verify path exists
   - Check file permissions

2. **Encrypted PDF**
   - Ask user for password
   - Use `reader.decrypt(password)`

3. **Corrupted PDF**
   - Inform user
   - Suggest using `pdfplumber` as alternative

4. **Missing dependencies**
   - Install PyPDF2 and pdfplumber
   - Provide installation commands

## Best Practices

1. **Always verify file path** before processing
2. **Ask for confirmation** before installing dependencies
3. **Handle large PDFs** carefully (show progress for many pages)
4. **Preserve formatting** when extracting tables
5. **Offer multiple output formats** (text, CSV, JSON, markdown)

## Tool Restrictions

This skill has access to:
- `Read` - For reading file paths and existing content
- `Bash(python *:*)` - For running Python scripts
- `Bash(pip *:*)` - For installing dependencies
- `Write` - For saving extracted content

**No access to** other tools to maintain focus.

## Testing Checklist

Before using with real user data:

- [ ] Test with simple single-page PDF
- [ ] Test with multi-page PDF
- [ ] Test with PDF containing tables
- [ ] Test with encrypted PDF
- [ ] Test merge operation
- [ ] Test split operation
- [ ] Verify error handling works
- [ ] Check output formatting is clear

## Advanced Features

### Form Filling

```python
from PyPDF2 import PdfReader, PdfWriter

def fill_form(template_path, data, output_path):
    reader = PdfReader(template_path)
    writer = PdfWriter()

    # Fill form fields
    writer.append_pages_from_reader(reader)
    writer.update_page_form_field_values(
        writer.pages[0], data
    )

    with open(output_path, 'wb') as output:
        writer.write(output)
```

### OCR for Scanned PDFs

For scanned PDFs (images), suggest using OCR:
```bash
pip install pdf2image pytesseract
# Requires tesseract-ocr system package
```

## Version History

- **1.0.0** (2025-10-20): Initial release
  - Text extraction
  - Table extraction
  - Metadata reading
  - Merge/split operations

## Related Skills

- **document-converter** - Convert between document formats
- **data-analyzer** - Analyze extracted data
- **report-generator** - Create reports from PDF data

## Notes

- Works best with text-based PDFs
- For scanned PDFs, recommend OCR tools
- Large PDFs may take time to process
- Always preserve user's original files