Files
FoundryVTT/.claude/skills/pdf-processor/SKILL.md
2025-11-06 14:04:48 +01:00

7.1 KiB

name, description, allowed-tools, version
name description allowed-tools version
pdf-processor Extract text, tables, and metadata from PDF files, fill PDF forms, and merge/split PDFs. Use when user mentions PDFs, documents, forms, or needs to extract content from PDF files. Read, Bash(python *:*), Bash(pip *:*), Write 1.0.0

PDF Processor Skill

Process PDF files: extract text/tables, read metadata, fill forms, merge/split documents.

Capabilities

1. Text Extraction

Extract text content from PDF files for analysis or conversion.

2. Table Extraction

Extract tables from PDFs and convert to CSV, JSON, or markdown.

3. Metadata Reading

Read PDF metadata (author, creation date, page count, etc.).

4. Form Filling

Fill interactive PDF forms programmatically.

5. Document Manipulation

  • Merge multiple PDFs
  • Split PDFs into separate pages
  • Extract specific pages

Trigger Words

Use this skill when user mentions:

  • PDF files, documents
  • "extract from PDF", "read PDF", "parse PDF"
  • "PDF form", "fill form"
  • "merge PDFs", "split PDF", "combine PDFs"
  • "PDF to text", "PDF to CSV"

Dependencies

This skill uses Python's PyPDF2 and pdfplumber libraries:

pip install PyPDF2 pdfplumber

Usage Examples

Example 1: Extract Text

User: "Extract text from report.pdf"
Assistant: [Uses this skill to extract and display text]

Example 2: Extract Tables

User: "Get the data table from financial-report.pdf"
Assistant: [Extracts tables and converts to markdown/CSV]

Example 3: Read Metadata

User: "What's in this PDF? Show me the metadata"
Assistant: [Displays author, page count, creation date, etc.]

Instructions

When this skill is invoked:

Step 1: Verify Dependencies

Check if required Python libraries are installed:

python -c "import PyPDF2, pdfplumber" 2>/dev/null || echo "Need to install"

If not installed, ask user permission to install:

pip install PyPDF2 pdfplumber

Step 2: Determine Task Type

Ask clarifying questions if ambiguous:

  • "Would you like to extract text, tables, or metadata?"
  • "Do you need all pages or specific pages?"
  • "What output format do you prefer?"

Step 3: Execute Based on Task

For Text Extraction:

import PyPDF2

def extract_text(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n\n"
    return text

# Usage
text = extract_text("path/to/file.pdf")
print(text)

For Table Extraction:

import pdfplumber

def extract_tables(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if page_tables:
                tables.extend(page_tables)
    return tables

# Usage
tables = extract_tables("path/to/file.pdf")
# Convert to markdown or CSV as needed

For Metadata:

import PyPDF2

def get_metadata(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        info = reader.metadata
        return {
            'Author': info.get('/Author', 'Unknown'),
            'Title': info.get('/Title', 'Unknown'),
            'Subject': info.get('/Subject', 'Unknown'),
            'Creator': info.get('/Creator', 'Unknown'),
            'Producer': info.get('/Producer', 'Unknown'),
            'CreationDate': info.get('/CreationDate', 'Unknown'),
            'ModDate': info.get('/ModDate', 'Unknown'),
            'Pages': len(reader.pages)
        }

# Usage
metadata = get_metadata("path/to/file.pdf")
for key, value in metadata.items():
    print(f"{key}: {value}")

For Merging PDFs:

import PyPDF2

def merge_pdfs(pdf_list, output_path):
    merger = PyPDF2.PdfMerger()
    for pdf in pdf_list:
        merger.append(pdf)
    merger.write(output_path)
    merger.close()

# Usage
merge_pdfs(["file1.pdf", "file2.pdf"], "merged.pdf")

For Splitting PDFs:

import PyPDF2

def split_pdf(pdf_path, output_dir):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for i, page in enumerate(reader.pages):
            writer = PyPDF2.PdfWriter()
            writer.add_page(page)
            output_file = f"{output_dir}/page_{i+1}.pdf"
            with open(output_file, 'wb') as output:
                writer.write(output)

# Usage
split_pdf("document.pdf", "output/")

Step 4: Present Results

  • For text: Display extracted content or save to file
  • For tables: Format as markdown table or save as CSV
  • For metadata: Display in readable format
  • For operations: Confirm success and output location

Step 5: Offer Next Steps

Suggest related actions:

  • "Would you like me to save this to a file?"
  • "Should I analyze this content?"
  • "Need to extract data from other PDFs?"

Error Handling

Common Errors

  1. File not found

    • Verify path exists
    • Check file permissions
  2. Encrypted PDF

    • Ask user for password
    • Use reader.decrypt(password)
  3. Corrupted PDF

    • Inform user
    • Suggest using pdfplumber as alternative
  4. Missing dependencies

    • Install PyPDF2 and pdfplumber
    • Provide installation commands

Best Practices

  1. Always verify file path before processing
  2. Ask for confirmation before installing dependencies
  3. Handle large PDFs carefully (show progress for many pages)
  4. Preserve formatting when extracting tables
  5. Offer multiple output formats (text, CSV, JSON, markdown)

Tool Restrictions

This skill has access to:

  • Read - For reading file paths and existing content
  • Bash(python *:*) - For running Python scripts
  • Bash(pip *:*) - For installing dependencies
  • Write - For saving extracted content

No access to other tools to maintain focus.

Testing Checklist

Before using with real user data:

  • Test with simple single-page PDF
  • Test with multi-page PDF
  • Test with PDF containing tables
  • Test with encrypted PDF
  • Test merge operation
  • Test split operation
  • Verify error handling works
  • Check output formatting is clear

Advanced Features

Form Filling

from PyPDF2 import PdfReader, PdfWriter

def fill_form(template_path, data, output_path):
    reader = PdfReader(template_path)
    writer = PdfWriter()

    # Fill form fields
    writer.append_pages_from_reader(reader)
    writer.update_page_form_field_values(
        writer.pages[0], data
    )

    with open(output_path, 'wb') as output:
        writer.write(output)

OCR for Scanned PDFs

For scanned PDFs (images), suggest using OCR:

pip install pdf2image pytesseract
# Requires tesseract-ocr system package

Version History

  • 1.0.0 (2025-10-20): Initial release
    • Text extraction
    • Table extraction
    • Metadata reading
    • Merge/split operations
  • document-converter - Convert between document formats
  • data-analyzer - Analyze extracted data
  • report-generator - Create reports from PDF data

Notes

  • Works best with text-based PDFs
  • For scanned PDFs, recommend OCR tools
  • Large PDFs may take time to process
  • Always preserve user's original files