Files

Claude Code 15355c35ea Initial commit: Fresh start with current state

2025-11-06 14:04:48 +01:00

7.1 KiB

Raw Blame History

name, description, allowed-tools, version

name	description	allowed-tools	version
pdf-processor	Extract text, tables, and metadata from PDF files, fill PDF forms, and merge/split PDFs. Use when user mentions PDFs, documents, forms, or needs to extract content from PDF files.	Read, Bash(python :), Bash(pip :), Write	1.0.0

PDF Processor Skill

Process PDF files: extract text/tables, read metadata, fill forms, merge/split documents.

Capabilities

1. Text Extraction

Extract text content from PDF files for analysis or conversion.

2. Table Extraction

Extract tables from PDFs and convert to CSV, JSON, or markdown.

3. Metadata Reading

Read PDF metadata (author, creation date, page count, etc.).

4. Form Filling

Fill interactive PDF forms programmatically.

5. Document Manipulation

Merge multiple PDFs
Split PDFs into separate pages
Extract specific pages

Trigger Words

Use this skill when user mentions:

PDF files, documents
"extract from PDF", "read PDF", "parse PDF"
"PDF form", "fill form"
"merge PDFs", "split PDF", "combine PDFs"
"PDF to text", "PDF to CSV"

Dependencies

This skill uses Python's PyPDF2 and pdfplumber libraries:

pip install PyPDF2 pdfplumber

Usage Examples

Example 1: Extract Text

User: "Extract text from report.pdf"
Assistant: [Uses this skill to extract and display text]

Example 2: Extract Tables

User: "Get the data table from financial-report.pdf"
Assistant: [Extracts tables and converts to markdown/CSV]

Example 3: Read Metadata

User: "What's in this PDF? Show me the metadata"
Assistant: [Displays author, page count, creation date, etc.]

Instructions

When this skill is invoked:

Step 1: Verify Dependencies

Check if required Python libraries are installed:

python -c "import PyPDF2, pdfplumber" 2>/dev/null || echo "Need to install"

If not installed, ask user permission to install:

pip install PyPDF2 pdfplumber

Step 2: Determine Task Type

Ask clarifying questions if ambiguous:

"Would you like to extract text, tables, or metadata?"
"Do you need all pages or specific pages?"
"What output format do you prefer?"

Step 3: Execute Based on Task

For Text Extraction:

import PyPDF2

def extract_text(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n\n"
    return text

# Usage
text = extract_text("path/to/file.pdf")
print(text)

For Table Extraction:

import pdfplumber

def extract_tables(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if page_tables:
                tables.extend(page_tables)
    return tables

# Usage
tables = extract_tables("path/to/file.pdf")
# Convert to markdown or CSV as needed

For Metadata:

import PyPDF2

def get_metadata(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        info = reader.metadata
        return {
            'Author': info.get('/Author', 'Unknown'),
            'Title': info.get('/Title', 'Unknown'),
            'Subject': info.get('/Subject', 'Unknown'),
            'Creator': info.get('/Creator', 'Unknown'),
            'Producer': info.get('/Producer', 'Unknown'),
            'CreationDate': info.get('/CreationDate', 'Unknown'),
            'ModDate': info.get('/ModDate', 'Unknown'),
            'Pages': len(reader.pages)
        }

# Usage
metadata = get_metadata("path/to/file.pdf")
for key, value in metadata.items():
    print(f"{key}: {value}")

For Merging PDFs:

import PyPDF2

def merge_pdfs(pdf_list, output_path):
    merger = PyPDF2.PdfMerger()
    for pdf in pdf_list:
        merger.append(pdf)
    merger.write(output_path)
    merger.close()

# Usage
merge_pdfs(["file1.pdf", "file2.pdf"], "merged.pdf")

For Splitting PDFs:

import PyPDF2

def split_pdf(pdf_path, output_dir):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for i, page in enumerate(reader.pages):
            writer = PyPDF2.PdfWriter()
            writer.add_page(page)
            output_file = f"{output_dir}/page_{i+1}.pdf"
            with open(output_file, 'wb') as output:
                writer.write(output)

# Usage
split_pdf("document.pdf", "output/")

Step 4: Present Results

For text: Display extracted content or save to file
For tables: Format as markdown table or save as CSV
For metadata: Display in readable format
For operations: Confirm success and output location

Step 5: Offer Next Steps

Suggest related actions:

"Would you like me to save this to a file?"
"Should I analyze this content?"
"Need to extract data from other PDFs?"

Error Handling

Common Errors

File not found
- Verify path exists
- Check file permissions
Encrypted PDF
- Ask user for password
- Use reader.decrypt(password)
Corrupted PDF
- Inform user
- Suggest using pdfplumber as alternative
Missing dependencies
- Install PyPDF2 and pdfplumber
- Provide installation commands

Best Practices

Always verify file path before processing
Ask for confirmation before installing dependencies
Handle large PDFs carefully (show progress for many pages)
Preserve formatting when extracting tables
Offer multiple output formats (text, CSV, JSON, markdown)

Tool Restrictions

This skill has access to:

Read - For reading file paths and existing content
Bash(python *:*) - For running Python scripts
Bash(pip *:*) - For installing dependencies
Write - For saving extracted content

No access to other tools to maintain focus.

Testing Checklist

Before using with real user data:

Test with simple single-page PDF
Test with multi-page PDF
Test with PDF containing tables
Test with encrypted PDF
Test merge operation
Test split operation
Verify error handling works
Check output formatting is clear

Advanced Features

Form Filling

from PyPDF2 import PdfReader, PdfWriter

def fill_form(template_path, data, output_path):
    reader = PdfReader(template_path)
    writer = PdfWriter()

    # Fill form fields
    writer.append_pages_from_reader(reader)
    writer.update_page_form_field_values(
        writer.pages[0], data
    )

    with open(output_path, 'wb') as output:
        writer.write(output)

OCR for Scanned PDFs

For scanned PDFs (images), suggest using OCR:

pip install pdf2image pytesseract
# Requires tesseract-ocr system package

Version History

1.0.0 (2025-10-20): Initial release
- Text extraction
- Table extraction
- Metadata reading
- Merge/split operations

document-converter - Convert between document formats
data-analyzer - Analyze extracted data
report-generator - Create reports from PDF data

Notes

Works best with text-based PDFs
For scanned PDFs, recommend OCR tools
Large PDFs may take time to process
Always preserve user's original files

7.1 KiB Raw Blame History