7.1 KiB
name, description, allowed-tools, version
| name | description | allowed-tools | version |
|---|---|---|---|
| pdf-processor | Extract text, tables, and metadata from PDF files, fill PDF forms, and merge/split PDFs. Use when user mentions PDFs, documents, forms, or needs to extract content from PDF files. | Read, Bash(python *:*), Bash(pip *:*), Write | 1.0.0 |
PDF Processor Skill
Process PDF files: extract text/tables, read metadata, fill forms, merge/split documents.
Capabilities
1. Text Extraction
Extract text content from PDF files for analysis or conversion.
2. Table Extraction
Extract tables from PDFs and convert to CSV, JSON, or markdown.
3. Metadata Reading
Read PDF metadata (author, creation date, page count, etc.).
4. Form Filling
Fill interactive PDF forms programmatically.
5. Document Manipulation
- Merge multiple PDFs
- Split PDFs into separate pages
- Extract specific pages
Trigger Words
Use this skill when user mentions:
- PDF files, documents
- "extract from PDF", "read PDF", "parse PDF"
- "PDF form", "fill form"
- "merge PDFs", "split PDF", "combine PDFs"
- "PDF to text", "PDF to CSV"
Dependencies
This skill uses Python's PyPDF2 and pdfplumber libraries:
pip install PyPDF2 pdfplumber
Usage Examples
Example 1: Extract Text
User: "Extract text from report.pdf"
Assistant: [Uses this skill to extract and display text]
Example 2: Extract Tables
User: "Get the data table from financial-report.pdf"
Assistant: [Extracts tables and converts to markdown/CSV]
Example 3: Read Metadata
User: "What's in this PDF? Show me the metadata"
Assistant: [Displays author, page count, creation date, etc.]
Instructions
When this skill is invoked:
Step 1: Verify Dependencies
Check if required Python libraries are installed:
python -c "import PyPDF2, pdfplumber" 2>/dev/null || echo "Need to install"
If not installed, ask user permission to install:
pip install PyPDF2 pdfplumber
Step 2: Determine Task Type
Ask clarifying questions if ambiguous:
- "Would you like to extract text, tables, or metadata?"
- "Do you need all pages or specific pages?"
- "What output format do you prefer?"
Step 3: Execute Based on Task
For Text Extraction:
import PyPDF2
def extract_text(pdf_path):
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n\n"
return text
# Usage
text = extract_text("path/to/file.pdf")
print(text)
For Table Extraction:
import pdfplumber
def extract_tables(pdf_path):
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
page_tables = page.extract_tables()
if page_tables:
tables.extend(page_tables)
return tables
# Usage
tables = extract_tables("path/to/file.pdf")
# Convert to markdown or CSV as needed
For Metadata:
import PyPDF2
def get_metadata(pdf_path):
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
info = reader.metadata
return {
'Author': info.get('/Author', 'Unknown'),
'Title': info.get('/Title', 'Unknown'),
'Subject': info.get('/Subject', 'Unknown'),
'Creator': info.get('/Creator', 'Unknown'),
'Producer': info.get('/Producer', 'Unknown'),
'CreationDate': info.get('/CreationDate', 'Unknown'),
'ModDate': info.get('/ModDate', 'Unknown'),
'Pages': len(reader.pages)
}
# Usage
metadata = get_metadata("path/to/file.pdf")
for key, value in metadata.items():
print(f"{key}: {value}")
For Merging PDFs:
import PyPDF2
def merge_pdfs(pdf_list, output_path):
merger = PyPDF2.PdfMerger()
for pdf in pdf_list:
merger.append(pdf)
merger.write(output_path)
merger.close()
# Usage
merge_pdfs(["file1.pdf", "file2.pdf"], "merged.pdf")
For Splitting PDFs:
import PyPDF2
def split_pdf(pdf_path, output_dir):
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
for i, page in enumerate(reader.pages):
writer = PyPDF2.PdfWriter()
writer.add_page(page)
output_file = f"{output_dir}/page_{i+1}.pdf"
with open(output_file, 'wb') as output:
writer.write(output)
# Usage
split_pdf("document.pdf", "output/")
Step 4: Present Results
- For text: Display extracted content or save to file
- For tables: Format as markdown table or save as CSV
- For metadata: Display in readable format
- For operations: Confirm success and output location
Step 5: Offer Next Steps
Suggest related actions:
- "Would you like me to save this to a file?"
- "Should I analyze this content?"
- "Need to extract data from other PDFs?"
Error Handling
Common Errors
-
File not found
- Verify path exists
- Check file permissions
-
Encrypted PDF
- Ask user for password
- Use
reader.decrypt(password)
-
Corrupted PDF
- Inform user
- Suggest using
pdfplumberas alternative
-
Missing dependencies
- Install PyPDF2 and pdfplumber
- Provide installation commands
Best Practices
- Always verify file path before processing
- Ask for confirmation before installing dependencies
- Handle large PDFs carefully (show progress for many pages)
- Preserve formatting when extracting tables
- Offer multiple output formats (text, CSV, JSON, markdown)
Tool Restrictions
This skill has access to:
Read- For reading file paths and existing contentBash(python *:*)- For running Python scriptsBash(pip *:*)- For installing dependenciesWrite- For saving extracted content
No access to other tools to maintain focus.
Testing Checklist
Before using with real user data:
- Test with simple single-page PDF
- Test with multi-page PDF
- Test with PDF containing tables
- Test with encrypted PDF
- Test merge operation
- Test split operation
- Verify error handling works
- Check output formatting is clear
Advanced Features
Form Filling
from PyPDF2 import PdfReader, PdfWriter
def fill_form(template_path, data, output_path):
reader = PdfReader(template_path)
writer = PdfWriter()
# Fill form fields
writer.append_pages_from_reader(reader)
writer.update_page_form_field_values(
writer.pages[0], data
)
with open(output_path, 'wb') as output:
writer.write(output)
OCR for Scanned PDFs
For scanned PDFs (images), suggest using OCR:
pip install pdf2image pytesseract
# Requires tesseract-ocr system package
Version History
- 1.0.0 (2025-10-20): Initial release
- Text extraction
- Table extraction
- Metadata reading
- Merge/split operations
Related Skills
- document-converter - Convert between document formats
- data-analyzer - Analyze extracted data
- report-generator - Create reports from PDF data
Notes
- Works best with text-based PDFs
- For scanned PDFs, recommend OCR tools
- Large PDFs may take time to process
- Always preserve user's original files