304 lines
7.1 KiB
Markdown
304 lines
7.1 KiB
Markdown
---
|
|
name: pdf-processor
|
|
description: Extract text, tables, and metadata from PDF files, fill PDF forms, and merge/split PDFs. Use when user mentions PDFs, documents, forms, or needs to extract content from PDF files.
|
|
allowed-tools: Read, Bash(python *:*), Bash(pip *:*), Write
|
|
version: 1.0.0
|
|
---
|
|
|
|
# PDF Processor Skill
|
|
|
|
Process PDF files: extract text/tables, read metadata, fill forms, merge/split documents.
|
|
|
|
## Capabilities
|
|
|
|
### 1. Text Extraction
|
|
Extract text content from PDF files for analysis or conversion.
|
|
|
|
### 2. Table Extraction
|
|
Extract tables from PDFs and convert to CSV, JSON, or markdown.
|
|
|
|
### 3. Metadata Reading
|
|
Read PDF metadata (author, creation date, page count, etc.).
|
|
|
|
### 4. Form Filling
|
|
Fill interactive PDF forms programmatically.
|
|
|
|
### 5. Document Manipulation
|
|
- Merge multiple PDFs
|
|
- Split PDFs into separate pages
|
|
- Extract specific pages
|
|
|
|
## Trigger Words
|
|
|
|
Use this skill when user mentions:
|
|
- PDF files, documents
|
|
- "extract from PDF", "read PDF", "parse PDF"
|
|
- "PDF form", "fill form"
|
|
- "merge PDFs", "split PDF", "combine PDFs"
|
|
- "PDF to text", "PDF to CSV"
|
|
|
|
## Dependencies
|
|
|
|
This skill uses Python's `PyPDF2` and `pdfplumber` libraries:
|
|
|
|
```bash
|
|
pip install PyPDF2 pdfplumber
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Example 1: Extract Text
|
|
```
|
|
User: "Extract text from report.pdf"
|
|
Assistant: [Uses this skill to extract and display text]
|
|
```
|
|
|
|
### Example 2: Extract Tables
|
|
```
|
|
User: "Get the data table from financial-report.pdf"
|
|
Assistant: [Extracts tables and converts to markdown/CSV]
|
|
```
|
|
|
|
### Example 3: Read Metadata
|
|
```
|
|
User: "What's in this PDF? Show me the metadata"
|
|
Assistant: [Displays author, page count, creation date, etc.]
|
|
```
|
|
|
|
## Instructions
|
|
|
|
When this skill is invoked:
|
|
|
|
### Step 1: Verify Dependencies
|
|
Check if required Python libraries are installed:
|
|
```bash
|
|
python -c "import PyPDF2, pdfplumber" 2>/dev/null || echo "Need to install"
|
|
```
|
|
|
|
If not installed, ask user permission to install:
|
|
```bash
|
|
pip install PyPDF2 pdfplumber
|
|
```
|
|
|
|
### Step 2: Determine Task Type
|
|
|
|
Ask clarifying questions if ambiguous:
|
|
- "Would you like to extract text, tables, or metadata?"
|
|
- "Do you need all pages or specific pages?"
|
|
- "What output format do you prefer?"
|
|
|
|
### Step 3: Execute Based on Task
|
|
|
|
#### For Text Extraction:
|
|
|
|
```python
|
|
import PyPDF2
|
|
|
|
def extract_text(pdf_path):
|
|
with open(pdf_path, 'rb') as file:
|
|
reader = PyPDF2.PdfReader(file)
|
|
text = ""
|
|
for page in reader.pages:
|
|
text += page.extract_text() + "\n\n"
|
|
return text
|
|
|
|
# Usage
|
|
text = extract_text("path/to/file.pdf")
|
|
print(text)
|
|
```
|
|
|
|
#### For Table Extraction:
|
|
|
|
```python
|
|
import pdfplumber
|
|
|
|
def extract_tables(pdf_path):
|
|
tables = []
|
|
with pdfplumber.open(pdf_path) as pdf:
|
|
for page in pdf.pages:
|
|
page_tables = page.extract_tables()
|
|
if page_tables:
|
|
tables.extend(page_tables)
|
|
return tables
|
|
|
|
# Usage
|
|
tables = extract_tables("path/to/file.pdf")
|
|
# Convert to markdown or CSV as needed
|
|
```
|
|
|
|
#### For Metadata:
|
|
|
|
```python
|
|
import PyPDF2
|
|
|
|
def get_metadata(pdf_path):
|
|
with open(pdf_path, 'rb') as file:
|
|
reader = PyPDF2.PdfReader(file)
|
|
info = reader.metadata
|
|
return {
|
|
'Author': info.get('/Author', 'Unknown'),
|
|
'Title': info.get('/Title', 'Unknown'),
|
|
'Subject': info.get('/Subject', 'Unknown'),
|
|
'Creator': info.get('/Creator', 'Unknown'),
|
|
'Producer': info.get('/Producer', 'Unknown'),
|
|
'CreationDate': info.get('/CreationDate', 'Unknown'),
|
|
'ModDate': info.get('/ModDate', 'Unknown'),
|
|
'Pages': len(reader.pages)
|
|
}
|
|
|
|
# Usage
|
|
metadata = get_metadata("path/to/file.pdf")
|
|
for key, value in metadata.items():
|
|
print(f"{key}: {value}")
|
|
```
|
|
|
|
#### For Merging PDFs:
|
|
|
|
```python
|
|
import PyPDF2
|
|
|
|
def merge_pdfs(pdf_list, output_path):
|
|
merger = PyPDF2.PdfMerger()
|
|
for pdf in pdf_list:
|
|
merger.append(pdf)
|
|
merger.write(output_path)
|
|
merger.close()
|
|
|
|
# Usage
|
|
merge_pdfs(["file1.pdf", "file2.pdf"], "merged.pdf")
|
|
```
|
|
|
|
#### For Splitting PDFs:
|
|
|
|
```python
|
|
import PyPDF2
|
|
|
|
def split_pdf(pdf_path, output_dir):
|
|
with open(pdf_path, 'rb') as file:
|
|
reader = PyPDF2.PdfReader(file)
|
|
for i, page in enumerate(reader.pages):
|
|
writer = PyPDF2.PdfWriter()
|
|
writer.add_page(page)
|
|
output_file = f"{output_dir}/page_{i+1}.pdf"
|
|
with open(output_file, 'wb') as output:
|
|
writer.write(output)
|
|
|
|
# Usage
|
|
split_pdf("document.pdf", "output/")
|
|
```
|
|
|
|
### Step 4: Present Results
|
|
|
|
- For text: Display extracted content or save to file
|
|
- For tables: Format as markdown table or save as CSV
|
|
- For metadata: Display in readable format
|
|
- For operations: Confirm success and output location
|
|
|
|
### Step 5: Offer Next Steps
|
|
|
|
Suggest related actions:
|
|
- "Would you like me to save this to a file?"
|
|
- "Should I analyze this content?"
|
|
- "Need to extract data from other PDFs?"
|
|
|
|
## Error Handling
|
|
|
|
### Common Errors
|
|
|
|
1. **File not found**
|
|
- Verify path exists
|
|
- Check file permissions
|
|
|
|
2. **Encrypted PDF**
|
|
- Ask user for password
|
|
- Use `reader.decrypt(password)`
|
|
|
|
3. **Corrupted PDF**
|
|
- Inform user
|
|
- Suggest using `pdfplumber` as alternative
|
|
|
|
4. **Missing dependencies**
|
|
- Install PyPDF2 and pdfplumber
|
|
- Provide installation commands
|
|
|
|
## Best Practices
|
|
|
|
1. **Always verify file path** before processing
|
|
2. **Ask for confirmation** before installing dependencies
|
|
3. **Handle large PDFs** carefully (show progress for many pages)
|
|
4. **Preserve formatting** when extracting tables
|
|
5. **Offer multiple output formats** (text, CSV, JSON, markdown)
|
|
|
|
## Tool Restrictions
|
|
|
|
This skill has access to:
|
|
- `Read` - For reading file paths and existing content
|
|
- `Bash(python *:*)` - For running Python scripts
|
|
- `Bash(pip *:*)` - For installing dependencies
|
|
- `Write` - For saving extracted content
|
|
|
|
**No access to** other tools to maintain focus.
|
|
|
|
## Testing Checklist
|
|
|
|
Before using with real user data:
|
|
|
|
- [ ] Test with simple single-page PDF
|
|
- [ ] Test with multi-page PDF
|
|
- [ ] Test with PDF containing tables
|
|
- [ ] Test with encrypted PDF
|
|
- [ ] Test merge operation
|
|
- [ ] Test split operation
|
|
- [ ] Verify error handling works
|
|
- [ ] Check output formatting is clear
|
|
|
|
## Advanced Features
|
|
|
|
### Form Filling
|
|
|
|
```python
|
|
from PyPDF2 import PdfReader, PdfWriter
|
|
|
|
def fill_form(template_path, data, output_path):
|
|
reader = PdfReader(template_path)
|
|
writer = PdfWriter()
|
|
|
|
# Fill form fields
|
|
writer.append_pages_from_reader(reader)
|
|
writer.update_page_form_field_values(
|
|
writer.pages[0], data
|
|
)
|
|
|
|
with open(output_path, 'wb') as output:
|
|
writer.write(output)
|
|
```
|
|
|
|
### OCR for Scanned PDFs
|
|
|
|
For scanned PDFs (images), suggest using OCR:
|
|
```bash
|
|
pip install pdf2image pytesseract
|
|
# Requires tesseract-ocr system package
|
|
```
|
|
|
|
## Version History
|
|
|
|
- **1.0.0** (2025-10-20): Initial release
|
|
- Text extraction
|
|
- Table extraction
|
|
- Metadata reading
|
|
- Merge/split operations
|
|
|
|
## Related Skills
|
|
|
|
- **document-converter** - Convert between document formats
|
|
- **data-analyzer** - Analyze extracted data
|
|
- **report-generator** - Create reports from PDF data
|
|
|
|
## Notes
|
|
|
|
- Works best with text-based PDFs
|
|
- For scanned PDFs, recommend OCR tools
|
|
- Large PDFs may take time to process
|
|
- Always preserve user's original files
|