--- name: pdf-processor description: Extract text, tables, and metadata from PDF files, fill PDF forms, and merge/split PDFs. Use when user mentions PDFs, documents, forms, or needs to extract content from PDF files. allowed-tools: Read, Bash(python *:*), Bash(pip *:*), Write version: 1.0.0 --- # PDF Processor Skill Process PDF files: extract text/tables, read metadata, fill forms, merge/split documents. ## Capabilities ### 1. Text Extraction Extract text content from PDF files for analysis or conversion. ### 2. Table Extraction Extract tables from PDFs and convert to CSV, JSON, or markdown. ### 3. Metadata Reading Read PDF metadata (author, creation date, page count, etc.). ### 4. Form Filling Fill interactive PDF forms programmatically. ### 5. Document Manipulation - Merge multiple PDFs - Split PDFs into separate pages - Extract specific pages ## Trigger Words Use this skill when user mentions: - PDF files, documents - "extract from PDF", "read PDF", "parse PDF" - "PDF form", "fill form" - "merge PDFs", "split PDF", "combine PDFs" - "PDF to text", "PDF to CSV" ## Dependencies This skill uses Python's `PyPDF2` and `pdfplumber` libraries: ```bash pip install PyPDF2 pdfplumber ``` ## Usage Examples ### Example 1: Extract Text ``` User: "Extract text from report.pdf" Assistant: [Uses this skill to extract and display text] ``` ### Example 2: Extract Tables ``` User: "Get the data table from financial-report.pdf" Assistant: [Extracts tables and converts to markdown/CSV] ``` ### Example 3: Read Metadata ``` User: "What's in this PDF? Show me the metadata" Assistant: [Displays author, page count, creation date, etc.] ``` ## Instructions When this skill is invoked: ### Step 1: Verify Dependencies Check if required Python libraries are installed: ```bash python -c "import PyPDF2, pdfplumber" 2>/dev/null || echo "Need to install" ``` If not installed, ask user permission to install: ```bash pip install PyPDF2 pdfplumber ``` ### Step 2: Determine Task Type Ask clarifying questions if ambiguous: - "Would you like to extract text, tables, or metadata?" - "Do you need all pages or specific pages?" - "What output format do you prefer?" ### Step 3: Execute Based on Task #### For Text Extraction: ```python import PyPDF2 def extract_text(pdf_path): with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) text = "" for page in reader.pages: text += page.extract_text() + "\n\n" return text # Usage text = extract_text("path/to/file.pdf") print(text) ``` #### For Table Extraction: ```python import pdfplumber def extract_tables(pdf_path): tables = [] with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_tables = page.extract_tables() if page_tables: tables.extend(page_tables) return tables # Usage tables = extract_tables("path/to/file.pdf") # Convert to markdown or CSV as needed ``` #### For Metadata: ```python import PyPDF2 def get_metadata(pdf_path): with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) info = reader.metadata return { 'Author': info.get('/Author', 'Unknown'), 'Title': info.get('/Title', 'Unknown'), 'Subject': info.get('/Subject', 'Unknown'), 'Creator': info.get('/Creator', 'Unknown'), 'Producer': info.get('/Producer', 'Unknown'), 'CreationDate': info.get('/CreationDate', 'Unknown'), 'ModDate': info.get('/ModDate', 'Unknown'), 'Pages': len(reader.pages) } # Usage metadata = get_metadata("path/to/file.pdf") for key, value in metadata.items(): print(f"{key}: {value}") ``` #### For Merging PDFs: ```python import PyPDF2 def merge_pdfs(pdf_list, output_path): merger = PyPDF2.PdfMerger() for pdf in pdf_list: merger.append(pdf) merger.write(output_path) merger.close() # Usage merge_pdfs(["file1.pdf", "file2.pdf"], "merged.pdf") ``` #### For Splitting PDFs: ```python import PyPDF2 def split_pdf(pdf_path, output_dir): with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) for i, page in enumerate(reader.pages): writer = PyPDF2.PdfWriter() writer.add_page(page) output_file = f"{output_dir}/page_{i+1}.pdf" with open(output_file, 'wb') as output: writer.write(output) # Usage split_pdf("document.pdf", "output/") ``` ### Step 4: Present Results - For text: Display extracted content or save to file - For tables: Format as markdown table or save as CSV - For metadata: Display in readable format - For operations: Confirm success and output location ### Step 5: Offer Next Steps Suggest related actions: - "Would you like me to save this to a file?" - "Should I analyze this content?" - "Need to extract data from other PDFs?" ## Error Handling ### Common Errors 1. **File not found** - Verify path exists - Check file permissions 2. **Encrypted PDF** - Ask user for password - Use `reader.decrypt(password)` 3. **Corrupted PDF** - Inform user - Suggest using `pdfplumber` as alternative 4. **Missing dependencies** - Install PyPDF2 and pdfplumber - Provide installation commands ## Best Practices 1. **Always verify file path** before processing 2. **Ask for confirmation** before installing dependencies 3. **Handle large PDFs** carefully (show progress for many pages) 4. **Preserve formatting** when extracting tables 5. **Offer multiple output formats** (text, CSV, JSON, markdown) ## Tool Restrictions This skill has access to: - `Read` - For reading file paths and existing content - `Bash(python *:*)` - For running Python scripts - `Bash(pip *:*)` - For installing dependencies - `Write` - For saving extracted content **No access to** other tools to maintain focus. ## Testing Checklist Before using with real user data: - [ ] Test with simple single-page PDF - [ ] Test with multi-page PDF - [ ] Test with PDF containing tables - [ ] Test with encrypted PDF - [ ] Test merge operation - [ ] Test split operation - [ ] Verify error handling works - [ ] Check output formatting is clear ## Advanced Features ### Form Filling ```python from PyPDF2 import PdfReader, PdfWriter def fill_form(template_path, data, output_path): reader = PdfReader(template_path) writer = PdfWriter() # Fill form fields writer.append_pages_from_reader(reader) writer.update_page_form_field_values( writer.pages[0], data ) with open(output_path, 'wb') as output: writer.write(output) ``` ### OCR for Scanned PDFs For scanned PDFs (images), suggest using OCR: ```bash pip install pdf2image pytesseract # Requires tesseract-ocr system package ``` ## Version History - **1.0.0** (2025-10-20): Initial release - Text extraction - Table extraction - Metadata reading - Merge/split operations ## Related Skills - **document-converter** - Convert between document formats - **data-analyzer** - Analyze extracted data - **report-generator** - Create reports from PDF data ## Notes - Works best with text-based PDFs - For scanned PDFs, recommend OCR tools - Large PDFs may take time to process - Always preserve user's original files