Files
storycove/EPUB_IMPORT_EXPORT_SPECIFICATION.md
2025-08-08 14:09:14 +02:00

459 lines
15 KiB
Markdown

# EPUB Import/Export Specification
## 🎉 Phase 1 Implementation Complete
**Status**: Phase 1 fully implemented and operational as of August 2025
**Key Achievements**:
- ✅ Complete EPUB import functionality with validation and error handling
- ✅ Single story EPUB export with XML validation fixes
- ✅ Reading position preservation using EPUB CFI standards
- ✅ Full frontend UI integration with navigation and authentication
- ✅ Moved export button to Story Detail View for better UX
- ✅ Added EPUB import to main Add Story menu dropdown
## Overview
This specification defines the requirements and implementation details for importing and exporting EPUB files in StoryCove. The feature enables users to import stories from EPUB files and export their stories/collections as EPUB files with preserved reading positions.
## Scope
### In Scope
- **EPUB Import**: Parse DRM-free EPUB files and import as stories
- **EPUB Export**: Export individual stories and collections as EPUB files
- **Reading Position Preservation**: Store and restore reading positions using EPUB standards
- **Metadata Handling**: Extract and preserve story metadata (title, author, cover, etc.)
- **Content Processing**: HTML content sanitization and formatting
### Out of Scope (Phase 1)
- DRM-protected EPUB files (future consideration)
- Real-time reading position sync between devices
- Advanced EPUB features (audio, video, interactive content)
- EPUB validation beyond basic structure
## Technical Architecture
### Backend Implementation
- **Language**: Java (Spring Boot)
- **Primary Library**: EPUBLib (nl.siegmann.epublib:epublib-core:3.1)
- **Processing**: Server-side generation and parsing
- **File Handling**: Multipart file upload for import, streaming download for export
### Dependencies
```xml
<dependency>
<groupId>com.positiondev.epublib</groupId>
<artifactId>epublib-core</artifactId>
<version>3.1</version>
</dependency>
```
### Phase 1 Implementation Notes
- **EPUBImportService**: Implemented with full validation, metadata extraction, and reading position handling
- **EPUBExportService**: Implemented with XML validation fixes for EPUB reader compatibility
- **ReadingPosition Entity**: Created with EPUB CFI support and database indexing
- **Authentication**: All endpoints secured with JWT authentication and proper frontend integration
- **UI Integration**: Export moved to Story Detail View, Import added to main navigation menu
- **XML Compliance**: Fixed XHTML validation issues by properly formatting self-closing tags (`<br>``<br />`)
## EPUB Import Specification
### Supported Formats
- **EPUB 2.0** and **EPUB 3.x** formats
- **DRM-Free** files only
- **Maximum file size**: 50MB
- **Supported content**: Text-based stories with HTML content
### Import Process Flow
1. **File Upload**: User uploads EPUB file via web interface
2. **Validation**: Check file format, size, and basic EPUB structure
3. **Parsing**: Extract metadata, content, and resources using EPUBLib
4. **Content Processing**: Sanitize HTML content using existing Jsoup pipeline
5. **Story Creation**: Create Story entity with extracted data
6. **Preview**: Show extracted story details for user confirmation
7. **Finalization**: Save story to database with imported metadata
### Metadata Mapping
```java
// EPUB Metadata → StoryCove Story Entity
epub.getMetadata().getFirstTitle() story.title
epub.getMetadata().getAuthors().get(0) story.authorName
epub.getMetadata().getDescriptions().get(0) story.summary
epub.getCoverImage() story.coverPath
epub.getMetadata().getSubjects() story.tags
```
### Content Extraction
- **Multi-chapter EPUBs**: Combine all content files into single HTML
- **Chapter separation**: Insert `<hr>` or `<h2>` tags between chapters
- **HTML sanitization**: Apply existing sanitization rules
- **Image handling**: Extract and store cover images, inline images optional
### API Endpoints
#### POST /api/stories/import-epub
```java
@PostMapping("/import-epub")
public ResponseEntity<?> importEPUB(@RequestParam("file") MultipartFile file) {
// Implementation in EPUBImportService
}
```
**Request**: Multipart file upload
**Response**:
```json
{
"message": "EPUB imported successfully",
"storyId": "uuid",
"extractedData": {
"title": "Story Title",
"author": "Author Name",
"summary": "Story description",
"chapterCount": 12,
"wordCount": 45000,
"hasCovers": true
}
}
```
## EPUB Export Specification
### Export Types
1. **Single Story Export**: Convert one story to EPUB
2. **Collection Export**: Multiple stories as single EPUB with chapters
### EPUB Structure Generation
```
story.epub
├── mimetype
├── META-INF/
│ └── container.xml
└── OEBPS/
├── content.opf # Package metadata
├── toc.ncx # Navigation
├── stylesheet.css # Styling
├── cover.html # Cover page
├── chapter001.xhtml # Story content
├── images/
│ └── cover.jpg # Cover image
└── fonts/ (optional)
```
### Reading Position Implementation
#### EPUB 3 CFI (Canonical Fragment Identifier)
```xml
<!-- In content.opf metadata -->
<meta property="epub-cfi" content="/6/4[chap01]!/4[body01]/10[para05]/3:142"/>
<meta property="reading-percentage" content="0.65"/>
<meta property="last-read-timestamp" content="2023-12-07T10:30:00Z"/>
```
#### StoryCove Custom Metadata (Fallback)
```xml
<meta name="storycove:reading-chapter" content="3"/>
<meta name="storycove:reading-paragraph" content="15"/>
<meta name="storycove:reading-offset" content="142"/>
<meta name="storycove:reading-percentage" content="0.65"/>
```
#### CFI Generation Logic
```java
public String generateCFI(ReadingPosition position) {
return String.format("/6/%d[chap%02d]!/4[body01]/%d[para%02d]/3:%d",
(position.getChapterIndex() * 2) + 4,
position.getChapterIndex(),
(position.getParagraphIndex() * 2) + 4,
position.getParagraphIndex(),
position.getCharacterOffset());
}
```
### API Endpoints
#### GET /api/stories/{id}/export-epub
```java
@GetMapping("/{id}/export-epub")
public ResponseEntity<StreamingResponseBody> exportStory(@PathVariable UUID id) {
// Implementation in EPUBExportService
}
```
**Response**: EPUB file download with headers:
```
Content-Type: application/epub+zip
Content-Disposition: attachment; filename="story-title.epub"
```
#### GET /api/collections/{id}/export-epub
```java
@GetMapping("/{id}/export-epub")
public ResponseEntity<StreamingResponseBody> exportCollection(@PathVariable UUID id) {
// Implementation in EPUBExportService
}
```
**Response**: Multi-story EPUB with table of contents
## Data Models
### ReadingPosition Entity
```java
@Entity
@Table(name = "reading_positions")
public class ReadingPosition {
@Id
private UUID id;
@ManyToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "story_id")
private Story story;
@Column(name = "chapter_index")
private Integer chapterIndex = 0;
@Column(name = "paragraph_index")
private Integer paragraphIndex = 0;
@Column(name = "character_offset")
private Integer characterOffset = 0;
@Column(name = "progress_percentage")
private Double progressPercentage = 0.0;
@Column(name = "epub_cfi")
private String canonicalFragmentIdentifier;
@Column(name = "last_read_at")
private LocalDateTime lastReadAt;
@Column(name = "device_identifier")
private String deviceIdentifier;
// Constructors, getters, setters
}
```
### EPUB Import Request DTO
```java
public class EPUBImportRequest {
private String filename;
private Long fileSize;
private Boolean preserveChapterStructure = true;
private Boolean extractCover = true;
private String targetCollectionId; // Optional: add to specific collection
}
```
### EPUB Export Options DTO
```java
public class EPUBExportOptions {
private Boolean includeReadingPosition = true;
private Boolean includeCoverImage = true;
private Boolean includeMetadata = true;
private String cssStylesheet; // Optional custom CSS
private EPUBVersion version = EPUBVersion.EPUB3;
}
```
## Service Layer Architecture
### EPUBImportService
```java
@Service
public class EPUBImportService {
// Core import method
public Story importEPUBFile(MultipartFile file, EPUBImportRequest request);
// Helper methods
private void validateEPUBFile(MultipartFile file);
private Book parseEPUBStructure(InputStream inputStream);
private Story extractStoryData(Book epub);
private String combineChapterContent(Book epub);
private void extractAndSaveCover(Book epub, Story story);
private List<String> extractTags(Book epub);
private ReadingPosition extractReadingPosition(Book epub);
}
```
### EPUBExportService
```java
@Service
public class EPUBExportService {
// Core export methods
public byte[] exportSingleStory(UUID storyId, EPUBExportOptions options);
public byte[] exportCollection(UUID collectionId, EPUBExportOptions options);
// Helper methods
private Book createEPUBStructure(Story story, ReadingPosition position);
private Book createCollectionEPUB(Collection collection, List<ReadingPosition> positions);
private void addReadingPositionMetadata(Book book, ReadingPosition position);
private String generateCFI(ReadingPosition position);
private Resource createChapterResource(Story story);
private Resource createStylesheetResource();
private void addCoverImage(Book book, Story story);
}
```
## Frontend Integration
### Import UI Flow
1. **Upload Interface**: File input with EPUB validation
2. **Progress Indicator**: Show parsing progress
3. **Preview Screen**: Display extracted metadata for confirmation
4. **Confirmation**: Allow editing of title, author, summary before saving
5. **Success**: Redirect to created story
### Export UI Flow
1. **Export Button**: Available on story detail and collection pages
2. **Options Modal**: Allow selection of export options
3. **Progress Indicator**: Show EPUB generation progress
4. **Download**: Automatic file download on completion
### Frontend API Calls
```typescript
// Import EPUB
const importEPUB = async (file: File) => {
const formData = new FormData();
formData.append('file', file);
const response = await fetch('/api/stories/import-epub', {
method: 'POST',
body: formData,
});
return await response.json();
};
// Export Story
const exportStoryEPUB = async (storyId: string) => {
const response = await fetch(`/api/stories/${storyId}/export-epub`, {
method: 'GET',
});
const blob = await response.blob();
const url = window.URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = `${storyTitle}.epub`;
a.click();
};
```
## Error Handling
### Import Errors
- **Invalid EPUB format**: "Invalid EPUB file format"
- **File too large**: "File size exceeds 50MB limit"
- **DRM protected**: "DRM-protected EPUBs not supported"
- **Corrupted file**: "EPUB file appears to be corrupted"
- **No content**: "EPUB contains no readable content"
### Export Errors
- **Story not found**: "Story not found or access denied"
- **Missing content**: "Story has no content to export"
- **Generation failure**: "Failed to generate EPUB file"
## Security Considerations
### File Upload Security
- **File type validation**: Verify EPUB MIME type and structure
- **Size limits**: Enforce maximum file size limits
- **Content sanitization**: Apply existing HTML sanitization
- **Virus scanning**: Consider integration with antivirus scanning
### Content Security
- **HTML sanitization**: Apply existing Jsoup rules to imported content
- **Image validation**: Validate extracted cover images
- **Metadata escaping**: Escape special characters in metadata
## Testing Strategy
### Unit Tests
- EPUB parsing and validation logic
- CFI generation and parsing
- Metadata extraction accuracy
- Content sanitization
### Integration Tests
- End-to-end import/export workflow
- Reading position preservation
- Multi-story collection export
- Error handling scenarios
### Test Data
- Sample EPUB files for various scenarios
- EPUBs with and without reading positions
- Multi-chapter EPUBs
- EPUBs with covers and metadata
## Performance Considerations
### Import Performance
- **Streaming processing**: Process large EPUBs without loading entirely into memory
- **Async processing**: Consider async import for large files
- **Progress tracking**: Provide progress feedback for large imports
### Export Performance
- **Caching**: Cache generated EPUBs for repeated exports
- **Streaming**: Stream EPUB generation for large collections
- **Resource optimization**: Optimize image and content sizes
## Future Enhancements (Out of Scope)
### Phase 2 Considerations
- **DRM support**: Research legal and technical feasibility
- **Reading position sync**: Real-time sync across devices
- **Advanced EPUB features**: Enhanced typography, annotations
- **Bulk operations**: Import/export multiple EPUBs
- **EPUB validation**: Full EPUB compliance checking
### Integration Possibilities
- **Cloud storage**: Export directly to Google Drive, Dropbox
- **E-reader sync**: Direct sync with Kindle, Kobo devices
- **Reading analytics**: Track reading patterns and statistics
## Implementation Phases
### Phase 1: Core Functionality ✅ **COMPLETED**
- [x] Basic EPUB import (DRM-free)
- [x] Single story export
- [x] Reading position storage and retrieval
- [x] Frontend UI integration
### Phase 2: Enhanced Features
- [ ] Collection export
- [ ] Advanced metadata handling
- [ ] Performance optimizations
- [ ] Comprehensive error handling
### Phase 3: Advanced Features
- [ ] DRM exploration (legal research required)
- [ ] Reading position sync
- [ ] Advanced EPUB features
- [ ] Analytics and reporting
## Acceptance Criteria
### Import Success Criteria ✅ **COMPLETED**
- [x] Successfully parse EPUB 2.0 and 3.x files
- [x] Extract title, author, summary, and content accurately
- [x] Preserve formatting and basic HTML structure
- [x] Handle cover images correctly
- [x] Import reading positions when present
- [x] Provide clear error messages for invalid files
### Export Success Criteria ✅ **PHASE 1 COMPLETED**
- [x] Generate valid EPUB files compatible with major readers
- [x] Include accurate metadata and content
- [x] Embed reading positions using CFI standard
- [x] Support single story export
- [ ] Support collection export *(Phase 2)*
- [ ] Generate proper table of contents for collections *(Phase 2)*
- [x] Include cover images when available
---
*This specification serves as the implementation guide for the EPUB import/export feature. All implementation decisions should reference this document for consistency and completeness.*