Demo · StoryMine — AI picture-books starring your child (voice narration)

StoryMine takes a story a child tells out loud, creates a main character that resembles the child, draws an illustration for every page, and reads it aloud — a mobile-first storybook PWA. It supports Korean and English, and it's designed as a small but complete service (bookshelf, series, download, billing), not a one-shot toy.

1. What & for whom

Users: children and parents. The child speaks into the mic ("a dinosaur who flies to space"); the parent picks age, language, page count, art style, and narration voice.
Output: a storybook with a cover + body (per-page illustration + short text) + per-page voice narration. Pages auto-advance like a read-aloud picture book; books can be reopened from the shelf or downloaded as PDF.
Core value: ① the child becomes the hero, ② minimal waiting (parallel illustration), ③ privacy peace-of-mind (the child's photo is never stored).

2. User flow

Speak/Type           Design          Write           Art + Narration (parallel)   Reader
(STT·photo)  ─▶  story structure ─▶ page text  ─▶   gpt-image-1 ┐               auto-advance
name·age         (gpt-5.4)          (gpt-5.4)        Neural TTS  ┘─▶ store ─▶  ◀ prev/next ▶

Input — a story idea via voice (browser STT) or text; the child's photo & name are taken once up front.
Characterization (vision) — the photo is sent to a multimodal model only in the moment to derive an English character description (anchor) like "round brown hair, sky-blue hoodie…", and the original photo is discarded immediately (never stored).
Design → Write — gpt-5.4 builds the story structure and per-page text for the chosen age/language/length. The hero's name is locked to the parent's choice.
Art + Narration (generated together) — each page's illustration (gpt-image-1) and voice (Azure Neural TTS) are produced in parallel.
Reader — captions sit on a soft gradient so they don't cover the art, and pages auto-advance in step with the audio.

3. Architecture

                    ┌──────────────────────────── Azure ────────────────────────────┐
  mobile browser    │                                                                │
   (PWA, single HTML)│  Container Apps (FastAPI)                                      │
        │  HTTPS     │      │  keyless (Managed Identity, AAD token)                   │
        └───────────┼─────▶│──▶ Azure OpenAI  gpt-5.4      (story design·writing)     │
                    │      │──▶ Azure OpenAI  gpt-image-1  (page illustrations)        │
                    │      │──▶ Azure AI Speech  Neural TTS (page audio)               │
                    │      │──▶ (vision) multimodal — photo→description, NOT stored     │
                    │      │                                                          │
                    │      └──▶ PostgreSQL Flexible Server (keyless/MI)               │
                    │             · users·sessions·books·pages·usage(billing)         │
                    │             · assets table (BYTEA): cover·art·audio bytes        │
                    └────────────────────────────────────────────────────────────────┘

Frontend: a single-HTML PWA (installable, offline shell). No splash page — it goes straight to "create".
Backend: FastAPI on Container Apps. Every external AI call is keyless (Managed Identity + AAD token) — no keys in code or environment.
Persistence: PostgreSQL Flexible Server. Not only text metadata but also image/audio bytes live in an assets table (BYTEA), served back via an app proxy (/api/assets/...) so the browser never touches storage directly.

4. Key components

Component	Role	Note
Azure OpenAI gpt-5.4	story design + page writing + photo→description (vision)	keyless (AAD)
Azure OpenAI gpt-image-1	page illustration 1024²	text anchor keeps the character consistent
Azure AI Speech (Neural TTS)	page narration (SSML)	ko/en preset voices, keyless (resource_id)
Container Apps	FastAPI hosting (scale 0–N)	image from ACR
PostgreSQL Flexible	books·pages·usage + asset bytes	keyless (MI), assets in one place

5. Design decisions & trade-offs (the meat)

5-1. Character consistency — text anchor instead of "seed"

Image models have no seed to guarantee the same person across pages. So the English character description (anchor) produced by vision is prepended to every page prompt to keep the same hero. Meanwhile each page is forced to differ in action/setting/camera angle, so the art actively reflects the story (no near-duplicate pages).

5-2. Latency — parallel page art + audio

Art and audio are independent per page, so they're generated concurrently via a ThreadPool (tunable). Drawing 8–16 pages sequentially takes minutes; parallelism cuts the perceived wait dramatically, paired with an async UX that lets you read or start another book while one is generating.

5-3. Security — keyless end-to-end (Managed Identity)

OpenAI, Speech and the DB are all accessed keyless via AAD token / Managed Identity — no key leakage or rotation burden, no secrets in environment.

5-4. Governance avoidance — assets in Postgres, not Blob

Central governance periodically locks storage publicNetworkAccess. Since all assets are app-proxied (the browser never hits storage directly), the image/audio bytes were moved into a Postgres assets (BYTEA) table, making them governance-proof while keeping URLs unchanged.

5-5. Data minimization — the child's photo is never stored

The photo is used only at the instant a character description is created, then discarded. What's kept is just the login account + the child's nickname + a text character description; the thumbnail is an anonymous name-based avatar (SVG), not a photo. On startup, any previously-stored photo assets are purged.

5-6. Per-user isolation & billing limits

Books, child profiles, and usage are all isolated by login user_id (no mixing of shelves). Billing is free (1 lifetime) / $9 (10/mo) / $19 (30/mo), enforced via a usage table (the limit is opened up wide during development).

6. Privacy & regulation

Because it's a children's service, data protection comes first.

Korea PIPA: processing data of children under 14 requires guardian consent, and a face photo can count as biometric data → risk is removed at the source via no-photo-storage + data minimization, with guardian-consent / no-storage notices on the login & account screens.
COPPA (US, <13) / GDPR Art.8 (EU, children): designed in the same direction (minimal collection, guardian consent, purpose limitation).
Stored: login account, child nickname, non-identifying character description, generated books. Not stored: the child's photo or biometrics.

Regulatory specifics change over time, so this assumes verification against official sources (law.go.kr, pipc.go.kr).

7. Cost & performance lens

Hosting: Container Apps (scale-to-zero when idle) + a small Postgres Flexible. Consumption-based, so steady-state cost is low.
Variable cost (per generation): text (gpt-5.4) + one image per page (gpt-image-1) + audio (TTS). Images dominate, so page count and concurrency are the cost/speed dials.
Speed: parallel illustration shortens "speak → finished"; the async UX lets other work continue while generating.

8. Limits & what's next

Login: currently guest-centric (cookie-based). Google/Kakao OAuth is the next step for stable, portable, per-account isolation.
Voice: currently Azure Neural TTS preset voices (child / warm-narrator tones). Synthesizing the child's own voice needs extra consent and safeguards — handled with caution.
Billing: the Stripe structure (products·webhooks) is ready; keys are injected at go-live.

In one line

A private, kid-as-hero storybook service on keyless Azure AI (story·art·voice) + Container Apps + Postgres, with data minimization (no photo storage) and parallel generation for a fast feel as its core values.

← All demos Portal home