TM Taeyang Moon
Demo · Architecture write-up

StoryMine — AI picture-books starring your child (voice narration)

A mobile PWA that turns a spoken idea into characters, illustrations, and narration

Azure OpenAI gpt-5.4gpt-image-1 artVoice narration (TTS)Parallel image genContainer AppsPostgreSQL (keyless MI)Data minimization (PIPA)PWA

StoryMine takes a story a child tells out loud, creates a main character that resembles the child, draws an illustration for every page, and reads it aloud — a mobile-first storybook PWA. It supports Korean and English, and it's designed as a small but complete service (bookshelf, series, download, billing), not a one-shot toy.


1. What & for whom


2. User flow

Speak/Type           Design          Write           Art + Narration (parallel)   Reader
(STT·photo)  ─▶  story structure ─▶ page text  ─▶   gpt-image-1 ┐               auto-advance
name·age         (gpt-5.4)          (gpt-5.4)        Neural TTS  ┘─▶ store ─▶  ◀ prev/next ▶
  1. Input — a story idea via voice (browser STT) or text; the child's photo & name are taken once up front.
  2. Characterization (vision) — the photo is sent to a multimodal model only in the moment to derive an English character description (anchor) like "round brown hair, sky-blue hoodie…", and the original photo is discarded immediately (never stored).
  3. Design → Write — gpt-5.4 builds the story structure and per-page text for the chosen age/language/length. The hero's name is locked to the parent's choice.
  4. Art + Narration (generated together) — each page's illustration (gpt-image-1) and voice (Azure Neural TTS) are produced in parallel.
  5. Reader — captions sit on a soft gradient so they don't cover the art, and pages auto-advance in step with the audio.

3. Architecture

                    ┌──────────────────────────── Azure ────────────────────────────┐
  mobile browser    │                                                                │
   (PWA, single HTML)│  Container Apps (FastAPI)                                      │
        │  HTTPS     │      │  keyless (Managed Identity, AAD token)                   │
        └───────────┼─────▶│──▶ Azure OpenAI  gpt-5.4      (story design·writing)     │
                    │      │──▶ Azure OpenAI  gpt-image-1  (page illustrations)        │
                    │      │──▶ Azure AI Speech  Neural TTS (page audio)               │
                    │      │──▶ (vision) multimodal — photo→description, NOT stored     │
                    │      │                                                          │
                    │      └──▶ PostgreSQL Flexible Server (keyless/MI)               │
                    │             · users·sessions·books·pages·usage(billing)         │
                    │             · assets table (BYTEA): cover·art·audio bytes        │
                    └────────────────────────────────────────────────────────────────┘

4. Key components

Component Role Note
Azure OpenAI gpt-5.4 story design + page writing + photo→description (vision) keyless (AAD)
Azure OpenAI gpt-image-1 page illustration 1024² text anchor keeps the character consistent
Azure AI Speech (Neural TTS) page narration (SSML) ko/en preset voices, keyless (resource_id)
Container Apps FastAPI hosting (scale 0–N) image from ACR
PostgreSQL Flexible books·pages·usage + asset bytes keyless (MI), assets in one place

5. Design decisions & trade-offs (the meat)

5-1. Character consistency — text anchor instead of "seed"

Image models have no seed to guarantee the same person across pages. So the English character description (anchor) produced by vision is prepended to every page prompt to keep the same hero. Meanwhile each page is forced to differ in action/setting/camera angle, so the art actively reflects the story (no near-duplicate pages).

5-2. Latency — parallel page art + audio

Art and audio are independent per page, so they're generated concurrently via a ThreadPool (tunable). Drawing 8–16 pages sequentially takes minutes; parallelism cuts the perceived wait dramatically, paired with an async UX that lets you read or start another book while one is generating.

5-3. Security — keyless end-to-end (Managed Identity)

OpenAI, Speech and the DB are all accessed keyless via AAD token / Managed Identity — no key leakage or rotation burden, no secrets in environment.

5-4. Governance avoidance — assets in Postgres, not Blob

Central governance periodically locks storage publicNetworkAccess. Since all assets are app-proxied (the browser never hits storage directly), the image/audio bytes were moved into a Postgres assets (BYTEA) table, making them governance-proof while keeping URLs unchanged.

5-5. Data minimization — the child's photo is never stored

The photo is used only at the instant a character description is created, then discarded. What's kept is just the login account + the child's nickname + a text character description; the thumbnail is an anonymous name-based avatar (SVG), not a photo. On startup, any previously-stored photo assets are purged.

5-6. Per-user isolation & billing limits

Books, child profiles, and usage are all isolated by login user_id (no mixing of shelves). Billing is free (1 lifetime) / $9 (10/mo) / $19 (30/mo), enforced via a usage table (the limit is opened up wide during development).


6. Privacy & regulation

Because it's a children's service, data protection comes first.

Regulatory specifics change over time, so this assumes verification against official sources (law.go.kr, pipc.go.kr).


7. Cost & performance lens


8. Limits & what's next


In one line

A private, kid-as-hero storybook service on keyless Azure AI (story·art·voice) + Container Apps + Postgres, with data minimization (no photo storage) and parallel generation for a fast feel as its core values.

← All demosPortal home