Field Guide № 01·Field notes for architects·April 2026

Prompting GPT Image 2

After a month with the model: ten patterns I keep returning to, three workflows I have found useful, and the questions I am still asking. A working notebook for architects and interior designers, not the last word.

By Chiang Ning · chiangning.net · 12 min read

Hybrid sketch-render of a contemporary civic building at dusk — Cover image, generated with gpt-image-2. A hybrid sketch-render: half photorealistic, half graphite line drawing, in a single pass.GPT-IMAGE-2 · SINGLE PASS

Foreword · Why I started this notebook

A working notebook, not a finished guide.

OpenAI released gpt-image-2 on 21 April 2026. A month in, I have spent enough hours with it to notice patterns, and enough to know that what I have found so far is a fraction of what is there.

I am an architect, not an ML researcher, and I have been treating the model the way I would treat any new tool that landed on my desk: using it on real briefs, watching where it surprises me, and writing down what seems to repeat. The ten patterns in Part One are the ones I keep returning to. The workflows in Part Two are the ones I now reach for first. Neither set is finished.

What I am noticing is that the model rewards specific visual facts over adjectives, and rewards constraint language over decorative praise. Findings that converge across early case studies from Archtene, Rendair, and the fal.ai prompting guide. OpenAI's own Cookbook is explicit: for photorealism, include the word “photorealistic” to engage that mode; for edits, repeat what must be preserved on every turn.

What pulled me in: gpt-image-2 is the first general-purpose image model that responds to a prompt the way a contractor responds to a drafting brief, with hierarchy, materials, constraints, and intent. That is a shift, and I do not think I have fully metabolised it yet.

Read this as a starting point, not a conclusion. Every feature image here is a gpt-image-2 output, not stock, not retouched. Each is also a thing I learned something from making. If you find something I have not, please tell me at chiangning.net.

The essentials · What I keep checking before I prompt

Before I write a brief, I check the machine.

Parameter-level realities I had to internalise the hard way, by asking the model for things it structurally could not deliver and burning credits to find out. It saves me a beat to glance at this first.

Max resolution2560 × 1440Edges in multiples of 16. Ratio ≤ 3:1. Longest edge under 3840 px.

Quality tierslow · med · highHigh for client-facing, medium for sprints, low for ideation volume.

Reference imagesUp to 16Each processed at high fidelity. Address by index in the prompt.

Cost / render~ $0.0411536×1024 high-quality landscape. Cheap enough for 20-image studies.

PNG · JPGDefault outputWebP and JPEG are faster and support output_compression. Transparent backgrounds are not supported; use background opaque and remove downstream.

n = 4Variant generationOne brief yields four parallel variants. Faster than four separate prompts. Best friend of concept-board work.

~ 2 minComplex promptsHeavy briefs can take up to two minutes. Plan iteration cycles; do not expect instant turnaround.

Part One · Ten patterns

Ten patterns I keep returning to.

I am not sure these are principles yet, that is a word for things you have tested across years. But in a month of using the model on real briefs, these are the moves that have shifted my output the most. Each one earned its place. Most still have edges I have not found.

Four-tier diagrammatic architectural analysis — A four-tier diagrammatic analysis I generated in a single call: circulation, program, environmental, structure. I have never been able to reproduce this exact composition twice. The model has a range, not a fixed style.SINGLE PASS · HIGH QUALITY

I · Structure

I write in a fixed order.

Scene, subject, details, constraints, use case.

I tested this against a single dump-everything-in paragraph for two weeks. The structured version won every time. For a render I write: scene is site and time of day, subject is the building or room, details are materials and camera framing, constraints are what must not change, and use case (competition board, client presentation, magazine editorial) sets the polish level.

Short labelled segments with line breaks, not a single run-on. The model treats a structured brief as a brief, and an adjective soup as an adjective soup. What I am still asking: does the order itself matter, or just the labelling?

II · Specificity

I replace style words with their visual atoms.

Minimalist, brutalist, Japandi on their own are weak triggers. I have seen the model deliver competent but generic outputs to all three, the same competent generic each time. The single highest-leverage move I have found is to break a style label into the materials, palette, and silhouette rules underneath it. Every canonical style compresses to a finite list. Write the list.

Prompt

Instead of: brutalist façade

Board-formed exposed concrete with visible wooden formwork striations, aggregate texture, cantilevers, deep recesses, raw concrete grey with rust staining, modular repetition, high-contrast raking shadows.

III · Realism

I trigger photorealism by name, then anchor it with texture.

The OpenAI Cookbook says it directly: including “photorealistic” engages a photorealistic mode. Pairing it with texture-forward nouns (real skin texture, pores, subtle film grain, brushed aluminium with micro-scratches, weathered copper patina, chipped paint, worn travertine) does more than the word alone.

Archtene's phrasing (smooth realistic quality like 3Ds Max and V-Ray rendering, accurate shadows, reflections and architectural realism, materials feel natural and premium, less plastic) pulls renders away from the default oversaturated gloss. I now use it as a tail on most photoreal briefs. What I have not worked out: is “less plastic” really doing work, or am I just superstitious about it?

IV · Framing

I treat camera language loosely. Composition precisely.

The Cookbook warns that detailed camera specs are interpreted loosely: lens lengths influence look, not physics. I write “medium close-up at eye level, 50mm lens feel, shallow depth of field, 35mm film aesthetic” rather than a sensor-and-aperture recipe. The mood lands; the dimensions do not.

For composition I am literal. “Corner perspective at eye level, slight three-quarter angle, hero object centred with generous negative space, horizon line in the lower third.” Corner perspectives outperform flat front elevations because they reveal depth and form, a thing Archtene flagged early and I keep verifying.

Photorealistic interior with fluted oak, exposed brick and marble — Principle III in evidence: fluted oak, exposed brick, marble veining, brass spotlights, two figures with credible eye contact.FIG. 1 · SINGLE PASS · HIGH QUALITY

V · Edits

I treat edits as a two-column contract: change plus preserve.

This is the move that has reshaped my iteration practice the most. Every edit prompt I write now states explicitly what changes and, repeated every turn, what must be preserved.

Prompt

OpenAI Cookbook template

Replace ONLY the white chairs with chairs made of wood. Preserve camera angle, room lighting, floor shadows, and surrounding objects. Keep all other aspects unchanged.

Without the preservation clause, I see drift in saturation, reflections, background. Without repeating it on every turn, the drift compounds. What I am still testing: how stable is preservation across six or more iterations on the same room? My informal answer so far is “less stable than I want.”

VI · Cadence

I iterate one change per turn.

Both fal.ai and Archtene arrived at this rule independently, and I keep landing in the same place. A single surgical edit (warm the lighting, mature the trees, soften the façade finish) beats a large rewrite every time. Combined with Pattern V: the preserve list stabilises what I already like while one knob moves. I think of it as CAD versioning. Wishing will not make it work.

“Is a prompt closer to a brief, or to a conversation?”

VII · Text

I treat in-image text as typography, not content.

For signage mockups, wayfinding studies, and presentation boards with captions, I put the literal string in quotes or ALL CAPS, specify font family (Inter, condensed sans, humanist serif), size, colour, placement, kerning, and add the bluntest line: “render the text exactly once, no duplicate text, no extra words.”

Use quality high for small text. PixVerse's informal fifty-prompt test found roughly nineteen of twenty generations returned legible first-pass text on gpt-image-2 when prompted this way, a result I did not quite believe until I ran my own and saw similar numbers. Open question: does the 19 in 20 hold for non-Latin scripts? I have not tested.

VIII · Mode

I name the render mode.

Photorealistic, or architectural?

“Photorealistic” for renders, “architectural render” for concept boards. Archtene's tested template (building type, style, materials, camera angle, lighting, site context, architectural render, realistic proportions, clean presentation, design-focused composition) produces the flat, even, competition-board aesthetic I want for concept work.

“Photorealistic candid photograph” gives me client-facing heroes with weather and atmosphere. Both modes are there. I name which one I want. What I am noticing: the boundary between modes is fuzzier than I first thought; there is a hybrid sketch-render territory I keep stumbling into accidentally.

Typographic poster rendered by gpt-image-2 — Principle VII in evidence: five sizes, four colours, multiple weights, an East Melbourne Art Market subhead, all rendered legibly in one pass.FIG. 2 · SINGLE PASS · HIGH QUALITY

IX · References

I treat reference images as indexed inputs, addressed by number.

For multi-reference compositions (a site photo plus a finishes board plus a furniture swatch) I label them explicitly in the prompt.

Prompt

Image 1 is the existing room to preserve. Image 2 is the wood grain reference. Image 3 is the sofa reference. Apply the wood from Image 2 to the flooring in Image 1; replace the sofa in Image 1 with the sofa from Image 3. Match scale, cast shadows, and white balance to Image 1.

Gpt-image-2 accepts up to sixteen references per edit call and processes every one at high fidelity, though I have not yet pushed past eight on a single brief. Open question: at what number does the model start dropping references?

X · Scale

I add scale cues. I do not trust proportion.

The model does not render to construction dimensions and never will. Rendair's caution (the model tends to dream over geometry) has held up across everything I have tested. So I seed scale explicitly, with bodies and objects at known sizes:

Prompt

One parked car at the curb, two pedestrians walking at adult height, café tables and chairs on the terrace, bicycles against the wall, garden lighting at knee height.

Human and object references calibrate the model's sense of storey height, door widths, and furniture depth more effectively than any numeric prompt I have tried.

Part Two · Applied workflows

Three workflows. Where the patterns earn their keep.

A grammar without sentences is inert. Three workflows are where I have found gpt-image-2 actually slotting into my existing process: concept boards, photoreal renders, and material studies. None of these has replaced anything in my stack. All three have changed the cadence of how I get to a first sketch.

Workflow 01 · Concept boards

The digital sketchpad. I generate four, then I commit.

For early ideation, the model has become what Rendair calls a digital sketchpad. The Cookbook's n=4 parameter is what makes this work: four variants from one brief, side-by-side, costs me roughly sixteen cents and a couple of minutes. Faster than writing four separate prompts, and the variety is genuinely useful.

Prompt

A brief I tested recently

Open sketchbook on a designer's desk, hand-drawn ink studies of a civic building: design intent annotations, form exploration thumbnails, structure detail. Pencils, sharpener, ruler, coffee mug arranged casually around. Soft warm lighting, top-down three-quarter view. Photorealistic.

Photorealistic open sketchbook with hand-drawn studies — From that exact brief. Annotations included, in a single pass. I am still surprised this works.WORKFLOW 01 · HIGH QUALITY

Workflow 02 · Photoreal renders

The hybrid sketch-render: my favourite accident with this model.

The output mode I keep returning to is what early users have started calling the hybrid sketch-render: half the image rendered photorealistically with full lighting and reflection, the other half left as graphite line drawing. The first time I got one I assumed it was a glitch. Then I worked out the prompt structure that triggers it, and now I use it deliberately for client-facing work where I want the design intent visible and the aesthetic rendered.

Prompt

The brief that gives me hybrid output

Photorealistic dusk view of a contemporary civic building with a vertical-fluted white facade, glowing warm interior, wet street reflections. The left third of the image transitions into a hand-drawn pencil sketch on white paper: visible graphite linework, ghosted trees, line-only buildings. Smooth blend through the middle third. Eye-level corner perspective.

For more conventional sketch-to-render, Archtene's recipe is the strongest documented workflow I have used for turning Revit, SketchUp, Rhino, or Archicad screenshots into client-ready renders.

Prompt

Standard sketch-to-render

Do a realistic render of this photo. Keep the same shapes and forms of buildings, fences, windows and doors. Add lush foliage to planter boxes and landscaping. Make materials feel natural and premium, less plastic.

Workflow 02 · Continued

Floorplan to perspective. I draw an arrow, I get a room.

The Architizer method (drawing a POV arrow on a floorplan, uploading the annotated plan, and prompting the model) has become my fastest path from plan to client-facing perspective. Boyuan Chen, the gpt-image-2 research lead, described 3D-style perspective shifts and complex spatial reasoning through simple text prompts as a headline capability. I think she undersold it.

Prompt

After uploading the annotated plan

Create an image of a 3D space from the angle shown on the floorplan as if you are a human standing there.

“If the sketch phase now costs four cents, what is the architect's role on Tuesday morning?”

Workflow 03 · Material & editorial studies

When the model produces a finished spread.

Material studies are where I run most of my edit-endpoint cycles: change one finish, hold the room, look. But the poster below is the output that genuinely surprised me. A complete editorial layout with hero render, four-panel analysis grid, design-evolution sequence, key-data sidebar, and full typographic hierarchy, generated in a single call from a structured brief. I do not yet know what to make of this.

Prompt

The brief that produced it

Single editorial poster, dark background. Hero: photorealistic Lego scale model of City Hall London at night, reflected in water. Left column: title block, four core concept icons with body text. Below hero: four-panel analysis grid (geometric form, solar performance, natural ventilation, spatial experience). Bottom: design evolution sequence, key-data sidebar. Typography: condensed sans titles, humanist sans body. Render text exactly as specified.

Single-pass editorial poster generated by gpt-image-2 — Eight visual elements, two columns of body copy, a consistent typographic system, one pass. I have not yet been able to reproduce this consistently, but the fact it is possible at all has changed what I think gpt-image-2 is actually for.WORKFLOW 03 · SINGLE PASS

The honest limits · Where it breaks

A CAD tool it is not. But the failures are interesting.

The places gpt-image-2 falls down are also the places I have learned the most about what it actually is. Five failure modes I have run into repeatedly, each followed by what I think it is telling me. None of these are settled.

It dreams over geometry. Anything that requires dimensional precision, repeating fenestration with consistent mullion spacing, or load paths that actually work, the model will confidently invent.

What that might mean — The model is a sketch tool that thinks in mood, not a drafting tool that thinks in measurements. I use it for the former and never the latter.

Logo reproduction is unreliable. Brand marks come back warped, wrong-coloured, or invented. I now composite logos in Photoshop after the fact.

What that might mean — The model has learned logo shape as a generic category, not specific marks as fixed assets. There is no library it is looking things up in.

iii

Complex prompts can take two minutes. A heavy brief with 12 references and a long preserve clause genuinely takes minutes to render.

What that might mean — The iteration pace I want, five cycles in fifteen minutes, is not always available. Some thinking has to happen inside the prompt before I press send, not after.

Adjective stacking degrades output. Stunning, ultra-detailed, cinematic, masterpiece, 8K, award-winning produces worse results than three terse visual facts.

What that might mean — The model is trained on captions, and good captions describe, they do not praise. Treat it like a caption-writer's audience, not a marketer's.

Vague preservation compounds. Keep the room the same while adding furniture quietly shifts wall colour, floor reflection, window framing. By turn three I have sometimes drifted to a different room.

What that might mean — There is no persistent state. Every turn is a fresh negotiation. Naming what stays is how I impose continuity the model does not have on its own.

Parameter discipline

A note on quality settings.

When I spend.

I use quality high whenever text appears in the image, when small-scale details (door hardware, stair nosings, tile grout) carry the design, or when the render is client-facing. I drop to medium for fast comparison sprints. Low when I am ideating at volume and do not yet care about polish.

Aspect and ratio.

I stay at or below 2560×1440 for reliable output. Above that is experimental territory I have not fully mapped. Landscape for exteriors and wide interior shots, portrait for towers and tall interior details, square only for social and mood-board tiles. All edges must be multiples of 16. Ratio capped at 3:1.

Client herohigh · 1536×1024

Comparison sprintmedium · 1024×1024

Ideation volumelow · 1024×1024

Mood boardhigh · square

Closing the loop · What I am still asking

The patterns are stabilising. The questions are not.

The ten patterns earned their place in my practice. The three workflows are now where I reach first for early-stage work. None of that means I have understood gpt-image-2, only that I have found the edges of what I tested.

01How stable is preservation across long edit chains? My answer so far is “less stable than I want,” but I have not pushed it methodically.
02Can the model maintain consistent material identity across separate sessions? Or is every session a fresh negotiation?
03What is the failure mode when I mix two architectural languages in one brief? Does the model average them, or pick one?
04If the sketch phase now costs four cents and two minutes, what does that change about how I bill, scope, and explain my work to clients?
05What is the half-life of any of these patterns? gpt-image-2 will be gpt-image-3 inside a year.
06And the one I think matters most: what does it mean for design practice that competent visualisation is now a commodity, while design judgment is not?

If you are working with the model and finding things I have not, I would genuinely like to know. Send notes via chiangning.net. I am collecting field reports for a follow-up.

All feature images here are gpt-image-2 outputs. None retouched. Each one taught me something I did not know about prompting before I made it.

Field Guide № 01 · April 2026 · End, for now

Back to resources