Text and Image Contradictions

Mar 09, 2026

This is a companion post to the Converting-Books-Into-Visual-Novels blog series I'm currently working on. It's not part of the series itself, because this post won't go into the how of the conversion process, but the why.

Namely: I want to share some of my thoughts on what I consider to be the main constraint of book-converted visual novels — that of being non-contradictory — and how that affects how I approach the conversions.

Contradictions in Text

Before talking about visual novels, with their text and images, let's consider the strictly simpler format of regular novels — non-illustrated ones — where all we have is text.

With novels, we have basic constraints like "Must be grammatically correct" and "Must convey meaning". These constraints don't apply to all possible texts, since the result of a monkey (or a postmodernist — anyone listed on this Wikipedia article really, especially Pynchon) on a typewriter would still be text; however, random texts do not novels make.

Similarly, for a text to function as a novel, it needs to be non-contradictory, and this is the constraint I want to explore.

Consider the following text that features a direct contradiction:

Richard is not wearing a hat. Richard is wearing a top hat.

Or this text, which contains an indirect contradiction:

Richard lives in medieval Europe. Richard is wearing a top hat.

Neither of these texts are self-consistent. Therefore, they fail at being valid text-based story constructions.

Now, both could be fixed via just removing one sentence each, or they could also both be fixed by adding more sentences:

Richard is not wearing a hat. Richards put on a hat. Richard is wearing a top hat.

Richard lives in medieval Europe. He got there by traveling back in time. Richard is wearing a top hat.

So it's not particularly difficult to fix contradictory texts, once we've spotted the inconsistency. And this is part of the reason why "non-contradiction" isn't usually thought of as a constraint in this way — since, to the the extent that the "constraint" exists, it can just be easily bypassed.

Text and Images Together

Where things get trickier is when we consider those mediums that combine text and images in their storytelling — mediums including visual novels, but also illustrated novels, graphic novels, comic strips, and so on.

If text by itself only had the one consistency constraint of "Text must not contradict itself", text and images together have four, but virtue of forming a two-by-two grid:

Let's consider the three new sub-constraints:

Images Must Not Contradict Themselves

Images must be consistent with their other images of the same story.

An example of an inconsistency between images could be: one image illustrates the character of Richard as wearing no hat, whereas the other illustrates the character as wearing a top hat.

Like with text, this isn't necessarily a contradiction if other images (or text) within the story can bridge the gap, explaining how we get from one state to the other. In this case, even the implicit passage of time from when one image is presented to when the next is could be sufficient. (Just like how, if "Richard is not wearing a hat" and "Richard is wearing a top hat" were separated by a scene change instead of being back-to-back, it wouldn't necessarily be a text contradiction either.)

A clearer example of a contradiction of images alone then would be in the depicting of something that definitely should not change without explanation, but does: for example, Richard's top hat's color changing from one image to the next. (Or, the contradiction could be of depicting something that definitely should change, yet doesn't.)

Lastly, we can note that, if we only have the one image of Richard, then that sidesteps the image-consistency problem altogether, since at that point, there wouldn't be anything for the one depiction to contradict.

Images Must Not Contradict Text

Given a text like:

Richard is wearing a top hat.

Then, if we are to have an image of what's supposed to be Richard, the image needs to show him wearing that top hat, in order to be non-contradictory.

Note that this doesn't mean we need to have an image of Richard. The constraint isn't "Images must reflect all details of text" — we're only concerned now with being non-contradictory with the text.

So we can satisfy the constraint by either having our image(s) showing off Richard in his top hat, OR having no images of Richard at all.

Text Must Not Contradict Images

This is the reverse of the above. If we have an image of Richard in a top hat, our text can't say "Richard is not wearing a hat". Instead, the text can either omit mention of Richard's headwear altogether, or just correctly identify it.

It may seem redundant to consider this a separate constraint from the one above, but note there's a bit of an asymmetry here:

Changing "Richard is not wearing a hat" to "Richard is wearing a top hat" is very easy, just as easy as deleting the sentence altogether. Whereas solving the contradiction from the image side — changing the image of Richard from top hat-wearing to hatless — is rather less easy.

AI image-editing tools certainly make the problem easier than would be the case years ago, but the easiest solution would still be to just not use the image. Since as established, from the perspective of fixing contradictions, deletion is just as effective as updating.

Visual Novels

Let's switch away from generic text-and-image pairings and now start talking specifically about visual novels. Or rather, even more specifically: visual novels converted in unabridged form from books.

The Public Domain Pulp project is all about taking public domain texts, breaking up those texts into shorter lines to accommodate the visual novel format, but otherwise leaving the text unmodified. The self-imposed rule is: we can add line breaks and we can adjust quotes to match the new line breaks, but we can't change any other aspect of the text.

This means that, as we add images to the text and start to deal with the four different possible subcontradictions of the contradictions grid, we can only actually make changes from the images side of things.

A regular visual novel, upon finding it contained a contradiction between its text and images, could choose to adjust either the text or the images. It could attack the problem from either the "Text must not contradict images" angle or the "Images must not contradict text" one.

But with converted visual novels, with their immutable texts, we can only change the images. So while the top half of the constraints grid still exists in theory, the two subconstraints there are therefore moot.

And that's fine for the "Text must not contradict itself" square, since just by virtue of stealing texts whole (good texts that is — the best texts), that square is guaranteed to always stay in a stable state.

But it's highly unfortunate for the "Text must not contradict images" square, since as established, it's easier to update text to match images than it is to adjust images to match text.

Furthermore, it's not just about adjusting images if and when we notice inconsistencies with the text (although, in the editing process, inconsistencies with the text are always found). It's also about: how do we even get the images in the first place, such that they have a chance of being consistent?

Sprites

With visuals, we have two main types of images: sprites and backgrounds.

Sprites aren't so bad, all things considered. While all the sprites of the same character need to be kept consistent with each other, they generally exist in isolation otherwise, from a "Images must not contradict themselves" perspective. And with the sprites being rather standardized, they're also relatively easy to keep consistent with the text too.

The texts' descriptions do often require creating many sprite variations for the characters, updating based on and and clothing and expression and other changes, but this doesn't make them more difficult than the backgrounds, even if they are more numerous and tedious to generate.

Partially it's again down to them still being more systemized in creation, but there's another key factor: sprites don't always need to be on-screen.

Remember: being consistent with text can take the form of following its details exactly in image depictions, but the non-contradiction constraint is also satisfied by just not showing an image. Therefore, if we have a particularly tricky text to visualize with respect to a character, we have a lot of control over whether the visual even needs to be displayed.

Backgrounds

Backgrounds are painful because they lack all of the above aspects that make sprites easier to work with:

Backgrounds need to always be on-screen and always cover the whole screen.

Backgrounds also lack any standardization: whereas sprites will always be a single character standing vertical on transparency, backgrounds can depict anything — they need to be able to depict anything that the text throws at them. And remember: we're dealing with texts that were never designed to be put into visual novel form, even though its obvious superiority would be apparent to all public domain authors, were they only alive still today — texts whose scattered descriptions are often not easy to wrangle into a cohesive picture.

Lastly, as compared to the relative ease of creating sprite variations off of the same character, whose updates are targeted and semi-automatable, creating background variations — whether that be changes to a background to represent the same location at different times or different sublocations within a larger one — is almost always difficult to get right. With background variations, we have to mind both the already tricky "Images must not contradict text" constraint, but then also the "Images must not contradict themselves" constraint.

Again, the fact that the constraint is only that of non-contradiction does help to make the task somewhat easier, since, although we do need to always display a background, the background images don't need to represent every last detail from the texts. And what this means is that we can often take refuge either in abstraction, or by just depicting only a portion of a described location. In either case: we lower the number of details to depict.

(Taken to the extreme, this results in fully abstract backgrounds with zero informational value: non-contradictory, but not exactly nice to look at, or effectively using the visual medium.)

The Details Problem

The fundamental problem with having lots of detailed backgrounds that overlap with lots of other details backgrounds is that: the more details are present, the more constraints we have to satisfy with regard to both the "Images must not contradict text" and "Images must not contradict themselves" rules.

And because this whole project is based on generating backgrounds, and because AI-generated images are probabilistic, the more non-contradiction constraints we have: the lower the likelihood of our generated images satisfying all the constraints (while also still looking good). And since violating even one of the constraints, whether that be from being inconsistent with another image or with some particularly detail from the text, is enough to be distracting to the point of it being better to have no image at all, you can hopefully see why detail-minimization is so critical.

Or rather, superfluous detail minimization is perhaps the better way to put it. We obviously still want many backgrounds, and for those backgrounds present to as comprehensibly represent the original text as possible; but those desires have to be considered in the tradeoff of how practical it is to even get the backgrounds in the first place (and then, how practical they'll be to update, if necessary).

While we can (and do) always re-roll any image prompt we want to generate for the best results, this doesn't avoid the limitation that, as the number of constraints increases, the closer the probability of getting a non-contradictory result moves towards zero.

This is why the other extreme of visual novel image generation — creating a unique background for every slide — is essentially impossible, at least if we want to avoid contradictions under the current capabilities of image generation AIs. The number of images we'd need would scale O(N), but the number of generations we'd need would scale… if not O(N^2), then still at least some super-linear function.

All this is to establish that: in the context of visual novels created here, the primary challenge is almost always the "art" of getting good backgrounds, which is downstream of minimizing the constraints the images are subjected to, balancing that minimization effort with the need to still have background variety and appeal.

As a whole, this is the fundamental constraint of the Public Domain Pulp conversion process, getting this aspect of the conversion correct. And that's why the process needs to revolve so heavily around background planning.

Closing Thoughts

For converted visual novels, where we can't change the text, the images must warp themselves to match the text, balancing the desire for detail with the need for consistency, often suffering in the process, as compared to a counterfactual project that would be able to modify both.

And yet, not modifying the source text is still the correct decision.

First, illustrated editions of books exist, and they have the same constraints. The constraints aren't as tricky for that medium, with its generally fewer images, which can also be strategically placed (though I suppose you could consider abstract backgrounds in visual novels as an effective off-state of the images themselves), and also have the advantage of not being probabilistically generated. But even still, the constraints are still fundamentally the same, which helps validate the concept of applying images after text in the first place.

Second, changing the Pulp rules to allow modifying text definitely wouldn't be all upside. Yes, it would make some tricky visualization problems easier (and it would also make some line formatting problems easier).

But: the key advantage of the non-modification is that, the text never needs to make any sacrifices to accommodate the images. The images may be required to warp, but the text never has to do so.

When reading non-converted visual novels, the prose is very often strange, unnatural, and clunky; and I believe a lot of this is down to the strategy of making the text fit the visuals, rather than the more difficult task of making the images fit the text. So the writing often reads both superfluous in unnecessary detail (required to keep up with visual changes), and yet still somehow lacking in substance.

Probably a lot of this is just down to average visual novel prose already being terrible. But at least part of it is the constraints problem.

Ideally: neither the text nor the visuals of any medium with both would need to sacrifice their quality to meet the other. But in practice, it does need to happen sometimes. And between the two, I'm happier to prioritize preserving the prose. Especially when the prose is, you know, the classics.

Of course, the visuals need to not be so sacrificed as to negate any supplementary value they have in telling the story. Otherwise, there'd be no point to reading a visual novel as compared to just reading the book.

And maybe after reading this, you agree with that statement, and go back to just the plain books. But as for me, I think the visual novels still have great potential for story enhancement, constrained as their visuals might be in cases. And after all, the better the constraints are understood, the better we can work around them!