Converting Books Into Visual Novels Part 1: The First Edit — Creating the Starter pulp.txt

Mar 20, 2026

This is the third in a series of posts explaining my processes for how I convert books into visual novels. See the full series links over on the website's README.md.

This post will walk through the first "edit" of the conversion process, wherein we create the starter pulp.txt file with its starter metadata annotations. In doing so, we'll leverage LLM-based text processing to the maximum extent possible, automating what can be automated.

If you want to skip ahead to just the specific commands and LLM instructions used as part of the first edit process, you can start reading from the The Feeding Pipeline section further below. But before getting there, it helps to have some background explaining why the pipeline and instructions and set up in the way that they are:

What Can't Be Automated

As described in the Part 0, which walked through the all the metadata tags of the pulp.txt format, there are a lot of metadata annotations that need to be added as part of the overall book-to-VN conversion process. However, not all aspects of the metadata make sense to LLM-generate.

We only want to "automate" those parts whose LLM-generation will save us time in the long run. And for that, we need to take into account not just the time it takes for an LLM to spit out the metadata, but also the time it takes to review and clean up the LLM output. (Since if the cleanup time for a particular aspect of the metadata exceeds the time it would take to just enter that metadata's values manually, we aren't being efficient.)

Background Metadata Tags

As described in more detail in constraints blog post, we do not want the LLM to be in charge of assigning backgrounds. Backgrounds are very tricky to make look good while also not contradicting anything said in the text or anything shown in other backgrounds.

We want to assign the background tags manually, because by that process we can build up the cohesive mental model of what the backgrounds should depict and how they should flow. Were we to have the LLM generate those background tags, we'd lose that mental model, making the already difficult process of background generating and re-generating and re-working and re-thinking that much harder. (It can take dozens of attempts for some backgrounds to get results that are both good and correct, and that's with the careful planning and prompt-tweaking. Absolute pain-in-the-ass.)

Therefore, we will not instruct our LLM converter to generate any b background tags or its related r reset-to-background tags.

Other Metadata Tags

For the rest of the metadata, there are three main reasons why we might not want to automate the generation of a particular tag type:

First, if a tag's metadata is dependent on knowing what the visuals are going to look like, we can't generate it at this stage. This rules out the c character tags, because we can't reason about whether character sprites should be on-screen before we know what the background images look like (or are planned to look like). It also rules out the v viewport tag (for sizing active sprites), the z zoom tag (for sizing characters relative to each other and to the speaker box), and the i speaker-counter index tag (for choosing among specific sprite variations).

Second, we don't want to automatically generate any of the tags whose use cases are too niche. The reasoning is that, the more rarely used a tag is, the higher the ratio of false-positives to good-outputs we'll end up seeing. We'll therefore save time by just entering the values manually in the first place than in the counterfactual of having to clean up an LLM's output. This becomes less true the smarter the LLMs become, but as of time-of-writing, I think can fairly be applied to all these tags:

The o object tag only comes up a handful of times per book, and has the complexity involved in reasoning about visuals and visual relationships. Therefore, the noise value of generating would be high.
The a age and x extra tags for modifying sprites are also used relatively infrequently, and also involve reasoning about how those changes will affect sprite appearances. This is in theory somewhat less complex of a task, but still presents the potential problem of over-annotating, introducing too many sprite variations that aren't really needed.
The t thinker tag is the rarely-used cousin tag to the s speaker tag. In typical books, character thoughts presented as dialogue are infrequent enough that this isn't work muddying the waters of potentially generating t tags where s tags should go.
The f filter tag for applying image filters like blur() or brightness() to images is also uncommon, and often (though not always) requires knowing what images will look like beforehand.
The m modifier and h header tags apply temporary and permanent Pulpifier behavior changes respectively. Both tags' use cases are complex and uncommon.

Third and lastly, it's generally nice to avoid generating metadata whose values need to be carried over between chapters. The reason for this is that, in the course of processing book-length texts, we actually want to do the LLM-converting in chapter-length chunks (for reasons explained in the Text Feeding Approaches section further below). And this means that the LLM won't, by default, be able to see the metadata it generated for previous chapters, unless we do additional scripting work to carry specific values we want over. This doesn't rule out any specific metadata tags in particular on its own, but does make it that much harder to automatically generate context-dependent tags like the background ones.

What Can Be Automated

The previous section showed off the many aspects of the metadata that aren't worth LLM-generating. But that was thinking about the automation in terms of annotation types, not annotation volume. And it's in the reducing of the volume of manual annotations that we get our big LLM-automation time-savings.

Flipping the script from above, then, the annotations we do want to generate are those that are the most common, most easy to verify and/or clean, and least complex to understand. (Since again, less complexity = fewer mistakes = less manual cleanup.)

This leaves three tags to generate, and one additional non-tag aspect of the metadata. Going from least-to-most common:

The (n)ame Tag

Every book character with a speaking role and/or visual appearance needs an n metadata tag, which assigns the character's ID and speaker display name.

In general, most characters need only one name declaration, or at most a few, when accounting for display name changes over the course of the book. But despite the n tags being relatively uncommon, we do still want them automated, since 1. The LLM-generated results here are generally pretty good, requiring minimal cleanup and 2. We'll need the character IDs for generating the next two tag types.

Now, previously I mentioned how it's inconvenient to generate metadata whose values need to carry over between chapters. And the name tags certainly do need to carry over, since we need the character IDs to remain consistent between chapters. Fortunately, because name tags are both uncommon and easy to quickly validate, it's relatively painless to grep them out and then just copy them over into the prompt to be fed into the next chapter's generation. And since the name tags generally aren't position-dependent, this works fine for keeping the LLM updated on the current state of character annotations.

The (s)peaker Tag

Next, we'll want the LLM to automatically assign s speaker tags, using the name IDs generated and continually fed in from above, to attribute the correct character IDs to each line of dialogue, setting the speaker back to none whenever dialogue ends.

In general, the LLMs are very good at this task, tending to make fewer mistakes than a human would over the course of a full book. (Or maybe I just suck at attributing dialogue consistently over long texts, idk.) Fewer mistakes, though not no mistakes. And it is interesting to note that, like a human, they tend to make more mistakes when fewer dialogue cues are given, like then there's many successive unattributed paragraphs in long back-and-forth conversations.

So proofreading is a necessity here — the speaker annotations are important to get correct. But the good news is that the overall conversion process requires at least two subsequent full manual re-read edits anyway, and thus we get a good level of manual double-checking more-or-less baked in.

We can also consider, for trickier texts that are more likely to have attribution mistakes, feeding the outputted pulp.txt back into another LLM, and have it double-double-check. (Although in my limited experiments testing out this approach, the false-positive flagging rate was rather high, so it may just be another case where manual checking is ultimately more efficient.)

The (e)xpression Tag

The last tag we want the LLMs to automatically generated is the e expression tag.

In theory, the expression tag should be at most as common as the number of speaker tags + thinker tags + active characters in the character tags. However, there's not really any downside to overproducing on our expression quotas, since an expression tag that doesn't cover any visible sprites just does nothing. (As opposed to overproducing on e.g., background tags, which would necessitate sifting through and cleaning up the unwanted extras.)

Therefore, in our eventual LLM instructions, we won't say anything about limiting the expression to only when characters are on-screen, since the LLM isn't going to know the full scope of that anyway (it doesn't know what the thinker or character tags will be set to yet). This gives us flexibility to add and shuffle tags around later on, while still having a good basis for keeping the expressions reasonably accurate, which is a good upside vs the minor annoyance of slightly noisier expression tag annotations.

When it comes to generating expressions, any expression value is valid in terms of what the Pulpifier accepts, whether it's set to grinning or gesticulating or genuflecting or whatever. But we don't want the LLM output to generate any arbitrary emotion — we want to constrain it to a pre-determined list of allowable emotions, since that'll both help reduce the number of near-redundant expression sprites to generate and also help make the expressions sprites easier to automatically generate.

We just don't want the set to be too small, however, since the sprites should still allow for a wide range of emotions. That current set that we instruct the LLM to pull from is this:

neutral
happydoubleplusgood
unhappy
hangry
wary
nervous
upset
suspicious
confused
thinking
serious
smirking
weary
fearful
surprised
annoyed
scowling
excitedhorny
contemplative
crying

The specific emotion list is fluid — it can be added to to address emotional gaps, and it can also be shrunk to reduce duplicates — so it may differ slightly from book-to-book.

But taking the above list as a baseline example: how well do the LLMs work in practice at assigning emotions from it? Well, there's a couple aspects to consider:

First, the LLMs aren't perfect about restricting the emotions it generates to just the above list, but that's mostly okay. A one-off special emotion to capture some special unique state can even be desirable if it captures something the list doesn't represent. And for those one-offs that we don't actually need, cleaning them up is easy enough.

Second, the accuracy of the emotion labeling is generally pretty good, but those are just the labels. The actual sprite or sprites that get generated for those expressions might differ somewhat in the intended meaning from the LLMs' labeling — emotion interpretation isn't exact.

Fortunately, unlike speaker attributions, which can only have one correct annotation, "correct" labeling of expressions is a little looser. There can certainly still be wrong assignments, but we often can't even say whether an expression is right or wrong until we can actually see it in its proper context, after the sprites and backgrounds have been generated.

In that sense then, expression tags are a case where it's actually faster and easier to the LLM generate a close-enough initial draft of the metadata, which we can then fix and tweak as needed later on. For such a common tag, where heuristic interpretation is fine, this saves a lot of time vs trying to do all the initial expression tagging ourselves. Since either way, subsequent passes will be able improve the labeling as-needed.

Line Breaks

Lastly, there's the insertion of line breaks themselves (and the subsequent necessary adjustment of quotes, as detailed in Part 0).

This is similar to expression attribution in that, doing all the splitting ourselves would be highly tedious, as compared to just letting the LLM have the first pass, and making adjustments as needed. Like with the expression tags, there's no one right answer on where all line breaks should be inserted, although there are plenty of heuristics for where good line breaks should go. But we just feed those heuristics into the LLM's instructions, where it can generally do a fine job interpreting them.

There are definitely wrong places for line breaks to be inserted, such as between two random words with no punctuation separating them, but the LLMs rarely make this category of mistake. And even when they do, it's the kind of mistake the Pulpifier can automatically check for when comparing the generated pulp.txt against the original book.txt source. If it finds a misplaced line break (or bad handling of quotations across line breaks, or some other line break related mistake), it can just throw an exception.

The specific line breaks do get adjusted in places over the course of the conversion process, as the later edits figure out what works best for overall flow, taking into account the generated visuals and what-not. But that's again fine, since this is another situation where letting the LLM do the first draft gets us 90+% of the way there, saving lots of manual editing time.

Text Feeding Approaches

We've established what all metadata we want the LLMs to generate, and we know that we'll have to feed the book text into the LLM somehow to get the metadata back out. But how do we actually go about doing so?

The Full Context Approach

The most straightforward approach would be to feed in the full book.txt text all at once, and ask the LLM to output the full pulp.txt back out. Unfortunately, this doesn't work well.

On the input side, current LLMs can handle book lengths texts just fine. One way you can verify this is to feed in an entire book, ask a specific question, one detailed enough that the text needs to be referenced directly, and observe the (usually correct) result it spits out. (In fact, this ability to quickly answer arbitrary questions comes in handy later on for double-checking images and their descriptions.)

The problem is on the output side. Even if we consider only a novella-length book of 30k words (and assume a 4:3 ratio for tokens:words), that's 40k tokens just to spit the text back out — the bare minimum. Then there's all the new stuff around the original text — the line breaks and metadata annotations — that push the output token count even higher.

And from observations in practice of seeing what the top-tier LLMs can handle, their ability to accurately spit back out the text with good line breaks and metadata annotations does drop as output text lengths increase. And eventually, the text either goes completely awry, or the LLM just errors out.

We can re-roll bad text generation outputs, but ideally we want to do this as infrequently as possible, and as the requested output length increases, the probability of eventually ending up with bad results approaches… if not one, then something close enough to it.

Smaller Context Approaches

Now, the next thing to try would be to feed in the whole book text, but only ask it to convert one chapter at-a-time. Then, feed the previously converter chapter metadata (the rolling pulp.txt generation) back into the LLM, either replacing the book.txt portions already converted, or else just adding to them. The point of feeding the already-converted metadata back in is that we need the character IDs generated by previous chapters' name tags in order to correctly add additional name, speaker, and expression tags.

Overall, this approach works better, but the failure mode rates are still too high. Even though the LLM can ingest the whole book, and even the whole book repeated again with added metadata annotations, there's still quality drop-off with the increased input context.

Third attempted approach then: feed in just the particular chapter text an LLM needs at any given time, plus all previously generated metadata-annotated text. This works better still, but still starts to collapse in later chapters of longer texts, as the full weight of the ever-growing pulp.txt text crowds the input context. While the LLMs can very well understand the pulp.txt in small doses, they do still tend to choke as the pulp.txt approaches full book length. Their capabilities aren't as impaired as in the previous approaches, but still impaired enough to cause problems.

The Chapter-by-Chapter Approach

So we simplify the input further. Rather than feeding the full metadata, with all its line breaks and speaker annotations and expression annotations, none of which really matter between chapters, we just feed in the one specific bit of metadata that does matter: the n name metadata tags. And this is how we arrive at the system mentioned earlier for them of grepping out just their specific tags, cleaning up their list, and appending just that small text set to our prompt for each chapter, fed in one-at-a-time.

This, finally, works pretty well, with the results outputted being fairly accurate on average in terms of adherence to desired line break formatting, correct speaker attributions, correct strict adherence to the source text, and good expression attributions.

And, even when the results aren't good, doing things one chapter a time makes re-rolling for better results much easier. (LLMs being magic probabilistic machines, running the same prompt with the same exact input can give results that vary wildly in quality, and we can just pick the best one.)

Chapter-by-Chapter Considerations

Now, a valid objection to all this might be: if we're only feeding chapters in one-at-a-time, can we really expect the LLMs to assign accurate emotions to the characters, not knowing the full context of what came before in the story? You could even argue it might hurt the speaker tag generation accuracy as well, without the full context of how the character is referred too and tends to speak.

The answer is that, in practice, this doesn't seem to be much of a concern, for a few reasons:

First, the accuracy benefits of having smaller input and output context sizes pretty well dwarf any other effects in terms of getting the most accurate results. Again, this just seems to be fundamental limitation of how current-generation LLMs work. Perhaps that limitation might go away in the future, but it needs to be worked around in the present.

Second, as far as expressions go, cross-chapter emotional nuances aren't particularly common a problem that even needs worrying about. Almost always, the emotion of a character is almost fully rooted in their current state, which the chapter context will cover. And again, expressions are so fuzzy and heuristic-driven anyway that being "good enough" at this stage is plenty good enough. Adjustments around emotional subtleties needs to happen later on anyway, once we can actually see what the sprites look like, so not having access to the full text doesn't really change much.

Speaker attributions, on the other hand, have much more rigid right-and-wrong answers. But again, in practice, the LLMs seem plenty good enough reasoning on just a chapter-by-chapter basis when it comes to labeling. This seems to hold even when characters' full names aren't restated in each chapter. Partially this is down to us feeding in the full names as part of the name attribution metadata, and part of it comes down to:

The third factor: for the types of stories being processed on Public Domain Pulp — the literary classics — the LLMs will basically always already "know" the full story anyway. They don't need to be fed in the full text of The Great Gatsby to know what the overall plot is, and therefore what "state" the characters should exist in. You can kind of demonstrate this by asking the image generators to create characters sprites of characters from the books without providing any textual details — they'll still magically do a pretty good job of getting the details right. Not as good a job as when the details are specified directly, but certainly good enough that it doesn't seem likely that the LLMs will lose the plot as part of their conversions. (Or rather: when they do lose the plot, it's not because they don't know the plot.)

Now, perhaps this would become more of an issue when trying to convert lesser-known works. Though in practice, it still doesn't seem super likely to be problematic, since I think the LLM would still be able to get good results out of the chapter-length contexts still given to them for the limited set of metadata we want back out. But, without any real obscure texts attempted yet, it is still somewhat of an open question.

The Feeding Pipeline

We've established the pulp.txt format and its metadata attributes in detail.

We've established what specific subset of the metadata we want out from this initial conversion process.

We've established the overall strategy we're going to use for this conversion process: feeding in one chapter at-a-time, along with a rolling list of the previously generated name tags.

Now, we're finally ready to actually do the conversion.

Prepping the Source Text

The first thing we need to do is split up the source book.txt into chapter-length segments. To have an example to work with, we'll use Dr. Jekyll and Mr. Hyde as our example source text.

The bash command to split up the chapters looks like this:

csplit -z -f 'chapter_' -b '%02d.txt' book.txt '/^## /' '{*}'

Where we're using ## as the chapter header matcher, since that's what that particular book uses. But note that other books might require three pounds, or either two or three pounds, or some other handling, to get the splits. We might not even necessarily want to break on every possible chapter/section divider for some books, if the outputted chunk lengths are too small in their finest form. We can see for example that the chapter lengths for this basic split of Jekyll and Hyde produced some wildly different word count lengths, some of which are very short:

$ wc -w chapter_*.txt
    13 chapter_00.txt
  2402 chapter_01.txt
  2939 chapter_02.txt
   800 chapter_03.txt
  1669 chapter_04.txt
  1638 chapter_05.txt
  1496 chapter_06.txt
   555 chapter_07.txt
  4369 chapter_08.txt
  2800 chapter_09.txt
  6942 chapter_10.txt
 25623 total

Although there's an order of magnitude difference between the longest and shortest chapter here (note that chapter_00.txt just contains the book header, and so doesn't count), that's mostly fine for the purposes of what we're doing — we can still just feed things in one-at-a-time as-is. But note that we could always combine smaller chapters or split larger ones if we wanted a more strictly even breakdown.

While there's no hard limit for these inputs, it seems to be at about ~5-6k words that the quality drop-off ramps up into the annoying territory. So for this particular example, that final chapter would be about on that threshold, and would probably make sense to split up. Certainly I think anything over ~10k words would be worth splitting to get a better tradeoff in terms of number of chunks to process vs chunk-processing accuracy.

Feeding the Chapters

Now, we start feeding in chapters, staring with chapter_01.txt. We do this alongside the long boilerplate prompt created for this initial conversion process, which I've included the current version of as reference in the Github repo: vn-prompt.txt.

The prompt itself is pretty long, almost nine-thousand words, but remember: input tokens aren't as bad for quality reduction as the equivalent number of output tokens; and more importantly: we do need all those tokens to accurately specify in full detail what the task is. We need to explain how the line-splitting should happen, what the three types of metadata are, how to generate each, and then a long example of book.txt input paired with the draft pulp.txt output. And then the last part of the prompt is just the title of the book we want, the current running list of characters (empty for the first chapter), and the chapter header start we want.

With that big boilerplate ready, we can finally fire off the prompt, copy appending the resulting output into our rolling pulp.txt file, and begin the examination and cleanup process.

Failure Modes and Corrections

There are several ways the conversion process can go wrong. As mentioned previously, the probability of it going wrong is some function of the chapter length, but not a linear one, and bad results can come from any length of chapter, even if less likely from the shorter ones.

First off, the most entertaining conversion problem that can happen is when the LLM decides it doesn't want to transcribe the original text anymore and instead just starts writing its own continuation of the story. The LLMs will sometimes make simple one-character or one-word transcription errors, and they can usually recover from these, but sometimes small initial mistakes can spiral out into just complete fabulizing of the remaining text. Essentially, the LLM just starts writing fanfiction — not what we want. It's funny that these "spinoff" sequences tend to markedly poorer written that the original text, which I suppose on one hand makes sense, since they're up against the greats of the classics, though on the other hand I would expect the smartest LLMs to do better at creative writing at this point, though on the other other hand, I suppose they are being asked to do two things at once by that point, since they still also tend to spit out metadata against their fake story continuations.

Another more common failure mode is in problems around line breaks. Sometimes the LLM will decide to only insert a single line break between two text lines without any metadata, which is incorrect, since we want a 3-cycle sequence. Other times they'll insert one extra line break, turning it into a 4-cycle. Sometimes they'll fail at the quote-handling part of the line breaks, not adding new quotes where they're necessary for the pulp.txt format. And sometimes, they'll put line breaks after every semicolon, instead of keeping sentences joined in whole units.

Some of the time, these formatting mistakes are fixable given the right find-and-replace regexes, if the mistakes are of a systematic and regular nature. For example, for fixing line breaks on semicolons, the find/replace pattern (in Gedit syntax) is this (with a few optional subsequent find/replaces to deduplicate redundant semicolons that come about as a result of joining multiple metadata lines together):

   Find: \n\n([^\n]*;)”?\n([^\n]*)\n\n“?([^\n]*)\n([^\n]*)
Replace: \n\n\1 \3\n\2;\4

Other times though, the line break mistakes are either too irregular to fix up via regex, or else of a more fundamental nature that just can't be systematically fixed up (e.g., putting every single sentence on its own line, or too aggressively separating dialogue from non-dialogue text). In these cases, like when the LLM just transcribes incorrect text, the best approach is to just re-roll for better results.

Then, there are some smaller mistake categories:

As mentioned above, sometimes the LLM just transcribes one particular token wrong. Often this is the LLM "fixing" what it sees as a mistake, like normalizing an uncommon spelling. This isn't what we want, but such mistakes are easily identified by the Pulpifier in its text comparison between the original text and the new output, and so point mistakes like that can just be manually fixed.

Sometimes the LLM will include weird extra annotations — stuff like [cite_start] with numbers pointing back to the original text. This seems to be some problem in its "thinking" process where it spills citation tokens into the raw text output. But since these follow a regular pattern, they can also be find/replace regexed out.

The conversion process can also put line breaks in places that are valid, but just suboptimal for how we want the VN to be ready. Sometimes this is having lines that are too long, or too many lines in-a-row that are too short, or lines that combine multiple settings, or lines that combine dialogue and non-dialogue. Part of this initial first edit process requires scanning the outputted results anyway, and if mistakes such as these are noticed, they can be quickly fixed, though in general it's fine to leave these for subsequent edits to correct.

Speaker attributions are somewhat similar, although their mistakes can be a little trickier to identify if left for later, since we'll often end up needing to go back and reference the original source text to see how the quotes were formatted there to know if back-and-forth dialogue is labeled correctly, since we lose the original paragraph markers. But at least for speakers, mistakes tend to be much rarer, and usually still catchable in the second edit, or if not then, then the third.

Name attribution mistakes essentially never happen, though sometimes we'll want to tweak the display names of characters, or remove redundant name declarations, or add in additional ones for renaming, so as to not reveal a character's full name right away, or for whatever other reason. Since name tags are the least common tag and the most important to get right, and since we'll be feeding them forward anyway, we want to review all of them individually no matter what, correcting them if needed, as part of the first edit.

This regex can be used to quickly round up all the declared names:

grep -Eo 'n:[^=]*=[^;]*' pulp.txt

And note that throughout all this, we'll want to keep checking the compilation status of the new pulp.txt contents by running the PulpifierCLI tool against it. This identifies all the errors that can be automatically identified, such as the aforementioned transcription errors, invalid line breaks, and also some types of speaker attribution mistakes (e.g., active speaker on quote-less line).

Assuming the chapter has no remaining compilation errors within itself, the PulpifierCLI will still show an error if the chapter isn't the last chapter, showing a mismatch on the expected header of the following chapter. And that's how we know that the chapter is fully Pulpifier-ready.

Feeding Forward

After we've gotten our chapter all nice and converted, we move on to the next one. But before we try feeding it in, we need to update the rolling character list. We can use the same regex included above for the name tag cleanup to pull the names out, and then copy them into the end of the boilerplate prompt, along with updating the desired target chapter header, and selecting the next chapter file for our conversion.

Note that we don't necessarily want to carry every last character over between chapters, since some of them might be one-offs that have no bearing, and might have generic identifiers like waiter or policeman that we wouldn't want to get reused in subsequent chapters. (In fact, it's often beneficial to swap the IDs for these to ones like waiter1 or policeman1 to help make sure they don't get accidentally reused between chapters.)

But those minor prompt tweaks aside, it's basically just the same process again, and again and again, for each chapter, until we've gotten the whole book's pulp.txt out.

All Done (With the First Edit)

And with that, we've gotten our starter metadata-annotated text to build off of.

Overall, while the process takes a lot of explanatory setup to show why it is as it is, the actual conversion itself isn't so bad, even if it can involve a hefty amount of LLM output cleanup. But even then, this first edit is still by far the shortest of the edit steps, or of any of the steps really. The second and third edits, while conceptually simpler (and thus hopefully making for shorter blog posts), are fully manual, so they take more time.

But those subsequent step explanations and their own unique discussion topics will have to wait for the next parts.