RubyConf Taiwan 2023

Translating XML and EPUB using ChatGPT

#rubyconftw 2023

Translating XML and Epub using ChatGPT

Natsukantou and epub-translator gems offer XML translation by allowing users to mix & match filters and translation services (DeepL/ChatGPT). The gem's architecture will be presented, showcasing its middleware pattern, the use cases and wizard configuration inference.

RubyConf Taiwan 2023

00:00:28.960 Hello everyone, I hope you're all back. We're going to start with the next speaker right now. He is Mark Chao from RubyConf Taiwan. He will share about translating XML and EPUB using ChatGPT. Welcome!
00:01:12.000 Hi everyone! So, this talk is not AI-centric. It's just a usage of GPT, the technology we use in Taiwan. Sorry, my name is Mark, and I go by the name 'La La La' on the internet, specifically on GitHub and Twitter.
00:01:36.079 I'm a backend engineer at GitHub, and I'm looking for someone to review my pull request that has been sitting there for five weeks. If you're a contributor, please let me know.
00:02:01.439 I belong to a hobby group called Saku, which is an otaku group. We create things related to anime and earlier this year, we decided to do some research on Masaaki Yuasa, an anime director. I recommend his film called 'Mind Game'; it's really trippy.
00:02:15.280 To do this research, I discovered there is a magazine called Urea that published a special issue on Yuasa, which contains a lot of valuable interviews. Of course, my group wanted to read it, but we couldn't read Japanese, so we needed to rely on machine translation.
00:02:43.120 The ebook is a collection of HTML pages, and since HTML is a form of XML, we needed a translation service. I found two services: one is DeepL, which everyone is familiar with, but it costs money and doesn't accept this credit card, which is a shame.
00:03:09.319 The other one I found is a free Japanese translation service called 'Auto Translation.' If you want to translate Japanese content, I suggest trying it out. I tried writing a draft using both services, one for DeepL and the other for Auto Translation.
00:03:27.319 As you can see from the code, there are lots of simple functions—one for DeepL and one for the other service. There's no test involved; I just copied and pasted one function to the other without adhering to object-oriented design. But I was able to translate the content to Chinese, and we were happy.
00:04:07.239 However, I thought I could turn this spaghetti code into a modular design. My idea was to create one gem that focuses on translating XML and another to handle translation of dynamic content.
00:04:34.000 Now, I will be talking about my XML translation gem. The name came from my visit to a place called Aryama Hoto in 2017 where I had a dessert called 'Natsukotou.' It’s a grapefruit that has the center removed, turned into jelly, and then reinserted. This process is quite similar to XML translation: you pull out text, translate it, and then reinsert it into the XML structure.
00:05:40.680 That’s why I chose this name for my gem, even though there was a slight misunderstanding during a presentation when my Japanese friend pointed out they had no idea what 'Natsukotou' meant without context. So, the moral of the story is, if you want to use a foreign language to name something, it's crucial to consult an expert in that language.
00:06:45.320 Let’s start with the first topic I want to discuss, which is the API. My goal is to maximize the customization of the gems to enhance translation.
00:07:00.400 I want to allow users to choose different translation engines and different parsers. You will type your input at the top, which is passed layer by layer down through filter layers. The filters can modify the text if they wish. Once the button layer is reached, it will be translated by the language model.
00:07:38.120 The translated text is returned layer by layer and the filters can also modify the resulting translated output. In code, this will look like the middleware pattern, which you may be familiar with if you've written Rails applications.
00:08:01.440 In this case, I've created a layer called 'Filter' and once that's set up, I can submit the source text and the desired language for translation. It then returns the translated XML to me.
00:08:30.000 I want to utilize Ruby middleware. The concept of Ruby annotation is not related to Ruby the programming language, but it is a markup where you can place text on top of other text.
00:09:02.000 For example, I can place on text above a Japanese word to demonstrate pronunciation. However, this can be problematic for translators, as they may think they need to translate each character separately rather than as part of a phrase.
00:09:59.520 To improve the translation, I want to join two words and present them as one single unit through middleware. The middleware starts with an initial method that will take app parameters for the middleware configuration.
00:10:40.560 Then, I need to access every single Ruby XML tag. For each Ruby tag, I will first remove all the unnecessary tags, which we do not want here. The result after this process is that we have well-structured XML to pass to the translation service.
00:11:11.920 Once we pass this XML to DeepL, we receive a better translation, demonstrating the effectiveness of middleware in translation.
00:11:31.560 Now, let’s talk about the user interface design for my gem. When we think about command line parameters, we usually think about flags and arguments. They work fine for small tasks, but once your program grows, like ImageMagick, it can be difficult to remember which arguments do what. My gem is similar; there are too many dynamic states, and it becomes hard to function just by using simple arguments.
00:12:51.159 Therefore, I think a wizard interface is more appropriate. It would essentially help generate my middleware and then utilize it through a Ruby file for translation.
00:13:01.120 My program is divided into two phases. The first phase asks the user to choose which middleware and translation engines they want. For each of the middleware, we will go through each of the initial arguments and ask the user to provide them.
00:13:30.120 Once I have configured three settings, I can use ERB to generate the necessary Ruby file. In the second phase, I will perform actual translations and ask for the target language.
00:13:59.240 I'll then call the maximum with the XML HTML file. If you are familiar with Ruby, you know how this complicates things. I enter the access token and a password to select the translation service.
00:14:18.959 You can select two options, or just one if you prefer. Then, the program will ask you which language to translate into, and, finally, what file you want to save the translations to.
00:14:38.899 This is a brief outline of my program, showcasing the user interface design.
00:15:06.959 How we can implement this user interface relies on understanding which middleware is available. I could scan a directory to find all the middlewares, but I believe developers would prefer to know explicitly which middlewares are accessible.
00:15:23.720 To achieve this, I used an auto-loading mechanism that allows developers to register their middleware by using a method that marks middleware for usage.
00:15:49.760 Once we know which middleware we have, I need to ask for the necessary arguments that users must input. For example, they might need to enter an API token for DeepL.
00:16:05.840 Each middleware should have an initialized method that specifies what arguments it needs. I initially thought I would have to use RBS or similar for better type checking, but I discovered I could just use documentation to provide necessary instructions to users.
00:16:36.320 The documentation is crucial; if you're using a wizard to assist users, you need to clearly explain what they need to provide.
00:17:00.600 Firstly, they would provide a file path, and then from there, I can look into the yard registry for information on the DeepL gem.
00:17:40.680 Later, I would look through its methods for an 'initialize' method. Once I have the 'initialize' method, I will call text, which grabs the comments and necessary texts to produce a clear instruction for the users.
00:18:02.680 This method ignores parameters that users do not need to enter and leaves us with the essential arguments to capture.
00:18:34.160 My wizard needs to be user-friendly, so I decided to utilize TTY prompt for a pleasing, interactive user interface. This allows the users to select from different options, such as predefined selections or input prompts.
00:19:15.279 For example, I ask the user to choose a translation engine from a list of available candidates.
00:19:59.400 Once I've confirmed a selection, I will call a method that shows all available parameters and their documentation.
00:20:01.919 Next, I do the same for middleware layers, allowing multiple selections. For each middleware, I also initialize parameters to ensure everything is clear.
00:20:44.560 Before asking users about initialization parameters, I specify the defaults for each to give them clarity.
00:21:05.560 For each parameter, I check if it is optional or required, and if it has a default value. These insights guide users on what information they need to input.
00:21:32.360 Once I gather all the necessary inputs, I compile everything into a parameters configuration file. This includes the middleware details along with the users' inputs for each of the arguments.
00:22:48.560 The output from this ERB file establishes the middleware configuration file so that users do not have to re-enter the same information each time they wish to translate something.
00:23:20.048 Instead, they can simply specify the middleware file moving forward and initiate translations without hassle.
00:24:10.640 Now, regarding the integration of ChatGPT in translating XML, I thought of using the structured output but ended up opting for a more straightforward approach.
00:24:56.679 My main challenge was to instruct the model to handle XML structures accurately. I had to create very precise prompts to ensure it only returns XML.
00:25:12.919 I found that if I instructed it to translate XML from English to French without additional constraints, it sometimes responded inaccurately. I had to create a robust prompt indicating that the return must only be the translated XML.
00:26:19.920 Also, I want to discuss the idea of a glossary file. A glossary file is essentially a dictionary provided by the user for specific translations, sometimes following a limited format.
00:27:14.000 The approach I have in mind is to substitute text in the document with user-provided alternatives before sending it for translation.
00:27:55.919 This allows me to manage what portions I want to translate effectively. However, note that this method becomes tricky with models that don't have nice handling for exclusions, such as skipping certain texts.
00:28:48.160 Another issue I've noted is integration with large models like GPT-4 for complex translation tasks. While I'm still employing version 3.5, it has its limitations and does not always read the previous context, leading to potentially inaccurate translations.
00:29:48.200 To provide a solution for this, I plan to implement a more contextual architecture to integrate better with prior text structures.
00:30:40.000 In the concluding part, I want to borrow some ideas from Java's Option Data Library. They have excellent functionality for combining translations side by side.
00:32:45.919 As for EPUB processing, I evaluated various libraries and considered their current maintenance status. For XML processing, I believe the main libraries to focus on are Nokogiri and REXML.
00:33:19.000 I chose certain paths for managing dependencies while ensuring functionality across platforms, which informs my decision-making.
00:34:14.560 To conclude my talk, I’d like to focus on my future plans, including support for alternative models and addressing integration challenges.
00:34:56.040 I appreciate your attention, and if you have any questions or thoughts, please feel free to ask!
00:35:41.560 Thank you! Does anyone have questions for Mark?
00:36:07.000 Participant: For the Gem to work properly, does the source material need to be in a specific format? Mark: Yes, it requires a proper structure. It only supports text-based formats, not images.
00:37:07.760 Mark responds to other audience inquiries. Thank you for your participation. We are now moving to take a tea break and will reconvene before 3:30 PM.