Summarized using AI

Do LLMs dream of Type Inference?

Shunsuke "Kokuyou" Mori • November 15, 2024 • Chicago, IL • Talk

The video titled "Do LLMs dream of Type Inference?" presented by Shunsuke "Kokuyou" Mori at RubyConf 2024 focuses on the potential of Large Language Models (LLMs) to assist in type inference for Ruby code. The session highlights the development and evaluation of RBS Goose, an LLM-based type inference tool, that aims to redefine how type inference is performed compared to traditional methods.

Key points discussed in the session include:

- Introduction to Type Inference: The speaker explains the significance of type inference in programming, emphasizing the challenges posed by dynamic typing in Ruby, where types are determined at runtime rather than compile time.

- Limitations of Traditional Approaches: Traditional static type checking often fails to recognize certain dynamic elements of Ruby, making it difficult to identify type errors before code execution. Examples highlight common pitfalls in type inference using algorithms that rely solely on data flow analysis, lacking human-like intuition.

- RBS Goose Development: The presentation details how RBS Goose infers Ruby's type definitions from code without explicit type annotations. Using LLMs like ChatGPT, RBS Goose replaces unknown or untyped elements in code with inferred concrete types by analyzing context and conventions, which aligns more with human understanding.
- Evaluation Challenges: The speaker acknowledges that while RBS Goose has shown improved performance over traditional methods in several cases, the lack of standardized metrics makes it challenging to assess its effectiveness comprehensively. Currently, evaluations are manual, making it difficult to track improvements over time.

- Benchmark Proposal - TypeEvalRb: Discussions include plans for establishing a more detailed evaluation methodology, inspired by previous studies in type inference, such as Styper for Ruby and Type EV Pi for Python. The proposed benchmarks will help quantify the performance of type inference tools like RBS Goose, focusing on the comparison of expected and inferred types.

- Future Directions: The session concludes with the speaker expressing optimism about refining the evaluation process for RBS Goose. By combining insights from past studies with new benchmarks, there is potential for significant advancements in LLM-based type inference tools in Ruby.

Overall, the talk provides an insightful overview of how Large Language Models can enhance type inference, addressing both current successes and the need for improved evaluation frameworks to measure progress in this innovative field.

Do LLMs dream of Type Inference?
Shunsuke "Kokuyou" Mori • November 15, 2024 • Chicago, IL • Talk

Large Language Models (LLMs) have been rapidly evolving in recent years. They offer a completely different approach to challenges that were difficult to solve with classical algorithms.

In this session, I will describe how RBS Goose, an LLM-based type inference tool I am currently developing, works and its current evaluation.
For future improvement of RBS Goose, a more detailed evaluation mechanism is needed. In this regard, we will share ideas about TypeEvalRb, a type inference benchmark, by referring to examples from previous studies.

RubyConf 2024

00:00:15.839 yes let's start I'm shunori from Lina Technologies please call me Kyo it's my
00:00:22.160 online handle I'm excited to be Ruby K to talk about llm B type inference
00:00:30.800 llm stands for large language models such as chat
00:00:35.920 GPT can LM be use for typ reference and
00:00:41.039 can R determine how capable it is to answer those questions let's dive into
00:00:47.800 our session do l m streams of type
00:00:54.160 inference now some of you might be wondering why use LS for hyp inference
00:01:00.680 in the first place let's explore that with a simple
00:01:06.880 example in this code there is a bad class and the dark and good classes that
00:01:13.759 inherit from it and there's a Mak sound method which takes an argument named
00:01:21.079 bad we call the Mak sound method with both a dark object and a goose
00:01:27.400 object then what is the argument type of Mak sound
00:01:34.159 method the tral method to type refence is algorithmic I'll focus on data from
00:01:40.840 analysis here it analyes call to the me sound method and see that it's C with both
00:01:49.840 dark object and a goose object therefore it INF argument type is a union of dark
00:01:56.680 and Goose logically it this makes perfect sense but would you think the same way
00:02:05.439 probably not you might think have thought of it
00:02:11.560 this way oh there is a bad class and the
00:02:16.599 argument name is bad so perhaps the argument type is bad object that's
00:02:24.720 all this is a holistic approach using intuition and Contex rather than strict
00:02:31.879 logic and it of align better with how human actually write and understand
00:02:39.440 code so algorithms are great at logic but can miss the N of human WR
00:02:47.680 code llm such as chat GPT offers a
00:02:52.760 different approach then use fistic and context much like a human developer
00:03:00.360 this may bridge the gap between logic and intuition in type inference now can we actually use l m
00:03:09.159 for type inference as an example of using l m for
00:03:15.560 type inference I developed RVs goose and presented it at previous R
00:03:21.560 Ki it infers RVs type definition from Ruby code using
00:03:27.280 LM I named it a goose sounds like guess
00:03:32.400 because its behavior is more like an intuitive guess than a logical
00:03:39.120 inference not that the bad in the logo is not a goose but a
00:03:45.920 duck but this duck works like a goose so it can be used in the logo instead of
00:03:52.400 goose and you know this is called Goose typing and of course this a joke
00:04:00.680 I de gra let's get back to talk about type inference currently I have confirmed
00:04:07.760 that RBS Goods can generate RVs type definitions for some small rubby
00:04:12.959 codes however the overall performance of the AR Go is
00:04:17.959 unclear because there is no clear metrics to evate the performance of
00:04:23.360 typing FS this will be a problem for future improvements when I make make a change
00:04:31.039 how do I determine if the performance has improved so we need some metri of type
00:04:38.720 inance performance to see where RBS goose is now and to make sure that
00:04:44.360 changes are going correctly today I will explain how RVs
00:04:52.120 good works with l m and evaluate it already at this point RBS go show better
00:05:00.000 than traditional methods in several cases then I will share the idea of a
00:05:06.560 type inference Benchmark that I'm planning to create for more detailed evaluation for this I refer to several
00:05:14.440 previous studies the outline for this session is
00:05:20.319 here first we will review the basics of type system and type
00:05:25.759 inference next I will talk about the type AR gos I
00:05:31.039 developed it's architecture for typing fin using LM and its current performance
00:05:36.479 evaluation results after that I will focus on the
00:05:41.560 evolution method used in previous studies as a reference for a more detailed
00:05:47.000 evaluation finally based on this I will share my idea for type EV RB a ruby type
00:05:53.800 inference Benchmark now let's start by reviewing
00:05:59.680 the basics of type system and type
00:06:04.720 inference a type system is a mechanism for classifying the components of a program such as strings and
00:06:12.720 numbers Ruby is a dynamic type of the langage this means that types are
00:06:19.120 determined only at run time for example 1 + a raises a typ
00:06:26.520 layer at run time furthermore when this expression is rra in a false
00:06:33.840 the type eror is not R because 1 plus a is never
00:06:39.639 evaluated to the human eye the problem with this code is a
00:06:45.319 bias is it possible to find this problem on Ci or IDE before
00:06:54.039 execution St type checking is a mechanism to find such type ER before execution
00:07:01.080 to do this we need to know the type of each part in the code without executing
00:07:07.759 it unlike C or other langages Ruby does not use type annotations in each code so
00:07:16.440 some device is required one method is to describe the
00:07:23.120 type in RBS language to describe Ruby types in this case St tip can be used
00:07:30.639 for static type checking alternatively solve with RBI or
00:07:36.720 RDL can be used in the same way but are not covered in this
00:07:43.560 session as example let's look at the case of 1 plus
00:07:49.240 a we can detect type error if we know the types in three parts one is an
00:07:56.800 integer a is a string and and plus method of integer cannot accept a
00:08:03.800 string the first two are overos and the plus method type for integer is defined
00:08:10.280 in the RVs in the standard Library according to this type
00:08:16.400 definition the plus method of integer can accept an integer but not a
00:08:22.599 string taken together we can determine that this code coordinates a type error
00:08:28.080 without running it this is how static check type checking
00:08:35.039 works by the way it is hard to declare for every part in the
00:08:40.440 code to reduce this workload type inference is used type inference is a mechanism that
00:08:48.720 assist static type checking it inference the type of code without explicit type
00:08:55.920 annotations it is primar used in static Ty check to INF types that are overused
00:09:01.240 from Context such as local variables for another purpose it be used
00:09:08.040 to generate type definition for Ruby code with no type
00:09:13.120 definitions type Pro is a type inference tool that can be used for this
00:09:19.200 purpose it uses data flow analysis for INF types by tracking the data flow in
00:09:25.519 variable assignments and Method calls
00:09:30.880 data flow analysis works well in most cases however in some tricky cases the
00:09:37.880 result Will Made different from Human expectation consider the b class example
00:09:44.880 again in this case as the argument type of Mi sound human expect the generalized
00:09:52.760 type bad class however data form analysis tracks the type that actually appear in the
00:09:59.440 code and infa Union of sub types dck and goods this is different from
00:10:08.560 expectation as another example static analysis including data for analysis is
00:10:14.720 not good at Dynamic definitions this code uses defin method
00:10:20.839 to Define print F and print bar static analysis only loads the class
00:10:26.600 structure without running code So Def method is not
00:10:31.640 executed as a result the method print F and print bar are not recognized and in
00:10:38.279 first fails since Ruby is a many Dynamic
00:10:43.320 elements this will be a majure challenge now we will review the type
00:10:50.760 system and type inference so it is time to talk about typing fance using
00:10:56.200 LMS here I will share about RBS go I
00:11:02.160 developed our goose is a type inference tool using LM it takes rubby code without type
00:11:09.639 definitions and generate RVs type definitions this is almost the same
00:11:15.839 input and output as type Pro however the mechanism of type inference is very
00:11:23.320 different before explaining how our BAS Goose works let us briefly discuss llm M
00:11:29.839 the large language models llm is a language model
00:11:36.320 pre-trained on large data sets I am sure you all know about chat
00:11:43.360 GPT chat GPT is an El TR on conversational data and is adap and
00:11:49.560 answering questions let's look at a very simple
00:11:54.839 example the user enters input text answer in in one line about what ruby
00:12:01.200 and python have in common this input Tex is called a
00:12:08.360 prompt chart GPT will respond to this and gen Tex explaining the similarities
00:12:14.399 between Ruby and python it is very simple but as natural
00:12:19.720 and Powerful as if you were talking to about a human
00:12:25.800 being there are several techniques to make better LM outputs F short prompting is a technique
00:12:33.000 in which a prompt is embedded with example similar to the Target programs suppose you want to know the
00:12:40.639 color code for blue simply ask what is the color code for blue the L will
00:12:47.839 respond with a summary of a various color coded
00:12:53.000 representations in the F short prompt include examples instead question right answer sh ff0000
00:13:03.480 0 and question BL this alls the LM to understand that
00:13:11.079 the answer format is RGB evening the answer sh 000000
00:13:18.279 FF th fion proin can be used to define the format of response and suggest
00:13:25.279 inference method to the LM now I have finished explaining LM so
00:13:33.079 let's take a look how RS goes in first types with LM this figure is the architecture of AR
00:13:41.320 Go type inference I will explain component in
00:13:47.040 turn first RBS G runs RBS prototype
00:13:52.199 command to create a RVs type definition template for Target D code
00:13:59.199 in this template most types are untyped meaning that the type is
00:14:06.279 unknown next Lo the prepared examples each example consists of three
00:14:13.240 files a ruby code A RVs template and unexpected RVs type
00:14:21.079 definitions the expected RBS type definition is almost the same from as the RVs template with untyped replac
00:14:29.720 with the concrete types then these are combined to
00:14:35.160 construct The Prompt The Prompt contain the input and
00:14:40.279 output examples first following that the Target recode
00:14:45.800 and the Aras template are provided Ln for the given examples and
00:14:53.720 replace the an type in the Target RBS template with a concrete type
00:15:00.920 as a result the output of the LM looks like this the obvious type definition is
00:15:07.639 output in markdown and RVs Goods pass it and output into a
00:15:13.360 file this is how the RBS good inal
00:15:18.920 types there are a few key points of the prompt constructions llms are trained with huge
00:15:26.759 amount of data but contain few definitions of RVs types for this reason
00:15:33.759 llm is not good at generating RVs type definitions from
00:15:39.360 scratch to address this I provide an RVs template generated by RB protype
00:15:46.399 command this Define the type inference problem as a f in the Brank problem that
00:15:53.000 replaces untyped in the front template with concrete types as another Point fot prompting
00:16:01.199 controls the output format to pass easily it also provide examples of how
00:16:07.839 special syntax such as attribute reader should be handled in
00:16:14.600 RBS now let's see that type inference can be performed with RBS
00:16:20.720 Goods how about the examples of Bad Duck and goose with RVs Goose the argument type
00:16:28.720 of make sound is infa to be the bad class yes it works as
00:16:35.920 expected interestingly the Ruby code provided to RBS go doesn't include any
00:16:42.440 CES to make sound method so there is no explicit
00:16:47.839 relationship in the code between the Mi sound method and the bir
00:16:53.120 class nevertheless RVs good iner the b class as an argument type for Mi sound
00:17:01.880 methods this is probably due to the understanding in the convention of using
00:17:08.039 the same name as the class names for variable
00:17:14.079 names as another example let's look at the case of dynamic
00:17:19.120 definitions in this code print who and print bar are defined using defined
00:17:25.000 method as before the code calling this method is not
00:17:30.640 provided even so RBS good finds the definition of print F and print B and
00:17:37.520 correctly INF the written types from the result the LM correctly
00:17:45.000 understand the concept of dynamic definitions and what method are defined according to the
00:17:51.880 context l m is vely to outperformance transition methods in this area
00:18:00.320 so is type inference in by L M perfect not
00:18:06.400 quite this example Define the call method that takes a pro meth PL
00:18:13.039 object type Pro correctly in fact the pro Types on the other hand RVs Goods make a
00:18:20.280 mistake in the grammar of the pro types in these cases where Ruby or RVs
00:18:27.200 specific syntax is used the LM often makes
00:18:33.559 mistakes Ruby and RBS have various grammar elements and there are also
00:18:38.880 framework such rays that are widely used and have unique idioms to reveal the full performance of
00:18:46.600 type inference using LM it is necessary to evaluate on a variety test of
00:18:52.960 cases however only a few test cases are being evaluated at this time
00:18:59.960 currently all evaluations are done manually not standards for selecting
00:19:05.760 test cases and the evaluation result is visually checked for difference from the
00:19:11.120 expected types there are limitation to manual
00:19:16.159 evaluation this will be a pro problem for future improvements when I make a change how do
00:19:23.640 I determine if the performance has improved so we we need better method to
00:19:30.120 evaluate typing fence for reference let's look at how previous Studies have
00:19:35.840 evaluated this so far I shared how our BAS Goods
00:19:41.679 works at the current evaluation next we will survey how previous studies evaluate the
00:19:47.840 performance of typing fence while there are many previous
00:19:54.480 studies this session will focus on two the first is evalution method of s typ
00:20:03.120 which is typing forence to in Ruby and the second is type P which is
00:20:10.480 Type in micro Benchmark in Python let's start with the
00:20:17.600 first s typer is a rubby type inference tool that uses a constraint based
00:20:24.720 approach this tool is built on a ruby static type Checker called RDL which is
00:20:31.400 incompatible with the RBS we are currently working on what is important to us is how they
00:20:40.480 evaluate the performance of type inference this study Compares expected
00:20:46.760 and infant types for each argument return values and variables then for each the comparison
00:20:54.760 results are classified into three categories when both are exactly the same it is
00:21:02.600 classified as match if the best class match it but the
00:21:07.679 parameters are different classify as much up to parameter finally if the b class is not
00:21:15.480 matches it is classified as different this number of matches can be
00:21:22.279 used as metrix of typ inface
00:21:27.520 performance to evaluate this we need test data with the expected
00:21:33.679 types the study used four raise application with types described in
00:21:39.760 RDL and for libraries with extensive y
00:21:47.200 documentations a graph of the evolution results is shown here this graph shows
00:21:53.240 the result of the four approaches side by side the blue area is is the number of
00:21:59.400 matches and the right blue area is the number of matches up to
00:22:05.320 parameter we can determine that the approaches which has a large Blue Area
00:22:11.080 is performed better such metrics would be useful for quantitative
00:22:19.720 comparisons I wonder if this evation tool could be used for RBS Goods as well
00:22:25.159 and reflection data is provided however as BM image yes it's n
00:22:32.760 gab this is not open source and cannot be fed or
00:22:38.080 modified furthermore since it is an RDL based tool it could not be used directly
00:22:44.919 in RVs so even with Ruby type INF research
00:22:52.840 I found that the artifact were not always used directory for our case no
00:22:58.559 now less than as to previous studies in other languages type EV Pi is a micro
00:23:05.960 Benchmark for typing fence in python as the name micro Benchmark
00:23:12.039 suggests this tool covers small test of cases categorized by grammatical element
00:23:17.559 and other factors the evalution method is similar
00:23:22.880 to type Sim typ evaluation it Compares expected and INF type for each
00:23:29.520 parameters return values and variables however it covers only exact
00:23:39.080 matches now let's look at the retail of the test of
00:23:44.120 cases under the test case directory they are categorized by python feature such
00:23:49.960 as ARS assignments classes decorators and so
00:23:56.400 on each test case cons consist of a small python code and a Json file which
00:24:02.440 describes the expected types it's very
00:24:08.559 simple the result of this Benchmark are output in various format here the table
00:24:15.159 are shown AG aggregated by categories scar pair in the rightmost is
00:24:22.679 the name of the inference tool being evaluated in the build categories
00:24:29.039 the results are is zero this shows that scarer is not good at grammars in the B
00:24:37.240 category th micro benchmarks King can visualize the strength and weakness of
00:24:43.799 each inference method by classifying test of
00:24:49.559 cases now let's review what we learned from these previous
00:24:55.080 studies the evolution in both studies compare the expected and INF types for each
00:25:01.679 argument return vales and variables the number of mates can be
00:25:07.520 used as Matrix to evalate the type inference performance there are two types of test
00:25:15.080 data for evaluation using real world data such as
00:25:20.919 libraries and applications media practical performance on the other hand using
00:25:27.640 micro benchmark data based on grammar elements and other factors it allow us
00:25:33.279 to clarify the strength and weakness of each method each of these has its own
00:25:39.399 advantages and should be used for different
00:25:45.240 purpose we have looked at two previous studies using these previous studies as
00:25:50.799 a guide I am working on type RB a ruby type inference Benchmark to evaluate the
00:25:56.960 performance of RBS School still I work in progress but I will
00:26:02.760 share my ideas I am planning to build typ RB with
00:26:11.440 an architecture similar to previous studies differently I am plan to use RBS
00:26:18.399 for both expected and infant types the evaluation process is shown in
00:26:24.120 this figure prepare Ruby code and expected
00:26:29.200 RBS type definitions as test of data and generate typing for RBS with RBS
00:26:35.840 gos the infal RVs and expected RVs are compared and the number of matches is
00:26:42.679 counted as Benchmark results how does we construct the infer
00:26:49.840 type uh the infer RBS type with the expected I types I consider building a comparison
00:26:57.480 three using RBS environment the environment roller can
00:27:03.200 be used to rad the RBS for a given pass however the number of cross de or
00:27:11.000 is over 300 because STD leave and other are also loaded in this
00:27:17.840 environment to extract only Goose class we can use hash access to crass
00:27:25.880 Deckers the good class entry we this here so the is this is because clust
00:27:33.520 entry has a lot of information however only a small part of
00:27:39.240 the information is needed for the comparison
00:27:45.200 tree the comparison three construction process has been implemented shown
00:27:50.399 here I take only the classes defined in the Target RVs and create a type node
00:27:57.600 for each instance variables or method parameter and return return vales to
00:28:02.760 compare this is a work in progress and the Mi sound method of the object class
00:28:08.360 is missing however it seems likely that the construction of a comparison three could
00:28:15.240 be possible when it is all done we all have to do is to Traverse the comparison
00:28:22.159 three count the number matches and that is The Benchmark result
00:28:30.240 next what should we do use for test data micro benchmarking and real world
00:28:37.840 data each have their own advantages so both should be prepared
00:28:43.559 and us it according to the purpose like type EV Pi micro Benchmark
00:28:49.720 data is categorized by grammatical elements using this test data a detailed
00:28:55.440 evaluation that shows the strengths and weakness of each method should be
00:29:01.240 possible for real world data it would be better to for typ of evaluation and use
00:29:08.159 Ruby libraries and raise application with arious type de
00:29:14.200 definitions and evalation using this data should be able to measure practical
00:29:23.159 performance since the preparation of micro Benchmark data is work intensive
00:29:28.320 so I am exploring the possibility of using the GitHub compiled workspace a
00:29:33.720 task automation agent my R testing has not it very good
00:29:39.440 details but it is worth it to adjust the prompts once the construction three
00:29:45.799 construction is implemented I will get to work on this area
00:29:53.000 well I'm currently working on typ about RB in this direction
00:29:58.480 it is still in the early stages but I will gradually work on it as my private
00:30:05.840 project finally let's wrap up I shared how our B good works and
00:30:13.880 currently evaluation results this evaluation revealed that
00:30:18.919 llm provides better inference results than traditional methods in examples
00:30:24.399 such as generalization and dynamic definitions for more detailed evaluation we surveyed
00:30:31.600 how previous studies evaluates type inference performance this evaluation compare
00:30:37.240 expected and INF types and counted the number of matches for
00:30:42.840 metrix in addition both micro benchmarks and real world data were seen as test
00:30:49.039 data and both seems useful based on these previous studies I
00:30:55.600 shared the idea of type but be a type inference Benchmark for Ruby and
00:31:02.080 RBS this development is in the early stages but hopefully it will reveal the
00:31:07.720 type INF performance of LM and can be used for future improvements toward the
00:31:13.639 realization realization of a typing that's all thank you for
Explore all talks recorded at RubyConf 2024
+64