00:00:15.839
yes let's start I'm shunori from Lina Technologies please call me Kyo it's my
00:00:22.160
online handle I'm excited to be Ruby K to talk about llm B type inference
00:00:30.800
llm stands for large language models such as chat
00:00:35.920
GPT can LM be use for typ reference and
00:00:41.039
can R determine how capable it is to answer those questions let's dive into
00:00:47.800
our session do l m streams of type
00:00:54.160
inference now some of you might be wondering why use LS for hyp inference
00:01:00.680
in the first place let's explore that with a simple
00:01:06.880
example in this code there is a bad class and the dark and good classes that
00:01:13.759
inherit from it and there's a Mak sound method which takes an argument named
00:01:21.079
bad we call the Mak sound method with both a dark object and a goose
00:01:27.400
object then what is the argument type of Mak sound
00:01:34.159
method the tral method to type refence is algorithmic I'll focus on data from
00:01:40.840
analysis here it analyes call to the me sound method and see that it's C with both
00:01:49.840
dark object and a goose object therefore it INF argument type is a union of dark
00:01:56.680
and Goose logically it this makes perfect sense but would you think the same way
00:02:05.439
probably not you might think have thought of it
00:02:11.560
this way oh there is a bad class and the
00:02:16.599
argument name is bad so perhaps the argument type is bad object that's
00:02:24.720
all this is a holistic approach using intuition and Contex rather than strict
00:02:31.879
logic and it of align better with how human actually write and understand
00:02:39.440
code so algorithms are great at logic but can miss the N of human WR
00:02:47.680
code llm such as chat GPT offers a
00:02:52.760
different approach then use fistic and context much like a human developer
00:03:00.360
this may bridge the gap between logic and intuition in type inference now can we actually use l m
00:03:09.159
for type inference as an example of using l m for
00:03:15.560
type inference I developed RVs goose and presented it at previous R
00:03:21.560
Ki it infers RVs type definition from Ruby code using
00:03:27.280
LM I named it a goose sounds like guess
00:03:32.400
because its behavior is more like an intuitive guess than a logical
00:03:39.120
inference not that the bad in the logo is not a goose but a
00:03:45.920
duck but this duck works like a goose so it can be used in the logo instead of
00:03:52.400
goose and you know this is called Goose typing and of course this a joke
00:04:00.680
I de gra let's get back to talk about type inference currently I have confirmed
00:04:07.760
that RBS Goods can generate RVs type definitions for some small rubby
00:04:12.959
codes however the overall performance of the AR Go is
00:04:17.959
unclear because there is no clear metrics to evate the performance of
00:04:23.360
typing FS this will be a problem for future improvements when I make make a change
00:04:31.039
how do I determine if the performance has improved so we need some metri of type
00:04:38.720
inance performance to see where RBS goose is now and to make sure that
00:04:44.360
changes are going correctly today I will explain how RVs
00:04:52.120
good works with l m and evaluate it already at this point RBS go show better
00:05:00.000
than traditional methods in several cases then I will share the idea of a
00:05:06.560
type inference Benchmark that I'm planning to create for more detailed evaluation for this I refer to several
00:05:14.440
previous studies the outline for this session is
00:05:20.319
here first we will review the basics of type system and type
00:05:25.759
inference next I will talk about the type AR gos I
00:05:31.039
developed it's architecture for typing fin using LM and its current performance
00:05:36.479
evaluation results after that I will focus on the
00:05:41.560
evolution method used in previous studies as a reference for a more detailed
00:05:47.000
evaluation finally based on this I will share my idea for type EV RB a ruby type
00:05:53.800
inference Benchmark now let's start by reviewing
00:05:59.680
the basics of type system and type
00:06:04.720
inference a type system is a mechanism for classifying the components of a program such as strings and
00:06:12.720
numbers Ruby is a dynamic type of the langage this means that types are
00:06:19.120
determined only at run time for example 1 + a raises a typ
00:06:26.520
layer at run time furthermore when this expression is rra in a false
00:06:33.840
the type eror is not R because 1 plus a is never
00:06:39.639
evaluated to the human eye the problem with this code is a
00:06:45.319
bias is it possible to find this problem on Ci or IDE before
00:06:54.039
execution St type checking is a mechanism to find such type ER before execution
00:07:01.080
to do this we need to know the type of each part in the code without executing
00:07:07.759
it unlike C or other langages Ruby does not use type annotations in each code so
00:07:16.440
some device is required one method is to describe the
00:07:23.120
type in RBS language to describe Ruby types in this case St tip can be used
00:07:30.639
for static type checking alternatively solve with RBI or
00:07:36.720
RDL can be used in the same way but are not covered in this
00:07:43.560
session as example let's look at the case of 1 plus
00:07:49.240
a we can detect type error if we know the types in three parts one is an
00:07:56.800
integer a is a string and and plus method of integer cannot accept a
00:08:03.800
string the first two are overos and the plus method type for integer is defined
00:08:10.280
in the RVs in the standard Library according to this type
00:08:16.400
definition the plus method of integer can accept an integer but not a
00:08:22.599
string taken together we can determine that this code coordinates a type error
00:08:28.080
without running it this is how static check type checking
00:08:35.039
works by the way it is hard to declare for every part in the
00:08:40.440
code to reduce this workload type inference is used type inference is a mechanism that
00:08:48.720
assist static type checking it inference the type of code without explicit type
00:08:55.920
annotations it is primar used in static Ty check to INF types that are overused
00:09:01.240
from Context such as local variables for another purpose it be used
00:09:08.040
to generate type definition for Ruby code with no type
00:09:13.120
definitions type Pro is a type inference tool that can be used for this
00:09:19.200
purpose it uses data flow analysis for INF types by tracking the data flow in
00:09:25.519
variable assignments and Method calls
00:09:30.880
data flow analysis works well in most cases however in some tricky cases the
00:09:37.880
result Will Made different from Human expectation consider the b class example
00:09:44.880
again in this case as the argument type of Mi sound human expect the generalized
00:09:52.760
type bad class however data form analysis tracks the type that actually appear in the
00:09:59.440
code and infa Union of sub types dck and goods this is different from
00:10:08.560
expectation as another example static analysis including data for analysis is
00:10:14.720
not good at Dynamic definitions this code uses defin method
00:10:20.839
to Define print F and print bar static analysis only loads the class
00:10:26.600
structure without running code So Def method is not
00:10:31.640
executed as a result the method print F and print bar are not recognized and in
00:10:38.279
first fails since Ruby is a many Dynamic
00:10:43.320
elements this will be a majure challenge now we will review the type
00:10:50.760
system and type inference so it is time to talk about typing fance using
00:10:56.200
LMS here I will share about RBS go I
00:11:02.160
developed our goose is a type inference tool using LM it takes rubby code without type
00:11:09.639
definitions and generate RVs type definitions this is almost the same
00:11:15.839
input and output as type Pro however the mechanism of type inference is very
00:11:23.320
different before explaining how our BAS Goose works let us briefly discuss llm M
00:11:29.839
the large language models llm is a language model
00:11:36.320
pre-trained on large data sets I am sure you all know about chat
00:11:43.360
GPT chat GPT is an El TR on conversational data and is adap and
00:11:49.560
answering questions let's look at a very simple
00:11:54.839
example the user enters input text answer in in one line about what ruby
00:12:01.200
and python have in common this input Tex is called a
00:12:08.360
prompt chart GPT will respond to this and gen Tex explaining the similarities
00:12:14.399
between Ruby and python it is very simple but as natural
00:12:19.720
and Powerful as if you were talking to about a human
00:12:25.800
being there are several techniques to make better LM outputs F short prompting is a technique
00:12:33.000
in which a prompt is embedded with example similar to the Target programs suppose you want to know the
00:12:40.639
color code for blue simply ask what is the color code for blue the L will
00:12:47.839
respond with a summary of a various color coded
00:12:53.000
representations in the F short prompt include examples instead question right answer sh ff0000
00:13:03.480
0 and question BL this alls the LM to understand that
00:13:11.079
the answer format is RGB evening the answer sh 000000
00:13:18.279
FF th fion proin can be used to define the format of response and suggest
00:13:25.279
inference method to the LM now I have finished explaining LM so
00:13:33.079
let's take a look how RS goes in first types with LM this figure is the architecture of AR
00:13:41.320
Go type inference I will explain component in
00:13:47.040
turn first RBS G runs RBS prototype
00:13:52.199
command to create a RVs type definition template for Target D code
00:13:59.199
in this template most types are untyped meaning that the type is
00:14:06.279
unknown next Lo the prepared examples each example consists of three
00:14:13.240
files a ruby code A RVs template and unexpected RVs type
00:14:21.079
definitions the expected RBS type definition is almost the same from as the RVs template with untyped replac
00:14:29.720
with the concrete types then these are combined to
00:14:35.160
construct The Prompt The Prompt contain the input and
00:14:40.279
output examples first following that the Target recode
00:14:45.800
and the Aras template are provided Ln for the given examples and
00:14:53.720
replace the an type in the Target RBS template with a concrete type
00:15:00.920
as a result the output of the LM looks like this the obvious type definition is
00:15:07.639
output in markdown and RVs Goods pass it and output into a
00:15:13.360
file this is how the RBS good inal
00:15:18.920
types there are a few key points of the prompt constructions llms are trained with huge
00:15:26.759
amount of data but contain few definitions of RVs types for this reason
00:15:33.759
llm is not good at generating RVs type definitions from
00:15:39.360
scratch to address this I provide an RVs template generated by RB protype
00:15:46.399
command this Define the type inference problem as a f in the Brank problem that
00:15:53.000
replaces untyped in the front template with concrete types as another Point fot prompting
00:16:01.199
controls the output format to pass easily it also provide examples of how
00:16:07.839
special syntax such as attribute reader should be handled in
00:16:14.600
RBS now let's see that type inference can be performed with RBS
00:16:20.720
Goods how about the examples of Bad Duck and goose with RVs Goose the argument type
00:16:28.720
of make sound is infa to be the bad class yes it works as
00:16:35.920
expected interestingly the Ruby code provided to RBS go doesn't include any
00:16:42.440
CES to make sound method so there is no explicit
00:16:47.839
relationship in the code between the Mi sound method and the bir
00:16:53.120
class nevertheless RVs good iner the b class as an argument type for Mi sound
00:17:01.880
methods this is probably due to the understanding in the convention of using
00:17:08.039
the same name as the class names for variable
00:17:14.079
names as another example let's look at the case of dynamic
00:17:19.120
definitions in this code print who and print bar are defined using defined
00:17:25.000
method as before the code calling this method is not
00:17:30.640
provided even so RBS good finds the definition of print F and print B and
00:17:37.520
correctly INF the written types from the result the LM correctly
00:17:45.000
understand the concept of dynamic definitions and what method are defined according to the
00:17:51.880
context l m is vely to outperformance transition methods in this area
00:18:00.320
so is type inference in by L M perfect not
00:18:06.400
quite this example Define the call method that takes a pro meth PL
00:18:13.039
object type Pro correctly in fact the pro Types on the other hand RVs Goods make a
00:18:20.280
mistake in the grammar of the pro types in these cases where Ruby or RVs
00:18:27.200
specific syntax is used the LM often makes
00:18:33.559
mistakes Ruby and RBS have various grammar elements and there are also
00:18:38.880
framework such rays that are widely used and have unique idioms to reveal the full performance of
00:18:46.600
type inference using LM it is necessary to evaluate on a variety test of
00:18:52.960
cases however only a few test cases are being evaluated at this time
00:18:59.960
currently all evaluations are done manually not standards for selecting
00:19:05.760
test cases and the evaluation result is visually checked for difference from the
00:19:11.120
expected types there are limitation to manual
00:19:16.159
evaluation this will be a pro problem for future improvements when I make a change how do
00:19:23.640
I determine if the performance has improved so we we need better method to
00:19:30.120
evaluate typing fence for reference let's look at how previous Studies have
00:19:35.840
evaluated this so far I shared how our BAS Goods
00:19:41.679
works at the current evaluation next we will survey how previous studies evaluate the
00:19:47.840
performance of typing fence while there are many previous
00:19:54.480
studies this session will focus on two the first is evalution method of s typ
00:20:03.120
which is typing forence to in Ruby and the second is type P which is
00:20:10.480
Type in micro Benchmark in Python let's start with the
00:20:17.600
first s typer is a rubby type inference tool that uses a constraint based
00:20:24.720
approach this tool is built on a ruby static type Checker called RDL which is
00:20:31.400
incompatible with the RBS we are currently working on what is important to us is how they
00:20:40.480
evaluate the performance of type inference this study Compares expected
00:20:46.760
and infant types for each argument return values and variables then for each the comparison
00:20:54.760
results are classified into three categories when both are exactly the same it is
00:21:02.600
classified as match if the best class match it but the
00:21:07.679
parameters are different classify as much up to parameter finally if the b class is not
00:21:15.480
matches it is classified as different this number of matches can be
00:21:22.279
used as metrix of typ inface
00:21:27.520
performance to evaluate this we need test data with the expected
00:21:33.679
types the study used four raise application with types described in
00:21:39.760
RDL and for libraries with extensive y
00:21:47.200
documentations a graph of the evolution results is shown here this graph shows
00:21:53.240
the result of the four approaches side by side the blue area is is the number of
00:21:59.400
matches and the right blue area is the number of matches up to
00:22:05.320
parameter we can determine that the approaches which has a large Blue Area
00:22:11.080
is performed better such metrics would be useful for quantitative
00:22:19.720
comparisons I wonder if this evation tool could be used for RBS Goods as well
00:22:25.159
and reflection data is provided however as BM image yes it's n
00:22:32.760
gab this is not open source and cannot be fed or
00:22:38.080
modified furthermore since it is an RDL based tool it could not be used directly
00:22:44.919
in RVs so even with Ruby type INF research
00:22:52.840
I found that the artifact were not always used directory for our case no
00:22:58.559
now less than as to previous studies in other languages type EV Pi is a micro
00:23:05.960
Benchmark for typing fence in python as the name micro Benchmark
00:23:12.039
suggests this tool covers small test of cases categorized by grammatical element
00:23:17.559
and other factors the evalution method is similar
00:23:22.880
to type Sim typ evaluation it Compares expected and INF type for each
00:23:29.520
parameters return values and variables however it covers only exact
00:23:39.080
matches now let's look at the retail of the test of
00:23:44.120
cases under the test case directory they are categorized by python feature such
00:23:49.960
as ARS assignments classes decorators and so
00:23:56.400
on each test case cons consist of a small python code and a Json file which
00:24:02.440
describes the expected types it's very
00:24:08.559
simple the result of this Benchmark are output in various format here the table
00:24:15.159
are shown AG aggregated by categories scar pair in the rightmost is
00:24:22.679
the name of the inference tool being evaluated in the build categories
00:24:29.039
the results are is zero this shows that scarer is not good at grammars in the B
00:24:37.240
category th micro benchmarks King can visualize the strength and weakness of
00:24:43.799
each inference method by classifying test of
00:24:49.559
cases now let's review what we learned from these previous
00:24:55.080
studies the evolution in both studies compare the expected and INF types for each
00:25:01.679
argument return vales and variables the number of mates can be
00:25:07.520
used as Matrix to evalate the type inference performance there are two types of test
00:25:15.080
data for evaluation using real world data such as
00:25:20.919
libraries and applications media practical performance on the other hand using
00:25:27.640
micro benchmark data based on grammar elements and other factors it allow us
00:25:33.279
to clarify the strength and weakness of each method each of these has its own
00:25:39.399
advantages and should be used for different
00:25:45.240
purpose we have looked at two previous studies using these previous studies as
00:25:50.799
a guide I am working on type RB a ruby type inference Benchmark to evaluate the
00:25:56.960
performance of RBS School still I work in progress but I will
00:26:02.760
share my ideas I am planning to build typ RB with
00:26:11.440
an architecture similar to previous studies differently I am plan to use RBS
00:26:18.399
for both expected and infant types the evaluation process is shown in
00:26:24.120
this figure prepare Ruby code and expected
00:26:29.200
RBS type definitions as test of data and generate typing for RBS with RBS
00:26:35.840
gos the infal RVs and expected RVs are compared and the number of matches is
00:26:42.679
counted as Benchmark results how does we construct the infer
00:26:49.840
type uh the infer RBS type with the expected I types I consider building a comparison
00:26:57.480
three using RBS environment the environment roller can
00:27:03.200
be used to rad the RBS for a given pass however the number of cross de or
00:27:11.000
is over 300 because STD leave and other are also loaded in this
00:27:17.840
environment to extract only Goose class we can use hash access to crass
00:27:25.880
Deckers the good class entry we this here so the is this is because clust
00:27:33.520
entry has a lot of information however only a small part of
00:27:39.240
the information is needed for the comparison
00:27:45.200
tree the comparison three construction process has been implemented shown
00:27:50.399
here I take only the classes defined in the Target RVs and create a type node
00:27:57.600
for each instance variables or method parameter and return return vales to
00:28:02.760
compare this is a work in progress and the Mi sound method of the object class
00:28:08.360
is missing however it seems likely that the construction of a comparison three could
00:28:15.240
be possible when it is all done we all have to do is to Traverse the comparison
00:28:22.159
three count the number matches and that is The Benchmark result
00:28:30.240
next what should we do use for test data micro benchmarking and real world
00:28:37.840
data each have their own advantages so both should be prepared
00:28:43.559
and us it according to the purpose like type EV Pi micro Benchmark
00:28:49.720
data is categorized by grammatical elements using this test data a detailed
00:28:55.440
evaluation that shows the strengths and weakness of each method should be
00:29:01.240
possible for real world data it would be better to for typ of evaluation and use
00:29:08.159
Ruby libraries and raise application with arious type de
00:29:14.200
definitions and evalation using this data should be able to measure practical
00:29:23.159
performance since the preparation of micro Benchmark data is work intensive
00:29:28.320
so I am exploring the possibility of using the GitHub compiled workspace a
00:29:33.720
task automation agent my R testing has not it very good
00:29:39.440
details but it is worth it to adjust the prompts once the construction three
00:29:45.799
construction is implemented I will get to work on this area
00:29:53.000
well I'm currently working on typ about RB in this direction
00:29:58.480
it is still in the early stages but I will gradually work on it as my private
00:30:05.840
project finally let's wrap up I shared how our B good works and
00:30:13.880
currently evaluation results this evaluation revealed that
00:30:18.919
llm provides better inference results than traditional methods in examples
00:30:24.399
such as generalization and dynamic definitions for more detailed evaluation we surveyed
00:30:31.600
how previous studies evaluates type inference performance this evaluation compare
00:30:37.240
expected and INF types and counted the number of matches for
00:30:42.840
metrix in addition both micro benchmarks and real world data were seen as test
00:30:49.039
data and both seems useful based on these previous studies I
00:30:55.600
shared the idea of type but be a type inference Benchmark for Ruby and
00:31:02.080
RBS this development is in the early stages but hopefully it will reveal the
00:31:07.720
type INF performance of LM and can be used for future improvements toward the
00:31:13.639
realization realization of a typing that's all thank you for