Talks
Simple Bayesian Networks with Ruby
Summarized using AI

Simple Bayesian Networks with Ruby

by Carl Youngblood

The video, "Simple Bayesian Networks with Ruby" presented by Carl Youngblood at the MountainWest RubyConf 2007, explores Bayesian networks and their application in Ruby programming. Youngblood starts by discussing the limitations of early AI systems based on first-order logic and emphasizes the importance of probability theory in addressing uncertainty in decision-making. He establishes a foundation in probability concepts, covering random variables, probability distributions, and crucial principles such as Bayes' rule and conditional independence, which are essential for understanding Bayesian networks.

Key points discussed include:

- Foundational Concepts: The presentation begins with the challenges faced by logic-based AI systems, leading to the need for probabilistic reasoning.

- Probability Theory: Youngblood explains key concepts such as propositions, random variables, and the axioms of probability, delineating how they apply to Bayesian networks.

- Bayes' Rule: The significance and application of Bayes' rule for updating beliefs based on new evidence are highlighted with practical examples, such as diagnosing cancer in patients.

- Bayesian Networks Structure: Youngblood introduces Bayesian networks as directed acyclic graphs that model relationships between random variables, elaborating on how to calculate joint distributions and perform inference.

- Implementation in Ruby: Concluding the presentation, he showcases his library, Simple Bayesian Networks (SBN), designed to facilitate Bayesian network creation and inference within Ruby applications. He provides a practical example of using the SBN library to model a grass-wetting scenario based on weather conditions and sprinkler usage.

- Future Enhancements: Youngblood shares his vision for future updates to the library, including better exact inference and support for continuous variables.

Youngblood's presentation offers a comprehensive overview of Bayesian networks, providing insights into both their theoretical foundations and practical implementations in programming, specifically Ruby. He stresses that while Bayesian networks involve complex mathematics, they can be practically applied for effective decision-making in uncertain situations, with the SBN library serving as an accessible tool for developers.

00:00:07.720 Okay, welcome! It's good to be here. My presentation today is on Bayesian networks and specifically some tools for using them in Ruby. As a day job, I crimp cables. If you weren't here earlier, you would have seen me doing that. I also like to work with Ruby. First, I'd like to set the stage by talking about some of the problems these networks solve and how they work.
00:00:30.480 Much early research in the field of artificial intelligence focused on expert systems in first-order logic. Although nowadays, there are much more sophisticated systems, most of these fledgling systems attempted to solve problems by storing numerous facts about their area of expertise and deducing the answers using predicate logic. For example, consider this first-order logic statement: "For all x, if x is a person, then x is mortal. Socrates is a person; therefore, Socrates is mortal."
00:01:01.399 As AI became commercialized in the early '80s, the sector initially had great success, but the so-called AI bubble burst towards the end of the decade as companies failed to deliver on their extravagant promises. Perhaps part of the appeal of this emphasis on logic was that it relied on things that computers were really good at, like storing a lot of information and quickly executing many operations. However, it has some profound limitations.
00:01:40.400 Consider the rule: "For all patients P, if patient P is experiencing a toothache, then P has the disease called cavity." This rule is problematic because not all patients with toothaches have cavities. You could try to fix this by changing the rule to accommodate all possible causes of toothaches, such as gingivitis, an abscess, or maybe a punch in the mouth. Unfortunately, the list of possible causes is nearly limitless. You could also reverse the causal order to say that cavities cause toothaches, but this isn't true either, as some cavities don't produce toothaches.
00:02:22.440 The problem with logical agents is that it's beyond our capability to make a complete list of all possible antecedents and consequences needed to ensure an exceptionless rule. It's also too difficult to use such a rule even if we had one. Even in cases that aren't beyond our capability, there's a point of diminishing returns after which it's just not worth spending any more time. In other words, just solving the problem isn't enough; we also have to solve it in a reasonable amount of time.
00:03:09.840 There are no useful areas of knowledge that have been completely mapped out by a grand unified theory of everything. In fact, Gödel's incompleteness theorem tells us that in any useful consistent formal system, such as first-order logic, there will always be statements that are neither provably true nor provably false. In other words, any useful system can never be both complete and consistent. Even if it were possible to know all the rules, we might be uncertain about a particular situation because we don't have the time to run all the necessary tests.
00:03:48.080 It's interesting to consider that people deal with uncertainty all the time. We're never really able to know anything with a likelihood of 100%. We are forced to take these leaps of faith every day. Just getting out of bed in the morning requires us to make a number of assumptions, whether we realize it or not. We employ these strategies in every decision we make; our knowledge can at best provide only a degree of belief in the possible outcomes for a given decision.
00:04:16.400 The main tool that experts use to quantify and express these degrees of belief is called probability theory. Stuart Russell and Peter Norvig, in their monumental text on artificial intelligence, articulate this well; they say that probability provides a way of summarizing the uncertainty that comes from our ignorance. Notably, I wouldn't be inclined to call it laziness since there are often good reasons why we cannot always obsessively optimize our decisions.
00:05:03.400 I hope you'll bear with me as I take a few minutes to go over some of the basic concepts in probability theory that are used in Bayesian networks. I think it will help lay a foundation for understanding them a little better. First, propositions are statements about degrees of belief or assertions that certain situations are the case. For instance, this proposition states that our level of belief that a cavity is false is 90%.
00:05:35.840 In this case, 'cavity' is referred to as a random variable, or some aspect of the problem space that can take on multiple states. Random variables can be Boolean, meaning they can only be set to true or false, discrete (meaning they can take on many different distinct states), or continuous (meaning they can take on any value from the real numbers). Each random variable has a number of different states it can take on; the list of all possible states for a random variable is called its domain.
00:06:17.120 The domain of the variable 'cavity' is true or false. There are a few fundamental pillars or axioms of probability upon which everything else rests. First, every probability must be between zero (something that is never true) and one (something that is always true). Second, which I basically just said, is that anything that is always true has a probability of one, and anything that is always false has a probability of zero.
00:06:55.760 Third, the probability that either one proposition or another is true is equal to the probability of the first plus the probability of the second minus the probability that they are both true together. Think of it like this: we have one event, we add it to the probability of the other event to get the probability that either event happened. However, we've counted the intersection—when both happen—twice, so we have to subtract that intersection once to get the correct probability.
00:07:46.440 From these axioms, we can also derive a few other truths. Clearly, A or not A is a certain event; either one or the other has to be true, but both can't be true. Thus, a shorter way of stating this is that the events are mutually exclusive. From axiom two, which states that a certain event equals one, we conclude that the probability of A or not A equals one since A and not A are mutually exclusive events.
00:08:26.440 We can use Axiom 3 without worrying about the possibility that both A and not A might be true simultaneously. Therefore, we can simplify the axiom to the probability of A plus the probability of not A equals the probability of A or not A, which we've just learned equals one. From this, we can also learn that subtracting any probability from one gives the inverse of that probability, or the probability that the proposition is not true.
00:09:06.320 Bear with me for just a few more minutes; I won't take much longer on this probability stuff. The prior or unconditional probability of something is the degree of belief given to it in the absence of any other information or based on information external to the problem space. For example, the proposition states that our prior belief about the likelihood of a cavity, before we gain any other information, is 10%.
00:09:55.760 This raises the subject of an interesting debate that has been ongoing for a long time among statisticians: Where do probabilities come from? The frequentist approach states that probabilities come from experience about the world. That is, we count all the times something has been true and divide it by the total number of samples to get its probability.
00:10:35.920 The Bayesian approach, however, expands probability theory by allowing us to reason about beliefs under conditions of uncertainty. It recognizes that there are outcomes we have never experienced that still hold some probability. For example, we can express a belief about the odds of Barack Obama winning the 2008 presidential election, even though it has never happened before. Furthermore, even events that have occurred may not have occurred frequently enough for us to accurately determine their real probability.
00:11:24.720 For this reason, probabilities are usually expressed as degrees of belief, and all are influenced by something called a subjective prior, whose effect diminishes the more observations you make. An interesting observation is that some cosmologists and philosophers have noted that the so-called laws of physics are probabilistic, and that things like time and the speed of light may not actually be constant throughout the universe.
00:12:17.520 Pragmatist philosophers like William James and Charles Peirce believe that truth is what works, and that physical laws are like well-worn pathways that only become so after a long Darwinian sifting process. This is similar to the idea of a subjective prior that gets modified as we gain more experience. Sometimes we want to talk about the probabilities of all possible values of a variable, which we call a probability distribution.
00:12:50.800 This example shows the prior probability distribution of the variable 'weather'. To shorten the notation, we can use a bold non-italicized P to indicate that what follows will be a vector of probabilities for all the states of the given variable, rather than a single probability. Furthermore, sometimes we want to talk about the combined probabilities of all the variables in a system; a joint probability distribution that covers this complete set is called the full joint probability distribution.
00:13:30.400 This is the full joint probability distribution for weather and cavity. The probability of each combination of states is found in the rightmost column. I've also included the prior probability of each state next to its name in parentheses, so you can compare them with the combined probability commonly known as the posterior or conditional probability, which I'll explain shortly. A typical full joint distribution will not list the prior next to each of those variables.
00:14:19.480 Notice that the sum of all the probabilities equals one. In a logical world, the sum of all possible atomic or mutually exclusive probabilities must equal one. Notice also that the probability of 'sunny' and 'cavity' occurring together, plus the probability of 'sunny' and 'not cavity' occurring together, equals the probability of 'sunny'. Whenever this happens, you know that the variables are independent of one another.
00:15:10.840 In this case, the behavior of the weather is not dependent on whether or not you have a cavity. The summing process I just described is called marginalization. Given the full joint probability distribution of a set of variables, we can determine the probability of individual variables by summing up all the cases where they hold.
00:15:57.160 This process got its name from the way actuaries would write up the sums of observed frequencies in the margins of insurance tables. So, in this case, we could calculate the probability of 'weather' equaling 'Sunny' by summing all the probabilities. When we calculate the probability of variable A by summing all the applicable probabilities of variable B, we say that B is marginalized out of the probability of A.
00:16:43.640 Here we have a generalized formula for the marginal probability of any probability distribution A from the joint probability distribution of A and B. As we saw in the table, you just sum up the probability of A given each state of B. Note that the vector notation refers to the whole distribution, not individual probabilities. Another important note is that the comma here is simply another way of indicating logical 'and'.
00:17:37.640 It is important to mention that the complexity of inference using the full joint distribution is exponential for both time and space. The probability table for a domain described by n Boolean variables is on the order of 2 raised to the n. The time it takes to process the table is the same, and these bounds, of course, only worsen with multi-state discrete variables.
00:18:24.880 For these reasons, it's not practical to use the full joint probability distribution for inference, but it helps us understand some basic operations used in Bayesian networks. Once we've obtained some evidence about a previously unknown random variable, prior probabilities no longer apply. Instead, we use a conditional or posterior probability.
00:19:09.480 The notation for expressing a conditional probability is something given something else. The way to interpret this is to say 'the probability of A given B'. You can also think of a prior probability as a special type of conditional probability, where it's conditioned on no evidence. Conditional probabilities can be described in terms of unconditional probabilities, which is what this defining equation highlights.
00:19:58.640 The probability of A given B is equal to the probability of A and B together, divided by the probability of B. I quickly put together an example so you can see how this works: Let A denote the event that a puppy in a kennel is female and let B denote that the puppy is a Labrador in a kennel of 100 puppies. Suppose that 40 of them are Labradors and 10 of those are females.
00:20:47.760 Thus, the probability of A and B—puppies that are both Labradors and female—would be 10 over 100. The probability of B—puppies that are Labradors—equals 40 out of 100. Therefore, the probability of A given B is 10 over 40, implying 10 of the 40 Labradors are female. Plugging these values into the formula yields a valid equality, demonstrating how the product rule can be derived from this equation.
00:21:41.120 When I present equations, I like to show every step, as I am not the best at math! You start by multiplying both sides by the probability of B, which cancels out those terms. Next, you'd flip the equality statement, and alter the order of multiplication on the right-hand side, ultimately deriving the product rule. You can think of this as saying that for A and B to be true, B must be true. Additionally, you also need A to be true given B.
00:22:31.520 Conversely, you can also express it by stating that the probability of A and B equals the probability of B given A multiplied by the probability of A. From all these observations, we can construct Bayes' rule. The equation encapsulating conditional probability and the product rule can be combined, and this simple equation underlies all modern AI systems utilizing probabilistic inference.
00:23:25.480 Bayes' rule shows us how to update our belief about a hypothesis in light of new evidence. Our posterior belief, or the probability of A given B, is calculated by multiplying our prior belief by the likelihood that B will occur if A is true. We owe the Reverend Thomas Bayes a debt of gratitude for this equation. It was published posthumously in the Proceedings of the British Royal Society in 1761, thanks to a friend of Bayes who recognized his brilliance.
00:24:15.600 Although not extremely well known in his time, Bayes' work has now become foundational to much of probability theory. I should add that there are some anachronisms in the portrait of Bayes that cast doubt on its authenticity, but I find it hard to resist using an old image even if it's not quite how he looked.
00:25:06.560 The power of Bayes' rule is that it is often difficult to directly compute the probability of a given B, but we may have information about the probability of B given A. Bayes' rule allows us to compute A given B in terms of B given A. Let me demonstrate a quick example of how this might work: suppose we're interested in diagnosing cancer in patients who visit a clinic.
00:25:49.480 Let A be the event that the person visiting the clinic actually has cancer, while B is the event that the person visiting the clinic is a smoker. We know the prior probability of A from past data, as we understand that 10% of all patients visiting the clinic end up having cancer. We want to compute the probability of A given B—that is, to determine if being a smoker makes it more likely that they will get cancer.
00:26:46.400 Although determining this probability directly is challenging, we also know the probability of B because this information is collected from incoming patients. By checking our records, we can estimate the probability of B given A, which could be 80% if we find that 80% of patients diagnosed with cancer are also smokers. Thus, we can use Bayes' rule to compute the probability of A given B.
00:27:45.920 By using the evidence that a patient is a smoker, we can revise our prior probability of 0.1 (or 10%) to a new probability of 0.16. This represents a significant increase. However, it also shows that the generally low probability of cancer is the greater influence on the outcomes. Frequently, the probability of B is treated as a normalizing constant; thus, we can isolate it and later discover the probability of B through marginalization.
00:28:40.440 This allows us to present a general formulation of Bayes' rule with normalization. Additionally, another fundamental tool for Bayesian inference is the chain rule. Recall that we derived the product rule from the formula for conditional probability. We can extend the product rule for more variables, leading us to the generic version for an arbitrary number of variables, known as the chain rule.
00:29:29.320 The significance of this for Bayesian networks comes from its ability to calculate the full joint probability distribution of any domain characterized by random variables. This is achieved by starting with a known probability—something that is not conditioned on anything else—such as the probability of A_n. This, in turn, allows us to determine the complete full joint probability distribution.
00:30:23.760 Though this rule gives us a method to express full joint distributions, it does not aid in simplifying inference complexity. This is where the concept of conditional independence becomes vital. A is independent of B if its posterior probabilities given B are the same as its prior probabilities, meaning B does not affect the probability of A.
00:31:21.600 Another crucial idea for Bayesian networks is the notion of conditional independence. If the probability of A given B and C is the same as the probability of A given C, we assert that A and B are conditionally independent given C. Understanding this concept was critical to the development of Bayesian network theory. Before grasping this, the researchers were unsure how to reduce operational complexity.
00:32:09.640 To comprehend conditional independence more clearly, let's take a few examples. In our first example, suppose Alice tosses one coin and Bob tosses another. Here, event A is the result of Alice's toss, and event B is the result of Bob's toss. It's evident that A and B are independent; the outcome of one will not influence the outcome of the other.
00:32:52.560 Now, let’s assume Alice and Bob toss the same coin. If the coin is biased towards one side, events A and B become dependent. Observing that B is heads or tails influences our belief about A's result since we have a slight prior belief the coin is biased. In this example, both variables A and B show dependence on a separate variable C, the coin, which might be biased. Even if A and B are not independent, it turns out that once C's value is known, any evidence about B cannot change our belief about A.
00:33:46.200 Specifically, the probability of A given C is equivalent to the probability of A given B and C. In such instances, A and B are labeled conditionally independent given C. In real-world situations, variables thought to be independent might only be independent contingent on some other variable. For instance, let’s assume Alice and Bob live on different sides of the city and commute via entirely different means.
00:34:34.680 Alice takes the train while Bob drives his car. We can represent event A as Alice arriving late and event B as Bob arriving late. It is tempting to assume A and B are entirely independent; however, there could be external factors that might affect both of them, such as a fuel shortage or an unexpected event.
00:35:25.840 When constructing our probability model, we typically ignore these unusual scenarios. In this example, both A and B might be influenced by a variable C, such as a train strike. If C is true, it can increase the probability of A being true and similarly influence B due to increased road traffic. Initially, we might believe these events are independent, but once we understand C, we see that once C is known, A and B are independent.
00:36:09.880 However, understanding C still affects beliefs regarding A and B; just knowing A does not change our view of B, and vice versa. This leads us to the concept of Bayesian networks, which are directed acyclic graphs where nodes represent random variables, and edges depict influence or causality. Each node has a state table with a joint probability distribution of its own states and the states of its parents.
00:36:55.680 Conditional independence implies that nodes need only maintain conditional probabilities for nodes directly influencing them—their immediate parents. This results in a Bayesian network being a complete and non-redundant representation of the domain, also allowing it to be more compact than a full joint probability distribution.
00:37:47.760 The full joint distribution for a Bayesian network can be generated using a specific formula, requiring consideration of a node's parent states and the probability of a node given its parent states. When looking at the notation that indicates variable arrangement in the network, we can calculate the probability for those arrangements. This compactness is a characteristic frequently observed in structured or sparse systems.
00:38:31.520 In most domains, it's reasonable to assume each random variable is only influenced by at most k other variables for some constant k. If we assume we have n Boolean variables for simplicity, the information needed to specify each conditional probability table would be at most 2 raised to the k. Consequently, the complete network can be defined by just n multiplied by 2 raised to the k.
00:39:21.600 In contrast, the full joint probability distribution contains 2 raised to the n. For instance, if we possess 30 nodes influenced by five parents each—where k equals five—the Bayesian network requires only 960 numbers, while the full joint distribution necessitates over a billion. This compactness also empowers Bayesian networks to manage complex domains with numerous variables.
00:40:12.960 Let's examine a straightforward example of a Bayesian network. In this domain, we have four variables: 1. 'cloudy' (Is it cloudy outside?), 2. 'sprinkler' (Were the sprinklers turned on?), 3. 'rain' (Did it rain?), and 4. 'grass wet' (Is the grass wet?). Here, 'cloudy' has no parents so the factors influencing whether or not it's cloudy are outside our problem space. We thus assign a prior belief on 'cloudy' with a 50% probability.
00:41:05.600 The variables 'sprinkler' and 'rain' are influenced by whether or not it's cloudy. It's fairly straightforward that cloudy skies increase the chances of rain, but cloudy conditions also raise the likelihood that the custodian disables the sprinklers. The probability of sprinklers being enabled given that it's cloudy is 0.1, whereas if it is not cloudy, the probability rises to 0.5. The probability of it raining given that it is cloudy is 0.8, and given that it is not cloudy, it is 0.2.
00:41:53.120 Determining whether or not the grass is wet is influenced by both the sprinklers and the rain. Since 'grass wet' has two parents, it must establish conditional probabilities for all combinations of its state and its parents. Therefore, if both the sprinkler and rain are true, there’s a 99% likelihood that the grass is wet, while if one or the other is true, the probability drops to 0.9; otherwise, there’s no chance that it will be wet.
00:42:42.000 It’s important to notice that the tables presented illustrate only the rows for each variable's true state to conserve space on this slide. Internally, however, the variables must have probabilities for all possible combinations of their own states, including their parents. Consequently, this necessitates two for 'cloudy', four for 'sprinkler' and 'rain', and eight for 'grass wet'. Inference within Bayesian networks occurs by passing evidence based on variable observations to the network and then querying it for the posterior probabilities of unknown variables.
00:43:32.000 Here’s an example: let's query the network by asserting that the grass is wet. We want to understand the probabilities of the other variables states. When we do this, we find that the probability of 'cloudy' slightly increases in light of the evidence, while we can also observe that the probability of the sprinkler being true is significantly lower than that of rain. Rain emerges as a more logical explanation for the wetness based on our conditional probability tables.
00:44:20.640 Now, suppose we observe that both the sprinklers were turned on and that the grass is wet. By doing so, the fact that the sprinklers were activated becomes a more credible explanation for the grass being wet, causing the probabilities of both 'cloudy' and 'rain' to decline significantly. Additionally, you can query the network with no explanation or evidence to see the probabilities of those states based purely on their priors and conditionals.
00:45:09.560 In Bayesian networks, numerous different algorithms have been developed for conducting inference, utilizing combinations of previously discussed formulas such as marginalization, Bayes' rule, and the chain rule. These generally fall into two categories: first, exact inference. Though many networks exhibit characteristics allowing them to be resolved in polynomial time, the general problem of inference on Bayesian networks—or exact inference—is NP-hard.
00:46:05.920 For those unfamiliar with complexity theory, this means that the time or space required to resolve the issue increases exponentially with the problem size. One might find themselves waiting an impractical length of time for a solution—perhaps even for the lifetime of the universe—if presented with a sufficiently large problem.
00:46:52.000 However, in practice, most algorithms can efficiently handle fairly complex Bayesian networks on today's computers, though there are exceptions. The second general classification for inference methods is approximate inference, which is often used when quicker results are needed or when precise outcomes are unnecessary, especially for problems that are too large to practically solve using exact inference.
00:47:43.920 Most of these approximation algorithms employ stochastic methods and Monte Carlo algorithms that traverse the network, randomly setting variable states based on their probabilities. In Bayesian networks, learning occurs in two main ways. First, it is not always evident what the actual structure of the network should be, and whether a variable possesses a causal influence or is a consequence of another variable is often unclear.
00:48:41.200 As such, computer scientists have developed numerous algorithms for analyzing apparent independence or causality levels between variables and suggesting potential relationships between nodes. These algorithms utilize sample points—defining states for all variables in the system—to calculate conditional probabilities and analyze the influences among these probabilities.
00:49:32.080 Additionally, a fundamental task for learning in Bayesian networks involves determining the actual parameters. This requires the collection of sample points in order for each node to construct its state table based on the frequencies of various observed states. We've only just scratched the surface of the complex mathematics underlying Bayesian networks, and I hope I haven't intimidated anyone with the details.
00:50:22.920 In practice, they can be quite straightforward to work with. A significant advantage of Bayesian networks is their ability to allow the data to speak for itself. You might not be aware of all the relationships in the domain, but if you let the network learn from sufficient data, it can yield impressive results.
00:51:10.080 As a general principle, as long as your network structure is reasonably accurate and you utilize all the available training data during the learning phase, querying your network will result in predictions that optimally leverage all your knowledge of the domain. However, I caution that there are caveats; in certain situations, well-performing networks may require extended inference time.
00:52:00.200 For the remainder of this presentation, I'd like to demonstrate a library I've been developing that allows you to implement Bayesian networks within your Ruby applications. This library is called SBN, or Simple Bayesian Networks. Initially, it began as a class project I developed during my master's program in Computer Science at BYU, focusing on Bayesian methods.
00:52:57.720 I based its pseudocode on the algorithms outlined in the text 'Artificial Intelligence' by Russell and Norvig. While the original code was implemented in Ruby, it suffered from inefficiencies due to a tight deadline. I later rewrote it in C++, but I questioned the decision to abandon Ruby. Eventually, I started from scratch using Ruby, aiming to enhance efficiency, and remarkably, the latest Ruby version—void of any C code—has become faster than the C++ version.
00:53:55.040 This experience reiterates the adage that premature optimization is the source of all evil, as stated by the famous Donald Knuth. Initially, you should code simply and elegantly, then profile it, and optimize areas with clear inefficiencies. When drafting the conference abstract for this talk, I contemplated naming the project BN4R based on the C++ library, but I chose the name SBN out of respect for Sergio Espinosa's recent BN4R library, which bore a similar name.
00:54:38.200 Let’s take a look at how we can recreate the grass network with SBN. First, we require the necessary files and then instantiate a new network, assigning it a title. Next, we create each node in the network, passing along the network they belong to along with their names in the form of symbols. Strings will automatically convert to symbols during implementation. We then input the probabilities for their state tables and I will clarify more about formatting this in a moment.
00:55:35.680 If you don’t specify their states, they default to Boolean variables: true and false. The edges in the network are created by invoking either 'add_child' or 'add_parent', where adding a child also establishes a parent relationship and vice versa, meaning it doesn’t matter which method you utilize. Lastly, we specify observed evidence as a hash with symbols for keys and values.
00:56:32.560 In this scenario, the key signifies the name of the observed node, while the value indicates its state. As an example, we specify the sprinkler as false and rain as true. Subsequently, we can query the network about the probabilities of the grass being wet. Allow me to run this quickly and show how it works.
00:57:17.440 The output suggests that the probability of the grass being wet is approximately 0.1. However, it is necessary to note that SBN uses an approximation algorithm, so it does not yield precise results each time. Ideally, the exact result assuming perfect inference would be about 0.1 for false and 0.9 for true.
00:58:07.040 Depending on the number of samples utilized in the approximation algorithm, results may vary. The adopted algorithm is called the Markov Chain Monte Carlo (MCMC) algorithm. The order in which the probabilities are supplied is significant; they must alternate between the states of the variable whose probabilities you're supplying.
00:59:03.520 You should provide the probabilities of variable states for its parents as established in the order they were added, starting from the rightmost. For instance, let's say there’s a variable A with two parents B and C. A has three states, while B has two and C has four. You will supply A's states for each set of B and C's states, remembering that each set must total 1.0.
00:59:56.960 Alternatively, you can specify probabilities using a hash to denote the state combinations associated with the table entry and then assign the respective probabilities. Parameter learning is another function that SBN enables; usually, establishing the expected probabilities directly is complex.
01:00:49.040 Instead, you can start with a blank slate and make reasonable estimates of variables' probabilities after gathering sufficient data. Using SBN, you simply pass in complete sample points for all the network nodes, and as many as you require. You can also pass one sample point at a time, and once you've added enough sample points, you invoke a method to set the probabilities based on those sample points.
01:01:42.760 The network accommodates sample points you've provided, ensuring you can continually benefit from previous data. Furthermore, the library supports the serialization of your network; presently, it utilizes a format called XML BIF, one of the few open-source formats for Bayesian networks promoted by the creator of Java Bayes, another useful tool that you can explore.
01:02:30.760 The networks created with this library can be opened in Java Bayes for additional graphical exploration, should you wish. Currently, the sample points you transmit to your network do not save in the XML BIF file, but I aim to address this in the future. Additionally, there are several advanced variable types intended to facilitate handling of real-world data, and I will cover those now.
01:03:18.680 The first variable type we have is the string variable, used for managing string data. Instead of manually setting string variable states, you can let the learning process handle it. During the learning phase, you will pass the observed string for each sample point. Each observed string will be divided into components called n-grams, which are short character sequences.
01:04:08.880 A new variable gets created for each n-gram, with its state being true or false depending on whether that n-gram was observed in the evidence. These co-variables are managed by the primary string variable, inheriting the same relationships as their host variable. By segmenting the observed string data into fine-grained substrings and assessing separate probabilities for each substring occurrence, you can derive a highly accurate understanding of the data.
01:05:01.040 However, I must admit the practicality of this feature remains uncertain. It has the potential to significantly improve inference accuracy but can also complicate your network structure to the point of becoming intractable. Not unlike the early flight attempts depicted in those amusing films, it might be beneficial for some users.
01:05:52.960 Additionally, I developed a variable type known as the numeric variable, which is leveraged for managing numeric data. Numeric data is continuous and more challenging to categorize into discrete states. Unfortunately, the current inference algorithm that SBN employs can only handle discrete variables.
01:06:40.680 Consequently, the workaround is to apply a discernment process, essentially creating buckets for ranges of numbers. The thresholds for these buckets rely on the mean and standard deviation computed from the observed data incorporated into the network. The thresholds are recalculated every time learning is performed, ensuring adequate representation.
01:07:29.920 While some accuracy may be lost in the discretization of your numbers, the numeric variable eases handling by dynamically adapting to your data and managing the discretization process for you. As long as your data is reasonably normally distributed, this functionality should work adequately.
01:08:12.400 To conclude, I want to illustrate a real-world application of how you might utilize this library for better expense tracking in a budgeting application. Imagine I want to automatically categorize my transactions into specific budget categories with minimal human involvement while facilitating continuous learning as I categorize more items.
01:08:58.800 I would declare the category as the variable we query whenever new transactions arise that need classification. The remaining variables would represent other relevant data elements, and while we're uncertain if factors like the day of the week or month significantly influence the categorization, we can use the available data.
01:09:50.280 In this model, a string variable would encapsulate the merchant identification string, while the expenditure amount would be labeled a numeric variable. I had intended to demonstrate this today, but I'm still in the process of refining the implementation, as the outcomes have not met my expectations thus far.
01:10:39.760 Perhaps a different network structure might be needed. Nevertheless, I hope to share more concrete examples with you soon. In terms of future enhancements for this library, I would like to incorporate exact inference capabilities, enable the use of real continuous variables, and serialize sample data within the networks alongside their structure.
01:11:34.480 Moreover, I aim for overall speed improvements. If time had permitted, I also wanted to introduce a brief lesson on array languages, emphasizing vectorization and parallelization. Leveraging libraries like MacSTL can significantly accelerate processing times by enabling efficient handling of data in bulk.
01:12:39.680 This library converts your code into optimized versions suitable for multimedia instruction sets in modern processors. Rather than iterating through individual numbers, it allows batch processing. By utilizing array-like notation, you can benefit significantly from the performance improvements this library can offer.
01:13:33.120 I would also like to explore learning the network's structure; currently, you must specify this explicitly while gaining greater control over the precision of inference during approximate methods. In closing, I'd be remiss if I didn't acknowledge the contributions of my professors and the remarkable tools available for visualizing complex Bayesian networks.
01:14:26.280 Thank you for your attention. Are there any questions?
Explore all talks recorded at MountainWest RubyConf 2007
+13