Data Science

Summarized using AI

The R language

Barbara Fusińska • March 12, 2016 • Wrocław, Poland

In this presentation, Barbara Fusińska discusses the R language and its application in analyzing GitHub comments. She highlights her personal journey with R, emphasizing the importance of choosing interesting domains for analysis rather than typical introductory examples.

Key Points Discussed:
- Introduction to R: R was created by Ross Ihaka and Robert Gentleman at Bell Labs and is popular in the data science field alongside Python.
- Why Use R: R is a great choice for programmers interested in data transformation, exploration, and machine learning. Its robustness in statistical analysis is one of its key features.
- Getting Started: Users need to download R and can use various editors. RStudio is recommended for its user-friendly interface.
- Basics of R: R is unique as it handles everything as vectors. The language's syntax may initially confuse users with its use of arrows for assignments and indexing practices.
- Data Structures: R utilizes various data structures like vectors, lists, and data frames, each serving specific functions. Data frames are especially important for analysis as they resemble tables in databases and allow mixed data types.
- GitHub Comments Analysis: Fusińska explores language distributions using GitHub's API. She discusses challenges like missing language data and how it can affect analyses.
- Data Manipulation in R: She illustrates functions for reading JSON data, filtering, and plotting the results to visualize language distribution effectively.
- Community Support: The commitment and contributions of the R community enhance its capabilities, making it beneficial for data scientists.

Fusińska encourages practice and exploration of R, noting its strengths in statistical analysis and data visualization. She concludes by reflecting on the gradual learning process associated with R and invites the audience to consider its use in future projects.
Overall, the presentation aims to inspire interest in R and highlight its potential in data science.

The R language
Barbara Fusińska • March 12, 2016 • Wrocław, Poland

wroclove.rb 2016

00:00:15.370 Today, I'll be talking to you about R and how I used it to analyze GitHub comments. Why GitHub comments? Because, of course, it makes my talk more interesting. But mostly, I hate introductory talks where you start with ‘Hello, World!’ and show how to do basic stuff that you can't even use in a production environment. So, this is the reason why I chose this particular domain for my work.
00:00:45.610 So, who am I? As I was already introduced, I'm currently a C# developer, yes, an eShop developer at a Ruby conference, but I will not be talking about C#. Instead, I will focus on R. Why R? Because it is a programming language and I'm a programmer. Once you enter the big data and data science world, R and Python are the two languages you could learn and teach others. I have always been a math enthusiast, so R suited me perfectly, as it is a language developed by statisticians.
00:01:11.950 However, it can sometimes be a very bad idea, which I will try to show you in a moment. I will present code and slides; there will be no live demos because I am terrified of them—at least for me, they never work. This is one of the drawbacks of this language and environment. Once, I gave this talk with live demos. I learned that RStudio cannot create graphs if you don't have the proper resolution on your screen, which was quite an embarrassing moment during the presentation.
00:01:42.520 But then I discovered the issue with the resolution. This is why you will see my beautiful graphs on the slides. All the code I will be sharing today is in this GitHub repository, and if you go through the history, you will probably see that I struggled with this code as I was learning. What I will present is my journey and my final results—not everything that I tried out and learned in the process.
00:02:08.250 Alright, let's start. I will come here so I can see my slides a little better. I will start with an overview of the R ecosystem, followed by the basics of R, and then we will dive into analyzing GitHub events. Get ready because there will be a lot of code, followed by slides.
00:02:40.930 The R language was created at Bell Labs by two gentlemen, Ross Ihaka and Robert Gentleman. That's how R got its name. Some people theorize that the name came from the first letters of their names, while another theory suggests that the language was originally called S, which was a commercial version. Apparently, someone thought to make it free and open-source, hence the name R, which is the letter before S. I don’t know which theory is true, and Wikipedia remains undecided.
00:03:02.050 These days, people are debating whether Python or R is better for various uses, much like the long-standing discussions between Java and C# programmers. But honestly, it doesn't really matter; R is very popular, so if you want to step into this world, it’s a good idea to learn at least some basics.
00:03:38.340 To get started programming in R, you need to download the R software. That’s basically it. Then you can use any editor you want since R is not a compiled language—it is an interpreted language. You can run code line by line and immediately see the results in the console. However, if you prefer using IDEs (Integrated Development Environments), I recommend RStudio. It may not be the best editor or IDE in the world, but it is the most decent one available for R. Recently, Microsoft incorporated R into Visual Studio, which was big news, as now you can write R code within Visual Studio.
00:05:01.810 RStudio features an editor with tabs, allowing you to write code, a console to run that code line by line, and an organized way to aggregate your code into files. The environment pane shows you the variables and their types, making it very convenient. If you're looking for plots or packages that are not in previous tabs, you can find them in the corresponding sections.
00:05:52.640 So, let's talk about the basics of R. R is a great language for many purposes. I have three in mind: data transformation, data exploration, and machine learning algorithms. There are numerous libraries available; whenever you think of an algorithm, there’s probably a library for that. Plus, R has a robust community to support you.
00:06:17.990 It's also great for plotting, but if you look at the syntax, you might find it quite confusing. For example, instead of using an equal sign for assignments, R uses an arrow to assign values. You can also use the equal sign or other assignment signs, but there are some quirks. The assignment only works locally, and if you're not deeply familiar with R, understanding the differences can be challenging.
00:07:26.940 In R, everything is a vector, which can be confusing for those coming from languages like MATLAB, where everything is treated as a matrix. In addition to numbers, R handles various types, including characters and logical types. It's a dynamic language, meaning you can change variable types easily.
00:08:19.110 Now, R has its share of hidden errors. If you encounter them, don’t expect clear feedback. There are strange error messages, and if you Google ‘weird errors in R’, you'll find a dedicated page where people compile some common peculiarities. In R, empty values can be tricky; there is ‘NA’ for missing values, and you need to be careful to distinguish between them.
00:09:04.199 Next, let's talk about vectors. To create a vector, you use the `c()` function, standing for 'combine'. Vectors in R are like arrays, but you have to call them vectors. When you print them, they look like single entries, even if they contain multiple elements. Understanding the indexing in R can also be tricky because R is one-indexed.
00:09:35.790 The type of the vector corresponds to the type of values contained within it, meaning that each vector can only hold one type of value. You can create vectors using sequences or default values, and you can access specific elements using their indices.
00:10:09.980 Lists allow you to combine elements of different types. However, they can be challenging to navigate since they introduce more complexity than vectors. An important aspect of R lists is nesting—each element can contain a vector, which can create a complicated structure resembling an object in Java or JSON.
00:10:51.670 If you want to access elements in vectors or lists without dealing with the nested structure, you can name your elements. This can make accessing elements much simpler when working with data. In R, the dollar sign is used to access named elements in lists, offering a syntax similar to accessing properties of an object in other languages.
00:11:38.840 When it comes to data frames, they are one of the most commonly used structures in R, especially for statistical analysis. Data frames resemble tables in databases, where every column can contain different types of data. A data frame can be constructed by assembling vectors of data.
00:12:13.920 Upon creating a data frame, R offers a summary function, allowing you to gain insights quickly about your data. You can easily look into the basic statistical characteristics of your data, which simplifies the exploratory phases of your analysis.
00:13:05.130 When filtering data based on conditions, R also shines. You can apply conditions directly inside brackets to filter your data easily. You can create functions that return logical conditions, allowing you to achieve complex conditional filtering. Additionally, R's ability to perform operations on data frames makes it powerful for exploratory data analysis.
00:14:59.400 Let's now analyze some GitHub comments. I wanted to explore language distributions, which is a hot topic in data analyses. GitHub provides an API called GitHub Archive that allows you to access events on open-source projects. You can download files with JSON format representing various events.
00:16:12.120 I ran a query for push events, create events, and pull request events for a specific time frame. I found that even on a holiday, the volume of events was substantial, leading to interesting conclusions regarding the languages displayed in those events. However, I noticed a limitation: language information was only available in pull request events, which made my previous analyses questionable.
00:17:54.510 I also had to deal with many missing languages, as GitHub doesn’t provide language data on every event. This observation raised concerns about the analyses of GitHub comments and prompted me to think critically about the conclusions drawn from incomplete data.
00:18:43.950 When handling the data, I read JSON lines, and R provided convenient functions to read and manipulate the data. The challenges I faced in structuring JSON data and converting it into valid R objects introduced me to useful functions, such as `lapply`, to apply a function across data sets.
00:19:52.310 I had to filter through the data, ensuring I only worked with events that contained language information. I used various functions to extract the relevant data, performing checks for duplicates and undertaking operations to simplify my analysis.
00:21:16.440 Ultimately, I considered how to represent the language distribution using tables. Using the table function in R allowed me to assess the frequency of different languages easily. I also worked on aspects like plotting the data, as visual representation forms an essential part of data analysis.
00:21:56.000 Through my journey, I shared diverse insights about R's functionality and its robustness in statistical analysis. I presented visual results using simple R library functions, demonstrating the potential of R for producing meaningful analyses efficiently.
00:22:41.410 I summed up the capabilities of R, from managing data types and filtering to handling missing values and plotting. I encouraged others who want to delve deeper into R to explore and practice, as the learning path is both challenging and rewarding. As I invite questions, I'm eager to discuss more about R and its applications.
00:23:56.370 After sharing my experiences, I raised a point about the gradual mastery of R, emphasizing that while it may be frustrating at times, consistent practice leads to an intuitive understanding of the language.
00:24:09.600 I also pointed out that while R excels in data visualization and exploratory data analysis, Python often works better as a general-purpose language. I urged for using the right tool for the job, mentioning that R is outstanding for statistical analyses and creating visualizations, while Python could be employed for web applications.
00:25:07.520 Before concluding, I noted the growth of R's popularity and how it ties back to the community surrounding it. Many contribute to developing useful packages, which fuels the growth and attractiveness of R for data scientists and researchers alike. As I detailed the power of R, I expressed my hope to see more professionals explore its capabilities.
00:26:37.490 I hope this presentation inspires you to dive deeper into R and consider it for your future data science projects, as there are countless possibilities within this language. As we move forward, I encourage you to bring your curiosity and openness to learning while working with R.
Explore all talks recorded at wroclove.rb 2016
+27