Talks
Hiring Developers, with Science!

Hiring Developers, with Science!

by Joe Mastey

In his talk titled "Hiring Developers, with Science!" at RailsConf 2016, Joe Mastey addresses the pressing issue of effectively hiring developers by applying principles from industrial and organizational psychology. He emphasizes the importance of designing interviews that accurately predict candidate performance rather than falling into common traps that many organizations experience. Mastey introduces three main concepts—validity, reliability, and usability—that should guide the development of effective hiring processes:

  • Validity: It's crucial to ensure that interview questions accurately reflect relevant job skills. Bad practices occur when we focus on unrelated traits instead of actual competencies. For example, assessing arithmetic ability should focus on relevant math questions rather than obscure memorization tasks.

  • Reliability: Consistency in interview results is paramount. The same candidate should receive similar evaluations regardless of who is interviewing them or when the interview occurs. Mastey warns that if interview processes yield varying results, they are likely measuring the interviewers rather than the candidates.

  • Usability: An effective interview process should consider candidates’ experiences. Complicated or excessive assignments can discourage promising candidates and inadvertently exclude certain demographics.

Through investigating common pitfalls in developer interviews, Mastey provides suggestions for improvement, such as:
- Rethinking job postings to maximize applicant diversity and avoid exclusionary language.
- Moving away from trivia questions and the FizzBuzz test, which can often lack real relevance to job performance.
- Considering behavioral interviews and hypothetical problem-solving scenarios that assess a candidate's capabilities in a practical context.
- Emphasizing structured feedback debriefs after interviews to minimize bias.

Mastey concludes by challenging organizations to reconsider their hiring standards. He argues that the perception of hiring the 'best' often narrows candidate evaluation instead of expanding it, suggesting that a thorough understanding of what qualifies as successful hiring can lead to better outcomes. By applying scientific principles to the interview process, companies can enhance their hiring practices and secure the right talent for their teams.

00:00:10.020 Thanks for showing up! I'm excited; it's my first talk where people aren't hungover. So, I'm super stoked that all of you are presumably awake. I'm told there's a lot of jet lag if you're coming all the way from Seattle, but hopefully, that'll work out.
00:00:22.800 My name is Joe Mastey, and let's talk a little bit about hiring developers. I am a consultant, and I work on various things with companies, but I tend to focus on their onboarding and hiring processes, including apprenticeships.
00:00:34.510 One of the things that I've noticed, both with companies that work with me and some that perhaps should, is that we have a problem with hiring rates. This is a big issue! How many of you at your job have a job posting that you cannot fill for a developer?
00:00:41.050 There are a lot of us! And in many cases, this is not just like, "Oh hey, we could use an extra person on the team." This is like an exigent threat to your business. This is a big deal.
00:00:54.059 Interviewing is hard. How many of you have received a terrible interview experience? Anyone ever attended a bad one? Yeah, so you go in and they don't have their act together. Does anyone want to admit to having given a terrible interview? Okay, I have.
00:01:10.140 Good! I was hoping that someone would be honest about it. What's funny is that even big companies—think about Google and Facebook—all these companies with 10,000 developers are not actually doing better. Their interviews are just as terrible as the rest of us.
00:01:21.430 And that points to the fact that interviewing is difficult and expensive. Anyone you have on your interview team also has a full-time job, right? These engineers are expected to take hours out of their day while also managing their regular tasks.
00:01:33.490 On the candidate side, anyone applying for your job may also have another job and other responsibilities in their life. If you think that this isn't your concern, remember that the candidates you really want to hire are probably already employed and applying to other places.
00:01:43.780 So, if you give someone, say, 100 hours of homework, they're probably just going to move on. We're ultimately making this worse for ourselves without realizing it.
00:01:57.190 My informal sense of how interviews tend to go often includes how I got hired. I’ve been interviewed in many different ways, and if one way seemed kind of cool, maybe we’ll imitate that.
00:02:05.230 But this doesn’t do us any favors. I will tell you now: there is no perfect interview.
00:02:10.690 We're going to discuss how to make interviews better or worse, but there is no correct answer per se. I will say that there are many bad answers, and unfortunately, many of the bad answers are the ones we are currently employing.
00:02:18.730 This is one of the main contentions that I want to make to you, and I wanted to put it out early for you to think about this: The reason our interviews are bad is that we typically are not measuring what we think we are.
00:02:26.320 When you have a bad interview, with puzzles or abrasive interviewers, you're not actually measuring the candidate; you're measuring the interviewer. The entire point of the interview is to assess whether that candidate is good.
00:02:39.160 This is ultimately why you end up turning down good candidates and accepting bad ones, and this is why you have two hundred interviews and never offer anybody a job. It's because you're not accurately measuring.
00:02:50.170 I don’t think this is intentional—nobody’s trying to do a bad job—but what's happening is that we lack a toolset. We don’t have a mental schema for how to evaluate whether our interviews are good.
00:03:02.500 Many of us come from engineering backgrounds, not HR, and so we usually end up just winging it. Even if we haven't invented the correct ways to interview, there are other fields that have—specifically, psychology.
00:03:10.690 Today, I want to talk about industrial and organizational psychology. This is one of the major branches of psychology that began in the late 1800s. It really came to prominence in the 1920s, during the First World War.
00:03:18.730 Psychologists in the army needed to figure out where to place a million recruits, and they developed methods for this selection process.
00:03:30.280 I’m going to include references at the end of this talk, so if you find yourself wanting to look at primary material, remember that 'selection' is the name of the concept you want to Google.
00:03:37.550 It’s a little tough to cover 100 plus years of psychology, but I will give you a tool in three parts—a way to think about the interviews you're conducting. Then we will look at some common tropes—typical things we see in interviews—through that lens.
00:03:50.430 The first part is validity. If we have something we want to measure, we call this a construct in psychology.
00:03:54.559 Validity tells us whether our measurement accurately reflects that thing. The targets, the criteria, and the measurements are centered approximately on a bullseye.
00:04:03.299 It’s okay that they’re a bit spread out; it’s more important that they’re measuring the correct concept.
00:04:09.250 One type of validity concerns whether the questions you ask correspond properly to the concept you're attempting to test. For example, if I wanted to test arithmetic ability, asking you to list the digits of pi is not relevant.
00:04:16.530 On the other hand, asking if five plus five equals ten does relate to arithmetic. The second type is often referred to as external validity.
00:04:28.210 If I can test your arithmetic, does it correspond to a skill needed for the job? For instance, if I want to hire you as a carpenter, does testing arithmetic skills apply?
00:04:36.110 You have to know what the bullseye is to measure effectively. This is where we hit our first complication in hiring developers.
00:04:43.150 As it turns out, we probably don’t all agree on what makes a good developer, which complicates how we measure success in our field.
00:04:52.900 What might a great developer look like to you? Hopefully, it doesn’t look like some unrealistic expectation. What it likely resembles is people you know or team members who have been successful. You generalize their characteristics and say, 'Okay, that’s a good developer.'
00:05:01.250 But this isn’t a real concept of a good developer—it’s more a collection of characteristics. Some may relate well to whether someone is a good developer, and some may not at all.
00:05:12.250 When we begin to measure people based on what we think we’ve seen from success, we end up measuring unrelated traits and potentially biasing our outcomes.
00:05:19.570 Reliability is the second concept. Validity is about whether we're centered on the bullseye; reliability is about how similar our measurements are to each other.
00:05:28.870 For example, if I give you an interview and someone else gives you the same interview, you should receive the same score.
00:05:36.830 If you don’t, you're measuring the interview process, not the candidate. This includes instances where I interview you one day and then later interview you again in a different mindset.
00:05:44.860 These shifts should not impact measurements, but they often do. That concept is known as inter-rater reliability.
00:05:53.290 Another form, test-retest reliability, occurs when I administer an interview once and again later; the score should remain consistent. If not, I have measured some random chance rather than the candidate.
00:06:03.840 Finally, for reliability, if I have multiple questions, those questions should yield consistent results. For instance, take our arithmetic example: if I ask what’s five times five and someone knows the answer, that’s reliable.
00:06:10.560 If another person knows what's eight times eight and another person knows a specific multiplication yet memorizes a different pool of math facts, we aren’t measuring true ability.
00:06:18.530 All these concepts of reliability point towards consistency. If you use a scoring rubric and give the same interview in the same way, regardless of who is conducting it, consistency breeds reliability.
00:06:28.900 Usability is the third concept. If we could come up with valid, reliable interviews for developers, we might have one for every single task they’d perform. However, that approach is exhausting, and candidates don’t enjoy it.
00:06:44.000 We have limited opportunities to take measurements and must be careful with usability. This differs between companies as well.
00:06:52.270 It’s tragic if you steal Google’s interviewing process; they could afford to abuse their candidates while people still wanted to work for them. It might work for Google, but that doesn’t mean it will work for you.
00:07:02.790 Imagine giving a 20-hour homework assignment. For some candidates, if they don’t have jobs or just graduated from a bootcamp, that’s fine. But for others, or if they have families or commitment challenges due to medical issues, we might inadvertently exclude them.
00:07:12.880 To exclude people unintentionally does not help your hiring efforts. So we need to keep usability in mind.
00:07:21.400 Your target likely doesn't match some idealized concept but reflects the reality of your requirements. What you need to measure depends on your specific circumstances.
00:07:29.320 You cannot just simply take someone else's interview. You must balance their validity against other factors and ask yourself if you should trade off validity for reliability. Often, the answer is yes.
00:07:39.510 What you will end up getting is something that’s messy and imperfect. The only way to see if an interview works is to test it.
00:07:49.860 If you have an existing team, this can be beneficial. You can administer your interview to your existing team members. However, just because you’re measuring against them doesn’t mean it’s good.
00:07:55.570 If your team has specific characteristics—like all being right-handed—and you design an interview that favors right-handedness, then all of your candidates will succeed, but you haven't measured what you intended.
00:08:06.610 You need to test the interview against candidates who reflect a range of experiences and demographics. It’s essential to have a diverse candidate pool to ensure you’re measuring effectively.
00:08:16.320 We have our tools: validity, reliability, and usability. Let's talk through the interviewing process.
00:08:25.560 The interview process starts back at the job posting because, again, if you accidentally exclude a bunch of people with a poorly worded job posting, nothing in the interview will make a difference.
00:08:35.760 This is your opportunity! Consider what it takes for someone to be successful in your organization.
00:08:43.920 If there's something you're listing as required, you should measure it during the interview. If not, you probably don't need it.
00:08:50.600 Prioritize your needs because everyone is different. If your list consists of six things you find necessary, realistically, you may only need four or five.
00:08:59.760 We often want candidates who are great at everything, but that may not be realistic. Many people are socialized not to apply for jobs they don’t feel qualified for, which can exclude potential candidates.
00:09:11.410 This is not meant to be the definitive solution, but the actual wording in your job postings is incredibly relevant.
00:09:16.710 Have any of you ever declined to apply for a job because you saw a posting that wanted a variation off the wall? It could be things like a 'ninja' or a 'rockstar,' and I think that’s a deterrent.
00:09:27.110 This leads to unintended exclusion, although it may not have been the intent. Similarly, consider where you post your job.
00:09:34.730 If your only posting venue is Carnegie Mellon, you're limiting your reach. This approach will lead your team to reflect only one background.
00:09:42.440 Your network is important too. If you or your team is homogeneous, your candidate pool will remain limited.
00:09:48.920 It’s essential to reach outside your comfort zone to find a more diverse talent pool, which is beneficial for your team.
00:09:57.450 Let’s discuss different methods of screening candidates, starting with trivia questions. These tend to be a litmus test for candidates.
00:10:05.410 Trivia questions, while interesting, can have very little signal. They may test recency rather than true knowledge.
00:10:14.300 If I ask you for the name of a specific method or the order of arguments, while it's valid, it often rewards those with recent exposure.
00:10:24.360 The validity of such questions becomes questionable. Could it be reliable? It could be if you standardized the queries, but usually, they’re not.
00:10:33.920 Is it usable? A single trivia question is an easy format, but it reflects nothing beyond superficial memorization.
00:10:41.920 Another common method is the FizzBuzz test. You might find FizzBuzz interesting because it can be a valid indicator of coding ability.
00:10:50.860 However, it’s become so well-known that candidates often prepare just to pass it, which diminishes its effectiveness as a measure.
00:11:00.590 Homework assignments are another option. Many companies implement this as part of their hiring process. Homework can provide valuable work samples.
00:11:10.160 However, it's important to note that homework has significant reliability issues. Different candidates will invest different amounts of time based on their commitments.
00:11:18.840 If your grading criteria don’t account for this disparity, results will vary widely, leading to potentially biased measurements.
00:11:28.640 One approach is to provide specific time parameters—like suggesting candidates spend around five hours—thereby encouraging comparability in submissions.
00:11:37.280 And avoid assignments that ask for extensive work that even your architects can’t accomplish—such as a full application rebuild!
00:11:46.900 Then we have these coding assessment sites, which promise to give you a score without any engineers involved, making it easy to administer.
00:11:56.980 While they may have reliability, their validity remains in question; they often produce questions whose relevance to building software can be tenuous.
00:12:06.760 I attended a talk by Carrie Miller a year ago at RailsConf about hiring, which highlighted the need to minimize variance across interviews.
00:12:12.730 Consistency in scheduling and expectations for both candidates and interviewers can greatly reduce variability and increase signal.
00:12:20.340 Your interviewers need to be trained! They should know what they are measuring, and candidates should be informed on what to expect.
00:12:30.420 Has anyone implemented a red-black tree at work in the last year? Likely no hands up; we don’t build software by memorizing algorithms.
00:12:38.640 The correspondence between such interviews and actual work performance is usually quite low. Thus, we may not be measuring true coding skills.
00:12:48.320 A better approach might be to provide the algorithm in writing and have candidates translate requirements into working code.
00:12:58.250 Moving on to problem-solving scenarios, hypotheticals can allow for valid assessments. You can modulate the complexity based on experience levels.
00:13:06.840 When it comes to behavioral interviews, asking candidates about past experiences allows you to assess qualities relevant to potential future performance.
00:13:13.390 However, training your candidates on how such interviews work can enhance their usability and make answers more reflective of true past experiences.
00:13:20.890 Culture fit can be another tricky metric. If your current culture appears only to mirror your existing team, you may only be measuring those similar to you.
00:13:29.370 Ultimately, when it comes to debriefing after interviews, the subjective nature can skew evaluations.
00:13:38.760 To combat this, interviewers should record specific feedback and ensure they share their assessments simultaneously, minimizing potential influence.
00:13:47.060 In summary, if you want to hire well, you need to pick a set of constructs and design tests that are valid, reliable, and usable.
00:13:55.450 The last point I wish to make relates to a common refrain from clients about wanting to hire only the best.
00:14:00.790 They claim they don’t want to lower the bar. However, this mindset often reflects a limited understanding of what constitutes a successful hire.
00:14:08.570 I argue that our perception of the bar is often warped; it’s not a straightforward measure but an uneven spectrum where inadvertent biases come into play.
00:14:16.260 To be effective, we ought to ensure that we are accurately representing the expectations and evaluating potential candidates accordingly. Thank you!