Schemas for the Real World

RubyConf AU 2013

Carina C. Zona

4 talks

Schemas for the Real World

by Carina C. Zona

In her talk at RubyConf AU 2013, Carina C. Zona addresses the complex intersection of social app development and users' identities, emphasizing the need for schemas that reflect real-world diversity. Zona highlights the inadequacies in traditional data normalization practices when applied to sociological aspects of human behavior, warning against reducing personal identities to simple lists or binary options. Key points include:

- User Inclusion: Modern social apps face challenges from users who feel marginalized by one-size-fits-all identity fields in forms. Incorporating varied identities is crucial for engagement.

- Beyond Normalization: Database normalization may clash with the intricate nature of user identities; developers must accommodate the messy reality of human relationships.

- Diverse Data Collection: Zona advocates for open-ended fields rather than closed options. Examples such as Metafilter illustrate how allowing users to express themselves freely results in richer, more authentic data collection.

- Complexity in Human Relationships: Traditional relational models struggle to represent multi-faceted identities. Platforms like Facebook, upon user feedback, expanded identity options to better reflect real-world complexities.

- Trust and Flexibility: By trusting users and avoiding mandatory fields, developers foster a more engaged community. Free-form responses can lead to unexpectedly valuable insights that are often overlooked in standardized surveys.

- Data Analysis and Discoveries: Ultimately, data should be viewed as a means for user expression rather than just for analytics. Methods like machine learning can offer ways to analyze free-form inputs and uncover trends.

In conclusion, Zona insists that developers must prioritize user experience and expressiveness over rigid schemas, ultimately leading to communities that feel more inclusive and supportive. Engagement and loyalty stem from allowing users to contribute authentically to their online personas. The talk highlights the necessity of evolving traditional approaches to meet the diverse needs of users in social apps, advocating for a balance between structure and chaos in data collection.

00:00:30.000 Imagine walking through the world knowing that everyone's first assumptions about how you see yourself, who you love, and what feels right for you are all completely wrong. Now, imagine signing up for a cool website and then being required to select an option from the drop-down menu that does not include anything that represents you at all. You'll feel defeated. You'll want to argue that whatever they think they're learning from that drop-down menu isn't really true of you.

00:00:43.040 You'll want to tell them that they're adding to your humiliation by making you do this, and that they are missing a huge part of you. Users have been giving pushback for a while when they feel that they're being left out. Social apps, in particular, are being pressed to adjust. Facebook, Google, and others have been dealing with these issues for years. In 2008, Facebook was saying this was old news for them that they've been struggling with.

00:01:07.680 So, if you feel like you don't know how to deal with this stuff, and you're trying to work it out and feel out of your depth, you're not alone. When developers talk about normalization, we're usually discussing databases. However, when dealing with human attributes, there's another construct and we wind up conflating the two: sociological normalization. We pursue that idealized norm to put in a form, right? Conflating database normalization and sociological normalization works just fine as long as you're okay with your user base only coming from that idealized select little norm.

00:01:38.720 If you don't need money from the rest of us, it's cool! But most of us are going for that broad user base, and so we lose track a lot of the time of one of the core objectives of database normalization itself, which is that we're supposed to be mirroring real-world concepts in their interrelationships. The real world is really messy, which means our databases have to be allowed to be messy too. So why is that hard? Well, at first glance, there's stuff that makes this look really easy. It's forms! We've all done forms a million times; we've made forms and we've filled out forms. This doesn't seem complicated at all.

00:02:36.400 But we get tripped up by some flawed premises. First, there's a premise that deeply personal stuff about humans can be reduced to lists. Secondly, there's the assumption that canonical lists for these things can exist, or that we can create them in all traditional engineering fashion. And thirdly, our faith that those first two problems can be easily solved by just adding more list items.

00:03:01.360 So, we know that's not going to work, and the reality is that, at the end of the day, it risks making us look like fools. Here's the next flawed premise: that we think we intuitively understand the nature of social networks. Let's step aside from how social networking has been implemented as technology, and just ask, what is a social network in purely human terms?

00:03:37.840 That's the real life that our apps are meant to replicate and build upon. It's one-on-one, it's personal—our social lives are as personal as they can get. In social apps, identity is a dependency. And when we're talking about social networking, we're always talking about one-on-one. As much as we see this grand scale that seems like we're talking to a billion people, we're always doing one-on-one. As developers, we can break or build the community that we're creating with each line of our code. We should be asking ourselves: what are the one-on-one relationships that we're fostering through our code? What are we hindering? Are we even part of that relationship too?

00:04:42.720 What is our one-on-one with the current user, and who is missing? Who could be a current user if we were fostering that one-on-one with them? In real life, the guy who says, 'I know your personhood better than you do,' just sounds like a presumptuous jerk. Similarly, in real life, that woman who says, 'Who you are is completely invalid,' just sounds arrogant. Your existence isn't possible just as cluelessly as those who say this to you.

00:05:51.120 It's not what we set out to do! We're not seeking to build communities that exclude or make us look foolish, or make others feel insulted. We're setting out to create things that feel cutting edge and exciting and engaging. The last thing we think we're doing is conveying a message that we're out of touch with modern reality.

00:06:59.200 But the subtext keeps coming up. Postel's law tells us to be conservative in what you do and be liberal in what you accept from others. What I want to discuss today is how we can do both: how are we going to bring those modern realities into data, views, and logic? We start at the schema level.

00:07:18.720 As developers, we get tugged in two directions: keep that code base manageable, yet design it for modern complexity, and that's really hard. We have to get our schemas into alignment. Just as there are different meanings of normalization, there are different meanings of schema. First, there's that mental or psychological concept of schema—it's a set of our preconceived ideas, our mental framework for representing some aspect of the world. It's a system that we use internally for organizing and perceiving new information; it's our filtering system.

00:08:45.879 Then there are database schemas, which are essentially the same thing imprinted onto a database. Your very personal schema is being sent out into the world reflected through your app. When you use rails scaffolding, you're creating a migration. You've probably created one pretty much like this, but more complex hopefully. That ultimately gets translated into a unified schema to be used by the environment's database, and these are just the front-end manifestations of that mental schema, and they vary, don't they? Everyone's is different, which is supposed to show that our idea that there is one way to describe human attributes or human behavior is always going to be wrong. As much as other humans are varied, so are we as developers.

00:09:51.760 Schemas are this foundation for expressing something deeply intimate, for expressing our self-image, our identity, our important relationships, our values—even our spirit, creativity, and uniqueness, which are not things we usually think about. They're defining user experience. Our code is user experience. Our schemas and our UX are leaving people behind, but we can fix that. When developing a schema that's going to ask a person about their experiences, their feelings, and their sense of self, there isn't going to be a right way to do it.

00:11:17.760 What I can tell you is this: we can evaluate the trade-offs and ask what benefit will the user notice. Notice, I didn't say how will the user benefit; that lets us off the hook. It gives us a lot of latitude to decide that whatever great thing I want to give you is something you're going to benefit from and love, thus fulfilling your need. We need to think in terms of the user's mindset; what is it that they will notice, pay attention to, and what will feel relevant for their life—whether or not it's really cool. Evaluating from that user perspective gives us focus.

00:12:03.120 You can look at this chart and there isn't any point where everyone is going to see maximum benefit—whether you're thinking of all the users or developers versus user base. So, we have to start making strategic choices. Check boxes, radio buttons, and select menus imply that all possible values can be represented. We're saying, 'The user must pick one, pick the right one,' or in the select box case, 'maybe pick the right ones.' But we're still suggesting that everything there covers all the possible bases.

00:14:16.640 It's the real world being rejected because it didn't happen to look like our mental schema. Checking a box is a convenient option, and that's really simple for the user. It feels simple for us to convey. Entering a text string takes more work, and then we get results that are messier. So both users and developers kind of agree: which one seems more preferable? A free-form solution like a text area or a text input doesn't seem exciting at first, but a free-form field does deliver some real benefits for users. At Metafilter, gender has been a text field for over a decade now, and this is interesting because initially, the user base was very tilted towards developers, and there were quite a few who were horrified that Matt was going to do this.

00:15:32.160 That was for about five seconds, then they jumped on board because what they learned is that they could be creative and silly. So take a minute, go ahead. Because they could express this thing about themselves fully and with an authentic voice, they really went for it. That text field, that one text field for gender, has grown into a beloved institution. And what you as a user put into it now says something revealing about who you are—that you're allowed to put anything into it, or put in nothing at all. This says something revealing about what Metafilter is meant to be for.

00:16:51.379 That schema's trust of the user was the foundation for them to start asking for more: 'Give us more places to reveal ourselves.' All those requests were by user initiative. Today, Metafilter's users are trusted with a whole lot of free forms, including ones that most developers would, I think, instinctively try to constrain and validate away. 'This is not a valid location!' Right? Go for it! The field values are blatantly contradictory; they're wrong, they're ridiculous. We're not going to object to that. The message here is clear: 'User, it's okay! We've got your back. We get it—this is your home. We're going to make it as comfortable and personal for you, and you should make it as messy and comfortable for yourself because we can handle it as developers.'

00:18:03.840 When you're collecting data on people, you're in a very different realm now, moving from computer science to social science. If you want to run useful analytics about personal attributes and behavior, then data collection is going to have to meet some very different criteria. The two minimum criteria you need to meet for gathering human subjects data are: every option needs to be exhaustive, meaning there can be nothing left in the universe for that field that could possibly be true. Secondly, the fields' options must be mutually exclusive; there can be no overlap at all. If we say the options are married and divorced, is there any possibility of overlap there at all? Some people get divorced and then they're married again. Do we really have the criteria? How many places in your apps don't meet this?

00:20:31.680 We see this all the time; this seems like a pretty standard, clear, exhaustive set of values. And then you find out there are still a few more, and then more. We're still not anywhere near complete; we're never going to get there by adding more list items. And that's fine! Instead, we can choose to look at human data from a really different perspective. Data doesn't have to be for analysis, folks. It's really easy to get into the habit of structuring data for the purpose of analysis and lose track, thinking that that's the reason it exists.

00:21:51.680 But we can step back and wallow in that user perspective; that data is for sheer expressiveness on their part. It has character, individualism, distinctiveness; it's who they are. It's not just their data! Diaspora is an open-source social networking project. A couple of years ago, one of the developers, Sarah, made gender a text field there too. That was one of the commits, and just like on Metafilter, the users had such a blast with this stuff.

00:23:03.760 But again, developers were not so amused. Unlike Metafilter, which is a closed service run by just a few people in English, Diaspora is intended to be internationalized and is a traditional open-source project with a ton of people, all with their own personal opinions. One of the big complaints raised was that this messes with internationalization. If you take away clearly delineated genders, how do I make pronouns?

00:24:19.080 Here's my response to that: you can't. It's a rat hole! If you go for gendered pronouns, not only do you really get tripped up in English, and it gets so much worse in other languages that depend heavily on the relationship between nouns and gendered pronouns in ways that English doesn't conceptualize. You will get so messy, so fast. You shouldn't be thinking in terms of being wedded to gender for the purposes of language. You should think about how to avoid it, how to work around that because there's a plan here that comes with such a level of complexity, and you'll just end up in a sinkhole.

00:25:20.480 We all know who Randall is, by the way, right? XKCD. So if it's going to screw you over like that, you want to think about avoiding it. You want to think about rewording. You want to think about any possible scenario in which you're not depending on gender. You can collect it for other reasons, but if you're depending on it, eventually, when you internationalize, it's going to hurt. If it truly is a requirement for your app, then the next way to cope with it is to start asking, and this was Randall's solution.

00:26:24.160 He's examined this problem a bunch of times for English-language projects. If you've read any of his stuff, when he examines something, he does it in excruciating detail. After quite a lot of research and thought, he concluded: the bottom line is you have to ask straight up what pronouns do you prefer? That really was the best he could come up with, and this is simple for English. As complicated as this looks, it gets harder in other languages. So we're right back to the problem of how do we get this done? Yammer developed a really great solution for all their internationalization—they just crowdsourced it to the users.

00:27:59.680 So you can ask individual users what do you want, and then ask your international user bases, 'What is this likely to be? What should this be?' and let users take it from there. You don't need to know all the pronouns! You just need to be able to give people a starting point. As developers, we have this vision of what a good code base should be and what it looks like in the most sensible ways.

00:29:00.040 We often arrive at solutions that are actually truthy and yet far removed from real-life utility. So we go after these goals, right? We like structured data, very predictable—linear—it lines up so nicely and groups so well; relational indexing to the nth degree. We think we're doing exhaustive work because we have this belief that all this stuff will lead us to nice, easy, useful analytics, and then we'll be able to make good data-driven decisions.

00:30:23.600 We're not going to be doing all this intuitive stuff with CS. We look at data, and what we actually do is impose a lot of work on ourselves that's getting in our way and interfering with the users. So we start adding all these validations to constrain the data in ways we want. We start throwing exceptions because the stuff isn't really following constraints that we think they can meet. We have conditionals and partials to deal with so many different scenarios because human beings are so variable. Now we have to deal with each of these possibilities, which are limitless. We're doing all this work, and it's premature optimization.

00:31:51.200 There are simpler ways to deal with this that don't have to confront the infinite cultural variability, the infinite individual variability, and the fact that we're always changing. Even if you can nail down all the possibilities of what humans are like today, by tomorrow, we're all screwed again. We're making decisions on data that looks really good, but... Focus Eris is the largest ongoing survey of America's religious identifications, and it just asks a really simple open-ended question: what is your religion, if any? This nets them about 100 unique answers, and if you're creating a form based on this, you already know what's wrong.

00:32:32.600 Right? 100 list items is just not going to work. There's no form element that will make this easy for users to pick out something that fits for them. So ERIS found they could compress that list down to 13 major categories. If you're doing a drop-down or checkboxes, this feels a little more manageable, right? We could use that on a form, although it still leaves an awful lot of edge cases, and we want to focus on the genuinely major groupings, right? We're not trying to find out who the two people in the whole user base are who have some obscure religion.

00:34:04.640 So we can just boil it down to this: it turns out that about three-quarters of Americans are Christian. I didn't know that either, by the way! So we're done right there! Every religion other than Christian is just clutter! All edge cases! We're very comfortable leaving off edge cases, right? And then there's this sort of extra crummy data that we've got with it, and we probably start assigning nil values to some of these categories. Maybe not 'other,' 'don't know,' or 'refused'—these feel a little like nils. We'll just fix that for them.

00:35:39.520 This focuses attention on the next problem, which is that one in five of these responses are not useful answers for us—certainly not from an advertiser's perspective, and certainly not for what we're probably trying to do with our own internal analytics. So, those nils gotta go! What we're left with now is a really good, clear, well-normalized list that covers all the big stuff. If you get reductive enough for Americans, we discover that religion is, in fact, a binary from a storage standpoint. Isn't this awesome? I mean, we can just reduce this whole thing down to a boolean score! From 100 to boolean? Come on!

00:36:55.760 But we don't do this stuff! We don't, because we know that although it covers the biggest categories, we recognize that this leaves people out. People just aren't edge cases—they push back on being treated like one! So all this attempt to be reductive has problems. But it's not just the reductive; trying to scale upward also has problems. We've always got this difficulty of trying to balance between approaches.

00:38:02.040 And instinctually, we get kind of uncomfortable with this—not deliberately structuring data for easy analysis. This gets me a little uptight, frankly, I really do. The foundational question we have to ask, though, is again, what benefit will the user notice? It's not about servicing our discomfort; it's about alleviating theirs. If necessary, we can find ways to strike a middle ground on that.

00:39:00.920 A guided response, such as an auto-suggest, can be helpful. We can give a text field that allows them to be as expressive as they want but still suggest a couple of options that we think are good starting points, letting them have their way as soon as they deviate from the expected values. So that auto-suggest becomes that middle ground, but I would suggest being very minimalist about it. When you think there’s a subset of values you’re most interested in, just select the handful you can for minimal auto-suggest.

00:40:25.840 This isn’t a place to pre-populate with every field you already have in the database or every value you’ve got for that field. You want to just use those few values to provide structure to those who want it, while also giving that free form to incite the kind of Metafilter diaspora expressiveness and nuttiness that hopefully will get them excited about being on this site in the first place.

00:41:42.560 And when you do that—even when you don’t do that, like Metafilter has no structure at all and no suggests—and still 40% of the responses look pretty normalized to me. I mean, you can figure out what they were going for, right? So, we don’t have to treat this as either a choice between chaos and structure. There can be balanced ways of dealing with this stuff.

00:42:58.760 It’s true that data quantity is lower. Freed from the option of providing data about ourselves, a lot of us don’t provide it. On the other hand, now we’re improving data quality because people aren’t being coerced into giving a response—any response including a false one—or giving a response just because they’re annoyed and would like to mess with the person who did this to them.

00:44:02.920 It’s fine to mix and match approaches here, and you can find a solution that works best for you, for your users, and for your business objectives. Like I said, there isn’t one right answer I’m going to tell you you should do. Facebook makes relationship status completely optional, but then they get coercive. If you do decide to provide anything, you have to choose one of their values. Most users—60%—do provide relationship status.

00:45:35.720 So even in cases where people have free choice, a lot of them will opt-in if you give them the opportunity. The bottom line is that we really want people to feel excited about what we're building, right? That's the whole point. We feel excitement in making it, and we want them to feel excitement too.

00:46:56.760 We want them to feel that passion. We want them to be engaged, connected, staying in our community and building it up. We want analytics and investment and monetization to all be premised on good data that will help us build further. But the data that's collected through coercive approaches has the risk of just being complete garbage because of those people lying because they feel forced to.

00:47:10.960 So the conclusions we draw from that bad data are just going to misdirect our decision-making about the next stages of development. We're not winning by forcing people into false choices. The restrictive options—that stuff at the bottom left—those don’t actually have to be marked as required either! We have that bad habit.

00:48:35.760 But the way we set up schema often embeds assumptions that we should and that we will. So we do a field that's not allowed to be null, and it is destined to be mandatory. A field that sets a very short length is asserting that any reasonable value is going to fit within it. This migration implies you're either going to be male or female, not reasonable values if you're transgender.

00:49:41.440 If you're transgender, you wind up feeling coerced into a response that's inauthentic, and that UX experience starts right here at the first thing we do. Boom! That's how we fix that! The foundation for an entirely different user experience—and the really cool move is that we win by doing less, making this stuff flexible at least up front. Optimize for storage later, decide what's valid later.

00:51:15.600 Right now, the best thing you can do is to start collecting values and make discoveries. So, you have a discretionary field—let people respond in whatever way they want and discover what they're telling you. As developers, we look at stuff like this and think it's completely useless, redundant—is it? It's true that null is true by default, right?

00:52:24.720 But by making this explicit, it’s a communication to the team and to your future self. It's a statement of intent; it's documenting a product decision that was consciously made. We often wonder what a canonical set of relationship statuses would look like. Three years ago, this was Facebook's list, and they figured it was pretty good—arguably even progressive.

00:53:38.480 I mean, hey, we got 'open relationship' in there, and 'it’s complicated.' This sounds pretty modern, right? But users disagreed completely and very strongly. So, under pressure, just two years later, they added many more options.

00:55:12.400 And when Google Plus launched, they largely adopted that list, notably leaving out 'separated' and 'divorced.' Observe that they also added something that Facebook hadn't: the option of choosing whether to say anything at all, allowing users to identify relationships with labels of greater personal significance. That's being driven by user experience.

00:56:32.560 How is it that some statuses seem universal while others don't? Naming a thing creates a scope. The assumed validity of a field's values gets constrained as soon as the field is named. For instance, marital status might lead to a list where you're assumed to be unmarried, preparing to be married, currently married, or formerly married—these are all the possible states.

00:57:13.440 If you're talking about marital status, if you change your paradigm and think about relationship status, then the possible values change. Notice these aren't polar opposites, and they're not completely inclusive. We still just have a different way of thinking. Now, we're thinking in terms of whether one has a current relationship or if their relationship status is defined by absence.

00:58:26.880 But our real lives aren't organized as neatly as either of these. We go through life experiencing many relationships, and sometimes new statuses don't replace the old ones. If you change the name entirely, you can shift the paradigm and the possibilities completely. I love what Flickr did here: they're not trying to be a dating site, but they recognized that users were trying to date and so they did something different entirely: 'singleness status.' They really upended the whole notion that the important thing here is to arrange your profile around marriage.

00:59:52.640 Naming fields with great specificity up front will make analyses more structured and powerful later. You know what it is you’re collecting, and you’re constantly reminding yourself every time you use that field what it's for.

01:00:57.680 I like to be truthful. 'It’s complicated' is really deceiving, for me. It’s not complicated for her—she’s separated, she’s still married. What status are you supposed to choose? Single? Separated? 'It’s complicated'? Sure! But then it’s leaving out the things that are there and are true of her.

01:02:06.960 So tell me, if you could spot the fatal flaw in this one; take a second—it’ll come to you. Say it louder! So what’s the relational model here? This is supposed to be inherently one to many. Alright, and of course, we're talking about database: one of many, right? Where the possible values can be zero, one, or many. Yep. This is the difficulty of mapping databases to people. But yeah, this is what happens when you bow to the will of the people in a really unthinking fashion. They said they wanted an open relationship? Fine! We’ll add it to the list!

01:03:12.480 So this is Facebook failing at modeling relationships in a relational database. I love it! Here's the takeaways that I hope you can have today. The first one is that modeling the real world is really complex, and that's fine. The next is that early constraints in schemas are going to net us crappy, misleading data.

01:04:40.640 We're assuming we know who users are, and that's surrendering our opportunity to discover who they actually are. We give ourselves a chance to find out. And third, you know that whole free-form text field? It's actually not going to kill us! As weird as it seems, data quality improves when lies are merely optional—not required. We get data that's rich and specific, and then we get to unearth patterns that are undetectable when data is generic and prefabricated.

01:05:56.960 We get to discover and adapt when we show trust in people. We also get to feel good about them placing trust in us. We get to have a conversation, and that one-to-one relationship between them and us leads to their engagement, passion, and loyalty—all of which we want. That's the foundation for great user experience.

01:07:22.760 Alright, so I happen to be a huge fan of Q&A, so I'm hoping that you have some questions.

01:08:07.280 We have other systems that we need to talk to that ask us for that information. I think the current responsibility is... Right, well, I will go back to my earlier argument that people, when they’re required to give something, can present data that looks incredibly clean but has no relationship to the real world.

01:10:01.920 Advertisers or whoever you’re monetizing this with feel really happy, but they could be getting better results. It’s worthwhile for us as developers— I mean we use opinionated software, right? We’re allowed to be opinionated developers and say, 'You’re wrong! We can give you better data!' We refuse to give you crappy data. I know not everyone will take that answer.

01:11:20.960 Can you speak up a little louder?

01:11:46.480 So, if you have legacy systems that are still dealing with that stuff, you have to deal with those systems too. So what can you do about that? I would actually like to hear from you how you’re dealing with it.

01:13:00.280 Essentially, they try to extract values that seem reasonable—what the advertiser is looking for. Whatever the other party expects to see, and try to pull out as much of that as possible. One thing you can do is simply be more clear with users about why you're asking for this data.

01:14:30.080 If you have to give them pre-selected fields and make it required, just saying why rather than leaving them to assume you’re just nosy or that you’re being like Facebook and Google—scooping up all the data that exists in the world—often makes them more willing to share that in a form that's easier for you.

01:15:48.600 I think a lot of this has to do with communication. Communicate with the users more candidly and communicate with the consumers of our data as well more openly about why we think that what you want is the wrong thing to do! You’re not going to be happy with the results of what you're asking for.

01:16:37.760 When you're doing your data analysis on the data you collect, how have you seen what your analysis changes when you’re doing things? You know, there’s a lot of—Okay, so if you start collecting free-form data, what can you do to analyze it then? How does that change analysis?

01:17:52.040 In part, you get into interesting stuff like machine learning! I mean, you can really start playing with this stuff. This is an opportunity to use things like NoSQL for storage. You can do document-based storage instead of being quite so wedded to relational databases. It’s an opportunity to, I think, go some technology spelunking, which I would describe it as—learn some new things!

01:18:47.120 And discover how data is changing over time. You're not necessarily having to look at all those individual values, but for instance, are you detecting a few trends? I would be particularly interested when you're starting out with something like gender or age or marital status to leave it open enough so that you get to find out how people describe themselves.

01:20:04.240 It may not even be that we're looking to constrain fields, but is there a particular way that they characterize themselves? If it's even something like age, do they have a natural way of lumping themselves together that you can use in the future? So we’re not always looking for something that’s a breakdown in really strict graphical terms; sometimes it’s about discovering interesting clusters we wouldn’t know existed.

01:21:37.840 Do you think there is a reasonable limit to how expressive users should be allowed to be, and how does that interact with legal requirements? Can you give me an example? Oh, did everyone hear that? For example, if you have a site that might be about alcohol, you have to be over a certain age to legally interact with that.

01:23:10.480 Should you how do you interact with legal requirements versus user expression? Yeah, it depends. Express myself as 30 but I'm actually 15! Yeah, my birthday is always on January 1st, by the way—be sure to get me a gift! In the U.S., there is a law like this. Certain websites cannot collect personally identifiable information about youth under 13. The law is pretty specific about what constitutes private identifiable information.

01:24:35.440 It doesn't mean that every site has to exclude 13 year olds, but it becomes a common checkbox for every user: 'Please give us your birth date so we can verify that you’re over 13, so we can feel comfortable knowing that no one’s going to sue us later.' We do get that kind of stuff.

01:25:28.080 You have to make some choices. If you're dealing with a legal constraint, then I can't say this is your solution—your social context is going to have to be to talk to your lawyer! But one of the things you can do is start thinking about what information you really need in that context. You do not need to know their birthdate.

01:26:47.920 What you genuinely need is the information: are you over X age? As long as the lawyer says that is sufficient documentation, why are we asking for nosy questions?

01:27:55.160 Oh. Yes? Regarding free-form data, are you assuming that all your users will be providing that data in the same spoken language? Good question! In a single spoken language?

01:29:02.040 An interesting assumption here is, if people are providing free-form data, can we make the assumption they’re providing it all in the same language? At least all in the same written language, in this case? I think that's totally up to your app. It may well be the case—in fact, I would say it has to be the case—that you're going to get results in languages you didn’t anticipate and character sets you didn’t suppose to be there.

01:30:08.360 And that should be the point. That’s one of those interesting trends to discover, right? I mean, you might discover that, hey, we need to be branching out to Singapore immediately! We didn't know that. I have to say that one of the things I do particularly in this research is I hack forms all the time to see how strong their validation is.

01:31:10.720 Gender is the easy one to check because they all think that male and female are the only possible options. So it’s easy—just open up the developer tools and start seeing what happens if you give a value that supposedly isn’t possible, which is how you get stuff like 'By the way, your gender is invalid.' That nice little example from Spotify? Just so you know, Spotify changed it like a day after someone tweeted about it, so it helps to take note! Did that fully answer your question, though?

01:32:47.920 The assumption is: don't make the assumption. Yeah, that’s true of anything we’re putting in the text field. That’s sort of the point. The point of having a text field is to stop making assumptions and see what happens.

01:34:00.000 Yeah, great shirt there! There's an enjoyable classic book on this topic called 'Data in Reality' by William Kent. It's from the 1970s, but it's not really about identity particularly; just a general mapping of how they feel is to model a real world. Oh, okay! Thank you for that! Could you tweet me that title? I would love to know.

01:36:00.000 [email protected]. Thank you very much! I'll check into that.

01:36:58.000 You've been raising your hand for a while back there in gray. And obviously that applies.

01:37:46.000 Because quite often you see first name, last name. Yeah, I think Kenzie had a great blog post about assumptions about names! Yes! That people even...

01:39:20.000 Yeah! I love that one! There are also some imitators. So what he's talking about is a post called '40 Myths Programmers Believe About Names.' That was just the first 40 that he provided, and I think users provided easily 100 more myths, mostly out of their own outrage about how their name had been abused!

01:40:03.000 Apparently, anyone with an apostrophe in their name has a really hard life! I feel so sorry for them! I actually had to leave this out because it's a great topic that is part of this, and I'm glad you raised it. Throughout Spanish-speaking countries, Portuguese Latin America, it is standard to have a surname for your child that is essentially bringing together both parents' names.

01:41:00.000 So throughout the Latin American world, it's common that your last name doesn't match either parent’s last name because each child is instructed by their own parents. But also, they get really long names! The idea that there's a first, middle, and last gets thrown into chaos.

01:41:57.000 If you have users from the Spanish-speaking world or the Portuguese-speaking world, you should care because that is one of the top two—it might be the second or third most spoken native language in the world! So hopefully, you are interested in having fields long enough to accommodate their real names.

01:42:43.000 That example I also showed of trying to validate first name—one letter or what was it? One or more characters? I met a number of people who kept complaining, 'Yeah, I'm so glad you talked about this! I constantly get turned away by fields that think my two-letter name isn't enough!'

01:44:14.000 So I actually found a website cited in the resources here that lists like 195 really common two-letter first names around the world. By first name, I should say 'given name,' because that's part of the question you're asking. We have this model that assumes there’s a first name, possibly a middle, and there’s a last name. And you know that is not at all a given.

01:45:21.000 Then we try to add on some stuff, like we assume there's a title, and if so, it has one of maybe four or five categories: you know, Mr., Ms., Mrs., Doctor, and then that's it! Which a, is not true! And sometimes these things overlap.

01:46:35.000 How do I know how to address somebody if I don't know what their first name is? This is where we’re looking at the difference between a legal name and a term of address. The important thing is that person, regardless of what’s on their documentation, should probably have a name!

01:47:22.000 Sometimes it’s essentially their username, or that they have some sort of pseudonymous name that they want to be known by—that might even be a username. The point is that we shouldn't care what their actual username is; we should care about addressing them by the name they want to be addressed by.

01:48:05.000 In that context. Yes! Or just 'name' without putting any constraints on it, like saying first name or last name. I mean, that’s one of the great ways to deal with the problems of those Spanish names, right? Just don’t try to break it out. Just say, 'What’s your name?'

RubyConf AU 2013