Features

It’s Not the Data; it’s You. What is an ethical approach to data?

August 2020

In the early 18th century, American scientist Samuel Morton began accumulating what would become the world’s largest collection of human skulls — more than 900 at his passing. Beyond just being an odd pastime, his collection and his conclusions have powerful lessons to teach us about how to avoid drawing dangerous conclusions from data — both in his time and ours.

At the age of 21, Morton studied Medicine at the University of Edinburgh. Here he encountered George Combe, a Scottish phrenologist whose work explored the relationship between human biology and intelligence. His logic, so it went, was ‘the bigger, the better’.

Back home, Morton’s curiosity about the diverse biology of the world’s people grew, as did his collection of their skulls. He began acquiring new and rare skulls like many people collect stamps, comic books or trading cards.

With this collection, Morton began his great study of human intelligence. Painstakingly pouring lead shot into each skull’s cranial cavity and then decanting it back into measuring cylinders. He aimed to classify and rank the implied intelligence of what he defined as the five ‘races’ of the world: Ethiopian (or African), Native American, Caucasian, Malay, and Mongolian.

Conducting this experiment several times, with 672 skulls, he consistently came to the same conclusion — Caucasians had the biggest brain size and, therefore, the greatest intelligence, while Africans the smallest brain size and consequently the smallest intellect.

All manner of awful people embraced up these findings — imperialists, colonialists, segregationists through to a certain well known German fascist. Each using the work as scientific ‘proof’ to validate their world view, justifying the supremacy of white men for decades and legitimising everything from slavery to workplace discrimination.

Despite being ‘proven with scientific certainty’, modern researchers have found Morton’s measurements to contain egregious errors. Stephen Gould argues there’s evidence Morton crammed extra lead into caucasian skulls while leaving the African skulls rather lightly packed. Morton also ignored the relatively straightforward fact that bodies and brains are usually proportionate to one another, whatever their race. People with bigger bodies just have bigger brains.

More importantly, the underlying assumptions in his work have been disproven too. Brain size does not have a relationship with intelligence. There are no separate human races that constitute different biological species, and there is no difference between the intellectual capacity of one group to another. His fundamental premise — that you can classify ‘races’ according to innate intellect or character — is deeply flawed.

‘The practice of classification that Morton used was inherently political, and his invalid assumptions about intelligence, race, and biology had far-ranging social and economic effects,’ writes Dr Kate Crawford in Atlas of AI (2021). She warns that all classification systems are embedded in a framework of societal norms, opinions, and values. Even the supposedly hyper-rational artificial intelligence programs we use today are highly susceptible to these problems.

As we look to our future, this lesson has never been more important to learn. Leaders have long been enticed by the promise that we could transcend the mistakes and prejudices of individual decision making if only we had more data, using it to make more accurate, objective decisions, faster. It is on this promise that the fields of statistics and economics are largely based.

They were right to be enticed. The modern world only exists because we were able to systematise and standardise decision making. This created the fields of medicine, education, banking, and even the internet itself, lifting billions out of poverty to live longer and happier lives.

Over the last few decades, statisticians have become more ambitious. Using glossy terms like ‘Artificial Intelligence’, ‘Machine Learning’ and ‘Big Data’. Yet the more ambitious their work, the more concerned the public has become, rightly asking what data is being collected, how it is being collected, how it is processed and what kind of lessons are taken from the process.

The methodology behind these approaches are more elaborate than those in the 1800s, but the underlying principles are remarkably similar. These are complex mathematical tools that help us identify patterns in a complex set of data — the signal in the noise.

Yet, they remain confusing and concerning to many due to the language we’re using to describe them. ‘Artificial intelligence’ is not true intelligence; our machines are only ‘learning’ in the loosest possible sense of the word.

These very metaphors and personifications are used in an attempt to explain a machine’s workings, but obscure them instead. It suggests a black box that ingests ‘raw data’ and can then provide objective, independent and unfiltered answers to our most challenging social and economic questions—acting in many ways like an Old Testament God. We approach the algorithm and ask to hear our fate. Insufficient ‘faith’ can risk a disaster. Your refusal to engage leaves you ostracised and excluded.

What we say, write and hear about these technologies profoundly affects what we expect them to do and how we interact with them.

The truth is that every step of the data-based decision-making process is defined and created by people. It is collected, structured and stored by people. The maths underpinning the algorithm and its training is done by people, and its recommendations are then implemented by people.

Getting the best from data-based decision making and managing its risks safely are deeply human challenges. Which is great news for anyone anxious about being hunted down by a robot, but bad news for people hoping to make thoughtful public policy and management choices. The challenge is social, not technical. And there is still so much work to do.

The language and legal lines we draw now will shape the future. We must ensure governments, companies and nonprofits are beginning with a strong foundation. Just like Morton, we must challenge the biases and assumptions which underpin our thinking — and the appropriateness of the data we collect, process and analyse.

In the following passages, I’ll explore what ‘data’ actually is and what it isn’t. Who owns it? How is it useful? And how do we get the best out of it while reducing the potential risks?

First, however, what attracted people using data to make decisions in the first place? It seems like an awful lot of work to collect, store and analyse it all — and then what if you don’t like the answer?

Over the last 10 years, I’ve had the privilege of advising cabinet ministers and executives of Australia’s largest NGOs on how technology can enrich and empower their work. Time and again, I’ve shared data-based analysis which shows the rational next decision is x, only for that leader to instead choose to do y. Or make no changes at all.

This hesitation is especially acute when that leader’s work requires a great deal of social interpretation, such as lawyers, politicians, journalists, front line doctors or the leaders of universities, social service providers or NGOs. Their work, and therefore success, often depend on human relationships rather than numbers. And there’s plenty of evidence to suggest they should be cynical. There are too many examples of where data-based analysis has led us all astray. Often with disastrous consequences for decision-makers themselves.

Public opinion polls, economic forecasting, and technology often fail to help us make better decisions, leaving many feeling that ‘big data’ is about as reliable as astrology.

So what is causing this? Do we not have enough data? Are there errors in the data, or the equations? Sadly, to use the cliche ‘The fault, dear Brutus, is not in our stars / But in ourselves’.

Our brains yearn for a simple world. If you stop and think, we know the world is seldom simple or straightforward. But that’s precisely the problem — our brains would prefer we didn’t think too much about any one decision. They’ve got enough going on! So they actively nudge us towards decisions that ‘just feel right’. We’re especially susceptible when the shortcut relates to deeper ideas like culture, politics and social values.

These elements create social norms — things we see as a fixed truth or a given but are actually entirely made up. Culture and values create deeply embedded mental shortcuts called ‘heuristics’. These rule-of-thumb strategies help us quickly solve problems and make decisions without constantly stopping to think about their next course of action. Yet these shortcuts frequently cause more damage than good.

When collecting, managing and processing data into useful information — each stage dangerously builds on a compounding series of heuristics or cognitive biases.

For example, public opinion polls are just as accurate as they were ten, twenty or thirty years ago. Looking at controversial electoral outcomes, such as the 2016 US Presidential Election, 2016 Brexit vote, or even the 2019 Australian Federal election — the polling average showed these would be extremely close elections. But still, people were shocked at the outcomes. This isn’t because the data was wrong, but rather the layers of interpretation by journalists, pundits and commentators who wanted to create a simple and easy to digest story about what was happening across the country. The more a narrative takes hold, the more self-perpetuating it becomes.

Statisticians themselves ultimately second-guess their own work, creating the effect known as ‘herding’. They emphasise similarities in their findings rather than results which challenge the narrative. Layer upon layer, the story drifts away from the data in front of us.

We cannot rebuild our brains, but we can humbly acknowledge their limitations — and create a plan to correct for their failures. When thinking about what ethical data use looks like we must make sure there are strong checks and balances around the information we collect, how it is gathered, how we store it, categorise and structure it, as well as how we use and analyse it to help guide our decisions.

What is data?

Research by IBM suggests that most people don’t know specifically what ‘data’ even is, conjuring different understandings between one person and another.

So what strictly is data? At its most basic, the Oxford Dictionary defines it as ‘things that are known or assumed to be facts, collected and standardised for reference or analysis’. ‘Data’ draws from the Latin ‘datum’, or ‘something given’ — which establishes a long historical tension in the relationship between collector and analyser with the subject of that analysis.

This information typically falls into a few categories, which I define as:

  • Demographic information which doesn’t typically change (i.e. your name, birth date, gender),

  • Personal information which remains largely consistent for long periods of time (ie. your employer, mobile number, email address, postal address),

  • Behavioural information, which is constantly changing and evolving (i.e. what you like to buy, where you like to go, what you may have clicked on or searched for)

as well as information that is derived from these sources through analysis. Such as:

  • ‘interests’ which are assumptions about what you might care for or enjoy based on people with similar demographic, personal and behavioural information,

  • ‘networks’ which make assumptions about who and what you might like based on perceived similarities with other people you know.

It is easy, and terrifying, to imagine an unseen and unknown person flicking through your personal file. It evokes the darkest stories of authoritarian regimes; party apparatchiks and goons bursting into your home in the early hours, grilling you about your sexual proclivities, the ideological symbolism behind your brand of milk and ‘why didn’t you buy those trainers you looked at on The Iconic three days ago. You’ve got the money! We’ve seen your bank account!’.

Somewhat comfortingly, our personal data is rarely available to view like that. Rarely is it even in one place. Typically it is anonymised and used in bulk. Mixed in with the data of thousands of others to identify patterns and trends about large groups who share common characteristics, allowing organisations to answer questions like, ‘Is this person more likely to prefer beach holidays or snow holidays?’, or ‘Is this person more likely or less likely to buy a new phone each year?’. It tells decision-makers when train stations are likely to be busiest, how quickly traffic is moving across the city or where COVID is most likely to spread next. I call this type of analysis ‘Cohort Analysis’.

Commentary around topics like Cambridge Analytica conceptualise this as a mystical and unknowable force, capable of manipulating or hijacking our brains. This way of thinking attributes unexpected societal changes to ‘data’ in the same way our forebears attributed them to the stars. Sadly, it just isn’t true.

Cohort analysis is suggestive, not determinative. It can help organisations improve the confidence and speed of their decision-making, giving adopters an edge over their competitors. It can help nudge us into choosing something that we may like to order or watch — but ultimately we need to make the call.

However, a second form of analysis is becoming widely used — one I call ‘Assessment Analysis’. This approach is increasingly used by organisations who want to quickly reduce uncertainty about you as an individual by comparing you to broader patterns — pursuing the most ‘desirable’, while excluding others. It often perpetuates existing discrimination and structural inequalities, when you’re applying for a job, a house, a home loan, or an airfare.

Over the last decade of helping organisations navigate the troubling waters of new technology policy, one troubling example of this comes to mind.Four years ago, a newly founded company approached me with a glossy sales pitch. Promising to use technology to reduce friction and mistrust between landlords and renters, making it easier to get into a home. I decided not to take them on, for reasons which will soon be apparent, yet their business has blossomed. Through partnerships with 50,000+ real estate agents across Australia, they cover 77% of the industry (according to REIA data). If you want to rent in Australia, there’s a very high chance that the real estate agent will require you to use this third party, commercial platform to apply. A platform whose owners can use the data collected in any way they please.

I experienced it first-hand last year while looking to move. In each home the property manager mandated applications be lodged via this platform, including the masses of personal identity, financial and employment information typically required. More than 100 points of individual data about each applicant are given to this company — regardless of whether you’re ultimately selected. I could, if I pushed, complete a paper form at their office. But I was told that they ‘tend not to lease to people who do that’.

As an ordinary person at an open house, the consequences of this are not clear. You reasonably believe you’re entering into a direct relationship with the real estate agent and your primary concerns are whether you like the place, and how to charm the real estate agent to lock it down. You’re incentivised to be as cooperative and easy as possible. You typically look closely at the lease itself, not the T&Cs of the platform you use to apply in the first place.

You might ask, ‘Haven’t real estate agents always collected a lot of personal information? And surely that’s appropriate given they’re trying to filter out people who might miss rent or trash the place?’.

The difference here is a commercial third party is covertly stepping into what was historically a 1-to-1, once-off relationship between you and the real estate agent. Many people don’t realise their personal information is going into an enormous, highly-detailed database which will be leveraged for private profit for years to come. Nor is this third party currently bound by the robust legal regulation which applies to the few other people who have that level of personal information — such as your bank or the ATO.

Despite all of the debate about Google, Facebook, Microsoft and the rest of the big tech sector, when it comes to personal data it is these small scale hustlers about whom I am most concerned.

Whether it’s the platform (i.e. app or website) you use to book gym classes or lodge rental applications, the loyalty app at your coffee shop, the site you use to check your credit score or get quotes for car insurance, there are plenty of suburban charlatans hoping to trick you into sharing information they can use to profile you and profit from it.

These actors are less interested in looking at trends across large groups, but instead patterns in smaller local and profitable subsections. These actions are likely to exacerbate, not diminish, discrimination. Their small scale makes them more likely to cut corners, store everything in one place, look at it individually, or sell it on to another actor without you knowing. Not to mention they often have a sloppier approach to data security — leaving highly sensitive information exposed to hackers (or staff) who can use it to commit much more damaging identity theft.

There’s a clear and significant legal and regulatory failure here. We need new frameworks to enforce good behaviour for small and medium-sized organisations just as much as multi-nationals.

What data is useful for?

The tidbits of information that make up ‘data’ are often depicted as a type of natural resource, similar to oil. As a result, there is a tendency to accumulate as much as possible or to stockpile it. This is a false comparison. Unlike oil, data doesn’t draw its value from scarcity, can be infinitely reproduced at little to no cost and typically increases in value as it’s ‘used’.

Having data isn’t enough. Without quality analysis, much data never achieves its potential value — creating competitive advantages or solving organisational challenges.

This analysis can be done by experienced data scientists (who are scarce and in high demand), or by algorithms (designed by experienced data scientists). Any advantage this analysis creates also has a remarkably short shelf life. To realise the value, organisations must act on their findings quickly.

Enduring value isn’t created by having data, but by enabling humans to develop methodologies, algorithms, and code for use on other data.

Now I know this is beginning to sound a little bit like a sci-fi film where time bends back on itself — but stay with me.

For example, ‘Artificial Intelligence’ didn’t beat grandmasters in chess and Go by being smarter. This feat was achieved by generating ‘training data’, i.e. learning from thousands of ‘practice rounds’, played 24 hours a day, 7 days a week, for 365 days a year. Each move made in these practice rounds was quite stupid — but they informed a machine learning model, i.e. repeatedly teaching machines how to solve problems by example.

This is why my earlier example is particularly problematic. That company is collecting incredibly personal and rare data (very few people have your full financial history), while being both capable and motivated to analyse it. Ultimately, they do this so they and their partners can include or exclude you from future services. Merely having the data isn’t the problem, it is their ability, capacity and motivation to drive action from their analysis of the data.

But who ‘owns’ all of this data?

Most people very reasonably see data about them as an asset that they own, thinking that companies don’t have a right to ‘take data from you’, or that you have a ‘right to ask for it back’.

Sadly, in most legal jurisdictions around the world, data cannot be ‘owned’ by anyone, at least in the traditional legal understanding of property. Legally recognised ‘property’ may be tangible (chairs, dogs and pencils) or intangible (software, creative writing, trademarks and patents). Often a significant component of intangible value is trade secrets, or as we usually call it in Australia, confidential information. Trade secrets are generally, however, not ‘property’.

Despite this, the stockmarket and investors clearly see value beyond the traditional definitions of assets or property. Trade secret ‘assets’ have often been valued at billions of dollars. Many trade secrets derive their value through closely guarded central control: the recipe for Coke, the Google search ranking algorithm, and so on. These trade secret ‘assets’ may not appear in the balance sheet as assets. Their value is instead created by being closely held. By being closely held, scarcity is created and managed. This is why our tech giants and many other businesses are so resistant to disclosing exactly how their algorithms work.

Additionally, adding friction to the sharing of personal data would disrupt much of our modern world. Ordering from Amazon or watching Netflix requires a complex and often unseen data sharing ecosystem of five or more data-holding entities to deliver a service to us, from the platform such as Amazon, to the third party who is selling the goods, a warehouse/storage centre that physically holds the goods, an analytics service provider that keeps track of how many goods are in the warehouse and where they need to go, a cloud data platform which ensures this data is safe and quickly available, through to credit card or delivery service providers.

But it is clear the status quo isn’t working.

The Edelman Harvard Trust Barometer shows public trust in business, government, NGOs and media are at historic lows. 61% of people don’t trust that Governments understand emerging technologies enough to regulate them effectively.

But what should we do? If you lead one of these institutions, Harvard Professor Dustin Tingley offers a simple test — ask yourself ‘Is this the right thing to do?’ and ‘Can we do better?’.

But we know that hoping leaders act in good faith isn’t enough. We need to create a robust system of public accountability. Fortunately for regulators, this isn’t a new discovery. By my count, more than 175 sets of ethical data principles have been developed by governments and major global institutions like the OECD in the last few years.

These are often, sadly, hobby projects by isolated government agencies. But they have some useful common themes. People who collect, store and analyse data on behalf of organisations should ensure that the data they hold on these people have a right to:

  • Ownership: Individuals should be able to legally own their personal information. This information is as much ‘ours’ as any other asset;

  • Consent: If an organisation asks for personal information, the organisation must ensure people provide informed and explicitly expressed consent of what personal data moves to whom, when, and for what purpose. This must be a negotiation, not a ‘take-it-or-leave-it’ proposition;

  • Transparency: When your data is used, you or a trusted regulator must have transparent oversight of the algorithm design used to generate conclusions from that data. Organisations must also be transparent about the decisions they are using the data to make;

  • Privacy: Whenever an organisation is the custodian of user data, it is obligated to ensure it does not become publicly available; from the technical level (database security), the social level (ensuring strong ethical access standards with staff), or procedural level (that the data is held in a de-identified form as much as practicable);

  • Accountability of outcomes: Organisations must prevent situations where data analysis may adversely affect one group of people with a fixed identity characteristic over another (i.e. race, sexuality, gender). Additionally, individuals must be aware when data plays a role in any decision that impacts their lives (i.e. understanding how an algorithm may have affected a home loan application or rental application).

The Australian code also refers to broader values, expecting actors to actively ensure their tools support social equity and human, social, and environmental well-being. Ensuring fairness, reliability and safety for all Australians.

In most constituencies, these are still voluntary and in-principle codes. We need to see them backed by meaningful policy substance. This would likely include:

Using legal or regulatory tools to increase the transaction costs and decrease the payoff for opportunistic behaviour. Policymakers could achieve that by improving the enforcement mechanisms for these principles, such as significant fines for poor behaviour;

Introducing a public right to transparency about how each tool is built and operates. Removing the information asymmetry between actors, and creating a more transparent, informed and accountable discussion about these issues;

Government and large organisations can use incentives and coercion for good behaviour such as only acquiring and using tools that meet these standards, or through an accreditation/standards process, and finally;

Improving public literacy about these challenges and potential solutions, increasing the social expectation of good behaviour in our daily lives. This is one thing you can directly help to achieve! Share this Fabian essay with someone you know who would benefit from reading it.

As damaging as Morton’s work was and how cruel his advocates were, we should draw confidence that it was challenged, reviewed and replaced. Hundreds of other researchers worked in parallel, each in dialogue and challenging each other. Only the most robust and most supported conclusions ultimately survived and flourished.

In 1853, abolitionist minister Theodore Parker delivered a sermon that offers both a challenge and reassurance. He said, ‘I do not pretend to understand the moral universe. The arc is a long one. My eye reaches but little ways. I cannot calculate the curve and complete the figure by experience of sight. I can divine it by conscience. And from what I see I am sure it bends toward justice.’

That bend is not a given — it requires our active, intentional engagement. But I believe we have it in us to pull this arc towards justice.