Big Data: Ethics and Design

Roman Filippov
14 min readMar 2, 2021

Data Visualization is a really powerful tool to tell a story. In today’s world, it needs ethical rules to minimize ways of its misinterpretation and manipulation. Inspired by a presentation, shown by Armand Emamdjomeh and Andrew Ba Tran, Washington Post investigator graphic reporters, on the 2020 DC Design Week, I’ll share dramatic examples and important design rules ensuring the correct working with data from a designer perspective.

In the 2000s, there was a show called “Whose Line is it Anyway ?” on American TV. Its participants played comedy sketches and then received fake points from the jury. The show had a slogan, “Where everything’s made up and the points don’t matter”. This is a good phrase that could be applicable to fundamental data.

“All models are wrong, but some are useful”.

— George E. P. Box, Statistician

Dig deeper, and you’ll discover in the end that all the figures managing large and small processes are based on a flimsy foundation. Take the coronavirus for example. We have no clue of the real number of infected and dead. Yes, we have some general figures, but there are lots of factors to be considered: mild asymptomatic cases, people who were ill and have never been tested, politics, undercounting due to hospital overload, staff tiredness, and much more. Human behavior is inconsistent and difficult to measure. The coronavirus statistics are attempts to assess it on a large scale. And this is everywhere.

Economy

All macroeconomic indicators orchestrate politics, the disposition of large corporations, and their investments. Besides, the GDP does not take into account household labor and the shadow economy. Can such indicators be absolutely accurate?

In 2010, economists Carmen Reinhart and Kenneth Rogoff published an authoritative work “Growth in a Time of Debt.” It stated that the government debt, which accounts for more than 90% of the GDP, slows down growth. This conclusion became the basis for future political programs and austerity mindset of the US Republican Party. British Member of Parliament George Osborne relied on the paper to portray excess debt as the universal cause of financial crises: “As Rogoff and Reinhart demonstrate convincingly, all financial crises ultimately have their origins in one thing.”

After 3 years, it turned out that there was an error in the Excel calculation in the authors’ article. Their scientific work was criticized by the expert community, and the conclusions were revised. This is a good example of a public incident where programs adopted based on such an article could result in an increase in poverty and loss of jobs. Just imagine how many other economic articles containing inaccuracies and errors influenced the adoption of important political and economic decisions.

Business

At the end of October, Spotify has released a report according to which the service has reached the point of 300M monthly active users. But what is the definition of an active user?

  • Is it someone who signed up for Spotify?
  • Is it someone who listened to at least one song?
  • Is it someone who logged into the app during the last month?
  • Is it someone who spends more than 30 minutes a day in the app?

This is a pending issue in all industries, especially in the digital arena. In theory, if you pay for something, you are a customer. But what if you are using a free trial? Or your credit card has expired and Spotify is trying to make you update your billing information? What if we collected statistics today, and tomorrow your subscription ends?

There are other figures difficult to define and unify — for example, work productivity or even income. Firstly, corporate financial reporting depends on estimates and judgments, which can be inaccurate even when the work is properly done. Secondly, the standard indicators may turn out to be incorrect; just look at innovative startups. It leads to the emergence of unofficial alternatives. And finally, pressure on managers provokes deliberate reporting misstatements to please the board and the market.

Books and Scientific Articles

Have you ever wondered about the accuracy of the data published by famous authors? Only in the past few years have errors in such books sparked debate about the publishers’ obligation to take greater responsibility for the accuracy of published books, even though this is a very expensive and time-consuming process.

For example, in 2019, scientist and writer Paul Dolan wrote the book “Happy Ever After.” Later on, it received many positive reviews, including one from The Times. The book provides research findings on the correlation between marriage and happiness.

Married people are happier than other population subgroups, but only when their spouse is in the room when they’re asked how happy they are. When the spouse is not present: f*cking miserable”.

— Paul Dolan, Behavioural scientist. Sourse: VOX

Later, economist and researcher Gray Kimbrough found that Dolan misinterpreted the results of an ATUS survey in which the “absent spouse” category meant that the partner was living separately rather than merely being absent from the room.

The situation with scientific journals is the same. Their publications are regularly criticized, even those of Nature and Science. We can recall cases when scientists deliberately sent fake articles to editorial offices of journals to make public the problem of their peer-review and selection process for the sake of trends and politics.

In 2018, three US scientists published fictional scientific articles in peer-reviewed journals that were frankly absurd, not supported by any data from serious scientific sources, and whose references consisted of non-existent research works. Among the works accepted for publication was even an article on feminism entitled “Our Struggle is My Struggle,” which turned out to be a hardly rewritten chapter from Adolf Hitler’s “Mein Kampf.”

Such pranks often work in exact sciences too, starting with the publication of a pseudoscientific article, “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy” (it was prepared by MIT students) and ending with a sweeping purge of the authoritative Springer of fake publications written by computer programs.

The data you rely on may be solid on the surface but chaotic underneath, made of imprecise values, research errors, sampling problems, plagiarism, and the other things surrounding human life. There is no real truth; every research study is a relative version of it, very similar to Plato’s Cave.

It should be remembered that the information you work with is to be treated critically, and here are a few simple tips:

  • Use only credible and widely cited sources with a good reputation.
  • Do not trust information without reference to sources just because it is “research by scientists” or “statistics”.
  • Study how and where the information referred to by the author was collected.
  • Try to find out if the source has a hidden motive to influence audience opinion.
  • See if there are outlying data, implausible figures, or cuts at certain points. It may be indicative of incorrect research, data collection errors, or manipulation.

Basic Rules for Big Data Visualizing

1. Be careful with the language you use in your infographics. Remember that there is a person behind each figure. If you have any doubts about potentially offensive phrases or expressions, please revise your text.

Now, imagine your visualization in COVID-19 is seen by someone who’s just lost a friend, or family member. Maybe they didn’t lose someone at all, but they were infected. What will their reaction be? So how do we treat these data souls with the reverence we need to show? Once you’ve completed your visualization, give it a graphic background, and add text explaining what they are about to see.

Here are a couple of good examples below.

To mark the coronavirus death toll approaching 100K in the US, The New York Times has presented names and biographical details of 1,000 COVID-19 victims, calling their deaths “An Incalculable Loss.”

Washington Post has published a longread about the situation with mass shooting in the US. The story describes life stories of people who were killed by gun violence. Their ages range from the unborn to the elderly, 199 were children and teenagers. In addition, thousands of survivors were left with devastating injuries, shattered families, and psychological scars.

Always remember: Data points can represent lives.

2. Try not to round upward for emotional emphasis. If you do use round upward, be sure to mention it.

3. Choose the right form.

Look at these maps below. What you’re seeing is how each county in the United States voted in the 2016 Presidential election. It looks like a landslide. However, this is a wildly inaccurate representation of proportionality of the population, because all of those little shapes representing counties have vastly different amounts of people living within them. A data scientist Karim Douïeb figured that a more accurate way to represent how people voted is to use colored dots, varied in size proportionally to the population of each county.

The 2016 Presidential Election. Try to impeach this? Challenge accepted!

But within each of those large blue dots, you still have plenty of people who voted red, and vice versa. These results only show you which party won the vote in each region.

Land doesn’t vote. People do.

Robert J. Vanderbei, a professor at Princeton, has also tried different methods to display presidential election results. When he saw a county results to map the next day after the 2000 election, he noticed the county he lived in was shaded red with 51–49 toward Bush. “Why not make it purple?” he said. A week after the election, he published a map called “Purple America,” which shows each county in a continuous scale from blue to red.

The 2016 Presidential Election. Princeton University

4. Adjust for the population.

Look at these two maps below. Both represent how COVID-19 spreads throughout the US. The first graph shows that both Greater New York and California have the most number of cases compared to other states. The second map displays country hot spots showing the number of infected people per capita.

COVID-19 Total Cases. New York Times (Jan 15, 2021)
COVID-19 Per Capita. New York Times (Jan 15, 2021)

5. Stick to basic rules. They exist for a reason.

This chart below is infamous for being extremely misleading. At quick glance, it appears that after the law was enacted the number of gun deaths decreased. But the white section isn’t actually the data — the red section is. The chart is upside down! This visualization goes against basic conventions — nobody expects charts to show data upside-down. Conventions endure for a reason, and visualizations should always anticipate readers’ expectations.

6. Respect the proportions. This is more important than showing a visual difference from the viewpoint of design.

Donald Trump’s campaign has posted more than 40 bar-chart graphics showing favorable poll results. Here’s a chart of the campaign tweeted on Oct. 5. Notice anything strange about it? How tall are the bars in the chart? Instead of ending at a clearly defined baseline, the bars fade to black. But if we add a baseline at the bottom of the graphic, the implied scale doesn’t match the numbers.

7. Adaptive design. Information should be readable both on the big screen and on mobile devices. Be mindful of UX & IA.

Always use a horizontal or square (1:1) aspect ratio.

8. Focus on the comparison you are talking about.

Which chart below is better at showing how the US GDP changed from 2017 to 2019. Can you look at the left chart and determine the exact amount of growth? A zeroed axis at this aspect ratio obscures the idea. Then, take a look at the right chart. It’s the exact same data. This chart clearly shows $2 trillion growth from the lowest point $19.5 trillion in 2017 to $21.4 trillion in 2019.

Bureau of Economic Analysis

Real-world criticism:

Use the right baseline. It’s OK not to start your y-axis at zero. Truncate the y-axis when small movements are important.

P.S. Always use a zeroed y-axis with column and bar charts.

9. Take care to protect personal information if you use detailed infographics.

For instance, a video by the data analytics and visualization company Tectonix showed how cellphones that were on one Fort Lauderdale beach at the beginning of March spread across the country — up the Eastern Seaboard and further West — over the next two weeks. Personal information was collected from users was fully anonymized. But if you dig into it, you can disclose some people comparing location and dates.

10. The figures can be quoted out of context. Look at the story from above to see the full picture. Stereotypes are also wicked.

“According to statistics compiled by the Washington Post, the number of unarmed Black men killed by police so far this year is eight. The number of unarmed white men killed by police over the same time period is 11. And the overall numbers of police shootings has been decreasing.”

— William Barr, Attorney General. Sourse: NYT

Mr. Barr cited a database of police shootings compiled by The Washington Post. But the raw numbers obscure the pronounced racial disparity in such shootings. The statement was also an echo of Donald Trump’s technically accurate, but misleading claim that “more white” Americans are killed by the police than Black Americans. When factoring in population size, Black Americans are killed by the police at more than twice the rate as white Americans, according to the database.

Another case is an article from Georgia. WP is reporting on the disparate effects of the virus were actually causing politicians to push for reopening because it was “Black and brown people dying”.

11. Always be completely honest. Mention in the infographic how the information was collected and also that some contextual data is not provided or missing.

WP reconstructed the movements of two D.C. Army National Guard helicopters that parked nearly still in the air over protesters. They used flight-tracking data, images, and videos.

To calculate the approximate altitude of the Lakota helicopter, WP used geospatial data from Open Data DC, building elevations, street widths, and measurements of other street objects to create a precise scaled model of the intersection. The Black Hawk’s transponder broadcasted its unique code without coordinates. WP relied on videos and photographs instead to trace the path of the Black Hawk. WP has made the data and scripts for analyzing the flight paths available on Github.

12. The chronological sequence makes it possible to clearly see the manipulations, and it facilitates the study of the processes’ dynamics.

Georgia’s Department of Public Health has been criticized in particular for sharing misleading data. It corrected a graph on its website that appeared to show confirmed case counts decreasing (shown below); the data, however, was not arranged in chronological order. The x-axis of the graph was not ordered chronologically, a method that caused the highest values to be clustered on the left, and the lowest on the right, regardless of when those values were recorded.

Georgia is striking for the number of times it has been criticized for its data — and the fact that misleading numbers are being presented to its citizens as they weigh how fully they should engage with their state’s reopening.

A Georgia Department of Public Health graphic prior to the correction

13. Don’t Forget about captions and labels.

The project that’s known simply as oceaniaeuropeamericasafricaasia uses the Olympic rings to compare data from 5 participant continents (the Americas are combined). While it is a good visual, it fails to communicate any real information because it lacks labels.

14. Data visualization can impact and encourage social responsibility.

After the first case of COVID-19, the disease caused by the new strain of coronavirus, was announced in the United States, reports of further infections trickled in slowly. Two months later, that trickle has turned into a steady current. WP is providing a simulator to show how a virus spreads in a town of 200 people.

This is not covid-19, and these simulations vastly oversimplify the complexity of real life. Yet just as it spread through the networks of bouncing balls on your screen, covid-19 is spreading through our human networks — through our countries, our towns, our workplaces, our families. And, like a ball bouncing across the screen, a single person’s behavior can cause ripple effects that touch faraway people.

“If you want this to be more realistic,” an author said after seeing a preview of this story, “some of the dots should disappear.”

The issue of ethics in data visualization is of course not what comes to mind first. It rarely happens that someone starts cheating without changing the figures. Nevertheless, the task of the designer is to present information in a crystal clear and aesthetically beautiful way. Done properly, the art of data visualization can be an incredibly powerful tool for educating people.

Sources:

--

--