On the night of November 8, 2016, Brian Olson was attending an election night party at his local bar with a group of close friends. A data science expert and software engineer, Olson had come armed with three laptops, whose screens flickered with various newsfeeds and heatmaps as the country waited for the final votes to be tallied. When the news broke that Donald Trump had won the electoral vote, Olson was despondent. “Everyone immediately switched from cocktails to sad drinks,” he said.

Like many Americans, Olson was shocked by Trump’s unexpected victory. Virtually all major polls forecasted a landslide Clinton win by a margin of 70-99%. Even Nate Silver, the go-to political stats expert who founded FiveThirtyEight, predicted the Democratic candidate would clinch over 300 electoral votes.

But the problem wasn’t bad data. Forecasts are based on probability, not actual voter intent. “A 30% chance of Trump winning isn’t a long shot at all,” said Eric Siegel, founder of Predictive Analytics World, a conference series that covers the commercial deployment of machine learning and predictive analytics. “Something that happens 30% of the time is really pretty common.”

While horse race polling remains the most mainstream method for predicting election outcomes, data science has played an increasingly crucial role in election forecasting and influencing voter turnout. Companies like PredictWise use data mining to enable political candidates to mobilize voters using targeted ads, while AI-powered fact-checkers deploy knowledge engineering techniques to help stem the spread of fake news and disinformation on social media networks.

“We don’t focus on the current state of the race: our goal is to change the state of the race through the ads that are delivered,” Nikhil Garg, a principal data scientist at PredictWise, said recently during a keynote session at Rise, Springboard’s annual tech and design conference.

After Trump’s victory in 2016, pollsters took a big credibility hit. The New York Times described election analytics as a “young science” in which “some people have been misled into thinking Clinton’s win was assured.”

Some analysts say the faulty poll findings came from weak signals that pollsters ignored so as not to upset a cognitive bias model. In data science, those weak signals are known as “dark data”—data that is collected and stored but not properly identified for analysis. One theory is that so-called “shy Trumpers” misrepresented their intentions in polls because they were afraid to acknowledge their support of a divisive and belligerent public figure, but this theory remains unproven in the absence of metadata (i.e. data that describes the data).

“I’m skeptical as to whether there is such a thing as a shy Trump voter,” said Siegel. “That’s a controversial idea and I haven’t seen data that proves it to be true.”

Lessons learned from the 2016 election

The 2016 election cycle kicked off a series of events that effectively proves how big data can be manipulated for less-than-benevolent uses. First, there were the Russian agents who created a “troll farm” to disseminate fabricated articles and social media posts favoring Donald Trump, which reached more than 126 million users on Facebook alone. They also published more than 131,000 posts on Twitter and uploaded over 1,000 videos to YouTube.

Then came the Cambridge Analytica data breach in 2018, in which Facebook and a political data analytics firm linked to former Trump adviser Steve Bannon obtained the private data of 50 million Facebook users to build individual voter profiles for political ad targeting. (By contrast, a typical audience solutions provider uses anonymized, publicly available data from credit unions and state-owned voter registration databases.)

The last few years have been no less tumultuous. Social media networks have continued to promote “echo chambers” that reinforce cognitive biases, while algorithms that promote incendiary content fuel the circulation of fake news, deep fakes, and increasingly outlandish conspiracy theories, from Wayfair child trafficking cabinets to claims that the coronavirus outbreak was planned all the way up to QAnon, a conspiracy theory involving Donald Trump and Satan-worshiping pedophiles. In July this year, a June Pew Research Center survey found that 25% of Americans believe there is at least some truth to the coronavirus conspiracy theories; just last week, a new survey found that one in four Britons believe in the conspiracy theories circulated by QAnon.

The fact that these conspiracy theories have gained such alarming popularity in the last few months is a testament to the times we live in. Social media usage has reached an all-time high, with nearly three in four U.S. adults using at least one social media site. Facebook, in particular, has come under fire for its role in the spread of misinformation and fake news, particularly in the leadup to the 2020 election cycle. A 2019 study from Stanford found people’s usage of Facebook correlated with how polarized they are and their openness to the views and ideas of the opposition party. The platform’s algorithms spoon-feed users more of what they want so they’ll spend more time on the site. If a user exhibits certain political leanings by clicking on a post, leaving a comment, liking the post, or sharing it, they’ll see more content in their feed that aligns with their views.

Facebook’s own internal research points to the same conclusion. “Our algorithms exploit the human brain’s attraction to divisiveness,” read a slide from an internal Facebook presentation uncovered by The Wall Street Journal. “If left unchecked,” it warned, Facebook would feed users “more and more divisive content in an effort to gain user attention and increase time on the platform.”

data science election springboard

Political campaigns looking to target voters with ads pay close attention to these engagement metrics and end up campaigning to voters who are already likely to vote for them, rather than appealing to a larger swath of constituents. “Some actions are more obvious than others. If you forward something [in an email], from a data standpoint that has a very definite segmentation,” said Samuel Lin, a data science mentor at Springboard and senior business intelligence developer at DocuSign. “If you only like something or view an article, that’s a weaker action, so you may not be categorized as a certain type of voter.”

While Facebook announced last year that it would crack down on disinformation spread by removing fake accounts and reducing the reach of articles that have been flagged by third-party fact-checkers, this doesn’t prevent fake news stories from being disseminated elsewhere. A political troll looking to sow discord on social media doesn’t have to try very hard; inflammatory content spreads organically through social sharing and algorithms that triage posts in people’s feeds according to engagement metrics, even if they’re based on negative reactions.

“With the ease of online publishing today, it is becoming harder for people to determine which sources are credible and which aren’t, leading the way for toxic news to travel faster than truth,” said Lyric Jain, founder and CEO of Logically, an AI-powered fact-checking tool that’s available via mobile app or browser extension.

Can data science help turn things around?

While big data has the potential to jeopardize the democratic process when misused, it’s also an important part of the solution to fairer elections.

Facebook recently announced it would ban new political ads starting October 27, one week before election day, and would extend the ban indefinitely after the polls close on November 3. The social networking site’s moderators are also actively removing posts that dissuade people from voting, such as calls for people to engage in poll watching or other voter intimidation. Meanwhile, Twitter banned all political advertising last year and recently introduced a fact-check feature that would flag posts containing falsehoods by providing extra context.

Another major lesson learned from 2016 is that horse race polls must be correctly represented as mere probability, with more transparency around the degree of uncertainty or margin of error. An overwhelmingly positive or negative forecast can discourage people from voting either because they’re convinced their candidate will win, or they feel hopeless about the impact of their vote.

“In the case of predicting the outcome of an election, you’re predicting not just one individual human’s behavior—which is already hard enough—but that of many humans together with the complex processes of the electoral college,” said Siegel. “No matter how good the math is, and no matter how much data we have, we don’t have a magic crystal ball.”

Part of the problem of election bias originates from way before voters even cast their ballots. Partisan gerrymandering is a common practice used to manipulate the boundaries of an electoral constituency so as to favor one political party, effectively locking out votes for the opposing party. In this case, data is used to identify and group voters according to their political affiliations and either group them together (a practice known as “packing”) or separate them to break up a majority (“cracking”).

data science election

While a growing number of states have embraced redistricting reforms after the 2020 census by appointing bipartisan commissions to draw maps for congressional and state legislative districts, the system can also be tackled using software. Olson, who is a principal software engineer at Algorand, has been working on building his own open-source redistricting software since 2005. His automatic redistricting algorithm transforms gerrymandered districts by using census data to draw impartial district lines based on the average distance per person to the center of their district. This number varies depending on the size of the state, but it ensures each district is spatially equal.

“My definition of an impartial district is one that is entirely based on compactness—keeping the people in a district as close to each other as possible,” Olson said. “If you have a group of people that are all clustered together they’re more likely to have something in common with each other.”

But a purely mathematical method is just one approach to fair redistricting. The data can also be used to ensure different groups are fairly represented. “We should also try to make majority-Black districts or majority-Hispanic districts in some parts of the country,” Olson said. “That’s gerrymandering for good restorative justice.”

Throughout its many use cases in the democratic process, big data is a double-edged sword that can be used for nefarious purposes or a good cause. On the one hand, audience solutions providers like PredictWise could be perceived as helping political candidates promote incendiary ads. But when done right, political advertising can be used to educate voters on causes that matter to them and mobilize apathetic voters.

“A whole bunch of people have quite diverse views when it comes to these issues,” said Garg. “They might tend to vote for law and order but are liberal when it comes to healthcare and economic issues. Our work serves to inform people about these issues.”

Meanwhile, the work of AI-powered fact-checkers like Logically helps to raise awareness of disinformation campaigns, hold social media networks to account, and help people verify the credibility of the information they read online at scale. Having access to big data represents a major opportunity and responsibility to ensure data provides transparency and is used in the service of helping others rather than deceiving them.

“I think that any party that’s in a position of power has a responsibility to use that power for good,” said Siegel. “If you’re not fighting against the problem then you’re part of the problem.”

Election Day is November 3. Click here to find out how to vote in your state.