~ Archive for Predicting ~

Replication, Verification and Availability for Big Data



The next step in the evolution of Social Computing Research: Formal acceptance of credit worthiness by the community of Replication, Verification, and Availability of Big Data.

In his response to my posting on Research Replication in Social Computing, Dr. Bernardo Huberman pointed to his letter to Nature on a related issue: Verification of results. Here I expand to include proposal that I have heard others mention recently.

I totally agree, of course, that “Science is unique in that peer review, publication and replication are essential to its progress.” This is what I also propose above. And he focuses on the need for having accessible data so that people can verify claims. For those who may not have access to his letter, I reproduce the central paragraph here:

“More importantly, we need to recognize that these results will only be meaningful if they are universal, in the sense that many other data sets reveal the same behavior. This actually uncovers a deeper problem. If another set of data does not validate results obtained with private data, how do we know if it is because they are not universal or the authors made a mistake? Moreover, as many practitioners of social network research are starting to discover, many of the results are becoming part of a “cabinet de curiosites” devoid of much generality and hard to falsify.”

Let me add something further, that I heard it mentioned by Noshir Contractor and Steffen Staab at the WebScience Track during the WWW2012 conference, that I think will complement the overall proposal: People who make their data available to others should get credit for that. After all, in Science a lot of time is spend collecting and cleaning data, and whose who do that and make their data available to other researchers for verification, meta-analyses and studying of other research questions should be rewarded for their contributions.

I believe the time is right to introduce formal credit for replication of results on comparable data sets, verification on the same data set, and for making data accessible to others for further and meta-analysis. I plan to use much of my group’s research time on these issues this summer and publish our findings afterwards.

Research Replication in Social Computing


On the need for Research Replication in Social Computing

A call for replicating Social Computing results as a necessary step in the maturity of the field.

The field of Social Computing is a rather new one but it is one of the more active in Computer Science in the last few years. Many new conferences have been created to host the research efforts of computer scientists, social scientists, physicists, statisticians and many other researchers and practitioners. The excitement generated by the opportunities that opened through the relatively ease of retrieving large amounts of data has led many young researchers in diving to uncover the modes of social interactions.

At the risk of oversimplifying, one could say that the research papers we produce follow the general pattern of observational sciences:

  • We collect data that arguably can capture the phenomenon we want to study,
  • we may apply some sophisticated statistical tools, test a hypothesis applying machine learning tools, and
  • analyze the results.

Our conclusions sometimes do not just state the phenomenon we just observed, but they expand from the specific findings to claim possible projections that go beyond the observed.

One of the reasons that this approach seems familiar it that it resembles the one used in Experimental Computer Science. There, we measure the characteristics of the systems or algorithms we have built, and study their performance experimentally when exact analysis is not easy or even possible. This is a true and tried approach since, in the systems we build, we take great effort to avoid any behavior that is outside of the specifications. In the artificial worlds we create, we try to control all of its aspects, and this process has produced amazing technological results.

On the other hand, this approach may be inappropriate or incomplete compared to those used in Experimental Natural Sciences. Physicists, Biologists and Chemists would start with this approach to make initial sense of the data they are collecting, but this is just the beginning of the process. Replication of their research is normally needed to verify the validity of the original experiments. Sometimes the research results would not be validated, nevertheless, even in this case the replication process would provide insight into the workings of natural phenomena. Nature is mostly repeating its phenomena consistently, and one may have to account for all the parameters that affect them. Sometimes this is not easy, and replication offers the best guarantee that the research findings are valid.

As we mentioned, Social Computing is now being done by researchers coming from many disciplines, but it is different from both Computer Sciences and Natural Sciences. Though it has the potential of also becoming an experimental science, so far it is mostly an observational Science. This, it turns out, is a very important distinction. Society is different than Nature in several important ways. Its basic building blocks are people, not atoms, or chemical compounds or molecules. The complexity of their interactions is not easily tractable, to the degree that one may not be able to even enumerate all the factors that affect them. Moreover, people (and even social “bots” released in Social Media) do not behave consistently over time and under different conditions.

The closest relative to Social Computing is not Computer Science, we would argue, but Medical Science, where Natural Sciences phenomena are influenced by Social conditions. In both Medical and Natural Sciences, replication of results is considered an irreplaceable component of scientific progress. Any lab can make discoveries, but these discoveries are not considered valid until they have been independently replicated by other labs. Not surprisingly, replicating research findings is considered a critical publishing action, and researchers are getting credit for doing just that.

In Computer Science, replication has not been considered important and worth any credit, unless it reveals crucial flaws in the original research. It is unlikely, for example, that replicating Dijkstra’s Shortest Paths algorithm would contribute to the development of our discipline, and so it makes sense not to give credit to its replication. On the other hand, inability to replicate Hopcroft and Tarjan’s tri-connected component algorithm was a significant development, and Gutwender and Mutzel who discovered it and corrected it, did receive credit for it.

We acknowledge the need for replicating Social Computing research results, as a way of establishing the patterns that Social Media data are discovering under all meaningful conditions. We believe that such research replication will give credibility to the field. Failing to do that, we may end up collecting a large number of conflicting results that may end up discrediting the whole field.


Three Social Theorems


Dear Readers,

Below are my annotated notes from a talk I gave at Berkman’s Truthiness in Digital Media Symposium a few weeks ago. I introduced the concept of Social Theorems, as a way of formulating the findings of the research that is happening the last few years in the study of Social Media. It is my impression that, while we publish a lot of papers, write a lot of blogs and the journalists report often on this work, we have troubles communicating clearly our findings. I believe that we need both to clarify our findings (thus the Social Theorems), and to repeat experiments so that we know we have enough evidence on what we really find. I am working on a longer version of this blog and your feedback is welcome!

P. Takis Metaxas

With the development of the Social Web and the availability of data that are produced by humans, Scientists and Mathematicians have gotten an interest in studying issues traditionally interesting mainly to Social Scientists.

What we have also discovered is that Society is very different than Nature.

What do I mean by that? Natural phenomena are amenable to understanding using the scientific method and mathematical tools because they can be reproduced consistently every time. In the so-called STEM disciplines, we discover natural laws and mathematical theorems and keep building on our understanding of Nature. We can create hypotheses, design experiments and study their results, with the expectation that, when we repeat the experiments, the results will be substantially the same.

But when it comes to Social phenomena, we are far less clear about what tools and methods to use. We certainly use the ones we have used in Science, but they do not seem to produce the same concrete understanding that we enjoy with Nature. Humans may not always behave in the same, predictable ways and thus our experiments may not be easily reproducible.

What have we learned so far about Social phenomena from studying the data we collect in the Social Web? Below are three Social Theorems I have encountered in the research areas I am studying. I call them “Social Theorems” because, unlike mathematical Theorems, they are not expected to apply consistently in every situation; they apply most of the time and when enough attention has been paid by enough people. Proving Social Theorems involves providing enough evidence of their validity, along with description of their exceptions (situations that they do not apply). It is also important ti have a theory, an explanation, of why they are true. Disproving them involves showing that a significant number of counter examples exists. It is not enough to have a single counter example to disprove a social theorem, as people are able to create one just for fun. One has to show that at least a significant minority of all cases related to a Social Theorem are counter-examples.

SoThm 1. Re-tweets (unedited) about political issues indicate agreement, reveal communities of likely minded people.

SoThm 2. Given enough time and people’s attention, lies have short questioned lives.

SoThm 3. People with open minds and critical thinking abilities are better at figuring out truth than those without. (Technology can help in the process.)

So, what evidence do we have so far about the validity of these Social Theorems? Since this is simply a blog, I will try to outline the evidence with a couple of examples. I am currently working on a longer version of this blog, and your feedback is greatly appreciated.

Evidence for SoThm1.

There are a couple papers that present evidence that “Re-tweets (unedited) about political issues indicate agreement, reveal communities of likely minded people.” The first is the From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search paper that I co-authored with Eni Mustafaraj and presented at the WebScience 2010 conference. When we looked at the most active 200 Twitter users who were tweeting about the 2010 MA Special Senatorial election (those who sent at least 100 tweets in the week before the elections), we found that their re-tweets were revealing their political affiliations. First, we completely characterized them into liberals and conservatives based on their profiles and their tweets. Then we looked at how they were retweeting. In fact, 99% of the conservatives were only retweeting other conservatives’ messages and 96% of liberals those of other liberals’.

Then we looked at the retweeting patterns of the 1000 most active accounts (those sent at least 30 tweets in the week before the elections) and we discovered the graph below:

As you may have guessed, the liberals and conservatives are re-tweeting mostly the messages of their own folk. In addition, it makes sense: The act of re-tweeting has the effect of spreading a message to your own followers. If a liberal or conservative re-tweets (=repeats a message without modification), he/she wants this message to spread. In a politically charged climate, e.g., before some important elections, he/she will not be willing to spread a message that he disagrees with.

The second evidence comes from the paper “Political Polarization on Twitter” by Conover et. al. presented at the 2011 ICWSM conference. The retweeting pattern, shown below, indicates also a highly polarized environment.

In both cases, the pattern of user behavior is not applying 100% of the time, but it does apply most of the time. That is what makes this a Social Theorem.

Evidence for SoThm2.

The “Given enough time and people’s attention, lies have short questioned lives” Social Theorem describes a more interesting phenomenon because people tend to worry that lies somehow are much more powerful than truths. This worry stems mostly from our wish that no lie ever wins out, though we each know several lies that have survived. (For example, one could claim that there are several major religions in existence today that are propagating major lies.)

In our networked world, things are better, the evidence indicates. The next table comes from the “Twitter Under Crisis: Can we trust what we RT?” paper by Mendoza et. al., presented at the SOMA2010 Meeting. The authors examine some of the false and true rumors circulated after the Chilean earthquake in 2010. What they found is that rumors about confirmed truths had very few “denies” and were not questioned much during their propagation. On the other hand, those about confirmed false rumors were both questioned a lot and were denied much more often (see the last two columns enclosed in red rectangles). Why does this make sense? Large crowds are not easily fooled as the research on crowd sourcing has indicated.

Again, these findings do not claim that no lies will ever propagate, but that they will be confronted, questioned, and denied by others as they propagate. By comparison, truths will have a very different experience in their propagation.

The next evidence comes from the London Riots in August 2011. At the time, members of the UK government accused Twitter of spreading rumors and suggested it should be restricted in crises like these. The team that collected and studied the rumor propagation on Twitter found that this was not the case: False rumors were again, short-lived and often questioned during the riots. In a great interactive tool, the Guardian shows in detail the propagation of 7 such false rumors. I am reproducing below an image of one of them, the interested reader should take a closer look at the Guardian link.



During the Truthiness symposium, another case was presented, one that supposedly shows the flip side of this social theorem: That “misinformation has longer life, further spread on Twitter than accompanying corrections”. I copy the graph that supposedly shows that, for reference.

Does this mean that the Social Theorem is wrong? Recall that a Social Theorem cannot be refuted by a single counter-example, but by demonstrating that, at least a significant minority of counter examples, exists.

Further, the above example may not be as bad as it looks initially. First, note that the graph shows that the false spreading had a short life, it did not last more than a few hours. Moreover, note that the false rumor’s spreading was curbed as soon as the correction came out (see the red vertical line just before 7:30PM). This indicates that the correction probably had a significant effect in curbing the false information, as it might have continue to spread at the same rate as it did before.


Evidence for SoThm3.

I must admit that “People with open minds and critical thinking abilities are better at figuring out truth than those without” is a Social Theorem that I would like to be correct, I believe it to be correct, but I am not sure on how exactly to measure it. It makes sense: After all our educational systems since the Enlightenment is based on it. But how exactly do you created controlled experiments to prove or disprove it?

Here, Dear Reader, I ask for your suggestions.



Election time, and the predicting is easy…


Election time, and the predicting is easy…

As I am sure you have heard, the Iowa caucus results are in. Several journalists are reporting on the elections along with claims of “predictions” that social media are supposedly making. And the day after the Iowa caucus, they are wondering whether Twitter predicted correctly or not. And they look at the “professionals” for advise such as Globalpoint, Sociagility, Socialbackers and other impressive sounding companies.

Shepard Fairey meets Angry Birds: Poster of our 2011 ICWSM submission "Limits of Electoral Predictions using Twitter"

Well, Twitter did not get it right. That is not surprising to my co-authors and I.  Yet, they try to find a silver lining, by claiming smaller predictions such as “anticipating Santorum’s excellent performance than the national polls accomplished.” Of course, the fact that Twitter missed the mismatches with the other 5 candidates is ignored. Why can’t they see that?

A few years ago I had created a questionnaire to help my students sharpen their critical thinking skills. One question that the vast majority got right was the following: “Is Microsoft the most creative tech company?” If one were to do a Web search on this question, the first hit (the “I feel lucky” button) would be Microsoft’s own Web page, because it had as title “Microsoft is the most creative tech company.” My students realized that Microsoft may not be providing an unbiased answer to this question, and ignored it.

It is exactly this critical thinking principle that journalists obsessed with election predictions are getting wrong: The companies I mentioned above ( Globalpoint, Sociagility, Socialbackers ) are all in the business of making money by promising magical abilities in their own predictions and metrics. One should not take their claims on face value because they have financial conflict of interest in giving misleading answers (e.g. “Comparing our study data with polling data from respected independent US political polling firm Public Policy Polling, we discovered a strong, positive correlation between social media performance and voting intention in the Iowa caucus.” Note that even after the elections they talk about intentions, not results.)

That’s not the only example violating this basic critical thinking principle I saw today. Earlier, I had received a tweet that “Americans more susceptible to online scams than believed, study finds“. The article reports that older, rich, highly educated men from the Midwest, politically affiliated with the Green Party are far less susceptible to scam than young, poor, high school dropout women from the Southwest that are supporting Independents. If you read the “study” findings, you will be even more confused about the quality of this study. A closer look reveals that the “study” was done by PC Tools, a company selling “online security and system utility software.” Apparently, neither the vagueness of the “survey” nor the financial conflict of interest of the surveying company raised any flags for the reporter.

In the Web era, information is finding us, not the other way around. Being able to think critically will be crucial.



Predict the Future (and Tell the World about it!)


In my previous posting (Predict the Future!) I was arguing for the benefits and risks of making predictions using data gathered from Social media. I will take this opportunity to mention a Call-For-Papers that I am involved in. The online journal “Internet Research”, famous for having published the original article by Tim Berners-Lee on the creation of the WWW, is having a special issue on “The Power of Prediction with Social Media” to be published in 2012. Below are the details. If you have any questions, please do contact me or any of the other guest editors.


Special issue call for papers on
“The Power of Prediction with Social Media”
from Internet Research, ISSN: 1066-2243

Editor in Chief: Jim Jansen


Social media today provide an impressive amount of data about users and their societal interactions, thereby offering computer scientists, social scientists, economists, and statisticians many new opportunities for research exploration. Arguably one of the most interesting lines of work is that of forecasting future events and developments based on social media data, as we have recently seen in the areas of politics, finance, entertainment, market demands, health, etc.

But what can successfully be predicted and why? Since the first algorithms and techniques emerged rather recently, little is known about their overall potential, limitations and general applicability to different domains.

Better understanding the predictive power and limitations of social media is therefore of utmost importance, in order to –for example– avoid false expectations, misinformation or unintended consequences. Today, current methods and techniques are far from being well understood, and it is mostly unclear to what extent or under what conditions the different methods for prediction can be applied to social media. While there exists a respectable and growing amount of literature in this area, current work is fragmented, characterized by a lack of common evaluation approaches. Yet, this research seems to have reached a sufficient level of interest and relevance to justify a dedicated special issue.

This special issue aims to shape a vision of important questions to be addressed in this field and fill the gaps in current research by soliciting presentations of early research on algorithms, techniques, methods and empirical studies aimed at the prediction of future or present events based on user generated content in social media.


To address this guiding theme the special issue will be articulated around, but not limited to, the following topics:

  1. Politics, branding, and public opinion mining (e.g., electoral, market or stock market prediction).
  2. Health, mood, and threats (e.g., epidemic outbreaks, social movements).
  3. Methodological aspects (e.g., data collection, data sampling, privacy and data de-identification).
  4. Success and failure case studies (e.g., reproducibility of previous research or selection of baselines).


  • Manuscript due date: June 1, 2012
  • Decisions due: August 1, 2012
  • Revised paper due: September 15, 2012
  • Notification of acceptance: October 1, 2012
  • Submission of final manuscript: October 31, 2012
  • Publication date: late 2012 / early 2013 (tentative)


All submitted manuscripts should be original contributions and not be under consideration in any other venue.

Publication of an enhanced version of a previously published conference paper is possible if the review process determines that the revision contains significant enhancements, amplification or clarification of the original material. Any prior appearance of a substantial amount of a submission should be noted in the submission letter and on the title page.

Submissions must adhere to the “Author Guidelines

Detailed instructions will be announced later this year.

Guest editors


Predict the Future!


The title may seem redundant. Of course if you  are going to predict, you should predict the future — what else, predict the past? But, when referring to social media data it may not be that redundant. In recent years there has been an increase of research on social media data predicting the future, predicting the present, and predicting the past using knowledge acquired in the future.

Why is predicting important? Predicting is equivalent to intelligence, with an important qualification: We admire the intelligence of someone who can predict what is going to happen, but only when they can explain why they are able to do so. If one (e.g., an octopus) is able to predict without explanation, we tend to downgrade it as coincidence.

Earlier today, the Pew Research Center on Journalism published an analysis entitled “Twitter and the Campaign“. They present a detailed study of millions of tweets and blogs, about what people say on social media about the candidates for the 2012 elections. (Not too many nice things, it turns out, except for Ron Paul, who, at the same time, is trailing on the polls.)

So, what does this mean for the predictive power of Twitter? Is he going to win because tweets have good things to say about him, or will he lose because tweets have good things to say about him? (Hint: The answer is “yes”.)

Shepard Fairey meets Angry Birds: Poster of our 2011 ICWSM submission "Limits of Electoral Predictions using Twitter"

Earlier this year, with my colleagues Eni Mustafaraj, Dani Gayo-Avello and student Catherine Lui we studied this question. Can one, analyzing social media data, predict the outcome of the US congressional elections? We did not find encouraging results, in neither the Google Trends data nor the Twitter data — thus the ingenious poster above that Dani designed.

When it comes to something so important as the elections, social media will be manipulated, because the stakes are too high. One should keep that in mind as we get closer to election time and “news articles” will start appearing arguing that someone will win or lose based on the number of friends or followers this candidate has. If the author gets it right, he will make sure to remind us in the future. If he gets it wrong, he will forget it first.

Today's mentally flexible tweet. Why is this important? What is special about the last 24 hours? Who is missing?

This does not mean that nothing can be predicted using social media. Movie sales can be predicted, as Bernando Huberman and his colleague showed. Flu outbreaks and periodic sales can be predicted, too. But not elections. At least without some sophisticated filtering that makes them as representative and competitive to the professional pollsters.



Log in