Sponsor:

National Science Foundation (NSF)

Project Team Members:

Understanding, Analyzing, and Retrieving Knowledge from Social Media

Overview

Social Media has become one of the most popular platforms to allow users to communicate, and share their interests without being at the same geographical location. With the rapid growth of Social Media sites such as Facebook, LinkedIn, Twitter, Customer reviews (e.g. Amazon, CNET) and Blogs (e.g. WSJ Blogs), there is vast amount of user-generated content. Our goal is to retrieve valuable nuggets of knowledge from this huge amount of data and help users make informed decisions. To achieve this goal, the specfic objectives of the project are as follows.

Constructing useful information networks from raw datasets.
Understanding and analysizing useful structures, patterns, and communities in networks.
Analysis of temporal and special dynamics in networks.
Analysis of social reputation, influence, and trust.
User activity modeling, profiling, exploration, and recommendation systems.
Developing new data mining and machine learning algorithms for social networks.
New applications and impact of social media in other areas of research.

Our current progress on this project can be categorized in two broad categories.

Community Mining

The rapid evolution of modern social networks motivates the design and understanding of networks based on users' interest. Using popular social media such as Facebook and Twitter, we present new perspective to bring out more meaningful information about the networks. Instead of using traditional user networks of Facebook and Twitter, we deduce user interest based networks using posts, comments, and tweets. Our approach is able to closely capture relations found in static networks and also finds affiliations that are constantly evolving either due to temporal or spatial activities. Further, we develop a new approach for mining communities to understand and analyze the structure of social networks. Our user-interest based model and community extraction algorithm together can be used to identify target communities in the context of business requirements. Figure 1 shows several such focused communities that belong to categories such as Technology (c231), Consumer Merchandize (c232), Retail (c243), Travel and Leisure (c244), Food (c248) and Baby Products (c251). We got many more interesting focused communities. Note that using the existing approaches, most of these focused communities belong to one large community, which does not reflect the structure of the network.

Figure 1. Partial dendrogram showing communities in Facebook.

In addition, we also explore newer methods to extract communities from a given network. In many real networks vertices may belong to more than one group, and such groups form overlapping communities. Classical examples are social networks, where an individual usually belongs to different circles at the same time, from that of work colleagues to family, sport associations, etc. Finding such overlapping community is a challenging problem, and is not supported by traditional community detection algorithms. We devised a hierarchical pruning based heuristic algorithm for finding the maximum clique for a given graph. The details on the maximum clique algorithm and source code are available online. We are currently working on extending the clique algorithm to identify the communities for massive networks. Using this, we also devised a clique-based community detection algorithm which is capable of finding overlapping communities. Figure 2 shows some of the communities detected. We see two isolated communities, one for popular singers, and another for retail chains and products. We also see a community for news channels and politics, and a community of MSNBC and popular TV shows. The highlight of our algorithms is that it allows a node to be a member of more than one community giving an overlapping community structure. Although the "news channels and politics" and "MSNBC and tv shows" communities are not directly related and have different members, they share a common member.

Figure 2. Some Facebook communities detected by our clique-based community finder.

Text Mining

Enormous amounts of messages get published each day on social media sites. For example, Twitter processes 230 million tweets (messages that are 140 characters long) a day (twitterstats ). The explosion of textual messages can cause information overload. Our goal is to design systems that can analyze and summarize social media content. The current work encompasses two main themes:

Sentiment Analysis

The goal is to mine opinions from social media data at sentence and document level.
Design a framework that can determine the sentiment (positive, negative, or neutral) overcoming social slang and lingos, non-textual expressions and language.

SES Framework

Figure 3: SES Architechture.

Text classification

The goal is to group similar text or messages that can make information more manageable.
Employ framework that can cluster similar text belong to the same topic and several similar topics that can belong to a broad category.

Constraint Mixture Clustering Framework

politics

Figure 4: The general framework of learning to clustering web text with the CML model.

Text and Network based Classification Framework

Figure 5: System Architecture.

sports, politics

Macbook

Publications

Bharath Pattabiraman, Mostofa Patwary, Assefaw Gebremedhin, Wei-keng Liao, and Alok Choudhary. Fast Algorithms for the Maximum Clique Problem on Massive Sparse Graphs. In the 10th Workshop on Algorithms and Models for the Web Graph, December 2013. (pdf)
Diana Palsetia, Md Mostofa Ali Patwary, Kunpeng Zhang, Kathy Lee, Christopher Moran, Yusheng Xie, Daniel Honbo, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. User-Interest Based Community Extraction in Social Networks. In the Workshop on Social Network Mining and Analysis, held in conjunction with the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2012. (pdf)
Yu Cheng, Kunpeng Zhang, Yusheng Xie, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. Learning to Group Web Text Incorporating Prior Information. In the Workshop on Optimization Based Methods for Emerging Data Mining Problems, held in conjunction with the IEEE International Conference on Data Mining, December 2011. (pdf)
Lalith Polepeddi, Ankit Agrawal, and Alok Choudhary. Poll: A Citation-Text-Based System for Identifying High-Impact Contributions of an Article. In the Workshop on Data Mining in Networks, held in conjunction with the IEEE International Conference on Data Mining, December 2011. (pdf)
Kunpeng Zhang, Yu Cheng, Yusheng Xie, Ankit Agrawal, Diana Palsetia, Kathy Lee, Wei-keng Liao, and Alok Choudhary. SES: Sentiment Elicitation System for Social Media Data. In the Workshop on Sentiment Elicitation from Natural Text for Information Retrieval and Extraction, held in conjunction with the IEEE International Conference on Data Mining, December 2011. (pdf)
Kathy Lee, Diana Palsetia, Md Mostofa Ali Patwary, Ankit Agrawal, Alok Choudhary, and Ramanathan Narayanan. Twitter Trending Topic Classification. In the Workshop on Optimization Based Methods for Emerging Data Mining Problems, held in conjunction with the IEEE International Conference on Data Mining, December 2011. (pdf)
Kunpeng Zhang, Yu Cheng, Wei-keng Liao, and Alok Choudhary. Mining Millions of Reviews: A Technique to Rank Products Based on Importance of Reviews. In the International Conference on Electronic Commerce, August 2011. (pdf) This paper was awarded Honourable Mention at the conference ICEC 2011, Liverpool, England.
Alok Choudhary, William Hendrix, Kathy Lee, Diana Palsetia, and Wei-keng Liao, Social Media Evolution of The Egyptian Revolution. Communications of the ACM, 55(5):74-80, ACM, New York, NY, USA, May 2012. (pdf) .

Source Code

Fast algorithms to find maximum clique in massive graphs

Sentiment Service (API)

Datasets

Facebook comments, Tweets, Amazon reviews sample datasets

» Return to top