Our current directions
Social media including popular forums, blogosphere, and social networks has evolved tremendously in recent decade and we can witness the proliferation of many successful services and applications: Twitter, Flickr, Youtube, Facebook, Hyves, Livejournal etc. Data generated but lots and lots of users of these applications allow for many interesting studies that were hard to imagine before. How social media is produced, cited and used; how various information propagates being searched and found, how social communities evolve, what people talk or argue about, what opinion people have - these and many other questions are interesting to academia and businesses.
Social media mining is aimed to facilitate traditional and new kinds of search, recommendation, and predictive modeling tasks. In the recent past we studied several topics related to social media mining.
Erik Tromp in his thesis Multilingual Sentiment Analysis on Social Media investigated automated sentiment analysis on multilingual data from different social media including Twitter. We studied a four-step approach solving this problem, comprising language identification, part of speech tagging, subjectivity detection and polarity detection. For language identification and polarity detection Erik presented new algorithms called LIGA and RBEM respectively. The experimental study illustrated the benefit of each of the steps in the four-step approach and allowed to quantify the importance of having the output of the corresponding techniques at each step as accurate as possible. Erik's thesis won two awards: the Best IT-thesis of the Netherlands 2011 granted by De Koninklijke Hollandsche Maatschappij der Wetenschappen and Berenschot thesis award
Murat Ongun in his thesis Utilizing Social Media Data for Search Engine Marketing studies how to align streaming data from the social media with web analytics data and facilitate its mining for different search engine optimization tasks including additional keyword generation, finding patterns related to geographical regions and trends detection for managing keyword bids.
Samuel Louvan in his thesis
Web Page Segmentation & Structure Analysis for Eliminating Nonrelevant Content studied how to identify relevant content in social media websites including blogs and forums.
Most of the previous approaches used heuristic rule sets to locate the main content. Our contribution in
this work is mainly the development of web content extraction module which uses a
hybrid approach that consist of machine learning and heuristic
approaches developed by Samuel, namely Largest Block String, String Length Smoothing, and Table Pattern. According to our experiments, the combination of machine learning and heuristic
approach gives encouraging result and it is a competitive content extraction method
compared to the current state of the art web content extraction methods.
Publications
- E. Tromp and M. Pechenizkiy. Graph-Based N-gram Language Identification on Short Texts. Benelearn 2011.
- Chambers, L., Tromp, E., Pechenizkiy, M. & Gaber, M. Mobile sentiment analysis, KES 2012.
- Tromp, E. and Pechenizkiy, M. SentiCorr: Multilingual Sentiment Analysis of Personal Correspondence, Demo @ IEEE ICDM 2011 (project page)
- Tromp, E. and Pechenizkiy, M. RBEM: A Rule Based Approach to Polarity Detection, WISDOM @ KDD 2013
- Demirtas, E. and Pechenizkiy, M. Cross-lingual Polarity Detection with Machine Translation, WISDOM @ KDD 2013
Code & Datasets
We are working on making the software, source code and datasets created and used in this project available for the research community (as long as there are no NDA, IP ethical or proprietary concerns). Currently, the following datasets are available:
- CINLP_datasets.zip (description.txt) Preprocessed labeled Twitter datasets, one automatically annotated and two manually annotated as used in Tromp et al, 2013, submission to CINLP special issue.
- Turkish_Movie_Sentiment.zip and Turkish_Products_Sentiment.zip (descpription.txt): Movie reviews and multi-domain product reviews (both in Turkish) dataset as used in Demirtas & Pechenizkiy, WISDOM@KDD'13 (cross-lingual polarity detection with machine translation).
- LIGA_Benelearn11_dataset.zip (description.txt) Preprocessed labeled Twitter data in six languages, used in Tromp & Pechenizkiy, Benelearn 2011
- SA_Datasets_Thesis.zip (description.txt) All preprocessed datasets as used in Tromp 2011, MSc Thesis

Collaboration and Funding
People
-
Dept. Computer Science, TU/e
-
Mykola Pechenizkiy
Erik Tromp
Erkin Demirtas
Murat Ongun
Samuel Louvan
-
Renzo de Hoogen
-
Guido Budziak
Bob Nieme
-
Arthur van Bunningen