News (From 2022)
- 03/2023 - Preprint of NoisyHate - an adversarial toxic texts dataset with human-written perturbations - is available
- 2023 - One paper on the plagiarism behaviors of LLM is accepted to WWW'23
- 2023 - Preprint on Unattributable Authorship Text is available
- 2022 - Tutorial ``Catch Me If You GAN: Generation, Detection, and Obfuscation of Deepfake Texts" accepted at WWW'23, with Prof. Dongwon Lee and Adaku Uchenda.
- 2022 - One survey paper on Authorship Detection of Deepfake Texts will be published at KDD Exploration
- 2022 - One demo paper on perturbations in the wild is accepted at ICDM'23
- 2022 - PC Members: PKDD'22, EMNLP'22, WSDM'23, AAAI'23, WWW'23
- 2022 - Accepted tenure-track faculty position at University of Mississippi
- 2022 - Receive the IST Ph.D. Student Award for Research Excellence, College of IST, PSU
- 2022 - Two papers on adversarial texts are accepted at ACL'22
- 2022 - One paper on RL-based Adversarial Socialbots is accepted at WWW'22.
- 2022 - One paper on Explainable RL is accepted at AAMAS'22.
|
Resources: Tools and Datasets
|
NLP Language Models, Neural Text Generation, (Reverse) Turing Test
|
|
Do Language Models Plagiarize?
Jooyoung Lee, Thai Le, Jinghui Chen, Dongwon Lee
The ACM Web Conference (WWW), 2023
We investigate the privacy risks of large language models's over-memorization behaviors in the context of plagiarism, both on pre-trained and fine-tuned models. Specifically, we analyze three different types of plagiarism, namely verbatism, paraphrase and idea plagiarism.
|
|
Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective
Adaku Uchenda, Thai Le, Dongwon Lee
SIGKDD Explorations, Vol. 25, June 2023 2023
In this survey, we make a comprehensive review of recent literature on the attribution and obfuscation of neural text authorship from a Data Mining perspective, and share our view on their limitations and promising research directions.
|
|
TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation
Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang and Dongwon Lee
Findings of Empirical Methods in Natural Language Processing (EMNLP), 2021
While there are many legitimate applications of generative language models, there is also a rising need to distinguish machine-generated texts from human-written ones (e.g., fake news detection). In this work, we present the TURINGBENCH benchmark environment which comprises datasets to evaluate both Turing test and authorship attribution on neural texts.
|
|
Authorship Attribution for Neural Text Generation
Adaku Uchendu, Thai Le, Kai Shu, Dongwon Lee
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
This paper investigates 8 large language models with two problems (i) Turing Test: differentiate between human and machine-generated texts and (ii) Authorship Attribution: differntiate texts generated among different generative models.
|
|
Deep headline generation for clickbait detection
Shu Kai, Suhang Wang, Thai Le, Dongwon Lee, Huan Liu
IEEE International Conference on Data Mining (ICDM), 2018
This work proposes to generate synthetic headlines with specific styles and explore their utilities to help improve clickbait detection. In particular, we propose to generate stylized headlines from original documents with style transfer
|
|
NoisyHate: Benchmarking Content Moderation Machine Learning Models with Human-Written Perturbations Online
Yiran Ye, Thai Le, Dongwon Lee
Preprint, 2023
We introduce a benchmark test set containing human-written perturbations online for toxic speech detection models. We test several spell corrector algorithms on this dataset. We also test this data on state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as perspective API, to demonstrate the adversarial attack with real human-written perturbations is still effective
|
|
UPTON: Unattributable Authorship Text via Data Poisoning
Ziyao Wang, Thai Le, Dongwon Lee
Preprint, 2023
This work proposes UPTON. UPTON uses data poisoning to destroy the authorship feature only in training samples by perturbing them, and try to make released textual data unlearnable on deep neuron networks. It is different from previous obfuscation works, that use adversarial attack to modify the test samples and mislead an AA model, and also the backdoor works, which use trigger words both in test and training samples and only change the model output when trigger words occur
|
|
CryptText: Interactive Discovery and Visualization of Human-Written Text Perturbations in the Wild
Thai Le, Ye Yiran, Yifan Hu, Dongwon Lee
ICDM (Demo), 2023
There is no available framework that explores and utilizes these human-written perturbation patterns online. Therefore, we introduce an interactive system called CrypText, which is a collection of tools for users to extract and interact with human-written perturbations. Specifically, CrypText helps look up, perturb, and normalize (i.e., de-perturb) texts. CrypText also provides an interactive interface to monitor and analyze text perturbations online.
|
|
SHIELD: Defending Textual Neural Networks against Multiple Black-Box Adversarial Attacks with Stochastic Multi-Expert Patcher
[code]
Thai Le, Noseong Park, Dongwon Lee
Annual Meeting of the Association for Computational Linguistics (ACL) , 2022
We propose HIELD algorithm that transforms a textual NN model into a stochastic ensemble of multi-expert predictors by upgrading and re-training its last layer only. Whenever an adversary try to fool the model, SHIELD confuses the attacker by automatically utilizing different subsets of predictors that are specialized in different sets of features, classes and instances.
|
|
Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense
[code]
Thai Le, Jooyoung Lee, Kevin Yen, Yifan Hu, Dongwon Lee
Annual Meeting of the Association for Computational Linguistics (ACL) , 2022 (Findings)
We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack and defense. Unlike existing character-based attacks which often deductively hypothesize a set of manipulation strategies, our work is grounded on actual observations from real-world texts.
|
|
Socialbots on Fire: Modeling Adversarial Behaviors of Socialbots via Multi-Agent Hierarchical Reinforcement Learning.
[code]
Thai Le, Long-Thanh Tran, Dongwon Lee
The Web Conference (WWW), 2022
The adversarial nature of these socialbots has not yet been studied. This begs a question ``can adversaries, controlling socialbots, exploit AI techniques to their advantage?" To this question, we successfully demonstrate that indeed it is possible for adversaries to exploit computational learning mechanism such as reinforcement learning (RL) to maximize the influence of socialbots while avoiding being detected.
|
|
A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger’s Adversarial Attacks.
[code]
Thai Le, Noseong Park, Dongwon Lee
Annual Meeting of the Association for Computational Linguistics (ACL) , 2021
This work borrows the "honeypot" concept from the cybersecurity community and propose DARCY, a novel honeypot-based defense frame-work against UniTrigger attack. DARCY greedily searches and injects multiple trapdoors into an neural network model to "bait and catch" potential attacks.
|
|
MALCOM: Generating Malicious Comments to Attack Neural Fake News Detection Models
Thai Le, Suhang Wang, Dongwon Lee
IEEE International Conference on Data Mining (ICDM), 2020
This work (i) proposes a novel attack scenario against fake news detectors, in which adversaries can post malicious comments toward news articles to mislead SOTA fake news detectors, and (ii) develops Malcom, an end-to-end adversarial comment generation framework to achieve such an attack.
|
|
A Policy-Graph Approach to Explain Reinforcement Learning Agents: A Novel Policy-Graph Approach with Natural Language and Counterfactual Abstractions for Explaining Reinforcement Learning Agents
Tongtong Liu, Joe McCalmon, Thai Le, Dongwon Lee, Sarra Alqahtani
Preprint, 2023
We propose a novel approach that summarizes an agent’s policy in the form of a directed graph with natural language descriptions with counterfactual explanations.
|
|
GRACE: Generating Concise and Informative Contrastive Sample to Explain Neural Network Model’s Prediction
[code]
Thai Le, Suhang Wang, Dongwon Lee
ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD), 2020
This work borrows two notable ideas (i.e., "explanation by intervention" from causality and "explanation are contrastive" from philosophy) and propose a novel solution, named as GRACE, that better explains neural network models' predictions for tabular datasets. In particular, given a model's prediction as label X, GRACE intervenes and generates a minimally-modified contrastive sample to be classified as Y, with an intuitive textual explanation, answering the question of "Why X rather than Y?
|
|
CAPS: Comprehensible Abstract Policy Summaries for Explaining Reinforcement Learning Agents
Joe McCalmon, Thai Le, Sarra Alqahtani and Dongwon Lee
International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2022
This work proposes a novel approach that summarizes an agent's policy in the form of a directed graph with natural language descriptions. A decision tree based clustering method is utilized to abstract the state space of the task into fewer, condensed states which makes the policy graphs more digestible to end-users. This abstraction allows the users to control the size of the policy graph to achieve their desired balance between comprehensibility and accuracy. In addition, we develop a heuristic optimization method to find the most explainable graph policy and present it to the users. Finally, we use the user-defined predicates to enrich the abstract states with semantic meaning.
|
Learning under Uncertainty
|
|
CHECKER: Detecting Clickbait Thumbnails with Weak Supervision and Co-Teaching
Tianyi Xie, Thai Le, Dongwon Lee
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2021
This work develops CHECKER to exploit: (1) the weak supervision framework to generate many noisy-but-useful labels, and (2) the co-teaching framework to learn robustly using such noisy labels to detect clickbait thumbnails on video-streaming websites.
|
|
5 Sources of Clickbaits You Should Know! Using Synthetic Clickbaits to Improve Prediction and Distinguish between Bot-Generated and Human-Written Headlines
Thai Le, Kai Shu, Maria Molina, Dongwon Lee, Shyam Sundar, Huan Liu
IEEE/ACM Int’l Conf. on Social Networks Analysis and Mining (ASONAM), 2019
This works investigates how to exploit human and computer generative models to generate synthetic clickbaits as additional training data to train better ML clickbait detectors. we observe an improvement in accuracy, up to 8.5% in AUC, even for top-ranked clickbait detectors from Clickbait Challenge 2017.
|
Computational Misinformation
|
|
”Does Clickbait Actually Attract More Clicks? Three Clickbait studies you must read.
Maria Molina, S. Shyam Sundar, Md Main Uddin Rony, Naeemul Hassan, Thai Le, Dongwon
ACM Conference on Human Factors in Computing Systems (CHI), 2021
This work carries out three user-studies to investigate why users do not reliably click more often on headlines classified as clickbait by automated classifiers.
|
|
Reading, Commentingand Sharing of Fake News: How Online Bandwagons and Bots Dictate User Engagement
Maria Molina, Jinping Wang, S. Shyam Sundar, Thai Le, Carlina DiRusso.
CommunicationResearch.
Do social media users read, comment, and share false news more than real news? Does it matter if the story is written by a bot and whether it is endorsed by many others? We conducted a selective-exposure experiment (N = 171) to answer these questions.
|
|
”Fake News” is Not Simply False Information: A Concept Explication and Taxonomy of Online Content
Maria Molina, Shyam Sundar, Thai Le, Dongwon Lee
American Behavioral Scientist, 2019
This work conducts an explication of “fake news” that, as a concept, has ballooned to include more than simply false information, with partisans weaponizing it to cast aspersions on the veracity of claims made by those who are politically opposed to them.
|
|
Effects of Bandwagon Cues and Automated Journalism on Reading, Commenting and Sharing of Real vs. False Information Online
Best Paper Award
Maria Molina, Jinping Wang, Thai Le, DiRusso, Carlina, Sundar, S. Shyam
Conference of the Association for Education in Journalism and Mass Communication (AEJMC), 2019
Do social media users read, comment, and share false news more than real news? Does it matter if the story is written by a bot, and whether it is endorsed by many others? We conducted a selective-exposure experiment (N = 171) to answer these questions.
|
|
How Gullible Are You? Predicting Susceptibility to Fake News
Jia Shen, Robert Cowell, Aditi Gupta, Thai Le, Amulya Yadav, Dongwon Lee
International ACM Web Science Conference (WebSci), 2019
This work hypothesizes that some social users are more gullible to fake news than others, and accordingly investigate on the susceptibility of users to fake news–i.e., how to identify susceptible users, what are their characteristics, and if one can build a predictionmodel
|
|
Machine Learning Based Detection of Clickbait Posts in Social Media
Xinyue Cao, Thai Le, Jason(Jiasheng) Zhang
Arxiv, 2017
This work attempts to build an effective computational model to detect clickbaits on Twitter as part of the Clickbait Challenge 2017
|
|
Large-Scale Data-Driven Airline Market Influence Maximization
Duanshun Li, Jing Liu, Jinsung Jeon, Seoyoung Hong, Thai Le, Noseong Park, Dongwon Lee
ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD), 2021
This work presents a prediction-driven optimization framework to maximize the market influence in the US domestic air passenger transportation market by adjusting flight frequencies
|
|
PathFinder: Graph-based Itemset Embedding for Learning Course Recommendation and Beyond
Jason (Jiasheng) Zhang, Thai Le, Yiming Liao, Dongwon Lee
IEEE International Conference on Data Mining (ICDM), 2019, Demo Paper
This paper demonstrates a tool that captures and visualizes rich latent relationships among courses as a graph, mines students’ past course performance data, and recommends pathways or top-k courses most helpful to a given student, using an itemset embedding based learning model.
|
|
A Machine Learning Framework for Automating Well Log Depth Matching
Thai Le, Lin Liang, Timon Zimmermann, Smaine Zeroug, Denis Helio
Journal of Petrophysics, 2019
This work develops a machine learing model and statistical metrics for depth matching well logs acquired from multiple logging passes in a single well.
|
|