Thai Le

thaile [at] olemiss [dot] edu | Google Scholar | GitHub

Welcome! I am an Assistant Professor of the Computer Information Science department at University of Mississippi (Ole Miss). I got my doctorate degree at Penn State in the College of IST. My doctorate advisor was Professor Dongwon Lee. I am an ex-Amazonian at Alexa Privacy and an ex-intern at Yahoo Research and VMWare OCTO. My research interest lies in Data Science and Machine Learning with a focus on Natural Language Processing.

I am looking for highly motivated research interns, master and Ph.D. students to do research in security & privacy for NLP applications. Ole Miss also has a competitive four-year award FCN Founders Graduate Fellowship for Ph.D. applicants.

News (From 2022)
  • 03/2023 - Preprint of NoisyHate - an adversarial toxic texts dataset with human-written perturbations - is available
  • 2023 - One paper on the plagiarism behaviors of LLM is accepted to WWW'23
  • 2023 - Preprint on Unattributable Authorship Text is available
  • 2022 - Tutorial ``Catch Me If You GAN: Generation, Detection, and Obfuscation of Deepfake Texts" accepted at WWW'23, with Prof. Dongwon Lee and Adaku Uchenda.
  • 2022 - One survey paper on Authorship Detection of Deepfake Texts will be published at KDD Exploration
  • 2022 - One demo paper on perturbations in the wild is accepted at ICDM'23
  • 2022 - PC Members: PKDD'22, EMNLP'22, WSDM'23, AAAI'23, WWW'23
  • 2022 - Accepted tenure-track faculty position at University of Mississippi
  • 2022 - Receive the IST Ph.D. Student Award for Research Excellence, College of IST, PSU
  • 2022 - Two papers on adversarial texts are accepted at ACL'22
  • 2022 - One paper on RL-based Adversarial Socialbots is accepted at WWW'22.
  • 2022 - One paper on Explainable RL is accepted at AAMAS'22.
Resources: Tools and Datasets

NLP Language Models, Neural Text Generation, (Reverse) Turing Test

Arxiv22Z Do Language Models Plagiarize?
Jooyoung Lee, Thai Le, Jinghui Chen, Dongwon Lee
The ACM Web Conference (WWW), 2023

We investigate the privacy risks of large language models's over-memorization behaviors in the context of plagiarism, both on pre-trained and fine-tuned models. Specifically, we analyze three different types of plagiarism, namely verbatism, paraphrase and idea plagiarism.

Arxiv22X Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective
Adaku Uchenda, Thai Le, Dongwon Lee
SIGKDD Explorations, Vol. 25, June 2023 2023

In this survey, we make a comprehensive review of recent literature on the attribution and obfuscation of neural text authorship from a Data Mining perspective, and share our view on their limitations and promising research directions.

EMNLP21 TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation
Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang and Dongwon Lee
Findings of Empirical Methods in Natural Language Processing (EMNLP), 2021

While there are many legitimate applications of generative language models, there is also a rising need to distinguish machine-generated texts from human-written ones (e.g., fake news detection). In this work, we present the TURINGBENCH benchmark environment which comprises datasets to evaluate both Turing test and authorship attribution on neural texts.

EMNLP20 Authorship Attribution for Neural Text Generation
Adaku Uchendu, Thai Le, Kai Shu, Dongwon Lee
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

This paper investigates 8 large language models with two problems (i) Turing Test: differentiate between human and machine-generated texts and (ii) Authorship Attribution: differntiate texts generated among different generative models.

ICDM18 Deep headline generation for clickbait detection
Shu Kai, Suhang Wang, Thai Le, Dongwon Lee, Huan Liu
IEEE International Conference on Data Mining (ICDM), 2018

This work proposes to generate synthetic headlines with specific styles and explore their utilities to help improve clickbait detection. In particular, we propose to generate stylized headlines from original documents with style transfer

Security and Privacy

Arxiv23B NoisyHate: Benchmarking Content Moderation Machine Learning Models with Human-Written Perturbations Online
Yiran Ye, Thai Le, Dongwon Lee
Preprint, 2023

We introduce a benchmark test set containing human-written perturbations online for toxic speech detection models. We test several spell corrector algorithms on this dataset. We also test this data on state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as perspective API, to demonstrate the adversarial attack with real human-written perturbations is still effective

Arxiv23B UPTON: Unattributable Authorship Text via Data Poisoning
Ziyao Wang, Thai Le, Dongwon Lee
Preprint, 2023

This work proposes UPTON. UPTON uses data poisoning to destroy the authorship feature only in training samples by perturbing them, and try to make released textual data unlearnable on deep neuron networks. It is different from previous obfuscation works, that use adversarial attack to modify the test samples and mislead an AA model, and also the backdoor works, which use trigger words both in test and training samples and only change the model output when trigger words occur

Arxiv22A CryptText: Interactive Discovery and Visualization of Human-Written Text Perturbations in the Wild
Thai Le, Ye Yiran, Yifan Hu, Dongwon Lee
ICDM (Demo), 2023

There is no available framework that explores and utilizes these human-written perturbation patterns online. Therefore, we introduce an interactive system called CrypText, which is a collection of tools for users to extract and interact with human-written perturbations. Specifically, CrypText helps look up, perturb, and normalize (i.e., de-perturb) texts. CrypText also provides an interactive interface to monitor and analyze text perturbations online.

Arxiv21A SHIELD: Defending Textual Neural Networks against Multiple Black-Box Adversarial Attacks with Stochastic Multi-Expert Patcher [code]
Thai Le, Noseong Park, Dongwon Lee
Annual Meeting of the Association for Computational Linguistics (ACL) , 2022

We propose HIELD algorithm that transforms a textual NN model into a stochastic ensemble of multi-expert predictors by upgrading and re-training its last layer only. Whenever an adversary try to fool the model, SHIELD confuses the attacker by automatically utilizing different subsets of predictors that are specialized in different sets of features, classes and instances.

Arxiv21B Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense [code]
Thai Le, Jooyoung Lee, Kevin Yen, Yifan Hu, Dongwon Lee
Annual Meeting of the Association for Computational Linguistics (ACL) , 2022 (Findings)

We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack and defense. Unlike existing character-based attacks which often deductively hypothesize a set of manipulation strategies, our work is grounded on actual observations from real-world texts.

Arxiv21B Socialbots on Fire: Modeling Adversarial Behaviors of Socialbots via Multi-Agent Hierarchical Reinforcement Learning. [code]
Thai Le, Long-Thanh Tran, Dongwon Lee
The Web Conference (WWW), 2022

The adversarial nature of these socialbots has not yet been studied. This begs a question ``can adversaries, controlling socialbots, exploit AI techniques to their advantage?" To this question, we successfully demonstrate that indeed it is possible for adversaries to exploit computational learning mechanism such as reinforcement learning (RL) to maximize the influence of socialbots while avoiding being detected.

ACL21 A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger’s Adversarial Attacks. [code]
Thai Le, Noseong Park, Dongwon Lee
Annual Meeting of the Association for Computational Linguistics (ACL) , 2021

This work borrows the "honeypot" concept from the cybersecurity community and propose DARCY, a novel honeypot-based defense frame-work against UniTrigger attack. DARCY greedily searches and injects multiple trapdoors into an neural network model to "bait and catch" potential attacks.

ICDM20 MALCOM: Generating Malicious Comments to Attack Neural Fake News Detection Models
Thai Le, Suhang Wang, Dongwon Lee
IEEE International Conference on Data Mining (ICDM), 2020

This work (i) proposes a novel attack scenario against fake news detectors, in which adversaries can post malicious comments toward news articles to mislead SOTA fake news detectors, and (ii) develops Malcom, an end-to-end adversarial comment generation framework to achieve such an attack.

Explainable AI

KDD20 A Policy-Graph Approach to Explain Reinforcement Learning Agents: A Novel Policy-Graph Approach with Natural Language and Counterfactual Abstractions for Explaining Reinforcement Learning Agents
Tongtong Liu, Joe McCalmon, Thai Le, Dongwon Lee, Sarra Alqahtani
Preprint, 2023

We propose a novel approach that summarizes an agent’s policy in the form of a directed graph with natural language descriptions with counterfactual explanations.

KDD20 GRACE: Generating Concise and Informative Contrastive Sample to Explain Neural Network Model’s Prediction [code]
Thai Le, Suhang Wang, Dongwon Lee
ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD), 2020

This work borrows two notable ideas (i.e., "explanation by intervention" from causality and "explanation are contrastive" from philosophy) and propose a novel solution, named as GRACE, that better explains neural network models' predictions for tabular datasets. In particular, given a model's prediction as label X, GRACE intervenes and generates a minimally-modified contrastive sample to be classified as Y, with an intuitive textual explanation, answering the question of "Why X rather than Y?

KDD20 CAPS: Comprehensible Abstract Policy Summaries for Explaining Reinforcement Learning Agents
Joe McCalmon, Thai Le, Sarra Alqahtani and Dongwon Lee
International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2022

This work proposes a novel approach that summarizes an agent's policy in the form of a directed graph with natural language descriptions. A decision tree based clustering method is utilized to abstract the state space of the task into fewer, condensed states which makes the policy graphs more digestible to end-users. This abstraction allows the users to control the size of the policy graph to achieve their desired balance between comprehensibility and accuracy. In addition, we develop a heuristic optimization method to find the most explainable graph policy and present it to the users. Finally, we use the user-defined predicates to enrich the abstract states with semantic meaning.

Learning under Uncertainty

PKDD21 CHECKER: Detecting Clickbait Thumbnails with Weak Supervision and Co-Teaching
Tianyi Xie, Thai Le, Dongwon Lee
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2021

This work develops CHECKER to exploit: (1) the weak supervision framework to generate many noisy-but-useful labels, and (2) the co-teaching framework to learn robustly using such noisy labels to detect clickbait thumbnails on video-streaming websites.

ASONAM19 5 Sources of Clickbaits You Should Know! Using Synthetic Clickbaits to Improve Prediction and Distinguish between Bot-Generated and Human-Written Headlines
Thai Le, Kai Shu, Maria Molina, Dongwon Lee, Shyam Sundar, Huan Liu
IEEE/ACM Int’l Conf. on Social Networks Analysis and Mining (ASONAM), 2019

This works investigates how to exploit human and computer generative models to generate synthetic clickbaits as additional training data to train better ML clickbait detectors. we observe an improvement in accuracy, up to 8.5% in AUC, even for top-ranked clickbait detectors from Clickbait Challenge 2017.

Computational Misinformation

CHI20 ”Does Clickbait Actually Attract More Clicks? Three Clickbait studies you must read.
Maria Molina, S. Shyam Sundar, Md Main Uddin Rony, Naeemul Hassan, Thai Le, Dongwon
ACM Conference on Human Factors in Computing Systems (CHI), 2021

This work carries out three user-studies to investigate why users do not reliably click more often on headlines classified as clickbait by automated classifiers.

CR21 Reading, Commentingand Sharing of Fake News: How Online Bandwagons and Bots Dictate User Engagement
Maria Molina, Jinping Wang, S. Shyam Sundar, Thai Le, Carlina DiRusso.

Do social media users read, comment, and share false news more than real news? Does it matter if the story is written by a bot and whether it is endorsed by many others? We conducted a selective-exposure experiment (N = 171) to answer these questions.

ABS19 ”Fake News” is Not Simply False Information: A Concept Explication and Taxonomy of Online Content
Maria Molina, Shyam Sundar, Thai Le, Dongwon Lee
American Behavioral Scientist, 2019

This work conducts an explication of “fake news” that, as a concept, has ballooned to include more than simply false information, with partisans weaponizing it to cast aspersions on the veracity of claims made by those who are politically opposed to them.

AEJMC19 Effects of Bandwagon Cues and Automated Journalism on Reading, Commenting and Sharing of Real vs. False Information Online

Best Paper Award

Maria Molina, Jinping Wang, Thai Le, DiRusso, Carlina, Sundar, S. Shyam
Conference of the Association for Education in Journalism and Mass Communication (AEJMC), 2019

Do social media users read, comment, and share false news more than real news? Does it matter if the story is written by a bot, and whether it is endorsed by many others? We conducted a selective-exposure experiment (N = 171) to answer these questions.

Websci19 How Gullible Are You? Predicting Susceptibility to Fake News
Jia Shen, Robert Cowell, Aditi Gupta, Thai Le, Amulya Yadav, Dongwon Lee
International ACM Web Science Conference (WebSci), 2019

This work hypothesizes that some social users are more gullible to fake news than others, and accordingly investigate on the susceptibility of users to fake news–i.e., how to identify susceptible users, what are their characteristics, and if one can build a predictionmodel

Arxiv17 Machine Learning Based Detection of Clickbait Posts in Social Media
Xinyue Cao, Thai Le, Jason(Jiasheng) Zhang
Arxiv, 2017

This work attempts to build an effective computational model to detect clickbaits on Twitter as part of the Clickbait Challenge 2017


KDD21 Large-Scale Data-Driven Airline Market Influence Maximization
Duanshun Li, Jing Liu, Jinsung Jeon, Seoyoung Hong, Thai Le, Noseong Park, Dongwon Lee
ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD), 2021

This work presents a prediction-driven optimization framework to maximize the market influence in the US domestic air passenger transportation market by adjusting flight frequencies

ICDM19 PathFinder: Graph-based Itemset Embedding for Learning Course Recommendation and Beyond
Jason (Jiasheng) Zhang, Thai Le, Yiming Liao, Dongwon Lee
IEEE International Conference on Data Mining (ICDM), 2019, Demo Paper

This paper demonstrates a tool that captures and visualizes rich latent relationships among courses as a graph, mines students’ past course performance data, and recommends pathways or top-k courses most helpful to a given student, using an itemset embedding based learning model.

SPWLAALS19 A Machine Learning Framework for Automating Well Log Depth Matching
Thai Le, Lin Liang, Timon Zimmermann, Smaine Zeroug, Denis Helio
Journal of Petrophysics, 2019

This work develops a machine learing model and statistical metrics for depth matching well logs acquired from multiple logging passes in a single well.

Last updated on 03/18/2023
This guy makes a nice webpage