Aylin, greenie and Rebekah Overdorf
Stylometry is the study of linguistic style found in text. Stylometry existed long before computers but now the field is dominated by artificial intelligence techniques.
Writing style is a marker of identity that can be found in a document through linguistic information to perform authorship recognition. Authorship recognition is a threat to anonymity but knowing ways to identify authors provides methods for anonymizing authors as well. Even basic stylometry systems reach high accuracy in classifying authors correctly. Stylometry can also be used in source code to identify the author of a program. In this talk, we investigate methods to de-anonymize source code authors of C++ and authors across different domains. Source code authorship attribution could provide proof of authorship in court, automate the process of finding a cyber criminal from the source code left in an infected system, or aid in resolving copyright, copyleft and plagiarism issues in the programming fields. Programmers can obfuscate their variable or function names, but not the structures they subconsciously prefer to use or their favorite increment operators. Following this intuition, we create a new feature set that reflects coding style from properties derived from abstract syntax trees. We reach 99% accuracy in attributing 36 authors each with ten files. We experiment with many different sized datasets leading to high true positive rates. Such a unique representation of coding style has not been used as a machine learning feature to attribute authors and therefore this is a valuable contribution to the field.
We also examine the need for cross-domain stylometry, where the documents of known authorship and the documents in question are written in different contexts. Specifically, we look at blogs, Twitter feeds, and Reddit comments. While traditional methods in stylometry that work well within one domain fail to identify authors across domains, we are able to improve the accuracy of cross-domain stylometry to as high as 80%. Being able to identify authors across domains facilitates linking identities across the Internet making this a key privacy concern; users can take other measures to ensure their anonymity, but due to their unique writing style, they may not be as anonymous as they believe.