Using Email Writing Styles to Reduce Authorship Identification Workload A Collaboration with the Dyfed Powys Police Cyber Crime Division

The end goal of the collaboration between Cardiff and Dyfed Powys is to produce a data analytics hub.

The hub will take a suspect and identify whether they are the author/operator of one or more possible alias accounts/devices that are already known to be part of criminal activity.

The hub will do this by taking in data from the suspect's "day to day" accounts and devices such as emails, text messages and information from their social media accounts and compare it to similar data from the alias accounts/devices.

It will look at various aspects such as:

  • Writing style
  • Images
  • Connections to other users
  • Location Data

Each capability of the analytics hub ( such as different data sources, different methods of analysing) can be developed in a modular fashion.

This enabled us to focus on one aspect and by the end of the 8 weeks hand over a module that was ready to reduce the workload of the cyber crime unit.

The focus chosen was identifying if a suspect operates one or more ’target’ email accounts by comparing the writing style of the target accounts with the writing style of an account already known to be owned by the suspect.

Our approach to writing prints is based on previous work (Iqbal et al., 2010) which outlined a set of features the researchers believed could characterise people's writing styles. We also added to the list, bringing the total to over 400 total features.

Examples of the features analysed:

  • How many times each character is used.
  • Vocabulary Richness (Yule's K method)
  • Ratio of short words to total words used
  • Do they use an email signature

We looked at these features across a group of emails known to be sent by the same account (a cluster) and produced a "feature print" that consisted of only features that showed statistical significance across the entire cluster.

For example if the majority of emails within a cluster showered a low language complexity we would include "low complexity" within the writing print.

We then compare these writing prints and those that show similar significant features are rated as more likely to be the same person.

For now, the module simply presents the cyber crime team with an ordered list to work through, checking the most likely candidate first. As part of the larger hub. this will form part of a bigger picture given more robust results.

Whilst we took the concept of the writing print from previous work and added a few features to it. We also overhauled the way the writing prints could be analysed and compared which allowed for more consistent results across larger groups of emails clusters.

During our research we developed techniques to analyse and refine the importance of certain features. We looked at which features were producing false positives and which were key indicators of authorship match.

Obviously there are plenty more modules to built as part of the larger analytics hub however there are also expansions on this specific work to explore:

  • Explore alternative methods for defining "significance" in a specific feature
  • Trial across more varied data sets