The database OCCRP used to publish the Troika Laundromat represents a banking data leak of enormous size and scope. Hundreds of bank account records, many with tens of thousands of transactions, form the core of the leak. All together, the transaction record exceeds 1.3 million transactions and US$470 billion.
The leak was compiled from multiple sources by OCCRP and 15min.lt. Additional data from OCCRP’s own database was combined to help understand the core leaks.
These transactional records are supported by tens of thousands of corporate documents, contracts, invoices, and emails. Together, this data provides a robust view of the activities of select clients of two now-defunct Lithuanian banks, Ukio Bankas and Bankas Snoras AB. Billions in dollars, euros, Swiss francs, and rubles flowed through these accounts over the course of more than 10 years, with the bulk of activity from 2003 to 2013, by which time both banks had been shuttered by Lithuanian authorities.
In order to facilitate a months-long cross-border investigation, OCCRP made the leak database available to a team of international journalists at more than 20 partner organizations. To present a holistic and unified view of such a large leak, two multi-layered systems were designed and employed to parse, recognize, organize, and represent the leaked data.
The document records were sent through a state-of-the-art information processing and retrieval technology stack called Aleph. Designed and built in-house by OCCRP, this tool allows journalists to easily access, search, and browse large volumes of multi-format document records. Interested readers can explore public records themselves at data.occrp.org and check out the open-source code base, along with other documentation, at https://github.com/alephdata/aleph.
The banking transactions within the leak were embedded in more than 20 different record and file formats in a mixture of both Lithuanian and English, and required an artisanal approach. OCCRP employed a custom set of parsing methods to structure the often messy and un-delimited transaction data. An array of methods was also used to ensure uniformity in the company names and accurately convert transaction amounts to common currencies, as well as to extract and verify bank account numbers and addresses. The resulting structured data was compiled and made available to investigators in a postgresql database. While we believe we accounted for the vast majority of name variations and other data discrepancies, no system can clean all data perfectly, and the published aggregated numbers should be taken as approximate.
Journalists interested in gaining access to the document or transaction records can contact data@occrp.org.