Posted by Sam Jefferies, Vice President of EMEA at Legal Futures Associate DocsCorp 
Data seems to be the word on everyone’s lips right now. Phrases like data breaches, data harvesting, and data regulations clog our news feeds until our brains start turning it all into white noise.
But don’t let the overwhelming amount of information out there blind you to what’s important – protecting the data you or your firm holds.
Most data leaks happen accidentally and could be as innocent as a reader uncovering information the sender may never have realised was there.
So, how can you take back control of what you share? The key lies in these three important terms.
One of the biggest worries the firms I speak with have is not knowing what’s in a document beyond what they have typed. With every Microsoft Word file comes complex metadata – like total editing time, last modified date, and author names – that can tell the reader much more than what would be printed on a piece of paper.
Metadata isn’t all bad. In fact, a lot of it can help with document management. Metadata like tags, title, and creation date can be searched for in a file system like Microsoft SharePoint, making file discovery easier and more accurate.
However, you should be careful with what metadata you leave attached to documents sent to people outside of the business.
Metadata like total editing time, comments, and anything else never intended to live beyond the draft stage should be wiped, so it doesn’t end up in the wrong hands. Clients and opposing counsel could uncover a goldmine of bonus information in a comments section only meant for internal use.
Hidden data is text covered with a black box instead of being redacted correctly or the font colour being turned white, embedded files, Track Changes, hidden formulas, and hidden columns in an Excel spreadsheet.
Any reader can uncover this text, or ‘unhide’ columns, and suddenly have access to a whole host of information that should have been kept private.
My company was once inadvertently given access to more than we bargained for. We were sent a spreadsheet with delegates registered for an event. But, we noticed the column names in the file weren’t A, B, C, but A, D, G. When we unhid the missing columns, we uncovered job titles, email addresses, and telephone numbers – much more than the event organisers told us they could send.
It was a lesson to us as much as it was to them – always know what you’re sending.
Where hidden data risked disclosing more information than intended, dark data is not being able to find the information you need.
The lifecycle of dark data begins with scanned files, email attachments, and bulk file imports added into a file system. From there, they go dark simply because they lack the text layer search technology uses to find them.
Image-based files with no text layer, like scanned IDs or invoices, need to be processed through optical character recognition (OCR) technology to be searchable. OCR technology scans an image file and applies a text layer, so it can be searched for using on-page content like a client name or case number.
Dark data is a serious threat to GDPR compliance since a response to a data subject access request requires an organisation to provide all data relating to the requestor. Failing to provide all the information because the documents were undiscoverable can lead to costly disputes, drawn-out negotiations, and financial penalties.
Don’t wait for a leak to happen before you consider what hidden and dark data you’re working with. When it comes to data management, prevention is always better than cure.