The worst nightmare a data manager can have? See your data lake turn into a data swamp invaded by dark data! Translation ?
Encouraged by the steady decline in the price of digital storage, more and more large groups and medium-sized businesses store in a “data lake”, located on their own servers or in the cloud, all their data. And this, whatever their formats: structured databases but also emails, Word or PDF files, images, videos, recordings of customer calls … And whatever their sources: digital workstations for employees, connected objects, software SAS (as a service), CRM for customer relations, ERP for operational management, social networks …
KEY CONCEPTS of DARK DATA
Structured data: a classic example of structured data is data from a spreadsheet whose column and row titles precisely define the nature of the quantitative content of each cell. A structured database can be likened to a huge spreadsheet. Each data is precisely identified and “calculable”.
Unstructured data: these are generally qualitative data, of varying sizes and shapes, contained in text files, videos, sound files, or published on social networks, etc.
Uncleaned data: these are data whose value is unreliable, as a result of a poor calculation method or methodological bias.
Dark data: data captured but not identified and unused. · Black information: this term is rather used in business intelligence to designate information protected by secrecy, to which only a few people can access and which may be the subject of hacking attempts or espionage from competitors.
Unlocking the Potential of Dark Data
However, all the studies estimate that more than half (52% according to, for example, the Californian company Veritas Technologies) of the data of the companies are in reality “dark data”, that is to say “Unstructured data that is neither used nor analyzed, but simply stored as it is generated by the company and its ecosystem”, explains Philip Carnelley, assistant vice president of software and analytics at research firm IDC Europe.
In other words, “ it is data under the radar, which is not exploited or valued by companies », Specifies Véronique Mesguich, consultant and trainer, specialized in business intelligence.
Examples? ” The annual reports downloaded by a bank’s analysts, the multiple versions of contracts exchanged by email in PDF format between the sales representatives and customers of a company and stored on the computers of these employees… ”, illustrate Grégoire Colombet, specialist in AI for financial services, and Guilhaume Leroy-Meline, expert in cognitive transformation, both from IBM France.
Another possibility: data, after being used, is forgotten. This is the case for information linked to numerous research projects in university laboratories or companies: if this work is abandoned, their results go under the carpet. “ Consequently, in many organizations, including those belonging to the military-industrial complex, we must reinvent the wheel by starting research from zero. », Notes Bryan Heidorn, director of the Center for Society and Digital Studies at the University of Arizona (United States), and dark data theorist, particularly in the academic world.
Better customer knowledge
To prove the financial value of these “Digital Sleeping Beauty “, Bryan Heidorn applied the concept of the “long tail” to them. Usually, this “long tail” proves that on the Internet, a few sales of a very large number of little-known products represent a market as lucrative as that of the very large volumes generated by a few bestsellers. Transposed to “dark data”, the “long tail” shows that Harnessing and interconnecting large amounts of dark data is a growth opportunity, allowing better management, the generalization of predictive maintenance, better customer knowledge… IDC has quantified these productivity gains at 430 billion dollars (365 billion euros).
“ Another example of dark data: your connection logs to a company’s website – which pages you visited, how long you stayed there. Imagine that we could link them to your loyalty card saved on your smartphone: we could then identify you, thanks to sensors, when you enter a store and offer you a personalized customer journey, based on discount coupons ”enthuses Raphaël Savy, Vice-President Southern Europe at Alteryx.
Alteryx is part with Alfresco, Blue Prism, IBM, Invenis, M-Files, Splunk and many others, software publishers who have developed dark analytics, tools capable, thanks to advances in AI, of identifying dark data and of structuring it or at least recognizing its nature (a contract, etc.) and extracting some information from it: the date of the document, the different stakeholders …
Evaluate the interest of keeping this data
This can lead to unpleasant surprises. ” Dark data also constitutes potential pots, in terms of legislation, environment and cybersecurity. », Warns François Royer, director and founder of Guanxi Labs, in Toulouse, specializing in supporting digital transformation. Careful examination of this data may reveal serious breaches of compliance. ” Their storage on greedy servers in air conditioning also harms the ecological footprint … », Recalls George Parapadakis, director at Alfresco. And if dark data is hacked, nobody notices …
The dark data identification and analysis market is all the more buoyant as these unidentified and unused files can constitute real time bombs, if they can be used to prove that their owner is breaking a law. This is the risk of “non-compliance”. “The GDPR obliges companies to identify all their files containing personal data”, recalls Grégory Serrano, sales and marketing director of Invenis, a software publisher.
“Once analyzed, dark data can reveal to the managers of a bank that it is in contact with a third party at risk: for example, a shipowner who regularly trades with countries under the American embargo”, illustrates Thomas Knidler, Commercial Director of Blue Prism (business process automation software) for the banking sector. If this relationship were brought to the attention of OFAC, the American body responsible for enforcing international sanctions, the bank could be fined very heavily.
A mine to be exploited
Also research is continuing all over the world, to gain in speed and finesse of analysis. Divesh Srivastava (AT&T Labs Research, USA), specializing in real-time data analysis, Juliana Freire (New York University) or Renée J. Miller (University of Toronto, Canada) are among the scientists exploring tracks, upstream or downstream of the phenomenon.
Upstream, “One of the approaches is to install data analysis software directly on their storage locations, explains Guilhaume Leroy-Meline, of IBM. This would make it possible to identify and qualify any incoming data and thus highlight it. “
Downstream, the cross-use of dark data would allow them to be better qualified. In France, the DGA (Directorate General for Armament) and the ANR (National Research Agency) are funding the “Sources Say” project up to 700,000 euros, directed by Ioana Manolescu, research director at Inria and professor at Polytechnique . “ It is a question of extracting very quickly the named entities – people, organizations, places… – in unstructured or structured documents, such as social networks or texts, to locate all the files which cite these same entities and to establish links between them to obtain more information on these entities ”, concludes Ioana Manolescu. The hunt for “dark data” has only just begun.
In this difficult period, entrepreneurs and managers of VSEs and SMEs need support more than ever. The Les Echos Entrepreneurs site is making its contribution by offering free information and testimonials for the next few weeks.
>>> To stay informed of entrepreneurs and startups news,
remember to subscribe to our daily newsletter
and / or the weekly newsletter Goodbye to the crisis!