Why Unstructured Data Is Sorting Itself Out

Information, without order, is chaotic. Attempting to work with data without structure and form is rather like watching white noise fuzz on an un-cabled television set, where shapes are almost familiar, but devoid of any recognizable manifestation. Unstructured data inside organizations appears to be full of energy, but it is weighed down by an inertia which precludes it from being useful, primarily because it doesn’t know which home (application) it belongs to.

What Is Unstructured Data?

To define the term, let’s first say that structured data includes spreadsheets with their formalized rows and columns, “form-based” data resources where we know the fields in a document and so we know what values to expect… and of course relational databases, the purest form of an ordered and structured data repository. Unstructured data, therefore, includes non-tabular data spanning records of phone calls and voicemails, it is raw video that has yet to get meta-tagged to explain its contents, it is blogs and web pages, it’s emails and also social media posts in all their forms.

Some data that may appear structured (such as sensor data from surveillance and internet of things devices) is still essentially unstructured i.e. 6,000 temperature readings and gyroscope movement records aren’t necessarily structured just because they are numbered by sequence; they need to be extracted, parsed, deduplicated and manipulated to become structured for productive use. In so many cases, unstructured data is regarded as an untapped source of real business context, but it is often the hardest to bring in line, the hardest to govern and the toughest to operationalize.

Technology analyst house IDC refers to the unclassified morass of information as the “unseen data conundrum” and estimates that unsiloed reserves of unstructured data now make up “the majority of enterprise information” today. IDC also suggests that it is more than doubling (growing 55%) each year. These data blind spots are thought to create operational risk and to potentially undermine the value of AI. This is important now because organizations are using unstructured data to power large language models and retrieval-augmented generation applications.

Unstructured Market Structure

There’s a whole marketplace structure of unstructured technology toolset vendors today. Amazon Web Services (AWS) offers an entire menu of functions in this space. Amazon Comprehend is a natural language processing and machine learning service capable of extracting metadata, extracting key phrases and determining sentiment from text in multiple languages. AWS positions this service alongside the Amazon Transcribe speech-to-text tools, the quirkily named Amazon Rekognition image and video analysis service… and there’s also Amazon Textract, which extracts metadata from scanned documents and images.

Given the breadth of AWS services in this market, it would be reasonable to expect similar-but-skewed proprietary versions of these functions in the major cloud service provider hyperscalers. Microsoft Azure Cosmos DB is a globally distributed, multi-model database with enough intelligence to be able to manage structured, semi-structured and unstructured data. This cloud-native database might be used alongside the playfully named Microsoft Blob Storage service, an object storage service designed for storing large amounts of unstructured data that might exist in images, videos, documents and other binary data. Also from Microsoft, AI Document Intelligence uses machine learning to extract text, key-value pairs, tables and structures from documents automatically.

Not to be left out, Google Cloud Platform also works at this level. The cloud and search giant points to its BigQuery brand and the object tables function within it. “Object tables provides a structured record interface for unstructured data stored in Google Cloud Storage. This enables [users]

to directly run analytics and machine learning on images, audio, documents and other file types using existing frameworks like SQL and remote functions natively in BigQuery itself,” noted the Google Cloud’s Gaurav Saxena and Thibaud Hottelier, at the time of this product’s launch a couple of years back.

An IT Sub-Discline Of Its Own

Given the services that exist as fairly prominent functions in the major cloud providers and from the toolsets that exist from more specialized players, working with unstructured data is clearly now a more pressing need. Often referred to as enterprise content management, ECM is certainly growing in the combined shadow of big data analytics and and rise of artificial intelligence.

The natural evolution for a data market like this is the arrival of industry-specific services aligned to industry verticals. Known for its work in unstructured data management across the healthcare industry, Hyland treads a careful line with its messaging as the company clearly wants to be seen as applicable to all use cases. The company says Hyland Content Intelligence turn unstructured data into actionable, AI-ready content with the 2025 arrival of its Knowledge Enrichment (currently in Beta) service being among its star players.

Related technologies are also present at IBM in the form of Watson Discovery for unstructured search and AI; Elastic for indexing and querying of unstructured text and logs; Cloudera for Hadoop-based data lake services across unstructured and semi-structured data; Databricks, Collibra, Alation, Palantir and Varonis, to name but a mouthful, there is a lot of structure being applied to the unstructured data space.

Black Box Blind Spots

“Unstructured data remains a black box for most organizations, [especially] as it becomes critical for AI and business operations,” said Jay Limburn, chief product officer at Ataccama. “Without a way to structure, govern and trust that information, enterprises risk missing the full value of their data.”

Limburn points to his firm’s Ataccama One platform as a means to combine data quality, governance, observability, lineage and master data management. Ataccama One is now available on Snowflake Marketplace as a new integration with Document AI, a Snowflake AI feature that uses Arctic-TILT, a proprietary large language model used to extract data from documents.

This fusion of data structuring services is billed as a means of turning unstructured content, such as contracts, invoices and PDFs, into structured data by running models directly within Snowflake. Businesspeople can use natural language prompts, such as “What is the effective date of the contract?”, which are then processed by Snowflake to create structured outputs written directly into Snowflake tables.

Unstructured AI Services

Where does the unstructured marketplace go next? If we accept the proposition that AI services are partly responsible for the surge in this sector (or let’s at least call it a sub-surge in a sub-sector), then we might actually see AI services themselves starting to shoulder the responsibility for structuring our unstructuredness.

Given the current debate over whether chat-based AI services will take over browser search – and the fact that OpenAI offers GPT-based APIs for text extraction, summarization, semantic intent analysis and classification – that might be exactly what happens.

– Adrian Bridgwater, Published Courtesy of Forbes.

Why Unstructured Data Is Sorting Itself Out

Why Unstructured Data Is Sorting Itself Out

Leave a Reply

End-To-End Congestion Control in Data Center Networks: A Survey

AI Mistakes Can Cost Doctors Time When Writing to Patients

It May Be Almost Impossible to Make Data Centers Pay Their ‘Fair Share’ of Electricity Costs

An Optimist’s Account of Artificial Intelligence

The EU Cloud and AI Development Act

Why Open Data is the Future of Mapping: TomTom’s Michael Harrell Explains

Databricks Data + AI Summit 2025: Five Takeaways for Data Professionals, Developers,

Leave a Reply