Data Discovery & Classification

Positively identify your content: where it is and what it is so you can determine how and when it should be handled.


Discover & analyze content in place.

Even with the advances of consolidated content platforms like Microsoft 365, the reality is most organizations will continue to use multiple systems to house enterprise content. The real risk many teams face is not knowing where their content is, what their content is, or who has access to it in public folders or systems.

Valora brings all previously unknown content into view by connecting to structured and unstructured data systems,  crawling, analyzing and reporting on that content in place. We deploy custom connectors using APIs or other direct database extraction methods to access, evaluate, protect, repair, and disposition the content at source. 

Uncover all enterprise content.

Valora connects to unstructured and structured cloud, onprem and hybrid data environments including:

  • On-prem unstructured systems: Windows fileshares.
  • On-prem structured systems: ECMs, databases, etc.
  • Cloud unstructured systems: SharePoint online, Dropbox, Box, etc.
  • Cloud structured systems: Data lakes/warehouse, Oracle, MSSQL, etc.
  • Could applications: Salesforce, Workday, SAP, NetSuite, etc.
  • Email systems: Microsoft 365, Google mail, etc.

AutoClassify every file.

Valora’s AutoClassification platform performs a full-text analysis of every file, not just the existing file or system-generated metadata, identifying key data elements unique to the document, and important to your organization, industry or jurisdiction.

Based on this analysis, Valora AutoClassifies each document to your Records Retention Schedule, identifies the content with legitimate business value from the content without (ROT), and unifies the detailed analysis, management, reporting and disposition of your enterprise data estate.

Scanning unstructured content.

Manually categorizing unstructured data is challenging due to its volume, complexity, variability, and lack of predefined organization.

Valora’s PowerHouse process and classifies unstructured data, such as documents, emails, and other corporate documents, using advanced AutoClassifiction algorithms. This full-text analysis allows PowerHouse to read and analyze every file without a human having to open and classify it.

Scanning structured data environments.

Although databases house data in structured fields, it can be difficult for Information Governance teams to be able to access, classify and act on that data. Valora’s PowerHouse connects to, scans, analyzes, classifies and performs defensible disposition on fielded data in structured databases.

PowerHouse takes the extracted database text and turns each database record into its own “mockument” so users can read, understand, and interact with the content in BlackCat.

Rich Metadata Tagging

AutoClassification enables the creation of a virtually unlimited range of enriched metadata tags. At Valora, we’ve encountered everything from straightforward tags like Document Type, Custodian and Keywords to more niche examples, such as Latitude/Longitude, Japanese Showa Dates, and everything in between. However, there is a core set of basic tags that almost every AutoClassification project will produce. They include:

Basic Identifiers

Document Type, Document Title, Document Date, Geographic Location, Document ID, Attachment Range

People Fields

Author/Custodian, Recipient/Audience, Employee Name, Customer/Patient Name, Contract Signatories

Records Management Fields

Record Class, Regulatory Citation, Retention Period, Expiration Date, Legal Hold, and ROT

Data Privacy Fields

Data Privacy Type & Detail: PII, PHI, PCI, Sensitivity Class, Minimization Status,  Redacted/Pseudonymized

Attributes Fields

Keywords, Duplicate, Version, Language, Product Names, Client/Customer Data, Employee/Personnel Data

From there, the possibilities are endless. It is easy to create custom AutoClassification tags for certain verticals, document types, and specialized needs that might be for just your organization or department. Examples of custom rich metadata tags are:

Custom Document Types

unique to the client or their industry (ex:Variation Order, Radiology Report)

Geographic Tags

indicating City, Zip Code (also zip + 4), Office or Store Location, etc.

Personnel Fields

Employee Termination Date, Employee ID, Promotion Date, etc.

Lines of Business

Product name/number, Case matter name/number, Supervisor, etc.

Contractual & Finance Fields

Invoice amount, Contract parties, Effective & Termination dates

Benefits of using Valora for Data Discovery & Classification

True AutoClassification

Valora’s PowerHouse platform “reads” every file in every repository, performing a full-text analysis of the content. It AutoClassifies each file based on content and context – not just the existing file- or system-generated metadata – to identify and flag important data facets.

Determine True Document Type, Not File Type

Knowing how many Word docs, Excel spreadsheets or PDFs is nice if you want to know how many files you have in a given repository, but tells you nothing about the contents of said files. PowerHouse reads every file to determine the true Document Type of each file, for example: a vendor contract vs. an employee contract.

Comprehensive Data Mapping

More than a simple data map, AutoClassification tells you precisely where the land mines are. Way beyond how many files you have and where, PowerHouse determines the specific Document Type for each file (ex. Contract, Board Minutes, Balance Sheet, etc.)m including whether it contains sensitive information (ex. employee records, client personal data, financial reports, etc.), how long it should be retained, and under what circumstances it can be defensibly deleted.

Reduce Risk

By tagging or classifying enterprise content you are actively taking responsibility over the data you hold. You are removing costly “surprises” that can result with litigation, eDiscovery, data privacy and information security events when unplanned activities necessitate a deep dive into “what you have or hold.”

Reduce Human Effort & Error

AutoClassification software processes content at speeds that no human, or team of humans, possibly can. It greatly reduces the frequency and impact of human error, inconsistency, and oversight – providing a comprehensive and thorough review of all enterprise data systematically.

Unlimited Metadata Values

Unlike other classification tools that are limited to point-in-time, single metadata values, Valora’s rich metadata applications support both multiple and hierarchical taxonomic data structures, including automatic updates based on workflows, events and other triggers. A single file can be classified as 1) a Contract, and more specifically 2) a Vendor Contract, and also tagged as 3) Containing sensitive information about a certain topic or person, while also be 4) flagged as being under Legal Hold.

Data Discovery & Classification FAQ

How fast does Valora crawl and scan repositories?

The first baseline run – scanning and full text analysis of every file – runs at about 0.5 GB per processor-hour of uncompressed/expanded data. Valora’s AutoClassification engine, PowerHouse, is offered in three tiers. The higher the tier, the higher the number processors, the faster the processing.

  • PowerHouse Starter – can process 2.5GB, or approximately 6,250 files per hour
  • PowerHouse Foundation – can process 7.5GB, or approximately 18,750 files per hour
  • PowerHouse Enterprise – can process 20GB, or approximately 50,000 files per hour

Subsequent data processing runs (for data updates, new configurations or handling rules, etc.) typically run at about 1.5 – 5 GB per processor-hour.

Can I inventory my content without AutoClassifying every file?

Yes. Some clients opt for Lite Processing – a fast, thorough scan of the file inventory of a file share or other repository. Lite Processing results in a comprehensive file listing, including all identical duplicates, their size, full path, last accessed date, and last modified date. This approach is often used to create a starting point for risk assessment, gap analysis, and processing recommendations. The final result of Lite Processing  yields an automated, rules-driven recommendation for remediating each share analyzed per its resulting risk profile.

Will it be able to classify document types or formats that are unique to my company?

Yes. While we have processed and identified thousands of different Document Types over the years, there may be Document Types unique to your organization or industry that we have not seen before. In such cases, Valora trains the system to identify your unique Document Type formats and attributes for accurate classification going forward.

Can Valora integrate and analyze my physical records?

Yes. There are 2 ways Valora integrates with physical records. 

  1. Valora inherits digitized physical documents and metadata from your document storage provider.  Each scanned file acts, appears, and is handled like its electronically-stored siblings., including full-text analysis, enriched metadata, and automated handling rules.
  2. For boxes and documents still in physical format, we integrate with your storage vendor’s inventory tracking systems, representing each physical box (or file) as a “Mockument” – a record placeholder inside BlackCat used to report on or trigger actions at the Box or Document level.
Can it AutoClassify files in other languages?

Yes. Some languages are supported with native, in-language processing, such as French, Spanish, and other Roman character-set languages.

For other language analysis, Valora identifies non-English documents and AutoTranslates them into whatever language your team is comfortable with. We integrate with Google Translate, and its support of over 240 world languages. We create your own Google API key and download languages to the same location as Valora systems are deployed (our cloud, your cloud, on on-prem). This allows for translating “onsite” without the content going to Google’s cloud.

What is the setup impact on my team? Do we need to “train” the system to recognize our files?

For initial set-up, if you know you have certain files types or document you want to train the system on, it could be helpful for your team to provide a description of the document or format, or better yet provide 2-3 examples of specific files. This guidance data will help us to train the system on what to look for and how to identify your unique Document Types.

Other than providing us a handful of templates (if you have them), the impact to your team is minimal.  We will occasionally have tagging questions for your InfoGov or content-holder teams, and we will require basic, service account access and permissions from your IT team.

Can Valora utilize and/or maintain previously tagged data?

Yes. Valora can read, inherit and apply existing metadata during the analysis process to maintain the metadata values already applied to your data.  Furthermore, we will incorporate prior-tagged data into the resulting taxonomy and any subsequent rules processing.

Do Valora’s tags integrate with other systems that use tags, such as SharePoint, DMS systems and archival/preservation storage systems?

Yes. Valora integrates with third-party systems that use metadata tags. With the correct write permissions, Valora and can send or migrate the file itself and its associated metadata to the target system.