AutoClassification 101

Leverage the power of methodology + technology to intelligently automate document processing, analysis and disposition.

What is AutoClassification?

AutoClassification is a suite of software that automates the analysis and classification of digital content or files – thus “AutoClassification.”

AutoClassification software uses both pattern-matching algorithms and machine learning to detect file contents and attributes, and assign contextual attributes (rich metadata) and disposition (rules) for each document or file.  AutoClassification answers:  What is this content? and How should it be managed throughout its lifecycle?

Classification answers the question: “What is this content and what should I do with it?” AutoClassification automates the analysis and decisioning of the proper answer at all times.

Why AutoClassify?


AutoClassification software processes content at a speed that no human, or team of humans, possibly can. AutoClassification software performs 10,000,000 computations per second, saving months or years of manual file categorization and attribution.


Sophisticated algorithms remove the possibility of human error or oversight through the automated crawling and classification of every file – nothing is missed or overlooked.


Automating the classification of data allows for consistent and perpetual reconciliation across multiple data repositories, regardless of time of day, time of year, data set size or location.


AutoClassification is used where there are large amounts of disparate content across many data stores.  Use AutoClassification for:


  • data and content management (records management)
  • records retention 
  • content migration
  • litigation compliance (legal hold)
  • data privacy (PII/PHI/PCI)
  • file security (access & permissions)
  • data governance (regulated industries)

How it Works

Valora Technologies’ PowerHouse AutoClassification Suite applies a 5-step methodology for locating, identifying, analyzing, actioning and monitoring content across multiple data stores.

1. Crawl & Locate - Where is it?
  • Crawl one or many (100,000,000+) documents
  • Single and multiple shared drives
  • Email repositories and servers
  • Document Management Systems (DMS) & Enterprise Content Management Systems (ECM)
  • Collaborative sites (SharePoint, Box, Dropbox, Drive)
  • Personal shares and laptops
  • eDiscovery repositories
  • HR, ERP & billing systems
  • On-prem and cloud-based document repositories
2. Identify & Tag - What is it?
  • Tag for file metadata: doc type, size, date, revisions
  • Identify basic metadata tags: DocumentType, DocumentTitle, Author, Recipients & CCs, and Date
  • Tag for custom metadata:  topics, locations, departments, product names, employee ID
  • Tag for keywords or pattern-matching (regular expression)
  • Search by creator or custodian
  • Identify duplicates and near-duplicates
  • Identify meaningful content vs. Redundant, Obsolete & Trivial content (ROT)
  • Identify different types of content (contracts, messages, financial)
  • Identify PII/PHI/PCI 
3. Analyze & Understand - What am I looking at?
  • Preview documents
  • OCR text for unreadable files (images, PDFs, audio files)
  • Translated foreign content into English
  • Transcribed audio and video files into text
  • Produce reports (high level and drill-down)
  • Notifications (via email, IM, text)
4. Decide & Action - What do I do with it?
  • Rules-based and machine learning automation for disposition
  • Apply or append rich metadata
  • Apply retention schedules and legal hold
  • Apply security access controls
  • Migrate on demand
  • Delete and sequester
  • AutoRedact sensitive information
  • Initiate custom workflows
5. Monitor & Audit - How often do I update?
  • Set customized refresh and retention schedules based on content type or location
  • Crawl, identify and action only new, edited or acquired data
  • Runs in the background with no performance draw on systems or repositories
  • Ensures retention schedules and compliance requirements are executed on time

AutoClassification in Action

The practical applications and use cases of AutoClassification can be used anywhere documents, files or content need to be located, identified, analyzed and actioned across one or many data environments.

Organizations use Valora Technologies’ PowerHouse AutoClassification engine for:

File Clean-Up / ROT Processing

Keep relevant content, remove the the junk.

Valora crawls and monitors multiple shared drives, folders, ECMs and email servers to locate, identify, tag and eliminate Redundant, Obsolete and Trivial (ROT) content within your organization.

  • Keeps repositories clean and compliant
  • Reduces storage costs by eliminating an average of 30-40% of ROT content 
  • Identifies & removes unnecessary content based on:
    • relevant content (keywords) or lack of relevant content (spam)
    • exact copies (duplicates) and versions (near duplicates)
    • file type (temp files) and file size (0 byte files)


Retention & Content Lifecycle Management

Easily implement retention schedules and workflows.

PowerHouse ensures files are only kept for the amount of time required per policy, per business process, per repository.

  • Continuously monitor files across multiple data repositories
  • Dispose of files properly and on time
  • Permit special accountability to Legal Hold
  • Migrate permanent records to appropriate archives 


Content Migration

Migrate content from multiple repositories into new ones.  Migrate content to the cloud.

PowerHouse connects with over 30 different on-prem and cloud-based content management systems to analyze, migrate or consolidate content from one place to another.

  • Removes unnecessary content
  • Classifies and applies rich metadata to content
  • Migrates content into new systems
  • Migrates content for archival purposes
  • Promotes content to the cloud
  • Establishes and respects taxonomies and ontologies.


Mergers, Acquisitions & Divestitures

Faster due diligence, organized data rooms, post-acquisition data merges.

PowerHouse speeds up the due diligence process and eliminates the need for manual data searches by locating, identifying and tagging relevant content. After the deal is closed, PowerHouse migrates post-acquisition merge and/or divestiture-based separation of content into the acquiring/divested organization’s data environment.

  • Pre-Due Diligence: remove Redundant, Obsolete and Trivial data (ROT) from repositories
  • Due Diligence: locate, identify and tag relevant content (corporate records, contracts, bank records, etc)
  • Data room: migrate clean content and organized content into third-party virtual data rooms
  • Post-acquisition: merge data into new business environments and/or divest data into component ones
  • Coordinate Legal Hold: across organizations, matters and jurisdictions
Data Privacy & Security

Identify Personally Identifiable Information (PII) and other sensitive data.

PowerHouse locates and identifies files that contain PII, PHI or PCI and identifies documents that may be sensitive in nature (contracts, employment agreements, etc).  Identify Personal Data (PD) or European citizens that may be subject to GDPR.

  • Locate and identify files that contain PII, PHI or PCI
  • Classify documents as PII-sensitive for proper handling
  • Apply security access controls and business processes for PII
  • Satisfy data protection and privacy legislation
  • Satisfy GDPR data controller and data processor compliance
  • Automate Data Subject Access Requests (DSARs)
  • Satisfy CCPA, NYDFS and other emerging privacy regulations


Data Under Management

Set it and forget it with a single platform that is aware of content across the enterprise and the world.

Valora PowerHouse AutoClassification implements a tooled and automated approach to Content Management that runs perpetually in the background across all data repositories and silos.

  • Constantly monitors for and indexes new, edited, or acquired content
  • Sets and applies rich metadata attribution
  • Implements multi-faceted disposition automatically
  • Runs without interfering with other systems or business processes
  • Automatic scheduling for heavy system loads and refresh activities
  • Machine learning rules automatically update with changes in strategy, systems, personnel and regulations.


Legal & eDiscovery

Efficient and cost-effective document analysis.

Valora offers rapid, low-cost and highly defensible eDiscovery processing, review, hosting and professional services.  Our rules-based Technology-Assisted Review (TAR) options are optimized for document collection and analysis.

  • Left side of the Electronic Discovery Reference Model (EDRM)
  • Gather and analyze relevant documents for Early Data Assessment (EDA)
  • Identify appropriate content to be placed under Legal Hold
  • Automate presumptive privilege, responsiveness and data requests
  • Migrate final data sets to a third-party review platforms


Records & Information Management

Automate customized end-to-end records management solutions.

Valora’s PowerHouse content management platform analyzes, manages and automates large-scale and customizable records management solutions for any size and type of enterprise.

  • Create customized records management solutions, tailored to your Retention Schedules and policies
  • Manage and automate records and content management disposition across multiple data silos
  • Records retention schedules and workflows
  • Tag and remove ROT and duplicates
  • AutoClassify record types based on rich metadata (Document Type, Topic, Custodian, Source, CY+ dates)
  • Professional Services for document management & workflow consulting


Related Resources

Explore Valora Technologies’ Resource Library for helpful articles, videos, presentations, white papers, blog posts and more.