Categories
AI

Microsoft’s Project Alexandria parses documents using unsupervised learning

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.


In 2014, Microsoft launched Project Alexandria, a research effort within its Cambridge research division dedicated to discovering entities — topics of information — and their associated properties. Building on the research lab’s work in knowledge mining research using probabilistic programming, the aim of Alexandria was to construct a full knowledge base from a set of documents automatically.

Alexandria technology powers the recently announced Microsoft Viva Topics, which automatically organizes large amounts of content and expertise in an organization. Specifically, the Alexandria team is responsible for identifying topics and metadata, employing AI to parse the content of documents in datasets.

To get a sense of how far Alexandria has come — and still has to go — VentureBeat spoke with Viva Topics director of product development Naomi Moneypenny, Alexandria project lead John Winn, and Alexandria engineering manager Yordan Zaykov in an interview conducted via email. They shared insights on the goals of Alexandria as well as major breakthroughs to date, and on challenges the development team faces that might be overcome with future innovations.

Parsing knowledge

Finding information in an enterprise can be hard, and a number of studies suggest that this inefficiency can impact productivity. According to one survey, employees could potentially save four to six hours a week if they didn’t have to search for information. And Forrester estimates that common business scenarios like onboarding new employees could be 20% to 35% faster.

Alexandria addresses this in two ways: topic mining and topic linking. Topic mining involves the discovery of topics in documents and the maintenance and upkeep of those topics as documents change. Topic linking brings together knowledge from a range of sources into a unified knowledge base.

“When I started this work, machine learning was mainly applied to arrays of numbers — images, audio. I was interested in applying machine learning to more structured things: collections, strings, and objects with types and properties,” Winn said. “Such machine learning is very well suited to knowledge mining, since knowledge itself has a rich and complex structure. It is very important to capture this structure in order to represent the world accurately and meet the expectations of our users.”

Microsoft Project Alexandria

The idea behind Alexandria has always been to automatically extract knowledge into a knowledge base, initially with a focus on mining knowledge from websites like Wikipedia. But a few years ago, the project transitioned to the enterprise, working with data such as documents, messages, and emails.

“The transition to the enterprise has been very exciting. With public knowledge, there is always the possibility of using manual editors to create and maintain the knowledge base. But inside an organization, there is huge value to having a knowledge base be created automatically, to make the knowledge discoverable and useful for doing work,” Winn said. “Of course, the knowledge base can still be manually curated, to fill gaps and correct any errors. In fact, we have designed the Alexandria machine learning to learn from such feedback, so that the quality of the extracted knowledge improves over time.”

Knowledge mining

Alexandria achieves topic mining and linking through a machine learning approach called probabilistic programming, which describes the process by which topics and their properties are mentioned in documents. The same program can be run backward to extract topics from documents. An advantage of this approach is that information about the task is included in the probabilistic program itself, rather than labeled data. That enables the process to run unsupervised, meaning it can perform these tasks automatically, without any human input.

“A lot of progress has been made in the project since its founding. In terms of machine learning capabilities, we built numerous statistical types to allow for extracting and representing a large number of entities and properties, such as the name of a project, or the date of an event,” Zaykov said. “We also developed a rigorous conflation algorithm to confidently determine whether the information retrieved from different sources refers to the same entity. As to engineering advancements, we had to scale up the system — parallelize the algorithms and distribute them across machines, so that they can operate on truly big data, such as all the documents of an organization or even the entire web.”

To narrow down the information that needs to be processed, Alexandria first runs a query engine that can scale to over a billion documents to extract snippets from each document with the high probability of containing knowledge. For example, if the model was parsing a document related to a company initiative called Project Alpha, the engine would extract phases likely to contain entity information, like “Project Alpha will be released on 9/12/2021” or “Project Alpha is run by Jane Smith.”

Microsoft Project Alexandria

The parsing process requires identifying which parts of text snippets correspond to specific property values. In this approach, the model looks for a set of patterns — templates — such as “Project {name} will be released on {date}.” By matching a template to text, the process can identify which parts of the text correspond with certain properties. Alexandria performs unsupervised learning to create templates from both structured and unstructured text, and the model can readily work with thousands of templates.

The next step is linking, which identifies duplicate or overlapping entities and merges them using a clustering process. Typically, Alexandria merges hundreds or thousands of items to create entries along with a detailed description of the extracted entity, according to Winn.

Alexandria’s probabilistic program can also help sort out errors introduced by humans, like documents in which a project owner was recorded incorrectly. And the linking process can analyze knowledge coming from other sources, even if that knowledge wasn’t mined from a document. Wherever the information comes from, it’s linked together to provide a single unified knowledge base.

Real-world applications

As Alexandria pivoted to the enterprise, the team began exploring experiences that could support employees working with organizational knowledge. One of these experiences grew into Viva Topics, a module of Viva, Microsoft’s collaboration platform that brings together communications, knowledge, and continuous learning.

Viva Topics taps Alexandria to organize information into topics delivered through apps like SharePoint, Microsoft Search, and Office and soon Yammer, Teams, and Outlook. Extracted projects, events, and organizations with related metadata about people, content, acronyms, definitions, and conversations are presented in contextually aware cards.

“With Viva Topics, [companies] are able to use our AI technology to do much of the heavy lifting. This frees [them] up to work on contributing [their] own perspectives and generating new knowledge and ideas based on the work of others,” Moneypenny said. “Viva Topics customers are organizations of all sizes with similar challenges: for example, when onboarding new people, changing roles within a company, scaling individual’s knowledge, or being able to transmit what has been learned faster from one team to another, and innovating on top of that shared knowledge.”

Microsoft Project Alexandria

Technical challenges lie ahead for Alexandria, but also opportunities, according to Winn and Zaykov. In the near term, the team hopes to create a schema exactly tailored to the needs of each organization. This would let employees find all events of a given type (e.g. “machine learning talk”) happening at a given time (“the next two weeks”) in a given place (“the downtown office building”), for example.

Beyond this, the Alexandria team aims to develop a knowledge base that leverages an understanding of what a user is trying to achieve and automatically provides relevant information to help them achieve it. Winn calls this “switching from passive to active use of knowledge,” because the idea is to switch from passively recording the knowledge in an organization to actively supporting work being done.

“We can learn from past examples what steps are required to achieve particular goals and help assist with and track these steps,” Winn explained. “This could be particularly useful when someone is doing a task for the first time, as it allows them to draw on the organization’s knowledge of how to do the task, what actions are needed, and what has and hasn’t worked in the past.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Soniox taps unsupervised learning to build speech recognition systems

Join Transform 2021 this July 12-16. Register for the AI event of the year.


AI-powered speech transcription platforms are a dime a dozen in a market estimated to be worth over $1.6 billion. Deepgram and Otter.ai build voice recognition models for cloud-based real-time processing, while Verbit offers tech not unlike that of Oto, which combines intonation with acoustic data to bolster speech understanding. Amazon, Google, Facebook, and Microsoft offer their own speech transcription services.

But a new entrant launching out of beta this week claims its approach yields superior accuracy. Called Soniox, the company leverages vast amounts of unlabeled audio and text to teach its algorithms to recognize speech with accents, background noises, and “fairfield” recording. In practice, Soniox says its system correctly transcribes 24% more words compared with other speech-to-text systems, achieving “super-human” recognition on “most domains of human knowledge.”

Those are bold claims, but Soniox founder and CEO Klemen Simonic says the accuracy improvements arise from the platform’s unsupervised learning techniques. With unsupervised learning, an algorithm — in Soniox’s case, a speech recognition algorithm — is fed “unknown” data for which no previously defined labels exist. The system must teach itself to classify the data, processing it to learn from its structure.

Unsupervised speech

At the advent of the modern AI era, when people realized powerful hardware and datasets could yield strong predictive results, the dominant form of machine learning fell into a category known as supervised learning. Supervised learning is defined by its use of labeled datasets to train algorithms to classify data, predict outcomes, and more.

Simonic, a former Facebook researcher and engineer who helped to build the speech team at the social network, notes that supervised learning in text-to-speech is both time-consuming and expensive. Companies have to obtain tens of thousands of hours of audio and recruit human teams to manually transcribe the data. And this same process has to be repeated for each language.

“Google and Facebook have more than 50,000 hours of transcribed audio. One has to invest millions — more like tens of millions — of dollars into collecting transcribed data,” Simonic told VentureBeat via email. “Only then one can train a speech recognition AI on the transcribed data.”

A technique known as semi-supervised learning offers a potential solution. It can accept partially labeled data, and Google recently used it to obtain state-of-the-art results in speech recognition. In the absence of labels, however, unsupervised learning — also known as self-supervised learning — is the only way to fill gaps in knowledge.

Soniox founder and CEO Klemen Simonic.

Above: Soniox founder and CEO Klemen Simonic.

According to Simonic, Soniox’s self-supervised learning pipeline sources audio and text from the internet. In the first iteration of training, the company used the Librispeech dataset, which contains 960 hours of transcribed audiobooks.

Soniox’s iterative approach continuously refines the platform’s algorithms, enabling them to recognize more words as the system gains access to additional data. Currently, Soniox’s vocabulary spans different people, places, and geography to domains including education, technology, engineering, medicine, health, law, science, art, history, food, sports, and more.

“To do fine-tuning of a particular model on a particular dataset, you would need an actual transcribed audio dataset. We do not require transcribed audio data to train our speech AI. We do not do fine-tuning,” Simonic said.

Dataset and infrastructure

Soniox claims to have a proprietary dataset containing over 88,000 hours of audio and 6.6 billion words of preprocessed text. By comparison, the latest speech recognition works from Facebook and Microsoft used between 13,100 and 65,000 hours of labeled and transcribed speech data. And Mozilla’s Common Voice, one of the largest public annotated voice corpora, has 9,000 hours of recordings.

While relatively underexplored in the speech domain, a growing body of research demonstrates the potential of learning from unlabeled data. Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. More recently, Facebook announced SEER, an unsupervised model trained on a billion images that ostensibly achieves state-of-the-art results on a range of computer vision benchmarks.

Soniox collects more data on a weekly basis, with the goal of expanding the range of vocabulary the platform can transcribe. However, Simonic points out that more audio and text isn’t necessarily required to improve word accuracy. Soniox’s algorithms can “extract” more about familiar words with multiple iterations, essentially learning to recognize particular words better than before.

Soniox

Above: Soniox’s cloud platform.

Image Credit: Soniox

AI has a well-known bias problem, and unsupervised learning doesn’t eliminate the potential for bias in a system’s predictions. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Simonic says Soniox has taken care to ensure its audio data is “extremely diverse,” with speakers from most countries and accents around the world represented. He admits that the data distribution across accents isn’t balanced but claims the system still manages to perform “extremely well” with different speakers.

Soniox also built its own training hardware infrastructure, which it stores across multiple servers located in a collocation datacenter facility. Simonic says the company’s engineering team installed and optimized the system and machine learning frameworks and wrote the inference engine from scratch.

“It is utterly important to have under control every single bit of transfer and computation when you are training AI models at large scale. You need a rather large amount of computation to do just one iteration over a dataset of more than 88,000 hours,” Simonic said. “[The inferencing engine] is highly optimized and can potentially run on any hardware. This is super important for production deployment because speech recognition is computationally expensive to run compared to most other AI models and saving every bit of computation on a large volume amounts to large sums in savings — think of millions of hours of audio and video per month.”

Scaling up

After launching in beta earlier this year, Soniox is making its platform generally available. New users get five hours per month of free speech recognition, which can be used in Soniox’s web or iOS app to record live audio from a microphone or upload and transcribe files. Soniox offers an unlimited number of free recognition sessions for up to 30 seconds per session, and developers can use the hours to transcribe audio through the Soniox API.

It’s early days, but Soniox says it recently signed its first customer in DeepScribe, a transcription startup targeting health care. DeepScribe switched from a Google speech-to-text model because Soniox’s transcriptions of doctor-patient conversations were more accurate, Simonic claims.

“To make a business, developing novel technology is not enough. Thus we developed services and products around our new speech recognition technology,” Simonic said. “I expect there will be a lot more customers like DeepScribe once the word about Soniox gets around.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Supervised vs. unsupervised learning: What’s the difference?

Join Transform 2021 this July 12-16. Register for the AI event of the year.


At the advent of the modern AI era, when it was discovered that powerful hardware and datasets could yield strong predictive results, the dominant form of machine learning fell into a category known as supervised learning. Supervised learning is defined by its use of labeled datasets to train algorithms to classify data, predict outcomes, and more. But while supervised learning can, for example, anticipate the volume of sales for a given future date, it has limitations in cases where data falls outside the context of a specific question.

That’s where semi-supervised and unsupervised learning come in. With unsupervised learning, an algorithm is subjected to “unknown” data for which no previously defined categories or labels exist. The machine learning system must teach itself to classify the data, processing the unlabeled data to learn from its inherent structure. In the case of semi-supervised learning — a bridge between supervised and unsupervised learning — an algorithm determines the correlations between data points and then uses a small amount of labeled data to mark those points. The system is then trained based on the newly-applied data labels.

Unsupervised learning excels in domains for which a lack of labeled data exists, but it’s not without its own weaknesses — nor is semi-supervised learning. That’s why, particularly in the enterprise, it helps to define the business problem in need of solving before deciding which machine learning approach to take. While supervised learning might be suited for tasks involving classifying, like sorting business documents and spreadsheets, it would adapt poorly in a field like health care if used to identify anomalies from unannotated data, like test results.

Supervised learning

Supervised learning is the most common form of machine learning used in the enterprise. In a recent O’Reilly report, 82% of respondents said that their organization opted to adopt supervised learning versus supervised or semi-supervised learning. And according to Gartner, supervised learning will remain the type of machine learning that organizations leverage most through 2022.

Why the preference for supervised learning? It’s perhaps because it’s effective in a number of business scenarios, including fraud detection, sales forecasting, and inventory optimization. For example, a model could be fed data from thousands of bank transactions, with each transaction labeled as fraudulent or not, and learn to identify patterns that led to a “fraudulent” or “not fraudulent” output.

Supervised learning algorithms are trained on input data annotated for a particular output until they can detect the underlying relationships between the inputs and output results. During the training phase, the system is fed with labeled datasets, which tell it which output is related to each specific input value. The supervised learning process progresses by constantly measuring the resulting outputs and fine-tuning the system to get closer to the target accuracy.

Supervised learning requires high-quality, balanced, normalized, and thoroughly cleaned training data. Biased or duplicate data will skew the system’s understanding, with data diversity data usually determining how well it performs when presented with new cases. But high accuracy isn’t necessarily a good indication of performance — it might also mean the model is suffering from overfitting, where it’s overtuned to a particular dataset. In this case, the system will perform well in test scenarios but fail when presented with a real-world challenge.

One downside of supervised learning is that a failure to carefully vet the training datasets can lead to catastrophic results. An earlier version of ImageNet, a dataset used to train AI systems around the world, was found to contain photos of naked children, porn actresses, college parties, and more — all scraped from the web without those individuals’ consent. Another computer vision corpus, 80 Million Tiny Images, was found to have a range of racist, sexist, and otherwise offensive annotations, such as nearly 2,000 images labeled with the N-word, and labels like “rape suspect” and “child molester.”

Semi-supervised learning

In machine learning problems where supervised learning might be a good fit but there’s a lack of quality data available, semi-supervised learning offers a potential solution. Residing between supervised and unsupervised learning, semi-supervised learning accepts data that’s partially labeled or where the majority of the data lacks labels.

The ability to work with limited data is a key benefit of semi-supervised learning, because data scientists spend the bulk of their time cleaning and organizing data. In a recent report from Alation, a clear majority of respondents (87%) pegged data quality issues as the reason their organizations failed to successfully implement AI.

Semi-supervised learning is also applicable to real-world problems where a small amount of labeled data would prevent supervised learning algorithms from functioning. For example, it can alleviate the data prep burden in speech analysis, where labeling audio files is typically very labor-intensive. Web classification is another potential application; organizing the knowledge available in billions of webpages would take an inordinate amount of time and resources if approached from a supervised learning perspective.

Unsupervised learning

Where labeled datasets don’t exist, unsupervised learning — also known as self-supervised learning — can help to fill the gaps in domain knowledge. Clustering is the most common process used to identify similar items in unsupervised learning. The task is performed with the goal of finding similarities in data points and grouping similar data together.

Clustering similar data points helps to create more accurate profiles and attributes for different groups. Clustering can also be used to reduce the dimensionality of the data where there are significant amounts of data.

Reducing dimensions, a process that isn’t unique to unsupervised learning, decreases the number attributes in datasets so that the data generated is more relevant to the problem being solved. Reducing dimensions also helps cut down on the storage space required to store datasets and potentially improve performance.

Unsupervised learning can be used to flag high-risk gamblers, for example, by determining which spend more than a certain amount on casino websites. It can also help with characterizing interactions on social media by learning the relationships between things like likes, dislikes, shares, and comments.

Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. In a paper, researchers at the company detail SoftNER, a framework that Microsoft deployed internally to collate information regarding storage, compute, and outages. They claim that it eliminated the need to annotate a large amount of training data while scaling to a high volume of timeouts, slow connections, and other product interruptions.

More recently, Facebook announced SEER, an unsupervised model trained on a billion images that ostensibly achieves state-of-the-art results on a range of computer vision benchmarks. SEER learned to make predictions from random pictures found on Instagram profile pages.

Unfortunately, unsupervised learning doesn’t eliminate the potential for bias in the system’s predictions. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Some experts, including Facebook chief scientist Yann LeCun, theorize that removing these biases might require a specialized training of unsupervised models with additional, smaller datasets curated to “unteach” specific biases. But more research must be done to figure out the best way to accomplish this.

Choosing the right approach

Between supervised, semi-supervised, and unsupervised learning, there’s no flawless approach. So which is the right method to choose? Ultimately, it depends on the use case.

Supervised learning is best for tasks like forecasting, classification, performance comparison, predictive analytics, pricing, and risk assessment. Semi-supervised learning often makes sense for general data creation and natural language processing. As for unsupervised learning, it has a place in performance monitoring, sales functions, search intent, and potentially far more.

As new research emerges addressing the shortcomings of existing training approaches, the optimal mix of supervised, semi-supervised, and unsupervised learning is likely to change. But identifying where these techniques bring the most value — and do the least harm — to customers will always be the wisest starting point.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link

Categories
AI

Analytics startup Unsupervised raises $35M to spot patterns in enterprise data

Join Transform 2021 this July 12-16. Register for the AI event of the year.


Boulder, Colorado-based Unsupervised, a big data analytics company leveraging AI to find patterns in business data, today announced that it raised $35 million in a series B round led by Cathay Innovation and Signalfire. Unsupervised says that it intends to use the funding to hire additional employees as it continues to develop its platform.

Most enterprises have to wrangle countless data buckets, some of which inevitably become underused or forgotten. A Forrester survey found that between 60% and 73% of all data within corporations is never analyzed for insights or larger trends. The opportunity cost of this unused data is substantial, with a Veritas report pegging it at $3.3 trillion by 2020. That’s perhaps why the corporate sector has taken an interest in solutions that ingest, understand, organize, and act on digital content from multiple digital sources.

Unsupervised claims to accomplish this by analyzing unstructured and structured datasets to arrive at insights “without ignoring the long tail.” The company automates data science processes including preparation and prioritization, making predictions on data in industries spanning transportation, supply chain, ecommerce, and sales and marketing.

“We’re seeing a shift in the market where customers are seeking out analytics and AI platforms that don’t just do simple reporting — they reveal opportunities to change the business. BI and traditional AI is great for probing handfuls of known problems, but when you’re really trying to understand what’s happening you need to investigate beyond known issues,” CEO Noah Horton told VentureBeat via email. “This is where unsupervised learning is uniquely valuable. COVID really revealed the need for what we’ve built and this round will help us expand our footprint faster.”

Unsupervised says that its AI can identify statistically significant patterns that highlight the differences across subgroups within the data. Using a technique called unsupervised learning or self-supervised learning, Unsupervised’s systems can generate labels from data by exposing the relationships between the data’s parts. That’s as opposed to traditional, supervised AI systems, which require annotated datasets in order to learn patterns and make predictions.

Unsupervised

Above: Unsupervised’s web dashboard.

Image Credit: Unsupervised

For example, in the supply chain domain, Unsupervised’s AI can ostensibly look at the nuances of the local economy, logistics site, employee details, and shipments and inventory to spotlight areas with excess or insufficient supply. On the finance side, Unsupervised can drawn on databases to find fraud schemes and spot financial trends like where people are willing to spend versus save. The technology even has applications in health care, Unsupervised says, where it can reveal opportunities to minimize the time spent on administrative tasks.

Unsupervised’s platform presents AI-discovered patterns to customers for review in a web dashboard. Teams can track the performance of these patterns over time, and the AI system learns from what’s prioritized and acted on to continuously improve the insights.

Momentum in the market

Unsupervised isn’t disclosing many customers at this point. That said, the company volunteered that it has “a number” of Fortune 500 customers using the product, including teams at ADP, Disney, and Coatue.

“Unsupervised’s customers use the platform for multiple use cases. The average customer is using the platform across three or more use cases. Some customers are supporting as many seven use cases with Unsupervised at one time,” a spokesperson told VentureBeat.

In its recent Augmented Analytics Is the Future of Analytics report, Gartner predicts that by 2021, “augmented analytics” like Unsupervised’s will drive new purchases of analytics and business intelligence, as well as data science and machine learning platforms. Assuming this comes to pass, 75-employee Unsupervised’s prospects in the $168.8 billion business analytics market look bright — even in the face of competition from companies like Outlier.

“Most companies recognize that data is the new ‘gold’ but still struggle to derive meaningful insights given the deluge of siloed data, both structured and unstructured, across organizations — exasperating teams that are already understaffed and overwhelmed,” Cathay Innovation cofounder and CEO Denis Barrier told VentureBeat. “However, Unsupervised’s unique approach to ‘AI-augmented analytics’ has the potential to be a game-changing tool. It is disrupting the entire process by ingesting data from everywhere and automating the time consuming, tedious portions so users can quickly draw the most interesting insights that are revenue-generating and actionable. We’re honored to support the company on their journey, which very well may usher in a transformation of big data and decision-making in the enterprise.”

Eniac Ventures and Coatue also participated in the company’s latest funding round. It brings Unsupervised’s total raised to over $55 million following a $12.8 million series A round in August 2019.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link