Data labeling will fuel the AI revolution

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

This article was contributed by Frederik Bussler, consultant, and analyst.

AI fuels modern life — from the way we commute to how we order online, and how we find a date or a job. Billions of people use AI-powered applications every day, looking at just Facebook and Google users alone. This represents the tip of the iceberg when it comes to AI’s potential.

OpenAI, which recently made headlines again for offering general availability to its models, uses labeled data to “improve language model behavior,” or to make its AI fairer and less biased. This is an important example, as OpenAI’s models were long reprimanded for being toxic and racist.

Many of the AI applications we use day-to-day require a particular dataset to function well. To create these datasets, we need to label data for AI.

Why does AI need data labeling?

The term artificial intelligence is somewhat of a misnomer. AI is not actually intelligent. It takes in data and uses algorithms to make predictions based on that data. This process requires a large amount of labeled data.

This is particularly the case when it comes to challenging domains like healthcare, content moderation, or autonomous vehicles. In many instances, human judgment is still required to ensure the models are accurate.

Consider the example of sarcasm in social media content moderation. A Facebook post might read, “Gosh, you’re so smart!” However, that could be sarcastic in a way that a robot would miss. More perniciously, a language model trained on biased data can be sexist, racist, or otherwise toxic. For instance, the GPT-3 model once associated Muslims and Islam with terrorism. This was until labeled data was used to improve the model’s behavior.

As long as the human bias is handled as well, “supervised models allow for more control over bias in data selection,” a 2018 TechCrunch article stated. OpenAI’s newer models are a perfect example of using labeled data to control bias. Controlling bias with data labeling is of vital importance, as low-quality AI models have even landed companies in court, as was the case with a firm that attempted to use AI as a screen reader, only to have to later agree to a settlement when the model didn’t work as advertised.

The importance of high-quality AI models is making its way into regulatory frameworks as well. For example, the European Commission’s regulatory framework proposal on artificial intelligence would subject some AI systems to “high quality of the datasets feeding the system to minimize risks and discriminatory outcomes.”

Standardized language and tone analysis are also critical in content moderation. It’s not uncommon for people to have different definitions of the word “literally” or how literally they should take something such as “It was like banging your head against a wall!” To decide which posts are violating community standards, we need to analyze these types of subtleties.

Similarly, the AI startup Handl uses labeled data to more accurately convert documents to structured text. We’ve all heard of OCR (Object Character Recognition), but with AI-powered by labeled data, it’s being taken to a whole new level.

To give another example, to train an algorithm to analyze medical images for signs of cancer, you would need to have a large dataset of medical images labeled with the presence or absence of cancer. This task is commonly referred to as image segmentation and requires labeling tens of thousands of samples in each image. The more data you have, the better your model will be at making accurate predictions.

Sure, it’s possible to use unlabeled data for AI training algorithms, but this can lead to biased results, which could have serious implications in many real-world cases.

Applications using data labeling

Data labeling is vital for applications across search, computer vision, voice assistants, content moderation, and more.

Search was one of the first major AI use-cases relying on human judgment to determine relevance. With labeled data, a search can be extremely accurate. For instance, Yandex turned to human “annotators” from Toloka to help improve its search engine.

Some of the most popular uses of AI in health care include helping to diagnose skin conditions and diabetic retinopathy, boosting recall rates for medication compliance reviews, and analyzing radiologist reports to detect eye conditions like glaucoma.

Content moderation has also seen significant advances thanks to AI applied to large quantities of labeled data. This is especially true for sensitive topics like violence or threats of violence. For example, people may post videos on YouTube threatening suicide, which need to be immediately detected and differentiated from informational videos about suicide.

Another important use of AI for data labeling is understanding voices with any accent or tone, for voice assistants like Alexa or Siri. This requires training an algorithm to recognize male and female speech patterns based on large volumes of labeled audio.

Human computing for labeling at scale

All this begs the question: How do you create labeled data at scale?

Manually labeling data for AI is an extremely labor-intensive process. It can take weeks or months to label a few hundred samples using this approach, and the accuracy rate is not very good, particularly when facing niche labeling tasks. Additionally, it will be necessary to update datasets and build bigger datasets than competitors in order to remain competitive.

The best way to scale data labeling is with a combination of machine learning and human expertise. Companies like Toloka, Appen, and others use AI to match the right people with the right tasks, so the experts do the work that only they can do. This allows firms to scale their labeling efforts. Further, AI can weigh the answers from different respondents according to the quality of the responses. This ensures that each label has a high chance of being accurate.

With techniques like these, labeled data is fueling a new AI revolution. By combining AI with human judgment, companies can create accurate models of their data. These models can then be used to make better decisions that have a measurable impact on businesses.

Frederik Bussler is a consultant and analyst, with experience across innovative AI platforms such as Commerce.AI, Obviously.AI, and Apteo, as well as investment offices such as Supercap Digital, Maven 11 Capital, and Invictus Capital. He is featured in Forbes, Yahoo, and other outlets, and has presented for audiences including IBM and Nikkei.


Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

Repost: Original Source and Author Link

Tech News

CFM RISE open fan architecture jet engine could reduce fuel consumption by 20 percent

GE Aviation and Safran announced a new technology development program that aims to reduce fuel consumption for jet aircraft by 20 percent while reducing CO2 emissions at the same time. The program is called CFM RISE (Revolutionary Innovation for Sustainable Engines) and will demonstrate a mature range of new, disruptive technologies for future commercial aircraft engines that have the potential to enter service by the mid-2030.

Both GE Aviation and Safran also agreed as part of the announcement to extend the CFM International 50/50 partnership through the year 2050. The company has a goal of reducing CO2 emissions by 50 percent by 2050. The two companies say that their relationship is the strongest it has ever been. They will work together with the RISE technology demonstration program to reinvent flight for the future.

The companies want to take next-generation single-aisle aircraft to a new level of fuel efficiency and reduced emissions. Executives working on the project say that the current LEAP engine has already reduced emissions by 15 percent compared to past generation of engines. The new RISE technology will reduce that number even further.

New engine technologies also ensure 100 percent compatibility with alternative energy sources, including Sustainable Aviation Fuels and hydrogen. Both companies say the RISE Program is the foundation for the next-generation CFM engine expected to be available by the middle of the 2030s. One of the key features of the new engine is an open fan architecture, which is the key to improved fuel efficiency while delivering the same travel speed and cabin experience offered by current generation aircraft.

The program will leverage hybrid electric capability to optimize the efficiency of the engines while enabling electrification for many aircraft systems. So far, the RISE program has more than 300 separate components, modules, and full engine builds. A demonstrator engine is scheduled to begin testing around the middle of the decade, with a flight test soon after.

Repost: Original Source and Author Link


Cyberattack prompts shutdown of major fuel pipeline in the US

One of the largest pipelines in the US has been taken offline by its operator following a cyberattack. First reported by the New York Times, Colonial Pipeline, which carries 45 percent of the fuel supplies for the eastern US, said in a statement late Friday that it took “certain systems offline to contain the threat, which has temporarily halted all pipeline operations and affected some of our IT systems.”

The pipeline is 5,500 miles long and carries jet fuel and refined gasoline from the Gulf Coast to New York, according to the Times, transporting some 2.5 million barrels daily.

It’s not yet clear whether the attack targeted Colonial’s industrial control systems, or if the attack was carried out by foreign government hackers. The Washington Post, citing a US official familiar with the matter, reported that the incident was a ransomware attack.

Alpharetta, Georgia-based Colonial said it had engaged a “leading third-party cybersecurity firm” to investigate the nature and scope of the incident, and has contacted law enforcement.

Colonial Pipeline is taking steps to understand and resolve this issue. At this time, our primary focus is the safe and efficient restoration of our service and our efforts to return to normal operation. This process is already underway, and we are working diligently to address this matter and to minimize disruption to our customers and those who rely on Colonial Pipeline.

The Times reported that it was unlikely that the shutdown would cause immediate disruption to consumers, since most of the fuel goes into storage tanks, and the US has seen a reduction in energy use due to the pandemic. How long the pipeline may remain shut down was still unclear Saturday.

Update May 8th 10:14AM ET: Adds detail that the incident reportedly was a ransomware attack

Repost: Original Source and Author Link