AI-produced images can’t fix diversity issues in dermatology databases

Image databases of skin conditions are notoriously biased towards lighter skin. Rather than wait for the slow process of collecting more images of conditions like cancer or inflammation on darker skin, one group wants to fill in the gaps using artificial intelligence. It’s working on an AI program to generate synthetic images of diseases on darker skin — and using those images for a tool that could help diagnose skin cancer.

“Having real images of darker skin is the ultimate solution,” says Eman Rezk, a machine learning expert at McMaster University in Canada working on the project. “Until we have that data, we need to find a way to close the gap.”

But other experts working in the field worry that using synthetic images could introduce their own problems. The focus should be on adding more diverse real images to existing databases, says Roxana Daneshjou, a clinical scholar in dermatology at Stanford University. “Creating synthetic data sounds like an easier route than doing the hard work to create a diverse data set,” she says.

There are dozens of efforts to use AI in dermatology. Researchers build tools that can scan images of rashes and moles to figure out the most likely type of issue. Dermatologists can then use the results to help them make diagnoses. But most tools are built on databases of images that either don’t include many examples of conditions on darker skin or don’t have good information about the range of skin tones they include. That makes it hard for groups to be confident that a tool will be as accurate on darker skin.

That’s why Rezk and the team turned to synthetic images. The project has four main phases. The team already analyzed available image sets to understand how underrepresented darker skin tones were to begin with. It also developed an AI program that used images of skin conditions on lighter skin to produce images of those conditions on dark skin and validated the images that the model gave them. “Thanks to the advances in AI and deep learning, we were able to use the available white scan images to generate high-quality synthetic images with different skin tones,” Rezk says.

Next, the team will combine the synthetic images of darker skin with real images of lighter skin to create a program that can detect skin cancer. It will continuously check image databases to find any new, real pictures of skin conditions on darker skin that they can add to the future model, Rezk says.

The team isn’t the first to create synthetic skin images — a group that included Google Health researchers published a paper in 2019 describing a method to generate them, and it could create images of varying skin tones. (Google is interested in dermatology AI and announced a tool that can identify skin conditions last spring.)

Rezk says synthetic images are a stopgap until there are more real pictures of conditions on darker skin available. Daneshjou, though, worries about using synthetic images at all, even as a temporary solution. Research teams would have to carefully check if AI-generated images would have any usual quirks that people wouldn’t be able to see with the naked eye. That type of quirk could theoretically skew results from an AI program. The only way to confirm that the synthetic images work as well as real images in a model would be to compare them with real images — which are in short supply. “Then goes back to the fact of, well, why not just work on trying to get more real images?” she says.

If a diagnostic model is based on synthetic images from one group and real images from another — even temporarily — that’s a concern, Daneshjou says. It could lead to the model performing differently on different skin tones.

Leaning on synthetic data could also make people less likely to push for real, diverse images, she says. “If you’re going to do that, are you actually going to keep doing the work? she says. “I would actually like to see more people do work on getting real data that is diverse, rather than trying to do this workaround.”

Repost: Original Source and Author Link


MindsDB wants to give enterprise databases a brain

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

Databases are the cornerstone of most modern business applications, be it for managing payroll, tracking customer orders, or storing and retrieving just about any piece of business-critical information. With the right supplementary business intelligence (BI) tools, companies can derive all manner of insights from their vast swathes of data, such as establishing sales trends to inform future decisions. But when it comes to making accurate forecasts from historical data, that’s a whole new ball game, requiring different skillsets and technologies.

This is something that MindsDB is setting out to solve, with a platform that helps anyone leverage machine learning (ML) to future-gaze with big data insights. In the company’s own words, it wants to “democratize machine learning by giving enterprise databases a brain.”

Founded in 2017, Berkeley, California-based MindsDB enables companies to make predictions directly from their database using standard SQL commands, and visualize them in their application or analytics platform of choice.

To further develop and commercialize its product, MindsDB this week announced that it has raised $3.75 million, bringing its total funding to $7.6 million. The company also unveiled partnerships with some of the most recognizable database brands, including Snowflake, SingleStore, and DataStax, which will bring MindsDB’s ML platform directly to those data stores.

Using the past to predict the future

There are myriad use cases for MindsDB, such as predicting customer behavior, reducing churn, improving employee retention, detecting anomalies in industrial processes, credit-risk scoring, and predicting inventory demand — it’s all about using existing data to figure out what that data might look like at a later date.

An analyst at a large retail chain, for example, might want to know how much inventory they’ll need to fulfill demand in the future based on a number of variables. By connecting their database (e.g., MySQL, MariaDB, Snowflake, or PostgreSQL) to MindsDB, and then connecting MindsDB to their BI tool of choice (e.g., Tableau or Looker), they can ask questions and see what’s around the corner.

“Your database can give you a good picture of the history of your inventory because databases are designed for that,” MindsDB CEO Jorge Torres told VentureBeat. “Using machine learning, MindsDB enables your database to become more intelligent to also give you forecasts about what that data will look like in the future. With MindsDB you can solve your inventory forecasting challenges with a few standard SQL commands.”

Above: Predictions visualization generated by the MindsDB platform

Torres said that MindsDB enables what is known as In-Database ML (I-DBML) to create, train, and use ML models in SQL, as if they were tables in a database.

“We believe that I-DBML is the best way to apply ML, and we believe that all databases should have this capability, which is why we have partnered with the best database makers in the world,” Torres explained. “It brings ML as close to the data as possible, integrates the ML models as virtual database tables, and can be queried with simple SQL statements.”

MindsDB ships in three broad variations — a free, open source incarnation that can be deployed anywhere; an enterprise version that includes additional support and services; and a hosted cloud product that recently launched in beta, which charges on a per-usage basis.

The open source community has been a major focus for MindsDB so far, claiming tens of thousands of installations from developers around the world — including developers working at companies such as PayPal, Verizon, Samsung, and American Express. While this organic approach will continue to form a big part of MindsDB’s growth strategy, Torres said his company is in the early stages of commercializing the product with companies across numerous industries, though he wasn’t at liberty to reveal any names.

“We are in the validation stage with several Fortune 100 customers, including financial services, retail, manufacturing, and gaming companies, that have highly sensitive data that is business critical — and [this] precludes disclosure,” Torres said.

The problem that MindsDB is looking to fix is one that impacts just about every business vertical, spanning businesses of all sizes — even the biggest companies won’t want to reinvent the wheel by developing every facet of their AI armory from scratch.

“If you have a robust, working enterprise database, you already have everything you need to apply machine learning from MindsDB,” Torres explained. “Enterprises have put vast resources into their databases, and some of them have even put decades of effort into perfecting their data stores. Then, over the past few years, as ML capabilities started to emerge, enterprises naturally wanted to leverage them for better predictions and decision-making.”

While companies might want to make better predictions from their data, the inherent challenges of extracting, transforming, and loading (ETL) all that data into other systems is fraught with complexities and doesn’t always produce great outcomes. With MindsDB, the data is left where it is in the original database.

“That way, you’re dramatically reducing the timeline of the project from years or months to hours, and likewise you’re significantly reducing points of failure and cost,” Torres said.

The Switzerland of machine learning

The competitive landscape is fairly extensive, depending on how you consider the scope of the problem. Several big players have emerged to arm developers and analysts with AI tooling, such as the heavily VC-backed DataRobot and H2O, but Torres sees these types of companies as potential partners rather than direct competitors. “We believe we have figured out the best way to bring intelligence directly to the database, and that is potentially something that they could leverage,” Torres said.

And then there are the cloud platform providers themselves such as Amazon, Google, and Microsoft which offer their customers machine learning as add-ons. In those instances, however, these services are really just ways to sell more of their core product, which is compute and storage. — Torres also sees potential for partnering with these cloud giants in the future. “We’re a neutral player — we’re the Switzerland of machine learning,” Torres added.

MindDB’s seed funding includes investments from a slew of notable backers, including OpenOcean, which claims MariaDB cofounder Patrik Backman as a partner, YCombinator (MindsDB graduated YC’s winter 2020 batch), Walden Catalyst Ventures, SpeedInvest, and Berkeley’s SkyDeck fund.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link


A deep dive into privacy-protecting databases

Elevate your enterprise data technology and strategy at Transform 2021.

Privacy-protecting databases use a number of techniques to guard data. The complexity of these techniques has evolved as threats to data privacy have risen dramatically.

The simplest way to protect individuals’ records in databases may be to assign digital pseudonyms that can be stored in a separate database. Researchers are given only the first database, with the pseudonyms relieving them of the obligation to protect people’s real names. The database with real names may be stored in a second, more carefully protected location — or even completely discarded.

More sophisticated approaches use encryption or a one-way function to compute the pseudonym. This can give users the ability to retrieve their information from the database by reconstructing the pseudonym. But anyone who accesses the database can’t easily match the records up with names. My well-aged book, Translucent Databases, explored a number of different approaches to this, and there have been many innovations since then.

Some of the most complicated solutions are called “homomorphic encryption.” In these systems, sensitive information is completely encrypted, but complex algorithms are specially designed to allow some basic operations without decryption. For example, some computers may add up a list of numbers from an accounting database without being able to unscramble associated protected values.

Homomorphic encryption is far from mature. Many of the early systems require too much computation to be practical, especially for large databases with many entries. They often require the encryption algorithms to be customized for the data analysis that might come. Still, mathematicians are doing exciting work in the area, and many recent innovations have dramatically reduced the workload involved.

In recent years, researchers have started seriously exploring how adding fake entries or shifting values by adding random noise can make it harder to identify individuals in a database. But if the noise is mixed in correctly, it will cancel out when computing some aggregated statistics, like averages — a technique referred to as “differential privacy.”

What are some use cases?

  • Saving time and money on security by removing more valuable data. A local version of the database stored at a branch may delete names to remove the danger of loss. The central database can keep complete records for compliance in a more secure building.
  • Sharing data with researchers. If a business or a school wants to cooperate with a research program, it may ship a version of the database with obscured personal information while withholding a complete version if it’s ever necessary to discover the correct name connected to a record.
  • Encouraging compliance with rules for record-keeping while also maintaining customers’ privacy.
  • Offering strategic protection for military operations while also sharing sufficient data with allies for planning.
  • A commerce system designed to minimize the danger of insider trading while still tracking all transactions for compliance and settlement.
  • A fraud detecting accounting system that balances disclosure with privacy.

Vendor approaches to encryption

Makers of established databases have long experimented with employing database encryption algorithms that scan and scramble the data in particular rows and columns so they can only be viewed by someone with the right access key. These encryption algorithms can protect privacy, but many approaches for protecting privacy try to avoid blanket encryption. The goal is to balance secrecy with sharing, to protect private information while revealing non-private information to researchers.

Often encryption algorithms are used as a component of this strategy. Personal information, like names and addresses, are encrypted, and the key for this encryption algorithm is only kept by trusted insiders. Other users receive access to the unencrypted sections.

One common technique involves using one-way functions like the SHA256 hash algorithm to create keys for particular records. Anyone can store and retrieve their personal information because they can compute the key for the data by hashing their name, for example. But attackers who might be browsing the data can’t reverse the one-way function to recover the name.

Lately, that option doesn’t require encryption, at least directly. Sometimes fake data is mixed into the database, and other times the actual data values are distorted by a small amount. Identifying the records of individual people becomes difficult because of the noise.

Some companies are extending their product line with libraries that add differential privacy to data collections. Google recently open-sourced its internal tool called Privacy-on-Beam, a collection of libraries written in C++, Go, and Java. Users can inject noise before or after storing the information in a Google Cloud database.

Microsoft also recently offered a differential privacy toolkit that was developed in collaboration with computer scientists at Harvard. The team demonstrated how the tool can be employed for a variety of use cases, like sharing a dataset used for training an artificial intelligence application or computing statistics used for planning marketing campaigns.

Oracle has also been exploring using the algorithms to help protect interactions with researchers training a machine learning algorithm. One recent use case explores mixing differential privacy algorithms with federated learning that works with a distributed database.

Is open source a way forward?

Many of the early explorers of differential privacy are working together on an open source project called OpenDP. It aims to build a diverse collection of algorithms that share a common framework and data structure. Users will be able to combine multiple algorithms and build a layered approach to protecting the data.

Another approach concentrates on auditing and fixing any data issues. The Privacera platform’s suite of tools can search through files to identify and mask personally identifiable information (PII). It deploys a collection of machine learning techniques, and the tools are integrated with cloud APIs to simplify deployment across multiple clouds and vendors.

For more than a decade, IBM has been shipping homomorphic encryption. The company offers toolkits for Linux, iOS, and MacOS to accommodate developers who want to incorporate homomorphic encryption into their software. The company also offers consulting services and a cloud environment for storing and processing the data securely.

Is there anything privacy-protecting databases can’t do?

The underlying math is often unimpeachable, but there can be many other weak links in the systems. Even if the algorithms don’t have any known weak spot, attackers can sometimes find vulnerabilities.

In some cases, bad actors simply attack the operating system. In others, they go after the communications layer. Some sophisticated attacks combine information from multiple sources to reconstruct the hidden data inside.

But using privacy-protecting techniques on data continues to provide another layer of assurance that can simplify compliance. It can also enable types of collaboration that wouldn’t be possible without it.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Repost: Original Source and Author Link