As training artificial intelligence (AI) models continues to raise ethical questions, the Dataset Providers Alliance (DPA) aims to bring fairness into the AI landscape by empowering creators and rights holders to decide whether or not to license their material for training purposes.
The DPA was formed from the joint efforts of seven AI licensing companies, including Rightsify, Global Copyright Exchange (GCX), vAIsual, Calliope Networks, ado, Datarade, and Pixta AI, and is expecting at least five new members in the fall. Since its founding last summer, the alliance has published a position paper detailing its views on AI-related issues, particularly unlicensed data scraping.
Unlicensed data scraping has hounded earlier generative AI tools since they were mainly trained on datasets publicly available on the Internet. As content creators cry foul over the illegal use of their materials while the need for additional datasets grows, the DPA comes forward to rally behind fair and standardized licensing agreements and ensure sufficient datasets for further AI training.
Opt-in system as the ethical route
Existing licensing agreements between rights holders and AI companies range from non-existent to opt-out systems. However, the issue with opt-out systems is that they put the responsibility of filtering the dataset on the owners—and in some cases, owners have no idea about the opt-outs.
The DPA believes that an opt-in system is a more ethical and moral route, as AI companies would have to get the consent of the creators before they can use their data. “Artists and creators should be on board. Selling publicly available datasets is one way to get sued and have no credibility,” said Rightsify CEO Alex Bestall.
Nevertheless, Shayne Longpre, the head of the Data Provenance Initiative, a volunteer group that examines AI datasets, warned that convincing companies to get on board with the opt-in standard could be difficult. “Under this regime, you’re either going to be data-starved, or you’re going to pay a lot,” Longpre explained. “It could be that only a few players, large tech companies, can afford to license all that data.”
Standardized compensations
The alliance also emphasizes that licensing should follow a “free market” approach, wherein data owners and AI companies handle negotiations directly rather than being government-mandated. Further, it proposes five compensation structures to ensure appropriate returns to creators.
According to the paper, the proposed compensation structures consist of a subscription-based mode, “usage-based licensing,” and “outcome-based” licensing that could be applied to any content.
“Looking to standardize compensation structures is potentially a good thing. The Dataset Providers Alliance is in a very good position to put terms out there,” stated Bill Rosenblatt, a technologist who studies copyright. He added that an easy and smooth process is a valuable incentive for AI companies to adopt licensing in the mainstream.
Synthetic data for future AI training
Additionally, the DPA backs several uses of synthetic data, provided that the real data used to create the synthetic data is properly licensed. It argues that artificial data will soon constitute the majority of training datasets for AI models. The alliance also expects synthetic data to be regularly evaluated to “mitigate biases and ethical issues.”
Whether or not these proposals will come true will be largely determined by the participation of AI giants. “There are standards emerging for how to license data ethically. But not enough AI companies are adopting them,” said Ed Newton-Rex, an executive at ethical AI nonprofit Fairly Trained.