Generative AI Models Have Been Trained Using Content Collected from the Web
- Huge quantities of data have been required to train the generative AI models and systems that are available. A large percentage of that data was scraped or otherwise captured from the Web without asking creators' permission and without providing them with compensation (Kim, 2025).
- In the United States, authors, musicians, media publishers, and others have filed lawsuits claiming copyright infringement against generative AI companies including OpenAI, Anthropic, Microsoft, Perpelxity, Meta, Suno, Udio, and others In most of these cases, the AI companies are arguing that their ingestion of the content falls under fair use and should be considered transformative use (Madigan, 2025). Some of the argument for fair use may rest on the issue of how generative AI (genAI) systems operate. While they ingest data in one form such as text, that data is typically converted into machine readable data and processed some more before it is then output again in text, audio, or video format. For more on genAI systems and how they work, see Laubheimer (2024) and Zewe (2023).
- The U.S. Copyright Office has been preparing a 3-part report on Copyright and AI. Part 1 deals with protecting people from "unauthorized digital replicas" (e.g. deepfakes) and was published in July 2024. Part 2 addresses "copyrightability" of works that include AI-generated content. It was released on January 29, 2025--Some works that were created with AI assistance can be copyrighted depending on the level of human input, decision-making, assemblage, etc.. Part 3 will tackle the issue of data ingestion from online sources for use in training of genAI systems. See the U.S. Copyright Office website "Copyright and Artificial Intelligence."
Further Reading on Copyright Issues with AI