We consume countless pieces of content every day—news articles, blog posts, bestselling novels, YouTube videos, and even the photos we upload to social media. But what if all this content is used as training data for AI without the creator’s permission?
Recently, news broke that Meta trained its AI using 81.7TB of data. Considering that an e-book is about 1MB, this is equivalent to approximately 81.7 million books—comparable to the combined scale of most digital libraries worldwide. While AI models require vast amounts of data to advance, the indiscriminate collection of such massive data may include information that is not secure.

AI companies argue that their training data consists of publicly available information, but just because data is on the internet doesn’t mean it is secure. Many websites and online communities prohibit data crawling through their terms of service, yet such restrictions are often ignored for AI training. In fact, there have been reports that some AI models have gathered data from closed forums and internal document-sharing systems. This raises concerns that corporate internal reports, research papers, private emails, and even restricted social media posts may have been scraped for training. This is not merely a copyright issue but a serious security risk.
There have already been several cases where AI posed security threats. In 2023, Samsung Electronics employees used ChatGPT for work, inputting internal source code. Concerns arose that this data could be used for AI training, prompting Samsung to ban ChatGPT within the company. Additionally, OpenAI experienced a ChatGPT bug in 2023 that exposed conversations between users. Some researchers warn that AI models may retain user inputs and could potentially reconstruct sensitive data based on them. The U.S. Department of Justice is also investigating whether AI models have trained on sensitive government documents, considering new legislation to limit AI training scopes.

AI-learned data is not just stored—it can be misused in new ways. For example, if AI is trained on email data, it can mimic corporate email patterns to generate highly convincing phishing emails, making attacks more persuasive than ever.
Furthermore, if AI learns internal communication data, it can analyze a company’s organizational structure and workflow, increasing the risk of targeted cyberattacks. More concerningly, AI could advance automated hacking techniques by studying security vulnerabilities, potentially creating new threats that traditional security solutions struggle to detect.

To address these issues, countries worldwide are discussing stronger regulations on AI training data security. The European Union (EU) is introducing strict regulations through the AI Act to prevent AI from learning personal data without consent. This includes legal penalties if AI models train on non-public data. The U.S. Federal Trade Commission (FTC) is investigating AI models for potential violations of copyright and privacy laws, demanding transparency in AI training data sources from major AI firms.
South Korea and Japan are also reviewing legal definitions of AI training data, aiming to require companies to disclose the sources of AI training data transparently. However, balancing industry growth with regulation remains a major challenge.
AI will continue to evolve. However, discussions on how far AI training should be allowed are essential. If AI training data is restricted, technological advancement could slow down. Conversely, if regulations are relaxed, privacy and corporate security may face severe threats. We now face a critical decision.
“Should we open data for AI advancement?” Or “Do we need strong regulations to protect security and privacy?”