OpenAI launched a web crawler technology called GPTBot

To address the privacy and copyright disputes involved in extracting data from publicly available web environments, OpenAI has announced the launch of a web-crawling technology named GPTBot. This technology will gather the necessary data for artificial intelligence training in a more transparent manner.

OpenAI stated that GPTBot will explicitly identify its crawler robot status through complete strings and tokens, and the publicly available web data it extracts will only be used to enhance future artificial intelligence models while excluding any content that requires payment for access.

However, if web administrators do not wish their content to be captured by GPTBot, for instance, if a webpage may contain large volumes of personal privacy-related content, they need only add a “GPTBot” description in the website’s robots.txt file, or customize the content GPTBot is allowed to capture. OpenAI also offers a method of prohibiting GPTBot from capturing webpage data by directly restricting IP access ranges, allowing web administrators to prevent their content from being captured by GPTBot.

In the past, many web pages have been configured to prevent various search engines from extracting data through crawling. But as artificial intelligence technology continues its upward trend of growth, more AI training relies on substantial amounts of public data for learning. This reliance has heightened many web administrators’ concerns that their content will be used for AI training, potentially affecting the value of information or impacting privacy and security. As a result, there are calls for AI technology providers to access web data in a reasonable manner.