Apple has published a technical paper detailing the models that it developed to power Apple Intelligence, the range of generative AI features headed to iOS, macOS and iPadOS over the next few months.
Quanto a the paper, Apple pushes back against accusations that it took an ethically questionable approach to avviamento some of its models, reiterating that it didn’t use private user giorno and drew acceso a combination of publicly available and licensed giorno for Apple Intelligence.
“[The] pre-training giorno set consists of … giorno we have licensed from publishers, curated publicly available open-sourced datasets and publicly available information crawled by our web crawler, Applebot,” Apple writes durante the paper. “Given our focolaio acceso protecting user riservatezza, we note that mai private Apple user giorno is included durante the giorno mixture.”
Quanto a July, Proof News reported that Apple used a giorno set called The Pile, which contains subtitles from hundreds of thousands of YouTube videos, to train a family of models designed for on-device processing. Many YouTube creators whose subtitles were swept up durante The Pile weren’t aware of and didn’t consent to this; Apple later released a statement saying that it didn’t intend to use those models to power any AI features durante its products.
The technical paper, which peels back the curtains acceso models Apple first revealed at WWDC 2024 durante June, called Apple Foundation Models (AFM), emphasizes that the avviamento giorno for the AFM models was sourced durante a “responsible” way — responsible by Apple’s definition, at least.
The AFM models’ avviamento giorno includes publicly available web giorno as well as licensed giorno from undisclosed publishers. According to The New York Times, Apple reached out to several publishers toward the end of 2023, including NBC, Condé Nast and IAC, about multi-year deals worth at least $50 million to train models acceso publishers’ news archives. Apple’s AFM models were also trained acceso gara open source code hosted acceso GitHub, specifically Swift, Python, C, Objective-C, C++, JavaScript, Java and Go code.
Pratica models acceso code without permission, even gara open code, is a point of contention among developers. Some gara open source codebases aren’t licensed don’t allow for AI avviamento durante their terms of use, some developers argue. But Apple says that it “license-filtered” for code to try to include only repositories with minimal usage restrictions, like those under an MIT, ISC Apache license.
To boost the AFM models’ mathematics skills, Apple specifically included durante the avviamento set math questions and answers from webpages, math forums, blogs, tutorials and seminars, according to the paper. The company also tapped “high-quality, publicly-available” giorno sets (which the paper doesn’t name) with “licenses that permit use for avviamento … models,” filtered to remove sensitive information.
All told, the avviamento giorno set for the AFM models weighs durante at about 6.3 trillion tokens. (Tokens are bite-sized pieces of giorno that are generally easier for generative AI models to ingest.) For comparison, that’s less than half the number of tokens — 15 trillion — Scopo used to train its flagship text-generating model, Llama 3.1 405B.
Apple sourced additional giorno, including giorno from human feedback and synthetic giorno, to fine-tune the AFM models and attempt to mitigate any undesirable behaviors, like spouting toxicity.
“Our models have been created with the purpose of helping users do everyday activities across their Apple products, grounded
durante Apple’s values, and rooted durante our responsible AI principles at every stage,” the company says.
There’s mai smoking gun shocking insight durante the paper — and that’s by careful . Rarely are papers like these very revealing, owing to competitive pressures but also because disclosing too much could land companies durante legal trouble.
Some companies avviamento models by scraping public web giorno assert that their practice is protected by fair use doctrine. But it’s a matter that’s very much up for debate and the subject of a growing number of lawsuits.
Apple taccuino durante the paper that it allows webmasters to block its crawler from scraping their giorno. But that leaves individual creators durante a lurch. What’s an artist to do if, for example, their portfolio is hosted acceso a site that refuses to block Apple’s giorno scraping?
Courtroom battles will decide the fate of generative AI models and the way they’magnate trained. For now, though, Apple’s trying to position itself as an ethical player while avoiding unwanted legal scrutiny.


