Skip to content
  • Facebook
  • X
  • Linkedin
  • WhatsApp
  • Associate Journalism
  • About Us
  • Privacy Policy
  • 033-46046046
  • editor@artifex.news
Artifex.News

Artifex.News

Stay Connected. Stay Informed.

  • Breaking News
  • World
  • Nation
  • Sports
  • Business
  • Science
  • Entertainment
  • Lifestyle
  • Toggle search form
  • “Ego Took Over…”: Sunil Gavaskar On His Brutal Criticism Of Rishabh Pant’s Dismissal
    “Ego Took Over…”: Sunil Gavaskar On His Brutal Criticism Of Rishabh Pant’s Dismissal Sports
  • “You Will See Surprises”: Punjab Kings CEO Drops Hint On Auction Strategy For IPL 2025
    “You Will See Surprises”: Punjab Kings CEO Drops Hint On Auction Strategy For IPL 2025 Sports
  • Make higher education regulations voluntary: Economic Survey
    Make higher education regulations voluntary: Economic Survey Business
  • At Least 39 Dead In Mexico Due To Acapulco Hurricane, Say Officials
    At Least 39 Dead In Mexico Due To Acapulco Hurricane, Say Officials World
  • Claims BJP Leader Narottam Mishra
    Claims BJP Leader Narottam Mishra Nation
  • G-20 Summit | British Prime Minister Rishi Sunak signs new strategic pact with Singapore in India
    G-20 Summit | British Prime Minister Rishi Sunak signs new strategic pact with Singapore in India World
  • Defence Minister Rajnath Singh To Commission Navy’s New Frigate ‘INS Tushil’ In Russia Today
    Defence Minister Rajnath Singh To Commission Navy’s New Frigate ‘INS Tushil’ In Russia Today Nation
  • Paris Olympics: Rower Balraj Panwar Places 5th In men’s Singles Sculls Final D, Finishes 23rd
    Paris Olympics: Rower Balraj Panwar Places 5th In men’s Singles Sculls Final D, Finishes 23rd Sports
How the DeepSeek-R1 AI model was taught to teach itself to reason | Explained

How the DeepSeek-R1 AI model was taught to teach itself to reason | Explained

Posted on September 17, 2025 By admin


The story so far: For many decades, one of the great challenges in artificial intelligence (AI) has been teaching machines to reason. Reasoning goes beyond memorising facts or completing sentences. It’s the ability to follow steps, reflect on mistakes, and adjust strategies until the right answer is found.

Humans use reasoning for everything from solving maths problems to writing computer programmes, from negotiating their daily lives to deciding whom to vote for. Large language models (LLMs) such as GPT-4 or DeepSeek-V3 have surprised scientists by showing signs of reasoning when scaled to large sizes. Another method, called chain-of-thought prompting, where the model is nudged to “think step by step”, has also boosted performance.

But both these approaches come with limits. Training models to reason usually demand human-made examples. E.g. people show an AI model how to solve problems and the AI learns to copy the method. This is slow, expensive, and introduces human biases. It also caps the AI’s creativity because the model can’t explore problem-solving methods that humans didn’t think of.

In a paper published in Nature on September 18, the DeepSeek-AI team reported that it was able to reach its model, called just R1, to reason by asking an ambitious question: what if we allowed the model to teach itself to reason without showing it human examples first? That is, they found that R1 could develop new forms of reasoning using reinforcement learning, a method of trial and error guided only by rewards for correct answers.

What is reinforcement learning?

The team’s aim was to make the model smarter at maths and coding as well as to uncover how reasoning behaviours might emerge naturally when a machine is given the proper incentives.

DeepSeek researchers began with V3 Base, a large language model similar to other state-of-the-art systems. Instead of using the usual supervised fine-tuning, where humans provide the reasoning steps, they applied ‘group relative policy optimisation’, a reinforcement learning method designed for efficiency.

In this setup, the model, called R1-Zero at first, was asked to solve mathematical and algorithmic problems. For each attempt, it had to produce two parts: a reasoning process inside `<think>…</think>` tags and a final answer inside `<answer>…</answer>` tags. The only reward came from whether the final answer was correct, judged by rule-based systems like answer keys or code compilers. No one told the model how its reasoning should look.

Over thousands of training steps, the model learned by trial and error. If an answer was wrong, the path that led there was discouraged; if it was right, the path was reinforced. Importantly, the researchers also tracked how the model’s thinking time, i.e. the number of tokens it used in its reasoning section, changed. Strikingly, the model began writing longer and more reflective reasoning chains on its own, sometimes including phrases like “wait” or “let’s try again”, revealing an ability to self-correct.

Was there human intervention?

To address weaknesses such as poor readability and mixing English with Chinese, the team built R1 from R1-Zero. This process included adding incentives for consistently using one language supervised fine-tuning with both reasoning and non-reasoning data. The final model thus inherited the raw reasoning power of R1-Zero while also becoming easier to use and safer.

The results were striking. On the American Invitational Mathematics Examination (AIME) 2024, a tough competition that usually the smartest high-school students attempt, R1-Zero’s accuracy jumped from just 15.6% at the start of training to 77.9% by the end. With more tuning, it reached 86.7%, surpassing the average performance of human students.

At a certain stage, R1-Zero began using the word “wait” more often in its reasoning, just like a human might have when a mistake is spotted. The researchers said this meant the model wasn’t blindly following a path but actively rethinking steps when something seemed off. In effect, reinforcement learning had coaxed the AI into behaviours that resembled reflection and verification, both elements of reasoning.

The ultimate R1 model was even stronger: it was good at maths and coding as well as on benchmarks for general knowledge, answering questions, and following instructions. Compared to its predecessors, R1 was also more consistent with its choice of language and better aligned with human preferences for helpfulness and safety. When evaluated with frameworks like AlpacaEval 2.0 and Arena-Hard, which test how well a model follows instructions, R1 improved by 25% and 17%, respectively, which are considered large.

What’re the pros and cons of reasoning?

Many large language models, including widely used systems like ChatGPT, often demand large amounts of computational resources during testing. R1, on the other hand, could adapt how much it “thought” depending on the task’s difficulty. Simple problems were met with short reasoning chains while harder ones led to longer, more elaborate chains. This dynamic allocation avoided demanding power on questions that didn’t warrant it. However, reinforcement learning itself is energy-intensive.

Taken together, the findings confirm that reinforcement learning alone (with the right design) could produce reasoning behaviours that were previously thought to require human examples. This could change the way we think about how intelligence might grow in artificial systems. For instance, in future, researchers could build verifiers that check answers and let the model figure out its own strategies. If the answer to a maths problem, a computer programme or a factual question can be reliably checked, then reinforcement learning can do the rest. This could speed up progress while reducing human labour and bias.

Indeed, traditional LLM training pipelines bank heavily on large human-labelled datasets — people writing question-answer pairs, reasoning steps, preference judgments, etc. They are expensive and often assembled under exploitative labour conditions. If machines can be taught to reason using reinforcement learning alone, the demand for human-annotated data can shrink, thus also reducing pressure to source cheap labour worldwide. However, the study paper also acknowledges that tasks without clear ground-truthing still rely on human-labelled data for reward models. So human input is not eliminated; only its scope may shrink to areas where no reliable verifier can be built.

A model that learns to reason will also demand better reward signals for open-ended tasks like writing, which is difficult, as well as stronger safeguards as it becomes capable of generating dangerous or manipulative content. In fact, watching a machine develop reflective behaviour (pausing, checking, revising, etc.) raises questions about how far such systems can go. If reasoning emerges from incentives rather than instructions, could creativity or deeper forms of understanding emerge in the same way?

Time will tell — unless DeepSeek-R1 figures it out first.

Published – September 17, 2025 08:30 pm IST



Source link

Science

Post navigation

Previous Post: Access Denied
Next Post: Access Denied

Related Posts

  • Science This Week | Scientists discovers ‘bubble of galaxies’, bird-like dinosaur found in China and more
    Science This Week | Scientists discovers ‘bubble of galaxies’, bird-like dinosaur found in China and more Science
  • Improving the compatibility of pig organs for transplantation into humans
    Improving the compatibility of pig organs for transplantation into humans Science
  • Antimicrobial resistant gene profile found in poultry from Kerala, Telangana: study
    Antimicrobial resistant gene profile found in poultry from Kerala, Telangana: study Science
  • Racing for Moon real estate
    Racing for Moon real estate Science
  • The Silent Killer: tackling hypertension in India
    The Silent Killer: tackling hypertension in India Science
  • Billion-light-year-wide ‘bubble of galaxies’ discovered
    Billion-light-year-wide ‘bubble of galaxies’ discovered Science

More Related Articles

Climate change is causing marine ‘coldwaves’ too, killing wildlife Climate change is causing marine ‘coldwaves’ too, killing wildlife Science
Dyson sphere: an energy devourer Dyson sphere: an energy devourer Science
Could ‘marine cloud brightening’ reduce coral bleaching on the Great Barrier Reef? Could ‘marine cloud brightening’ reduce coral bleaching on the Great Barrier Reef? Science
ISRO successfully launches earth observation satellite onboard third and final developmental flight SSLV-D3-EOS8 mission ISRO successfully launches earth observation satellite onboard third and final developmental flight SSLV-D3-EOS8 mission Science
Afghanistan: ‘all four quakes were in the same fault system’ Afghanistan: ‘all four quakes were in the same fault system’ Science
New study finds one of the oldest animal mummies is a forged fossil New study finds one of the oldest animal mummies is a forged fossil Science
SiteLock

Archives

  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022

Categories

  • Business
  • Nation
  • Science
  • Sports
  • World

Recent Posts

  • Access Denied
  • Access Denied
  • Access Denied
  • Access Denied
  • How the DeepSeek-R1 AI model was taught to teach itself to reason | Explained

Recent Comments

  1. dfb{{98991*97996}}xca on UP Teacher Who Asked Students To Slap Muslim Classmate
  2. "dfbzzzzzzzzbbbccccdddeeexca".replace("z","o") on UP Teacher Who Asked Students To Slap Muslim Classmate
  3. 1}}"}}'}}1%>"%>'%> on UP Teacher Who Asked Students To Slap Muslim Classmate
  4. bfg6520<s1﹥s2ʺs3ʹhjl6520 on UP Teacher Who Asked Students To Slap Muslim Classmate
  5. pHqghUme9356321 on UP Teacher Who Asked Students To Slap Muslim Classmate
  • Canada will match U.S. exemptions to punishing tariffs, says Canadian official
    Canada will match U.S. exemptions to punishing tariffs, says Canadian official World
  • Displaying An Egg, DMK Launches Signature Campaign To Abolish NEET
    Displaying An Egg, DMK Launches Signature Campaign To Abolish NEET Nation
  • A Record Number Of US Firms Are Leaving China
    A Record Number Of US Firms Are Leaving China Nation
  • Meet Rakshitha Raju: First Indian Female Athlete In 1500m Race At Paris Paralympics
    Meet Rakshitha Raju: First Indian Female Athlete In 1500m Race At Paris Paralympics Sports
  • Man Climbs Electric Tower In Noida, Dances On Top Of It
    Man Climbs Electric Tower In Noida, Dances On Top Of It Nation
  • Man Uses Hammer To Deface Ambedkar Statue In Amritsar
    Man Uses Hammer To Deface Ambedkar Statue In Amritsar Nation
  • Pollticks: Jobs get highest mention in last Budget before elections
    Pollticks: Jobs get highest mention in last Budget before elections Business
  • Putin Critic Alexei Navalny’s Wife Accuses Putin Of Holding His Body “Hostage”
    Putin Critic Alexei Navalny’s Wife Accuses Putin Of Holding His Body “Hostage” World

Editor-in-Chief:
Mohammad Ariff,
MSW, MAJMC, BSW, DTL, CTS, CNM, CCR, CAL, RSL, ASOC.
editor@artifex.news

Associate Editors:
1. Zenellis R. Tuba,
zenelis@artifex.news
2. Haris Daniyel
daniyel@artifex.news

Photograher:
Rohan Das
rohan@artifex.news

Artifex.News offers Online Paid Internships to college students from India and Abroad. Interns will get a PRESS CARD and other online offers.
Send your CV (Subjectline: Paid Internship) to internship@artifex.news

Links:
Associate Journalism
About Us
Privacy Policy

News Links:
Breaking News
World
Nation
Sports
Business
Entertainment
Lifestyle

Registered Office:
72/A, Elliot Road, Kolkata - 700016
Tel: 033-22277777, 033-22172217
Email: office@artifex.news

Editorial Office / News Desk:
No. 13, Mezzanine Floor, Esplanade Metro Rail Station,
12 J. L. Nehru Road, Kolkata - 700069.
(Entry from Gate No. 5)
Tel: 033-46011099, 033-46046046
Email: editor@artifex.news

Copyright © 2023 Artifex.News Newsportal designed by Artifex Infotech.