Self-organizing systems: what, how, and why?

Mike's Notes

Carlos Gershenson's paper about self-organising systems was recently published by Nature in npj complexity. Carlos is the editor of Complexity Digest.

I removed the references in the original paper because importing this into Google Blogger is messy. Please refer to the original paper for this missing information.

Resources

References


Repository

  • Home > Ajabbi Research > Library > Subscriptions > Complexity Digest
  • Home > Ajabbi Research > Library > Thermodynamics

Last Updated

05/04/2025

Self-organizing systems: what, how, and why?

By: Carlos Gershenson
Nature > npj complexity: 25/03/2025

Abstract

I present a personal account of self-organizing systems, framing relevant questions to better understand self-organization, information, complexity, and emergence. With this aim, I start with a notion and examples of self-organizing systems (what?), continue with their properties and related concepts (how?), and close with applications (why?) in physics, chemistry, biology, collective behavior, ecology, communication networks, robotics, artificial intelligence, linguistics, social science, urbanism, philosophy, and engineering.

What are self-organizing systems?

"Being ill defined is a feature common to all important concepts.” —Benoît Mandelbrot

I will not attempt to define a “self-organizing system”, as it involves the cybernetic problem of defining “system”, the informational problem of defining “organization”, and the ontological problem of defining “self”. Still, there are plenty of examples of systems that we can usefully call self-organizing: flocks of birds, schools of fish, swarms of insects, herds of cattle, and some crowds of people. In these animal examples, collective behavior is a product of the interactions of individuals, not determined by a leader or an external signal. There are also several examples from non-living systems, such as vortexes, crystallization, self-assembly, and pattern formation in general. In these cases, elements of a system also interact to achieve a global pattern.

Self-organization or similar concepts have been present since antiquity (see Section 3.12), so the idea itself is not new. Nevertheless, we still lack the proper conceptual framework to understand it properly. The term “self-organizing system” was coined by W. Ross Ashby in the early days of cybernetics. Ashby’s purpose was to describe deterministic machines that could change their own organization. Ever since, the concept has been used in a broad range of disciplines, including statistical mechanics, supramolecular chemistry, computer science, and artificial life.

There is an unavoidable subjectivity when speaking about self-organizing systems, as the same system can be described as self-organizing or not (see Section 2.1). Stafford Beer gave the following example: an ice cream at room temperature will thaw. This will increase its temperature and entropy, so it would be “self-disorganizing”. However, if we focus on the function of an ice cream for being eaten, it would be “self-organizing”, because it would approach a pleasant temperature and consistency for degustating it, improving its “function”. Ashby also mentioned that one just needs to call the attractor of a dynamical system “organized”, and then almost any system will be self-organizing.

So, the question should not be whether a system is self-organizing, but rather (being pragmatic) when is it useful to describe a system as self-organizing? The answer will slowly unfold along this paper, but in short, it can be said that self-organization is a useful description when we are interested on describing systems at multiple scales, and understanding how these affect each other. For example, collective motion and cyber-physical systems can benefit from such a description, compared to a single-scale narrative/model. This is common with complexity, as interactions can generate novel information that is not present in initial nor boundary conditions, limiting predictability.

So rather than a definition, we can do with a notion: a system can be described as self-organizing when its elements interact to produce a global function or behavior. This is in contrast with centralized systems, where a single or few elements “control” the rest, or with simply distributed systems, where a global problem can be divided (reduced) and each element does its part, but there is no need to interact nor integrate elementary solutions. Thus, self-organizing systems are a useful description when we want to relate individual behaviors and interactions to global patterns or functions. If we can describe a system fully (for our particular purposes) at a single scale, then self-organization could be perhaps identified, but superfluos (not useful). And the “self” implies that the “control” comes from within the system, rather than from an external signal/controller that would explicitly indicate elements of what to do.

For example, we can decide to call a society “self-organizing” if we are interested in how individual interactions lead to the formation of fashion, ideologies, opinions, norms, and laws; but at the same time, how the emerging global properties affect the behavior of the individuals. If we were interested in an aggregate property of a population, e.g., its average height, then calling the group of individuals “self-organizing” would not give any extra information, and thus would not be useful.

It should be stressed that self-organization is not a property of systems per se. It is a way of describing systems, i.e., a narrative.

How can self-organizing systems be measured?

"It is the function of science to discover the existence of a general reign of order in nature and to find the causes governing this order. And this refers in equal measure to the relations of man — social and political — and to the entire universe as a whole.” —Dmitri Mendeleev

Even when self-organization had been described intuitively since antiquity — the seeds of the narrative were present — the proper tools for studying it became available only recently: computers. Since self-organizing systems require the explicit description of elements and interactions, our brains, blackboards, and notebooks are too limited to consider the number of required variables to study the properties of self-organizing systems. It was only through the relatively recent development of information technology that we were able to study the richness of self-organization, just like we were unable to study the microcosmos before microscopes and the macrocosmos before telescopes.

Information

Computation can be generally described as the transformation of information, although Alan Turing formally defined computable numbers with the purpose of proving limits of formal systems (in particular, Hilbert’s decision problem). In the same environment where the first digital computers were built in the mid XXth century, Claude Shannon defined information to quantify its transmission, showing that information could be reliably transmitted through unreliable communication channels. As it turned out, Shannon’s information H is mathematically equivalent to Boltzmann-Gibbs entropy:

$$H=-K\mathop{\sum}\limits_{i=i}^{n}{p}_{i}\log {p}_{i},$$

(1)

where K is a positive constant and p is the probability of receiving symbol i from a finite alphabet of size n. This dimensionless measure will be maximal for a homogeneous probability distribution, and minimal when only one symbol has a probability p = 1. In binary, we have only two symbols (n = 2), and information would be minimal with a string of only ones or only zeroes (‘1111…’ or ‘0000…’). This implies that having more bits will not tell us anything new, because we already know what the next bits will be (assuming the probability distribution will not change). With a random-like string, such as a sequence of coin flips (‘11010001011011001010…’), information is maximal, because no matter how much previous information we have (full knowledge of the probability distribution), we will not be able to predict what the next bit might be better than chance.

In parallel, Norbert Wiener — one of the founders of cybernetics — proposed an alternative measure of information, which was basically the same as Shannon’s, but without the minus sign. Wiener’s information measured what one knows already, so it is minimal when we have a random string (homogeneous probability distribution) because all the information we already have is “useless” (to predict the next symbol), and maximal when we have a single symbol repeating (maximally biased probability distribution), because the information we have allows us to predict exactly the next symbol. Nevertheless, Shannon’s information is the one that everyone has used, and we will do the same.

Shannon’s information is also known as Shannon’s entropy, which can be also used as a measure of “disorder”. We already saw that it is maximal for random strings, and thus minimal for particularly ordered strings. Then, we can use the negative of Shannon’s information (which would be Wiener’s information) as a measure of organization. If the organization is a result of internal dynamics, then we can also use this measure for self-organization.

Nevertheless, just like with many measures, the interpretation depends on how the observer performs the measurement. Figure 1 shows how the same system, divided into four microstates or two macrostates (with probabilities represented as shades of gray) can increase its entropy/information (become more homogeneous) or decrease it, depending on how it is observed.

Fig. 1: The same system, observed at different levels or with different coarse grainings can be said to be disorganizing (entropy increasing) or organizing (entropy decreasing), for arbitrary initial and final states.

Probabilities of the system being in a state (a1, a2, b1, and b2 at the lower level, which can be aggregated in different ways at a higher level) are represented as shades of gray, so one can observe which configurations are more homogeneous (i.e., with higher entropy): if there is a high contrast in different states (such as between A' and B' in their initial state), then this implies more organization (less entropy), while similar shades (as between A' and B' in their final state) imply less organization (more entropy).

Still, the fact that self-organization is partially subjective does not mean that it cannot be useful. We just have to be aware that a shared description and interpretation should be agreed upon.

Complexity

Self-organizing systems are intimately related to complex systems. Again, the question is not so much whether a system is self-organizing or complex, but when is it useful to describe it as such. This is because most systems can be described as complex or not, depending on our context and purposes.

Etymologically, complexity comes from the Latin plexus, which could be translated as entwined. We can say that complex systems are those where interactions make it difficult to separate the components and study them in isolation, because of their interdependence. These interactions can generate novel information that limit predictability in an inherent way, as it is not present in initial nor boundary conditions. In other words, there is no shortcut to the future, but we have to go through all intermediate steps, as interactions partially determine the future states of the system.

For example, markets tend to be unpredictable because different agents make decisions depending on what they think other agents will decide. But since it is not possible to know what everyone will decide in advance, the predictability of markets is rather limited.

Complex systems can be confused with complicated or chaotic systems. Perhaps they will be easier to distinguish considering their opposites: complicated are the opposite of easy, chaotic (sensitive to initial conditions) are the opposite of robust, while complex systems are the opposite of separable.

Given the above notion of self-organizing systems, then all of them would also be complex systems, but not necessarily vice versa. This is because interactions are an essential aspect of self-organizing systems, which would make them complex by definition. However, we could have a description of a complex system whose elements interact but do not produce a global pattern or function we are interested in during the timeframe we are interested in. So, the narrative of complexity would be useful, but not the one of self-organization. Nevertheless, understanding complexity should be essential for the study of self-organization.

Emergence

One of the most relevant and controversial properties of complex systems is emergence. It could be seen as problematic because last century some people described emergent properties as “surprising”. So then emergence would be a measure of our ignorance, and then it would be reduced once we understood the mechanisms behind emergent properties. Also, there are different flavors of emergence, some easier to study and accept than others. But in general, emergence can be described as information that is present at one scale and not at another scale.

For example, we can have full knowledge of the properties of carbon atoms. But if we focus only on the atoms, i.e. without interactions, we will not be able to know whether they are part of a molecule of graphite, diamond, graphene, buckyballs, etc. (all composed only of carbon atoms) which have drastically different macroscopic properties. Thus, we cannot derive the conductivity, transparency, or density of these materials by looking only at the atomic properties of carbon. The difference lies precisely in how the atoms are organized, i.e. how they interact.

If emergence can be described in terms of information, Shannon’s measure can be used (understanding that we are measuring only the information that is absent from another scale). Thus, emergence would be the opposite of self-organization. This might seem contradictory, as usually emergence and self-organization are both present in complex systems8. But if we take each to its extreme, we can see that maximum emergence (information) occurs when there is (quasi)randomness, so no organization. Maximum (self-)organization occurs when entropy is minimal (no new information, and thus, no emergence). Because of this, complexity can be seen as a balance between emergence and self-organization.

Why should we use self-organizing systems?

"It is as though a puzzle could be put together simply by shaking its pieces.” —Christian De Duve

Self-organization can be used to build adaptive systems. This is useful for non-stationary problems, i.e., those that change in time. Since interactions can generate novel information, complexity often leads to non-stationarity. Thus, when a problem changes, the elements of a self-organizing system can adapt through their interactions. Then, designers do not need to specify precisely the problem beforehand, or how it will change, but just to define/regulate interactions to achieve a desired goal.

For example, if we want to improve passenger flow in public transportations systems, we cannot really change the elements of the system (passengers). Still, we can change how they interact. In 2016, we successfully implemented such a change to regulate boarding and alighting in Mexico City metro. In a similar way, we cannot change teachers in an education system. But we can change their interactions to improve learning. We cannot change politicians, but we can regulate their interactions to reduce corruption and improve efficiency. We cannot change businesspeople, but we can control their interactions to promote sustainable economic growth.

There have been many other examples of applications of self-organization in different field, and the following is only a partial enumeration.

Physics

The Industrial revolution led to the formalization of thermodynamics in the XIXth century. The second law of thermodynamics states that an isolated system will tend to thermal equilibrium. In other words, it loses organization, as heterogeneities become homogeneous, and entropy is eventually maximized. Still, non-equilibrium thermodynamics has studied how open systems can self-organize.

Lasers can be seen as self-organized light, which Hermann Haken used as an inspiration to propose the study of synergetics, which precisely studies self-organization in open systems far from thermodynamic equilibrium and is related to phase transitions, where criticality is found.

Self-organized criticality (SOC) was proposed to explain why power laws and scale-free-like distributions and fractals are so prevalent in nature. SOC was illustrated with the sandpile model, where grains accumulate and lead to avalanches with a scale-free (critical) distribution. Similarly, self-organization has been used to describe granular media.

Generalizing principles of granular media, self-organization can be used to describe and design “optimal” configurations in biological, social, and economic systems.

Chemistry

Around 1950, Boris P. Belousov was interested in a simplified version of the Krebs cycle. He found that a solution of citric acid in water with acidified bromate and yellow ceric ions produced an oscillating reaction. His attempts to publish his findings were rejected, arguing that it violated the second law of thermodynamics (which only applies to systems at equilibrium, and this system is far from equilibrium). In the 1960s, Anatol M. Zhabotinsky began working on this reaction, and only in the late 1960s and 1970s the Belousov-Zhabotinsky reaction became known outside the Soviet Union. Since then, many chemical systems far from equilibrium have been studied. Some have been characterized as self-organizing, because they are able to use free energy to increase their organization.

More generally, self-organization has been used to describe pattern formation, which includes self-assembly.

Molecules are basically atoms joined by covalent bonds. Supramolecular chemistry studies chemical structures formed by weaker forces (Van Der Waals, hydrogen bonds, electrostatic charges), and can also be described in terms of self-organization.

Biology

The study of form in biology (morphogenesis) is far from new, but far from complete.

Alan Turing was one of the first to describe morphogenesis with differential equations. Morphogenesis can be seen as pattern formation with local stimulation and long-range inhibition (skins, scales), or as fractals (capillaries, neurons). These processes are more or less well understood. Still, it becomes more sophisticated for embryogenesis and regeneration, where many open questions remain.

Humberto Maturana and Francisco Varela proposed autopoiesis (self-production) to describe the emergence of living systems from complex chemistry. Autopoiesis can be seen as a special case of self-organization (to the disagreement of Maturana), because molecules self-organize to produce membranes and metabolism. Moreover, it can be argued that living systems also need information handling, self-replication, and evolvability.

There are further examples of self-organization in biology, that include firefly synchronization, ant foraging, and collective behavior.

Collective Behavior

Groups of agents can produce global patterns or behavior through local interactions. Craig Reynolds presented a simple model of boids, where agents followed three simple rules: separation (don’t crash), alignment (head to average heading of neighbors), and cohesion (go towards average position of neighbors). Varying its parameters, this simple model produces dynamic patterns similar to those of flocks, schools, herds, and swarms. It was used to animate bats and penguins in the 1992 Batman Returns film and contributed to earning Reynolds an Oscar in 1998.

A flock of boids self-organize even only with the alignment rule and added noise. It has been shown that when the number of boids increases, novel properties emerge.

Slightly more sophisticated models have been used to describe more precisely animal collective behavior.

Furthermore, similar models and rules have been used to study the self-organization of active matterand robots (see below).

Ecology

Species self-organize to produce ecological patterns. These include trophic networks (who eats who), mutualistic networks (cooperating species), and host-parasite networks.

At the biosphere level, ecosystems also self-organize. This is a central aspect of the Gaia hypothesis, which defends that our planet self-regulates its own conditions that allow life to thrive.

Self-organization can be useful to study how ecosystems can be robust, resilient, or antifragile

Communication networks

Self-organization has been useful in telecommunication networks106, as it is desirable to have the ability to self-reconfigure based on changing demands. Also, having local rules to define global functions makes them robust to potential failures or attacks of central nodes: if there is a path that is not responsive, then an alternative is sought. These principles have been used in Internet protocols, peer-to-peer networks, cellular networks, and more.

Robotics

There has been a broad variety of self-organizing robots, terrestrial, aerial, aquatic, and/or hybrid (for a review see ref. 26).

A common aspect of self-organizing robots is that there is no leader, and the collective function or pattern is the result of local interactions. Some have been inspired in the collective behavior of animals, and their potential applications are vast.

Artificial Intelligence

As mentioned in the first section of this paper, the study of self-organizing systems originated in cybernetics, which had a strong influence and overlap in the early days of artificial intelligence. Claude Shannon, William Grey Walter, Warren McCulloch and Walter Pitts contributed to both fields in their early days.

If brains can be described as self-organizing, it is no surprise that certain flavors of artificial neural networks have also been described as self-organizing. Independently on the terminology, adjustments to local weights between artificial neurons lead to an error reduction in the task of the network.

Even when their interpretation is still controversial, large language models have been useful in multiple domains. Whether describing them as self-organizing would bring any benefit or not, still remains to be seen.

Linguistics

The statistical study of linguistics became popular after Zipf. Different explanations have been put forward to try to explain statistical regularities found across languages, and in even more general contexts.

Naturally, some of these explanations focus on the evolution of language. It has been shown that a shared vocabulary and grammar140 can evolve using self-organization: individuals interact locally leading to a population converging to a shared language. This is useful not only for understanding language evolution, but also to build adaptive artificial systems. Similar mechanisms can be used in other social systems, e.g. to reach consensus.

Social Science

Individuals in a society interact in different ways. These interactions can lead to social properties, such as norms, fashions, and expectations. In turn, these social properties can guide, constrain, and promote behaviors and states of individuals.

Computers have allowed the simulation of social systems, including systematic explorations of abstract models. Combined with an increase in data availability, computational social science has been increasingly adopted by social scientists. The understanding and implications of self-organization are naturally relevant to this field.

Urbanism

It is similar to the scientific study of cities.

For example, city growth can be modeled as a self-organizing process152. Similar to the metro case study mentioned above, self-organization has been shown to efficiently and adaptively coordinate traffic lights, and is promising for regulating interactions among autonomous vehicles.

More generally, urban systems tend to be non-stationary, as conditions are changing constantly. Thus, self-organization offers a proven alternative to design urban systems that adapt as fast as their conditions change.

Philosophy

Concepts similar to self-organization can be traced to ancient Greece in Heraclitus and Aristotle and also to Buddhist philosophy.

There has been a long debate about the relationship between mechanistic principles and the purpose of systems. This question was at the origins of cybernetics. It has been argued12 that self-organization can be used to explain teleology, in accordance with Kant’s attempt from the late XVIIIth century, as purpose can also be described in terms of organization.

Also, self-organization is related to downward causation: when higher-level properties cause changes in lower-level elements. This is still debated, along with other philosophical questions related to self-organization.

Engineering

There have been several examples of self-organization applied to different areas of engineering apart from those already mentioned, such as power grids, computing, sensor networks, supply networks and production systems, bureaucracies, and more.

In general, self-organization has been a promising approach to build adaptive systems, as mentioned above. It might seem counterintuitive to speak about controlling self-organization, since we might think that self-organizing systems are difficult to regulate because of a certain autonomy of their components. Still, we can speak about a balance between control and independence, in what has been called “guided self-organization”.

Conclusions

"We can never be right, we can only be sure when we are wrong” —Richard Feynman

There are many open questions related to the scientific study of self-organizing systems. Even when their potential has been promising, they are far from being commonly used to address non-stationary problems. Could it be because of a lack of literacy in concepts related to complex systems? Might there be any conceptual or technical obstacle? Do we need further theories? Independently of the answers, these questions are worth exploring.

For example, we have yet to explore the relationship between self-organization and antifragility: the property of systems that benefit from perturbations. Self-organization seems to be correlated with antifragility, but why or how still has to be investigated. In a similar vein, a systematic exploration of the “slower is faster” effect might be useful to better understand self-organizing systems and vice versa.

Many problems and challenges we are facing — climate change, migration, urban growth, social polarization, etc. — are clearly non-stationary. It is not certain that with self-organization we will be able to improve the situation in all of them. But it is almost certain that with the current tools we have, we will not be able to make much more progress (otherwise we would have made it already). It would be imprudent not to make efforts to use the narrative of self-organization, even if for slightly improving situations related to only one of these challenges.

Chaos and the Three Body Problem

Mike's Notes

This is an excellent video presentation of the three-body problem by theoretical astrophysicist and mathematician Eliezer Diggins. It was a Communicating Science Project—Astronomy 3070—at the University of Utah in 2022.

Wikipedia - "In physics, specifically classical mechanics, the three-body problem is to take the initial positions and velocities (or momenta) of three point masses that orbit each other in space and calculate their subsequent trajectories using Newton's laws of motion and Newton's law of universal gravitation.

Unlike the two-body problem, the three-body problem has no general closed-form solution, meaning no equation always solves it. When three bodies orbit each other, the resulting dynamical system is chaotic for most initial conditions. Because there are no solvable equations for most three-body systems, the only way to predict the motions of the bodies is to estimate them using numerical methods.

The three-body problem is a special case of the n-body problem. Historically, the first specific three-body problem to receive extended study involved the Earth, the Moon, and the Sun. In an extended modern sense, a three-body problem is any problem in classical or quantum mechanics that models the motion of three particles."

Resources

References


Repository

  • Home > Ajabbi Research > Complex Systems

Last Updated

03/04/2025

Chaos and the Three-Body Problem

By: Eliza Diggins
YouTube: 16/11/2022

Eliza Diggins is a theoretical astrophysicist and mathematician at the University of Utah. Her research is shared between the Department of Physics and Astronomy, where she studies galaxy cluster dynamics and gravitational theory; and the School of Dentistry, where she works on mathematical models of trade-mediated pathogens in complex global trade networks.

On the Biology of a Large Language Model

Mike's Notes

This fascinating article reports on the internal circuits of an LLM. I have only reposted the introduction. It is an excellent read. Transformer Circuits has many valuable reports.

Resources

References


Repository

  • Home > 

Last Updated

03/04/2025

On the Biology of a Large Language Model

By: Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall◊, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson
Transformer Circuits: 27/03/2025

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.

The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.

Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models (e.g. [1, 2, 3, 4, 5]

). These methods have uncovered representations of interpretable concepts – “features” – embedded within models’ internal activity. Just as cells form the building blocks of biological systems, we hypothesize that features form the basic units of computation inside models. 1

However, identifying these building blocks is not sufficient to understand the model; we need to know how they interact. In our companion paper, Circuit Tracing: Revealing Computational Graphs in Language Models, we build on recent work (e.g. [5, 6, 7, 8]

) to introduce a new set of tools for identifying features and mapping connections between them – analogous to neuroscientists producing a “wiring diagram” of the brain. We rely heavily on a tool we call attribution graphs, which allow us to partially trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response. Attribution graphs generate hypotheses about the mechanisms used by the model, which we test and refine through follow-up perturbation experiments.

In this paper, we focus on applying attribution graphs to study a particular language model – Claude 3.5 Haiku, released in October 2024, which serves as Anthropic’s lightweight production model as of this writing. We investigate a wide range of phenomena. Many of these have been explored before (see § 16 Related Work), but our methods are able to offer additional insight, in the context of a frontier model:

  • Introductory Example: Multi-step Reasoning. We present a simple example where the model performs “two-hop” reasoning “in its head” to identify that “the capital of the state containing Dallas” is “Austin.” We can see and manipulate an internal step where the model represents “Texas”.
  • Planning in Poems. We discover that the model plans its outputs ahead of time when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words that could appear at the end. These preselected rhyming options then shape how the model constructs the entire line.
  • Multilingual Circuits. We find the model uses a mixture of language-specific and abstract, language-independent circuits. The language-independent circuits are more prominent in Claude 3.5 Haiku than in a smaller, less capable model.
  • Addition. We highlight cases where the same addition circuitry generalizes between very different contexts.
  • Medical Diagnoses. We show an example in which the model identifies candidate diagnoses based on reported symptoms, and uses these to inform follow-up questions about additional symptoms that could corroborate the diagnosis – all “in its head,” without writing down its steps.
  • Entity Recognition and Hallucinations. We uncover circuit mechanisms that allow the model to distinguish between familiar and unfamiliar entities, which determine whether it elects to answer a factual question or profess ignorance. “Misfires” of this circuit can cause hallucinations.
  • Refusal of Harmful Requests. We find evidence that the model constructs a general-purpose “harmful requests” feature during finetuning, aggregated from features representing specific harmful requests learned during pretraining.
  • An Analysis of a Jailbreak. We investigate an attack which works by first tricking the model into starting to give dangerous instructions “without realizing it,” after which it continues to do so due to pressure to adhere to syntactic and grammatical rules.
  • Chain-of-thought Faithfulness. We explore the faithfulness of chain-of-thought reasoning to the model’s actual mechanisms. We are able to distinguish between cases where the model genuinely performs the steps it says it is performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue so that its “reasoning” will end up at the human-suggested answer.
  • A Model with a Hidden Goal. We also apply our method to a variant of the model that has been finetuned to pursue a secret goal: exploiting “bugs” in its training process. While the model avoids revealing its goal when asked, our method identifies mechanisms involved in pursuing the goal. Interestingly, these mechanisms are embedded within the model’s representation of its “Assistant” persona.

To read the rest

What is an evolutionary algorithm?

Mike's Notes

I gave a prompt to Gemini 2.5 Pro,  "What is an evolutionary algorithm?"

Resources

References


Repository

  • Home > 

Last Updated

02/04/2025

What is an evolutionary algorithm?

By: Gemini 2.5 Pro
2/03/2025

Okay, let's break down what an Evolutionary Algorithm (EA) is.

In simple terms:

An Evolutionary Algorithm is a type of problem-solving technique inspired by biological evolution, specifically Charles Darwin's theory of natural selection ("survival of the fittest"). It's used primarily for optimization (finding the best possible solution) and search problems, especially when the problem is complex, poorly understood, or has a vast number of potential solutions.

The Core Idea:

Instead of trying just one solution, EAs work with a population of potential solutions (called individuals or chromosomes). Over many generations (iterations), this population "evolves" towards better solutions by applying processes analogous to biological evolution:

  1. Selection: Better solutions (those that perform well according to a defined fitness function) are more likely to be chosen to "reproduce." This is like natural selection where fitter organisms survive and pass on their genes.
  2. Reproduction/Variation: New solutions (offspring) are created from the selected parents using operators like:
    1. Crossover (Recombination): Combines parts of two (or more) parent solutions to create new offspring, mixing their characteristics (like sexual reproduction).
    2. Mutation: Introduces small, random changes to an individual solution (like genetic mutation), potentially creating novel traits.
  3. Replacement: The new offspring often replace less fit individuals in the population, ensuring the overall fitness of the population tends to improve over time.

How it Works (The Basic Loop):

  1. Initialization: Create an initial population of random candidate solutions.
  2. Evaluation: Calculate the "fitness" (how good it is) of each solution in the population using a predefined fitness function tailored to the specific problem.
  3. Selection: Select parent solutions based on their fitness (higher fitness means higher chance of selection).
  4. Reproduction: Apply crossover and mutation operators to the selected parents to create new offspring solutions.
  5. Replacement: Form the next generation's population, often by combining the best parents and the new offspring, potentially discarding the worst solutions.
  6. Termination: Check if a stopping condition is met (e.g., a maximum number of generations reached, a sufficiently good solution found, or no significant improvement observed). If not, go back to step 2 (Evaluation) with the new population.

Key Components:

  • Population: A set of candidate solutions.
  • Individual/Chromosome: A single candidate solution, often represented as a string, vector, tree, or other structure.
  • Gene: A component part of an individual's representation.
  • Fitness Function: A way to measure how good a solution is. This is crucial and problem-specific.
  • Selection Mechanism: The method used to choose parents (e.g., roulette wheel selection, tournament selection).
  • Genetic Operators: Crossover and Mutation rules.
  • Generation: One iteration of the main loop (evaluation, selection, reproduction, replacement).

Why Use Evolutionary Algorithms?

EAs are particularly useful for:

  • Complex Optimization Problems: When the search space is vast, non-linear, or has many local optima.
  • Problems with No Easy Mathematical Solution: When traditional calculus-based methods fail or are too difficult to apply.
  • Black-Box Problems: Where you can evaluate a solution's quality but don't know the underlying structure of the problem well.
  • Multi-Objective Optimization: Finding solutions that balance several conflicting goals.
  • Creative Applications: Generating designs, art, music, or game strategies.

Types of Evolutionary Algorithms:

While the core idea is the same, there are different "flavors" of EAs, including:

  • Genetic Algorithms (GAs): The most common type, often using binary string representations.
  • Evolution Strategies (ES): Primarily used for real-valued parameter optimization, often emphasizing mutation.
  • Genetic Programming (GP): Evolves computer programs or tree-like structures.
  • Differential Evolution (DE): A popular and often efficient method for numerical optimization.

In Summary:

Evolutionary Algorithms are powerful, nature-inspired search and optimization techniques that mimic the process of evolution. They iteratively refine a population of candidate solutions using selection, crossover, and mutation, making them well-suited for tackling complex problems where traditional methods might struggle. They are a key part of the broader field of Computational Intelligence and Artificial Intelligence.

Data Modeling: Definition, Types, and Challenges

Mike's Notes

More on data contracts.

Resources

References


Repository

  • Home > 

Last Updated

01/04/2025

Data Modeling: Definition, Types, and Challenges

By: Mark Freeman
Gable.ai Blog: 7/12/2024

Class is in session as we break down the fundamentals of data modeling, its different forms, and why it's often a source of contention in the data space.

Serving as a (very real, fully accredited, we swear) 101-level collegiate course, this blog article aims to lay a solid, real-world-based foundation regarding the concept and practice of data modeling. 

As such, the article will include a summary of data modeling’s historical prevalence in data engineering, its more recent dissolution, a definition of the concept, and different methods of use. 

We’ll conclude by exploring why any attempt to discuss the benefits of one type over another consistently equates to booting a hornet’s nest.

This foundation will serve as a gateway for newer data engineers, function as a juicy target of ridicule for the more seasoned, and will act to foster an appreciation for the role data contracts will play in data modeling’s future.

Course schedule:

  • Data modeling: An overview
  • Data modeling defined
  • Common types of data models
  • Causes of controversy in the data modeling space
  • Restoring the model of balance
  • Suggested next steps

1. Data modeling: An overview

At one point in the not-too-distant past, data modeling was the cornerstone of any data management strategy. Due to the technical and business practices that were predominant at the end of the 21st century, data modeling at its zenith placed a strong emphasis on structured, well-defined models.

However, in the late 2000s, the emergence of major cloud service providers like Google Cloud Platform, Microsoft Azure, and Amazon Web Services (AWS) enabled cloud computing to gain traction within business organizations. 

By the end of that same decade, the benefits of scaleable, on-demand computing resources led to a proper surge within business organizations, which then led to the proliferation of what is now commonly referred to as the modern data stack—a group of cloud-based tools and technologies used for the collection, storage, processing, and analysis of data. 

Compared to the at-the-push-of-a-button benefits available on demand, data modeling was then seen by a growing number of practitioners as rigid and inflexible. Data modeling takes time. It can get complicated. The costs and overheads associated with the process reflected this. Perhaps most damaging at the time, it became easy to frame data modeling as a bottleneck—dead weight hampering the speed and flexibility of modern data management. 

However, this overemphasis on speed and flexibility and the underutilization of data modeling wasn’t sustainable. Though there is no specific “breaking point” to point to, by the mid-2010s, these issues became increasingly attributable to data modeling’s diminution. 

While far from exhaustive, increasingly common factors helped to precipitate this recalibration in the data space:

  • Data governance challenges: The abundance of cloud-based data storage and processing fueled an explosive increase in the data sources and repositories the average organization had access to. This sudden abundance, in turn, intensified the maintenance of data quality, security, and compliance, irreparably complicating the governance process.
  • Data quality issues: The fevered rate at which cloud-based solutions were adopted resulted in the neglect of data modeling and proper data architecture, resulting in inconsistencies, data quality issues, and difficulties in data integration. 
  • Lack of standardization: While cloud environments freed teams to use various tools and platforms, the consistency of data management practices degraded, making it harder to ensure consistency and interoperability across an organization.
  • Scalability and performance issues: Without proper data modeling, it became difficult to optimize systems for performance and scalability. Bottlenecks and reduced system efficiency resulted as data volumes grew.
  • Security and compliance risks: Rapid cloud adoption without adequate attention to data modeling and architecture can expose organizations to security vulnerabilities and compliance risks, especially when dealing with sensitive or regulated data.
  • Difficulties in extracting value from data: Without a well-thought-out data model, organizations struggle to extract meaningful insights from data. Inevitably, these organizations found that simply having data in the cloud did not guarantee it was inherently usable or valuable for decision-making.

2. Data modeling defined

Data modeling is the practice or process of creating a conceptual representation of data objects and the relation between them. Data modeling is comparable to architecture, in that the process blueprints how data is stored, managed, and used within and between systems and databases.

In essence, there are three key components of data modeling:

  1. Entities: Entities represent the real-world objects or concepts an organization wants to understand better. Examples of data modeling entities include products, employees, and customers.
  2. Attributes: These are the characteristics or properties of the entities being modeled. Attributes provide details that are used to describe and distinguish instances of an entity—product names, prices, customer names, phone numbers, etc.
  3. Relationships: The connections between entities in a data model are called relationships. They can be one-to-one, many-to-many, or one-to-many. Each entity is represented in a relational database in the typical data modeling process. While each entity has a unique identity, it can have multiple instances. 

Traditionally, the role of data modeling primarily focused on designing databases for transactional systems and normalizing data to reduce redundancy, improving database performance. The process itself mainly involved working with structured data in relational databases.

Modern data modeling is highly varied by comparison. And while its practice and process have evolved beyond some of its inherent qualities viewed negatively in the past, others are now increasingly accepted as trade-offs to be balanced against.

Data modeling today caters to a wide range of data storage and processing systems, ranging from traditional relational database management systems (RDBMS) to data lakes and NoSQL databases. Data models now facilitate data integration. They can support advanced analytics, data science initiatives, and predictive modeling. Modern models emphasize agility and scalability to quickly adapt to shifting business requirements.

As such, data modeling now also supports efforts in the data space to democratize data, helping to make data more understandable and accessible to a wide range of users.

3. Common types of data models

There are four main types of data models, conceptual, logical, physical, and dimensional. This is true when the goal is to simplify the categorization of data models.

Depending on the business needs of an organization, however, more than these initial four may be considered and utilized. We note the former simply because of the confusion this can sometimes cause within the data space.

Conceptual data models

The purpose of conceptual data models is to establish a macro, business-focused view of an organization’s data structure. Conceptual models are often leveraged in the planning phase of database design or a database management system.

In these cases, a data architect or modeler may work with business stakeholders and analysts to identify relevant entities, attributes, and relationships using unified modeling language (UML) and entity-relationship diagrams (ERDs).

Logical data modeling

Logical data models work to provide a detailed view of organizational data that is independent of specific technologies and physical considerations. By doing so, logical models are free to focus on capturing business requirements and rules without being biased by technical constraints. As a result, they can provide a clearer understanding of data from a business perspective.

The ability of less technical stakeholders to more easily understand logical data models also makes them a particularly useful tool for communicating with technical teams.

Physical data modeling

Alternately, physical data modeling aims to capture and represent the detailed structure and design of a database, taking into account the specific features and constraints of a chosen database management system (DBMS), as well as business requirements for performance, access methods, and storage.

For this reason, the entities database administrators and developers will focus on include physical aspects of a database—indexes, keys, partitioning, stored procedures and triggers, etc.

Dimensional data modeling

For business intelligence and data warehousing applications, dimensional data modeling is often used. This is because a dimensional model employs an efficient, user-friendly, flexible structure that organizes data into fact tables and dimensions to support fast querying and reporting.

Due to this, dimensional data models can specifically support related applications' complex queries, analysis, and reporting needs.

Object-oriented data modeling

Based on the principles of object-oriented programming, object-oriented data modeling represents data as objects instead of entities. The objects in this type of data modeling encapsulate both data and behavior. This object-oriented approach is key, making object-oriented models highly useful in scenarios where data structures must reflect real-world objects and their relationships.

Common examples of these scenarios include ecommerce and inventory management systems, banking and financial systems, customer relationship management (CMS) systems, and educational software.

Data vault modeling

As the word “vault” implies, data vault modeling is used in data warehousing, but also in business intelligence. Both data warehousing and BI projects benefit from the historical data preservation, scalability, flexibility, and integration capabilities that data vault models provide.

In theory, this makes data modeling a potential tool for any organization that needs to integrate data from multiple sources while maintaining data history and lineage (e.g., healthcare organizations, government agencies, and manufacturing companies).

Normalized data modeling

This type of data modeling focuses on two things—reducing data redundancy and improving data integrity. This can be crucial for transactional systems where data integrity and consistency are of prime importance. Normalized models are easier to maintain and update, while they also prevent data anomalies like inconsistencies and duplication.

De-normalized data modeling

Alternately, de-normalized data models involve the intentional introduction of redundancies into a dataset in order to improve performance. Through de-normalized modeling, related data can be stored in the same table or document. This reduces the need for computationally expensive join operations, which can slow down query performance.

Because of how they function, de-normalized data models also harmonize with the principles of NoSQL databases, which prioritize flexibility, scalability, and performance.

4. Causes of controversy in the data modeling space

Data scholars agree that discussions around data modeling function similarly to a hornet’s nest in nature—both tend to cause massive amounts of pain when stumbled into. While unfortunate for the stumbler, it helps to understand that, in both cases, damage results in the attempt to defend what one holds dear.

For hornets, driven to protect the nest’s existing and developing queens, the aggression results from a combination of their innate programming, alarm pheromones, and the instinct to attack in numbers in order to intimidate and dissuade larger foes.

For data practitioners, however, aggressively defending one’s beliefs about the process and practice of data modeling is usually motivated by one or more of the following factors:

  • Diverse perspectives: Data modeling is a field that intersects with numerous disciplines, including data science, software engineering, database design, data analytics, and business intelligence. While sharing varying degrees of overlap, these disparate professional backgrounds act as frames through which the views of “effective data modeling” become wildly divergent in the data space.
  • Complexity and trade-offs: Additionally, data modeling tends to involve near-endless tradeoffs between competing priorities. These tradeoffs include speed vs. governance, normalization vs. performance, and structure vs. flexibility—each with passionate advocates on both sides of the aisle.
  • Organizational context: The “right” data model in one organization may not be the same in another, even when operating within the same industry. Differing business rules and goals, data requirements, schema, information systems, and data maturity all but guarantee that there will never be one true data modeling technique or process.
  • Subjectivity in design: Data modeling itself can be quite subjective. Like many design disciplines, there are often multiple ways to model a given dataset. And data modelers themselves often have legitimate reasons for championing one approach over another. This subjectivity is part of why many find the challenges of data modeling so fulfilling.
  • Evolving technologies: Despite the order and logic practitioners attempt to bring to the table, the exponentially rapid evolution of data technologies—from traditional relational databases to NoSQL, low and no-code platforms, and big data—necessitates approaches to data modeling to continuously diversify.
  • Fluctuating best practices: Due to the ever-evolving modeling landscape, its related best practices invariably need to change. Techniques once considered sacrosanct can find themselves outdated, furthering debates about what the current best approach may be at any given time.
  • Emotional Investment: Data practitioners tend to be curious, persistent, analytical thinkers who benefit from a high attention to detail. As such, those who practice data modeling (or cross paths with it) tend to invest a great deal of intellectual and emotional capital in their work. Occasionally, this can create an environment where critiques or suggestions for alternate approaches can either be delivered as a personal attack, or taken as such. 

5. Restoring the model of balance

The good news is that navigating the tension between the impact of data modeling and the convenience of the modern data stack is inevitable. Organizations helping to strike the balance should consider employing the following:

  1. Adopt a hybrid approach: Consider using structured data modeling for core business entities that require consistency and stability above all. In areas that call for more agility and flexibility, employ modern data technologies that enable rapid iteration.
  2. Harmonize flexibility with standardization: Building on a hybrid approach, look to standardize core data elements and processes. At the same time, allow for flexibility in areas where rapid change can be expected. Embrace constant balancing and rebalancing of the strengths of structured data modeling and the modern data stack. 
  3. Use iterative data modeling: Instead of insisting on extensive upfront data modeling, try an iterative approach. Start with a basic model, then evolve it as needed. Iteration can produce the best of both worlds, maintaining a structured approach while responding to requirements as they change over time.
  4. Leverage data virtualization: Data visualization provides a helpful layer of abstraction that allows for integrating diverse data sources without extensive modeling. In some organizations, this approach maintains agility while ensuring data is effectively understood and used. 
  5. Focus on metadata management: Bridging the gap between structure modeling and agility usually involves a (sometimes renewed) focus on effective metadata management. Robust metadata curation further enables organizational flexibility while clarity regarding data structures and relationships is maintained. 
  6. Emphasize data governance: When individuals are empowered to enact consistent data governance, clear policies and standards guiding data quality, usage, and security help ensure a data environment remains as agile as possible.
  7. Enable self-service data access: When implemented with appropriate controls, self-service data access supports agility by allowing users to access data as needed while still operating within the framework of the established data model.
  8. Continuous collaboration: Make sure to foster continuous collaboration between your data architects, engineers, and business users. While the passionate data modeling discussions will still take place from time to time, making cross-disciplinary collaboration an important part of the culture helps keep modeling efforts and business needs aligned. 
  9. Implement data contracts: Finally, employ data contracts to provide structured agreements on data formats and interfaces. Their ability to foster communication between data producers and consumers promotes balance just as the other tactics here do—but also allows that balance to scale.

6. Suggested next steps

As is now abundantly clear, treating data as a product is paramount for any organization looking to succeed in an overwhelmingly data-dependent world. Data contracts are the best way to guarantee the quality of data before it even enters an organization. 

For this reason, we’re offering a transformative approach to retaining, developing, and operationalizing data contracts. Make sure to join our product waitlist to be among the first to experience the benefits of Gable.ai.

Three Hundred Years Later, a Tool from Isaac Newton Gets an Update

Mike's Notes

A fascinating article published this week in Quanta Magazine about a mathematical technique used to solve complex problems.

Resources

References


Repository

  • Home > Ajabbi Research > Libary > Subscriptions > Quanta Magazine
  • Home > Ajabbi Research > Libary > Mathematics

Last Updated

31/03/2025

Three Hundred Years Later, a Tool from Isaac Newton Gets an Update

By: Kevin Hartnett
Quanta Magazine: 24/03/2025

Kevin Hartnett was the senior writer at Quanta Magazine covering mathematics and computer science. His work has been collected in multiple volumes of the “Best Writing on Mathematics” series. From 2013-2016 he wrote “Brainiac,” a weekly column for the Boston Globe‘s Ideas section.

A simple, widely used mathematical technique can finally be applied to boundlessly complex problems.

Every day, researchers search for optimal solutions. They might want to figure out where to build a major airline hub. Or to determine how to maximize return while minimizing risk in an investment portfolio. Or to develop self-driving cars that can distinguish between traffic lights and stop signs.

Mathematically, these problems get translated into a search for the minimum values of functions. But in all these scenarios, the functions are too complicated to assess directly. Researchers have to approximate the minimal values instead.

It turns out that one of the best ways to do this is by using an algorithm that Isaac Newton developed over 300 years ago. This algorithm is fairly simple. It’s a little like searching, blindfolded, for the lowest point in an unfamiliar landscape. As you put one foot in front of the other, the only information you need is whether you’re going uphill or downhill, and whether the grade is increasing or decreasing. Using that information, you can get a good approximation of the minimum relatively quickly.

Although enormously powerful — centuries later, Newton’s method is still crucial for solving present-day problems in logistics, finance, computer vision and even pure math — it also has a significant shortcoming. It doesn’t work well on all functions. So mathematicians have continued to study the technique, figuring out different ways to broaden its scope without sacrificing efficiency.

Last summer, three researchers announced the latest improvement(opens a new tab) to Newton’s method. Amir Ali Ahmadi(opens a new tab) of Princeton University, along with his former students Abraar Chaudhry(opens a new tab) (now at the Georgia Institute of Technology) and Jeffrey Zhang(opens a new tab) (now at Yale University), extended Newton’s method to work efficiently on the broadest class of functions yet.

“Newton’s method has 1,000 different applications in optimization,” Ahmadi said. “Potentially our algorithm can replace it.”


In the 1680s, Isaac Newton developed an algorithm for finding optimal solutions. Three centuries later, mathematicians are still using and honing his method.

Godfrey Kneller/Public Domain

A Centuries-Old Technique

Mathematical functions transform inputs into outputs. Often, the most important feature of a function is its minimum value — the combination of inputs that produces the smallest possible output.

But finding the minimum is hard. Functions can have dozens of variables raised to high powers, defying formulaic analysis; graphs of their solutions form high-dimensional landscapes that are impossible to explore from a bird’s-eye view. In those higher-dimensional landscapes, said Coralia Cartis(opens a new tab) of the University of Oxford, “We want to find a valley. Some are local valleys; others are the lowest point. You’re trying to find these things, and the question is: What info do you have to guide you to that?”

In the 1680s, Newton recognized that even when you’re dealing with a very complicated function, you’ll still always have access to at least two pieces of information to help you find its deepest valley. First, you can calculate the function’s so-called first derivative, or slope: the steepness of the function at a given point. Second, you can compute the rate at which the slope itself is changing (the function’s second derivative).

Amir Ali Ahmadi sees optimization problems everywhere he looks.

Archives of the Mathematisches Forschungsinstitut Oberwolfach

Say you’re trying to find the minimum of some complicated function. First, choose a point on the function that you think might be close to the true minimum. Compute the function’s first and second derivatives at that point. These derivatives can be used to construct a special quadratic equation — a parabola if your function lives in a 2D plane, and a cuplike shape called a paraboloid if your function is higher dimensional. This quadratic equation, which is called a Taylor approximation, roughly resembles your function at the point you chose.

Now calculate the minimum of the quadratic equation instead of the original — something you can do easily, using a well-known formula. (That’s because quadratic equations are simple; it’s when equations get more complicated that calculating the minimum becomes prohibitive.) You’ll get a point. Then plug the coordinates of that point back into your original function, and you’ll get a new point on the function that is, hopefully, closer to its true minimum. Start the entire process again.

Newton proved that if you keep on repeating this process, you’ll eventually home in on the minimum value of the original, more complicated function. The method doesn’t always work, especially if you start at a point that’s too far away from the true minimum. But for the most part, it does. And it has some desirable attributes.


Mark Belan/Quanta Magazine; Source: arxiv:2305.07512(opens a new tab)

Other iterative methods, like gradient descent — the algorithm used in today’s machine learning models — converge toward the true minimum at a linear rate. Newton’s method converges toward it much faster: at a “quadratic” rate. In other words, it can identify the minimum value in fewer iterations than gradient descent. (Each iteration of Newton’s method is more computationally expensive than an iteration of gradient descent, which is why researchers prefer gradient descent for certain applications, like training neural networks. But Newton’s method is still enormously efficient, making it useful in all sorts of contexts.)

Newton could have written his method to converge toward the true minimum value even faster if, instead of taking just the first and second derivatives at each point, he had also taken, say, the third and fourth derivatives. That would have given him more complicated Taylor approximations, with exponents greater than 2. But the whole crux of his strategy was to transform a complicated function into a simpler one. These more complicated Taylor equations were more than Newton could handle mathematically.

Jeffrey Zhang and his co-authors wiggled functions in just the right way, allowing them to broaden the scope of a powerful optimization technique.

Courtesy of Jeffrey Zhang

“Newton did it for degree 2. He did that because nobody knew how to minimize higher-order polynomials,” Ahmadi said.

In the centuries since, mathematicians have worked to extend his method, to probe how much information they can squeeze out of more complicated Taylor approximations of their functions.

In the 19th century, for instance, the Russian mathematician Pafnuty Chebyshev proposed a version of Newton’s method that approximated functions with cubic equations (which have an exponent of 3). But his algorithm didn’t work when the original function involved multiple variables. Much more recently, in 2021, Yurii Nesterov (now at Corvinus University of Budapest) demonstrated how to approximate functions(opens a new tab) of any number of variables efficiently with cubic equations. But his method couldn’t be extended to approximate functions using quartic equations, quintics and so on without losing its efficiency. Nevertheless, the proof was a major breakthrough in the field.

Now Ahmadi, Chaudhry and Zhang have taken Nesterov’s result another step further. Their algorithm works for any number of variables and arbitrarily many derivatives. Moreover, it remains efficient for all these cases — something that until now wasn’t possible.

But first, they had to find a way to make a hard math problem a lot easier.

Finding Wiggle Room

There is no fast, general purpose method for finding the minima of functions raised to high exponents. That’s always been the main limitation of Newton’s method. But there are certain types of functions that have characteristics that make them easy to minimize. In the new work, Ahmadi, Chaudhry and Zhang prove that it’s always possible to find approximating equations that have these characteristics. They then show how to adapt these equations to run Newton’s method efficiently.

What properties make an equation easy to minimize? Two things: The first is that the equation should be bowl-shaped, or “convex.” Rather than having many valleys, it has just one — meaning that when you try to minimize it, you don’t have to worry about mistaking an arbitrary valley for the lowest one.

Abraar Chaudhry and two colleagues recently found a way to improve a centuries-old method for finding the minima of functions.

Camille Carpenter Henriquez

The second property is that the equation can be written as a sum of squares. For example, 5x2 + 16x + 13 can be written as the sum (x + 2)2 + (2x + 3)2. In recent years, mathematicians have developed techniques for minimizing equations with arbitrarily large exponents so long as they are both convex and a sum of squares. However, those techniques were of little help when it came to Newton’s method. Most of the time, the Taylor approximation you use won’t have these nice properties.

But Ahmadi, Chaudhry and Zhang figured out how to use a technique called semidefinite programming to wiggle the Taylor approximation just enough to make it both a sum of squares and convex, though not so much that it became unmoored from the original function it was supposed to resemble.

They essentially added a fudge factor to the Taylor expansion, turning it into an equation that had the two desired properties. “We can change the Taylor expansion a bit to make it simpler to minimize. Think of the Taylor expansion, but modified a little bit,” Ahmadi said. He and his colleagues then showed that, using this modified version of the Taylor expansion — which involved arbitrarily many derivatives — their algorithm would still converge on the true minimum of the original function. Moreover, the rate of convergence would scale with the number of derivatives used: Just as using two derivatives allowed Newton to approach the true minimum at a quadratic rate, using three derivatives enabled the researchers to approach it at a cubic rate, and so on.

Ahmadi, Chaudhry and Zhang had created a more powerful version of Newton’s method that could reach the true minimum value of a function in fewer iterations than previous techniques.

Like the original version of Newton’s method, each iteration of this new algorithm is still computationally more expensive than methods such as gradient descent. As a result, for the moment, the new work won’t change the way self-driving cars, machine learning algorithms or air traffic control systems work. The best bet in these cases is still gradient descent.

“Many ideas in optimization take years before they are made fully practical,” said Jason Altschuler(opens a new tab) of the University of Pennsylvania. “But this seems like a fresh perspective.”

If, over time, the underlying computational technology needed to run Newton’s method becomes more efficient — making each iteration less computationally expensive — then the algorithm developed by Ahmadi, Chaudhry and Zhang could eventually surpass gradient descent for all sorts of applications, including machine learning.

“Our algorithm right now is provably faster, in theory,” Ahmadi said. He’s hopeful, he added, that in 10 to 20 years, it will also be so in practice.

Correction: March 25, 2025

The graphic in this article has been updated.