when you talk about back propagation, you're talking about neural networks and connectionism. That's not the only kind of artificial intelligence, the other main branch is heuristics and is closer to traditional programming.
Connectionism is mathematically very simple, it's a matrix multiplication, back propagation is adjusting the elements of the multiplier vector up or down according to what the impact of that multiplier should've been. This results in a single layer neural net which is actually two layers. Basically the input is a vector which gets multiplied with a multiplier matrix to produce the output vector.
The choice of representation is critical for example if you were drawing three binary numbers ie.: three draws from pools of two balls, you could have each element of the input and output vector represent one of the two possibilities in each draw resulting in each vector having six elements, each pair representing a draw or you could have each vector have three elements with each element either asserted or not asserted corresponding to the 1 or 0 state of that digit's result or you could have eight elements each representing a composite result i.e.: 000, 001, 010, 100, 101, 110, and 111. The latter will allow the neural network to consider the most possibilities and the threshold requirements to identify the solution is simply the element with the strongest signal. Historical data would be represented by additional elements to the input vector with the data shifted down the line.
These are just the earliest known neural networks and are not considered the state of the art. Most neural nets are known as three layer neural networks. Basically, an input vector, a multiplier, an intermediate hypothesis vector, another multiplier matrix producing an output vector. The middle layer is seeded with half ones and half zeros, either arbitrarily, at random or as deemed necessary. Each asserted value in this middle layer will allow the neural net to form one hypothesis to consider. Back propagation is done the same way but traces through both multipliers hence adjusting the elements in both.
In the late 80's, I proposed a compiler that took standard single flow computer languages mapping them into an arbitrary number of layers much in the same way that a TTL circuit would be built, the unused portion of each layer would be given a small probability of becoming asserted during each back propagation so that over time additional hypothesis in each layer could occur, the idea was to mimic the human learning process where a few simple rule of thumbs are learned by rote to use as a crutch and over time we abandon them for the expertise that we develop and of course to leverage existing conventional programming solutions to teach a neural net. The size of the net would be unbearably large and hence difficult to model in the computers of the day as well as impossible to build into hardware, plus as an undergrad, my ideas weren't given much attention though I have noticed that the new LRT trains that particular city now buys uses principles that I outlined in a paper onc. I have discussed my ideas with some fairly well known computer scientists in the decades since then with good acceptance but we haven't done anything further, some of the platforms that it would've been suited to no longer exists today.
You don't see a lot of work on connectionism anymore and I'm not sure why. It's probably because the field was dominated by some very unusual characters at Thinking Machines, the company I worked for in the mid 90's and of course that Manhattan project style company is no more. I was there at the tail end so I never did see the conference room filled with Rubik's cubes and Lego blocks.