hypersoniq's Blog

The Python to C roadmap...

It looks daunting...

Memory management is the first part that has me concerned. The fix looks to be divide and conquer...

Garbage collection... that should be fun.

Pointers... this will most likely be the mechanism for column traversal.

Solution for the test program is to run individual tests with data of the same length. So we would start with the pick 2, as it has the fewest operations. Will split into eve and mid as well.

Production software will be 8 programs with the memory allocations to match the data size of each game.

Runs will most likely be sequential rather than in parallel since it will eliminate the possibility of threads vying for the same memory blocks and corrupting the data.

These can be added to execute in sequence using the built in Linux CRON job scheduler.

The order will consider the play strategy which has been modified to represent the least possible expense.

Pick 2 Mid

Pick 2 Eve

Pick 3 Mid

Pick 3 Eve

Pick 5 Mid (the goal)

Pick 5 Eve (the other goal)

Pick 4 Mid

Pick 4 Eve

The new play strategy...

Pick 2 only until a hit.

On a pick 2 hit, next 4 plays are 

Pick 2 x 1

Pick 3 x 5

That is $12 per mid/eve cycle, leaving $2 to get back to just the pick 2

On a pick 3 win we deal in the pick 4 and 5

Pick 2 x 1

Pick 3 x 5

Pick 4 x 1

Pick 5 x 20

That is $54 per mid/eve cycle for the next 4 played cycles... $216 on house money then it drops down to 4 cycles of the pick 2 win strategy for a total cost of $266, taken from the pick 3 hit profit of $2,500 that leaves over 2k.

The ONLY out of pocket expenses will be $2 for the pick 2 cycle! That is a 75% expense reduction in regular play.

It reduces greatly the exposure of the plays on the pick 5, which is the target, but part of the exercise was to develop the system into an entire play strategy that minimizes out of pocket expenses while still having the potential of decent profit. If it can't beat the 1:100 odds then there is not much point in going after 1:100,000

Entry #291

Double checking the math...

I do not have a way of knowing how many operations will be expected, but I can compute clock cycles based on the CPU speed and the run time of the test.

As written in Python, the elapsed time of the run was 190 seconds. In one second, a 2.4GHz processor goes through 2,400,000,000 cycles.

So, 190 x 2.4 Billion = 456 billion cycles.

To extrapolate that into the full run, since the test only did the first 100 iterations of 10 billion, we need to multiply that answer by 100,000,000.

The answer is a staggering 456 Quintillion cycles... again, as written in pure python.

That is 456,000,000,000,000,000,000 !

That is why the run time estimate is 554 years.

That is why Python and its seemingly minuscule overhead when running short scripts is the WRONG tool for the job.

Why?

1. It is interpreted. To have a chance at a run, this needs a language compiled at the processor level.

2. Dynamic data typing... python infers the data type, this makes it very flexible but that comes at a compute cycle cost. What this project needs is a static data type system where we can explicitly set the data types. Will save a ton of overhead vs constant re evaluation of the same variables.

3. The algorithm is tested and optimized as far as I can take it and still get the desired results.

 

The two leaders in a new language for this project are c and rust. However, c seems like the most likely candidate for the job.

What are the drawbacks of using c?

1. I am not very familiar with c, outside of a few programming exercises in school and using an Arduino, whose sketch programs are c like.

2. Memory management, including allocation and release will now be on me rather than an interpreter.

3. I also have to deal with pointers and manual garbage collection.

4. I have to manually create a data frame structure since I will not have access to the pandas library.

5. I still won't know the run time until the test program is run in c, this could all be for nothing.

There are many challenges ahead, but many have already been met, a working algorithm is created already, the flow chart will be of great use in converting to c. The rest of the system, the spreadsheets for validation and implementation already exists.

All I can do now is move toward the next solution. I have no idea how long it will take, but giving up when faced with that ludicrous cycle count is not an option... this is fascinating stuff!

The kicker is even with a successful run, it will still probably not help pick winning numbers. One plus is the memory situation was already reduced to bare minimum when deciding to use the Raspberry Pi 5, the generated csv files are small and the main loop does not hold data from each pass, only the incrementing of a single integer variable to count matches. When the next column is scanned, the variables reset.

The decision to completely move from Python to c was not taken lightly, and would not have been made if the algorithm did not work.

Might be on that borderline where hobby meets obsession...

Happy Coding!

Entry #290

A formal statement of the problem the project aims to address.

The BLOTBOT project is a thorough attempt to analyze the per digit replacement system of selecting lottery numbers. One such popular variant is the mirror system.

The null hypothesis: there is no statistical advantage to be gained by studying past draws to predict future results using per digit replacement in a pick N lottery game.

The alternative hypothesis is that there IS a statistical advantage to be gained by studying past draws to predict future results using per digit replacement in a pick N lottery game.

Where we are planning to deviate from the scientific method is by exhaustively testing all possible variations rather than sampling.

There is a second hypothesis also being tested.

For this part...

The null hypothesis is that direct follower data will not be the highest performing list in the 10 billion lists possible for each column of each game.

The alternative hypothesis is that direct follower data WILL be the highest performing list in the 10 billion lists possible for each column of each game.

So, with one massive test, we can make an honest attempt at answering both hypotheses.

If the first hypothesis is accepted, then there really is no point in continuing on the current path. That would be the indicator to maybe back away from the daily games for good. Not sure yet. It may be the indicator that I am just not smart enough to beat the lottery at their game, and should employ some other techniques like unsupervised machine learning to help find patterns that I fail to see.

On the other hand, working to get to this point has allowed me the opportunity to put some of the theory I learned in classes like algorithm design and software engineering into direct use. I am not the type to refuse to admit I was wrong, i have been studying the lottery for decades and have spent more time thinking about the problem than actually playing.

It is that burning desire to solve problems that kept the chase alive so far. I want to know, even if it means confirming that the chase is a waste of time and I should move on to something else as a hobby. As always, time will tell.

Entry #289

Other promising ideas to save the project.

All of the times so far are based on pure python code running on one core. The raspberry pi 5 has 4 cores. If I incorporate multithreading in some way, that could be an immediate 4x reduction in run time.

Since Pandas is a single thread library, it might make sense to let one core handle the day and night for each game with 4 separate instances of the same script. At a  Python only level, that should bring the 554 years down to 138.5 years. The low end of the PyPy scale shows a 7x improvement over the same code run in straight python, that would bring it down to around near 20 years... not bad from starting at 3,000!

Obviously some research will be required, but that makes some form of c subroutines even more attractive.

Not giving up easily, not giving up at all!

Happy Coding!

Entry #288

The number of operations is staggering...

Just for the pick 3 evening, that is the longest draw history, the code will step through the draw data in each column, which is over 17,000 draws 10 billion times. That's per column... 3 columns plus the evaluation of the current outcome to the top 3 list means that just that game will take an estimated 500 TRILLION operations.

This is why the run time estimate is still in the hundreds of years.

Cython is going to take much more time and work than originally thought. I may end up having to add that to windows just to do the development.

There is an interim technology I have planned to try called PyPy, which is a just in time python interpreter that runs in place of python and uses python code directly, no C conversions necessary. In general offering 7x speed ups, but in the case of what I am doing, it could be hundreds or thousands of times faster. Or not. At it's face, the python program cannot be run fully, and although I have taken the estimated run time down from 3,000 years to 554 years, it is obviously too ambitious of a project... for now.

As of the current time, it was a nice thought exercise, but not practical to run. Not giving up just yet, but the likelihood of running over a quadrillion operations is not in the cards with Python.

Other languages to be considered are obviously C, but also taking a good look at Rust as well.

The Cython output was way too fragmented to learn from, but now that I know the algorithm, it is just a matter of doing the same thing in another language. As a result this will drag out the process much longer than I had hoped, and now that the next class started, I might not be able to even look at it for the next 9 weeks.

Oh well, it has been an educational journey so far for sure.

Entry #287

Reframing the big test, and resuming play while waiting.

After thinking about how the follower system is actually calculated, based off of the most frequent follower of each digit in each column, it is with almost certainty that I am already using the data that the test program might produce as it's highest count.

As a result, the test may be re-framed to just run the pick 2 and see if that is correct. I still need to run the C conversion of Cython, but that was happening anyway.

So the new strategy will be to use these numbers I have now, and make an "additive" strategy...

Going to resume playing, but only the pick 2.

IF there is a hit, the pick 3 will be added UNTIL the free plays from the pick 2 win run out, then back to just the pick 2.

IF there is a pick 3 hit in the pick 2 win window, then the pick 4 and pick 5 will be added.

This keeps out of pocket expenses (which they will be since I used the last of the pick 3 winnings to buy the Raspberry Pi 5 kit) to a $2 per played day maximum. As I only play usually 4 days a week, that will make the system cost $8 per week, rather than $8 per day... a 75% reduction!.

I already feel better about the new strategy because it is now as low of a play budget as I can make it while still actually playing.

Cython mastery will come in handy for far more than the lottery, so I will be super motivated to continue to learn all about it, it is a common practice for data scientists who use Python to write c code subroutines when working with intense calculations on large data sets. That will give me practice with writing c code outside of the scope of Arduino sketches.

So the new goal is to run the ten billion tests on only the pick 2, for the purpose of testing my hypothesis that the follower data, which I can generate in 90 seconds, will indeed be the highest matching list of the top 3. If it is not a match, then I am honor bound to run the rest of the tests.

I still need to hope for Cython to make even the pick 2 test runnable in a lifetime... therefore luck is still a factor.

Happy Coding!

Entry #286

Still learning all about Cython

Progress is slower than anticipated. Cython may be the last ditch effort to get the program to run in a reasonable amount of time, but it is not as simple as the examples have shown.

Generating the c file was straightforward enough, but then when using gcc to compile it, there was a missing python.h header file that I had to include in the compile command.

Figured that out, now stuck with literally hundreds of "resource undefined" errors thrown by the gcc compiler.

From the research so far, it looks like the Pandas library headers must also be included.

I also have to rerun the setup file and include an --embed tag in the cythonize command.

With these hurdles to clear I am left with several options, the two most likely are

1. Continue to resolve reference errors until it works.

2. Write the program directly in C.

I may end up doing something like writing directly to C, but python is the language I solved the problem with... it is my go to, but not very efficient at fast execution because it does not let you explicitly declare data types. The constant inference from the 19 variables in the code are definitely a bottleneck.

Again, if it were easy everyone would be doing it.

On the Pi itself, awesome little computer! I set up SSH so it can be run without a monitor, keyboard or mouse.

Transferring the files is accomplished with FileZilla on secure FTP, and PuTTY was not necessary as ssh is built right into Windows power shell!

There is a program to allow connecting to the visual part (x windows) but we are going for low overhead so the setup is functional.

2 classes left, and the next one starts Thursday and looks to be daunting... "Data mining and Machine learning"... so my experiment time will be limited.

I could have given up when the first time test indicated 3,000 years. I could throw in the towel with the current Python estimated run time of 554 years... almost a 6x boost... but that is not the goal.

Ever forward...

Entry #285

Why multiple tickets for the same draw do not divide the odds

It is a common practice to look at a ratio and reduce it. In pure numbers that is just fine, but in the lottery these numbers represent something and therefore cannot be reduced.

Let's use the simplest example possible... a pick 2 game.

Posted odds are 1:100

The one represents your ticket, also known as the favorable outcome.

The 100 represents the draw, or the possible outcomes.

Buy one ticket, such as 45, then your odds are 1 favorable outcome to the 100 possible outcome

Buy 2 tickets with different numbers... say 45 and 71, and your odds ratio is 2:100

Some would say 1:50, but in every draw, the possible outcomes stay at 100

To extend the example, say you spent $50 to buy half of the possible outcomes... your odds are now at 50:100, this is why they only pay out half of the odds... IF you win, it pays $50 and you earned nothing. And if their draw was in the group of 50 that you did not cover, they took your $50.

This is why I take the more difficult route of trying to pick just one combo to play in a draw.

You win far less, but it feels like you have accomplished something when you get a hit. Plus you spend far less money in the process.

The same holds true for jackpot games.

Throwing extra money at one draw does not significantly increase your odds of winning, it is better to spread the same money out over multiple draws.

Best of luck!

Entry #284

Running out of coding optimizations, moving to the tests on the Pi.

It has been an exhaustive search, and I am at the point now that I can not further refine and compact the code to run and still get the desired output.

The next phase is to update the draw histories and move the whole project to the raspberry pi and run the timed test.

The expectation is that the pure Python solution will generate near the same result, which is not a feasible solution. BUT... enter Cython! 

Cython will convert the pure Python code into the C language, compiled specifically for the device it is running on and remove all of the overhead of Python, which is an interpreted high level language and reduce it to compiled C, which runs machine code at the processor level.

The first experiment... run the "cythonized" time reporting version, from this we can accurately predict the total run time.

The second experiment... IF that run time is reasonable, start the production code. On a rough estimate IF Cython operates as documented, total run time becomes a function of the clock speed, and the new ARM processor can handle roughly 2,400,000,000 instructions per second. Holding to the complexity of the code this could run in approximately 23 days... over 1/4 TRILLION calculations without overclocking! I am willing to accept any run time under 60 days.

I believe I have done the software engineering properly, chose the correct data structures and implemented the algorithm as efficiently as possible. Memory management was also considered. By only storing the top 3 lists for each pass of 10 billion, storage requirements are low and the computationally intensive parts like writing to a file or reading from a file are the bare minimum (8 reads, 28 writes, 28 screen prints total)

There are no shortcuts, in the replacement genre, where you transform the last draw into another number, there are exactly 10 billion possibilities. The game histories vary between 5k and 17k.

The order of operations

Pick 5 mid

Pick 5 eve

Pick 3 mid

Pick 3 eve

Pick 2 mid

Pick 2 eve

Pick 4 mid

Pick 4 eve

Hopefully it can launch this week..

Entry #282

Script progress

The script works!

Feels like an accomplishment already, even though the limited test was for the first 100, everything worked as expected!

Every day seems like new lessons are learned. I removed the innermost function because I learned that in Python, inline code has way less overhead under the hood than function calls.

Last thing to do is run the test script with some timing code to accurately calculate the loop times.

There are 8 csv files containing draw histories and there are 8 output csv files that will hold the top 3 lists for each column for each game.

I decided to scrap the email notifications as well as I do not currently have the time to learn another library. 

Streamlined and simple. Once I have the times, I can continue to seek out more optimizations.

I do not have time info yet, but that is relatively easy to set up.

From the first draft prediction of 9,000 years, I brought the pre optimized code down to 9 years. 1000x sounds impressive, but I need to do better.

I also need to run the test script on the Pi itself to get a true prediction. So I have made huge progress, but not quite ready to launch the program just yet.

There are several factors that have contributed to the optimization so far...

1. Removing the inner function in exchange for inline code. The potential savings are the overhead associated with 280,000,000,000 function calls.

2. Reducing the top 10 list to a top 3. This is a 70% reduction in the size of the heap. Less elements mean less comparisons.

3. Reducing the program screen prints to just 28 lines. No need to overwrite as this is super clean... it just prints out the game name and column name before it starts looping. It is a tradeoff for run time, I will only know which column oof 10,000,000,000 it is working on, not where it is progress wise.

That is where I am at so far... progress has definitely been made. The Raspberry Pi has been set up and is waiting for the files to be sent so it can get going. This last stretch will probably be the most difficult as I attempt to gain a realistic run time. The program works... but can it work faster?

Then there is the ever present reality that none of this makes a difference with playing and winning... but at least I will be able to say I have tried everything!

Happy Coding!

Entry #281

Planning to keep on with the follower system while the program runs across april.

Updated the draw histories and spreadsheets today. Took about 2 hours because I needed to change the lookup tables to run from 0 to 9 rather than 1 to 0, as the former is the output of the new program.

For whatever it may be worth, I noticed something I have never seen before on any single shooter system I worked on... a back tested straight hit on the pick 5! 100,000 to 1 odds...

It may end up being I already have the highest optimized lists one could hope for, but the project continues... 

More reasons to believe the code will run in a reasonable time frame... I was looking at python code on Stack Overflow (the best coding troubleshooting forum in existence) where someone running Python had run a for loop ten billion times in just under an hour!

The complexity of the code I wrote will stretch that out significantly, but that is in line with early per operation time estimates where execution would take 32.45 days with my particular code... no cPython or overclock required. The new Pi runs at a respectable 2.4 GHz out of the box.

Entry #280

Raspberry Pi 5 arrived! Checklist for launch...

The TO DO list...

Hardware (RasPi 5)

1. Install the operating system

2. Update the software

3. Install the required Python libraries

4. Enable SSH (secure shell) for headless operation.

5. Use secure FTP to transfer history files and the scripts.

The script...

1. Put the pieces together.

2. Pare down unnecessary operations.

This includes limiting the screen print and csv writes to a bare minimum. Right now the printing of the current list was scrapped, output will only be the current game name and column processing. 28 prints and 56 writes... total for all 8 games.

3. Finish experiments with the email on completion function.

The process looks to be possible with the early speed predictions coming out at about 32 days run time for all 28 columns / 8 games.

Will be using a separate output csv for each game so they can be put into play as they finish rather than wait the entire 32 days to get started.

There is no point in moving to another language as all iterations must be done for completeness here, c++ skipping what it deems to be "unnecessary iterations" defeats the purpose.

The 2 main functions are checking for hits by finding the previous draw number index in a 10 item list and comparing it to the next drawn and incrementing a hit counter, and then appending the counter to the current list and finding if it makes the top 10 heap structure... light as I can possibly code it to get the needed functionality.

I have an idea on a run time guess, but no timeframe on the completion of the script... still a work in progress, but it is cool to be setting up the device that will run my magnum opus...

Entry #279

Won't 280 billion of any operation take too long to realistically execute?

If the core algorithm took even 1 second to run, the process would take almost 8,879 years to run... fortunately the algorithm, both checking for a match and checking the top 10 can be performed thousands of times in 1 second.

An accurate estimate still needs to be made with a timer on a limited set of maybe 10,000

The time of the algorithm is of utmost importance, so the time consuming initial operations are only done once per run... loading a csv file into a pandas data frame... the actual match and sort are the heartbeat of the operation and are being optimized to be run thousands of times per second. 

Even with the optimizers in place, the estimate is now measured in months. How many months depends on my ability to make the code as streamlined as possible.

Entry #278

Productive coding session this morning!

Testing the puzzle pieces before assembling the final script.

Verified that the counter list mechanism works. Did that by printing out the first 1,000.

Verified that the draw history file reading does what is expected by using a column of test data with a known number of hits. Also verified it appends the correct hit count at the end. This was crucial as it is the back test part. Without this there could be no forward progress!

Currently working with a test script to create a heap data structure that returns the top 10 lists by hit count. This part has to go smooth as it will refine 10,000,000,000 tests to 10 lists. It is how such a large project can stay within memory range to run on a Raspberry Pi 5.

After that, it will be a simple matter of putting the pieces together.

Entry #277
Page 1 of 20