hypersoniq's Blog

Preliminary classification data from PA Pick3 Evening

So far, there were over 4,200 draws classified as Neutral Neutral Neutral... that is out of 16,800 draws placed into 2,400 groups of 7 classified based on the previous 150 draws.... roughly 1/4 of all draws... all hot HHH represented less than 50 draws in 16,800... all cold less than 60 in 16,800. The runners up had 1,000 to 1,200 draws and were mixes of HNN and CNN.

Ran into one 7 draw group where ALL 7 were NNN draws... no repeat combos!

Still doing a deep dive and will have a percentage by the weekend... looking interesting so far. The longest stretch with no NNNs was 3 groups... 21 draws.

What does it mean? Still working on that...

Entry #411

Run time was less than 2 hours!

I ran the time test on the first 70 draws, therefore it needed to load a ton of draws to get to that point, but each pass reduced the offset by 7 draws, so as the pool got smaller, the code ran faster!

I have reviewed the overall file and it does contain 2,400 groups of 7 profiled draws... it only works when there is 7 or more remaining draws... that profiled all but 152 in draw history!

The entire 19,000 row result file will need to be gone through manually, but there is some interesting observations already... the level of repeatability is NOT going to be 100%, as a few groups had 0 NNN draws. These were few and far between, but they existed. I figured they would... I thought I would see more bad ones actually.

I went in looking to see about 70% repeatability of the core observation. Only went through the first few dozen and it already looks like closer to the 80%-90% range. And, there are still segments of 7 where multiple NNN draws have happened... looking to see if multiples are the indicator of a bust segment with 0...

The code ran flawlessly, and in much less time than guessed. Pick 3 mid day with almost 50% less draws should run in an hour or less... same with the pick 5 games. But first this file needs to be studied for a few days to get a feel for the data and it's peculiarities.

This was a few weeks of planning and preparing to run this file, so that is a win!

Now, do I need to repeat this to see if there is a better day of the week to start? Probably not right now anyway... the idea is to run the last 150 draws, which will NOT have anything to classify, and just go off of that data for a week.

Entry #410

Coding can be fun.. honest!

Backtest script is underway for the PA pick 3 eve... the entire history! Wil take a bit over 4 hours to run.

I spent the better part of this morning making sure the csv file was writing correctly, and it did, then I needed a blank row between the classifier output, that was easy as well.

As a test, I wrote the code to process the last 70 draws, 7 at a time. This uses a rolling 150 draws before the classification. All 10 had at least 1 NNN draw! This matches observation.

Here is the fun part. Just to test the multi column functionality, I ran the same exact test on the evening pick 5... I was expecting to see worse results, and that kind of panned out, however... 6 of 10 groups had AT LEAST one NNNNN draw! That changes the approach to the pick 5! That means there can be the hope of catching an NNNNN draw within a 2 week window (rather than one for the pick 3).

When the csv file completes, I will be able to go through one pass and count the groups plus count the groups containing an NNN draw. Total NNN groups / Total Groups = how repeatable the observation is, as a percent.

Then... I can use the same exact file to look for any commonality in the percentages.

And I will be able to do this for the pick 3 day, as well as the pick 5 games...

So, part code, part spreadsheet.

The first NNNNN draw I saw was using the following neutral count per column... 7,7,7,7,8.

That means the winning combo was in a possible field of 19,208 combos... 80,792 LESS combos than the raw 100,000... and the groups without a NNNNN combo were NOT consecutive...

This is a big deal for me, as I spent the last 20+ years using back tests to see that systems did NOT work over any lengthy run... 

It is only the first step, but it will prove one thing... there is somewhat of an ORDER to CHAOS...

Even if I have a hard time figuring out the next step, this one was huge! A blanket reduction that has a guaranteed 7 draw window... SAFE eliminations!

Just imagine what this could mean if it holds somewhat similar on jackpot games... power loading an abbreviated wheel... safely eliminating millions of combos... don't want to get too far ahead of development though...

Short term targets are the pick 3 followed by the pick 5.

Longer term, PA Match 6, Cash 4 Life and the big games PB and MM (since they did not change the white ball matrix)

Best part is it will be cheap to play... pick 3 .50 straight and pick 5 $1 straight... $21 a week for all 4 games. However, the pick 5 will not be played without a pick 3 win to fund it... gotta stay cheap!

Entry #409

The plan for the big back test.

After some thought, I think I found a way to have the backtest start at the farthest point in history and move forward, so each group of 7 classified draws will match up exactly with the draw history.

It ends up as simple as starting with a high offset and subtracting 7 rows from the offset for each re run of the function, this will yield one large csv file with the majority of draw history being classified based on the previous 100 draws. This will allow calculating a confidence interval on the appearance of the NNN draws at least once every 7 draws.

What I have observed with random sampling is a confidence level of maybe 95% for the mid day pick 3 and 100% for the evening pick 3. The evening is still a mechanical ball draw with the mid day being all PRNG.

As stated earlier, the novelty of the approach so far is being able to almost guarantee a reduction in numbers to pick from to contain at least one draw that comes from the neutral set, after eliminating the hots and colds. So instead of trying to pick 1 combo in 1,000 it would allow me to pick 1 combo from around 300.

This is the base setup for the system... safe reduction... it does not hold for one draw, or two... but it holds for 7. That 7 was found by experimentation. This number may increase when moving to the pick 5, but that game, if the cycle remains 7 games, could narrow the field of selection down from 100,000 to 30,000 or less... that will not be known until this phase is complete.

Phase 2 will also be aided by the back test as percentages from the NNN draws can be analyzed. Here patterns will also be sought. If I can see that the majority of NNNs are based on a specific percentage, that is the key for picking 1 combo from the hundreds of possibilities.

The goal, as with any system I have ever devised, is one best guess... no wheels or trapping, just one pick. Because I take the column at a time approach, there are no interdependencies in positions, therefore the code scales automatically based on the input file. Going from pick 3 to pick 5 simply involves changing the input file name.

Ultimately jackpot style games will require a major rewrite because expectancies and number pools change. Wouldn't it be interesting though if a duration cycle can be found where these jackpots produce NNNNNN or NNNNN + N hits and ~70% of those combos could be eliminated?

Entry #408

Interesting exchange with Chad G. Petey provides some direction...

After having a discussion about what I had noticed, I have been given a next step... verify the results...

I interpret that to mean how many times a draw history can be back tested to see if those neutral numbers make up one in EVERY seven draws, or if there are gaps... i am working on this now. It was quick to offer assistance, but I made it this far coding solo, so I passed on that.

After the code is finished, set it up to run on ALL AVAILABLE DATA SETS... pick 2 through pick 5 is what I am hearing.

I will start from the top of the list (first 107 draws, starting in 1977) for the pick 3 eve, and follow up with all of the others.

The reason it gave to run such thorough tests is to see if I can develop a confidence level. Such that when using the classification boundaries I set, is the scenario true 100%? 95%? 90%?

7 draws is the observed range for the pick 3 evening, but this can vary, such as the pick 3 day being 8 draws... and this can totally change when adding or subtracting columns.

For whatever reason, GPT was "excited" by the discovery, appreciating the set up and reasoning behind the tests I ran so far...up to saying if the observations hold, it might be worth it to consider publishing the results... I think Chad may have still been dizzy from 4/20...

In the end, it mostly brought up ideas I had already considered, but the full back test was something I can work out and accomplish, so not a total waste of time.

Entry #407

Now may be the time to turn to AI to discuss interpreting output...

I had to come up with an idea, I did. (Too many to list over the past 20+ years)

I had to write the code to implement that idea, because that is the only way to truly understand what is being done, also did that.

I am not having much luck in interpreting the code output past spotting the 7 day cycle of NNN classified draws...

Here might be a good time to see what insight Chat GPT could offer...

I can generate data, I can understand the significance, but I am missing the crucial next step of taking that output and turning it into actionable intel.

For added complexity, the word "lottery" will be avoided at all costs... iterations instead of draws, values instead of ball numbers, etc...

This should be a fun exercise, because it is a discussion that will not have anything to do with code, and will instead focus on statistics and truly understanding what the data is trying to say. 

Worst case scenario, it is another dead end, best case... new leads to follow!

To the future...

Entry #406

What I am looking for, simplified...

What I have... a script that can look at short term draws and classify them.as hot, cold and neutral. Why? Because eliminating the hots and colds shows that you can reduce the numbers to put into a combo by around 70% and still see a winner from that group within 7 draws.

I created the same exact script, only using follower data (what numbers follow the last draw) instead of just counting how many times each number was drawn.

As a first step in reduction, I have tested it at many places in history and can see that reducing the set of combos from 1,000 to about 300 is a good first step, but where to go next?

THAT is what I am looking for! How to go from 300 possibilities down to 1.

I thought that I could find some correlation in the hots, colds and neutrals between both functions, but they don't hold over multiple tests like the first step does. Either system reduces the picks and has winners in those picks in the short term 7 draw window, but the raw frequency does better and sometimes has up to 4 draws within that group made up of the numbers between hot and cold.

If I cut the high and low neutrals, then I eliminate hits in the middle of the neutral range. If I cut the middle range, then I miss matches in the median area... it is still just as random in the middle as it is using all possible numbers. And still too expensive to play.

Outside of random selection, I am out of ideas to further move towards shrinking that set to a playable size.

Entry #405

Not finding correlations yet...

I am not surprised that there were no apparent correlations between the raw frequency and the follower frequency. Though classification is still about 70% neutral, when looking at the same draw between the two functions, they are rarely lining up as NNN on both.

Now it is time to grab some paper and start rearranging the data and looking for clues. Could a spreadsheet do the same? Sure, but there is something about actually writing stuff down that helps connect to the data. If something looks useful during the manual process, it is most likely simple to automate, but that part is for later.

With the script, I am passing in variables, so it is worth tweaking those, and much easier since I only have to change a variable to call the functions with a new setup. The exciting find of a "scaling factor" to control how much of the standard deviation is used will be helpful in "tuning" the classifiers between the 2 functions.

This factor I am calling neutral_bandwidth, as it scales the classification thresholds by multiplying the standard deviation by a scaling factor. If st_dev = 2, then setting this variable to 0.7 effectively reduces the st_dev to 1.4. By the same token, using 1.3 as a bandwidth factor increases the same st_dev to 2.6

I found this necessary, because even though the numbers have the same chance of being drawn with both functions, they don't seem to distribute in exactly the same way.

This part may take some time...

Entry #404

Adding a simple statistic, interesting findings between methods...

It looks like the neutrals in the raw frequency do not necessarily line up with the neutrals in follower frequency.

I am specifically looking at the case where the raw frequency of a draw is classified as NNN. Sometimes the follower frequency is classified as NNN, but also there can be colds and hots mixed in for the same classified draw.

I already count the number of followers for each column, so the simple add will be getting the percentage of the number of followers to the number of follower samples. If I have 1500 samples and the followers of the last draw first position was 150, then that would be the expected 10%. Maybe the variance in this number might help to decide which classification of follower might end up matching the neutral classification in the raw frequency...

Trying to build guidelines so that there is less guesswork in making a pick.

I am wondering if there is some other way to apply a frequency count so that a third function can be written to sort of "triangulate" the whole thing. Free styling at this point.

Also, I have to cook up a back test routine and capture the data to get a read on long term trends with the follower frequencies matching up to raw frequency data. I have the rough outline... 

1. Set the offset to 7, capture that classifier data to a .csv file. In this way, I have something that exactly matches the draw history

2. Increment the offset by 7, repeat, but only capture the first 7 in the classifier data.

3. Repeat this until the 1500 draws in the follower frequency run out of room at the start of draw history.

4. Isolate the classifiers in each and put into a spreadsheet where the classifier combo for each draw can be concatenated... such as NNN and NCH would be written as NNNNCH.

5. Find the most frequent combined overall classifier in the resultant list.

The hope is to have a fair chance at making one good guess by picking from the neutral list of raw frequency and matching it to the right set from the followers. Hopefully this time I am asking the right questions...

Getting bored of not playing anything so I am probably going back to a QP on the PA match 6 until I am ready for live tests. Cost is the same at $14/week.

I will be making a new version for writing output, because V3 works and I don't intend on losing it... so V4 will be the back test version. Important that I kept V2, because it allowed me to verify the process in V3.

Before the Chat GPT era existed, we used to troubleshoot with stack overflow, that is where I found my issue with Pandas series and how to resolve it. Old school!

Entry #403

A breakthrough! Script works.

Now I have the data I was looking for from the start. Learned a bit about getting expected output from the followers... even though there was 1,500 draws, the data of followers was limited to the expectancy, so the formula was not to calculate by the sample size, but rather the sample size divided by 10.

I also learned that while direct standard deviation works for raw frequency, it needs to be constricted a bit to produce similar results... so that was as easy as st_dev x 0.8

Where I went wrong was in the type... rather than using Pandas series for collecting the data to output, I just needed a list! That problem plagued me for days.

To validate the data, I compared the output of the previous version to make sure the counts agreed, and they did.

I may not be any closer to the winner's circle, but I do enjoy being able to take an idea and completely realize it in code from a clean sheet. Even the bug squishing was enjoyable ultimately because it is an exercise in real world problem solving, and there are no guides or documents for lottery analysis software, you are on your own in this domain.

Now for the really hard part, finding correlations in the output. This will take far longer than writing the script, and there may not be a solution, but at least I have some busy work to keep engaged with the hobby.

Entry #402

Stuck at classifying followers.

Strange error that I am trying to solve. When I put in a diagnostic line to determine the data type of the output, both functions return type as Pandas.series

This was as simple as 

print(type(distribution_list))

Here is the problem... in one function it works, in the other function it tells me the list is unhashable and I should probably change type to a tuple... but that is wrong. Both lists contain similar data from similar calculations...

It SHOULD work, but does not.

One thing I had noticed while looking at the output of the follower numbers that coincided with the neutrals of the other function is that there are more followers considered Cold matching with raw frequency numbers classified as neutral... could that be the correlation? In the raw frequency neutral set AND in the follower frequency cold set?

I will not know until I solve the problem. Been frustrating at times, but this is really the "fun" part of coding, solving problems!

Worst case scenario, post the function to Chat GPT and have it find where I went wrong. You would be surprised at how much better LLMs work at coding well if you give them sample code. However... that is how coders get lazy. I try DIY first. I even try to keep libraries to a minimum if there is a way to code with straight Python. I started using pandas because it solves many problems out of the box with less code and high utility vs. the huge amount of code to do the same thing. For me, I could not imagine NOT putting lottery data into a Pandas data frame.

I thought the conversion would be easy because I am doing the same thing with the same data type... nothing is ever easy...

Entry #401

The thought behind choosing draw samples

There is one fact of the discrete uniform distribution formed by lottery results in a pick 3 (or 2/4/5), that is that each number has a 10% chance of being selected.

150 draws is the sample size I use for classification of raw frequency data. Why? Because given the fact above, each number has a chance to be picked 15 times in each position. The expectancy is each digit CAN be drawn 15 times. The variance can be easily seen by how much above or below 15 any digit has been drawn.

1500 is the sample size for follower frequency data. Why? Because followers are only based on the last draw, at that number each digit still has a chance of being selected 15 times. The variance is still easy to spot.

Since each set, though counting a different metric, is expected to show 15 draws per number, the comparison holds at 15 on the same 10% expectancy.

If I wanted to observe more or less, the variables would have to be adjusted. For 10 expected appearances they would be set to 100 and 1,000 and for 30 they would be set to 300 and 3,000.

The number of classification draws is set to see recurring patterns in as short a period as possible. In the pick 3 evening draw in PA, that works out to 7 draws, which also coincides with how many draws they allow for advanced play.

In order to back test, I simply increase this variable, but only consider the first 7 classification results. This is the key to automation should I want to profile the entire draw history.

The mid day game has an observed period of 8 draws, while most NNN draws are found within 7, it does not always hold like the evening draw.

Modifying the follower function to output similar to the raw frequency function is nearly done.

Since neutrals are usually 70% of the observed frequency, the reduction in possible pick 3 combos is 7x7x7= 343, thereby safely eliminating 657 straight combos... if extended to the pick 5, using the average of 70%, 7x7x7x7x7 = 16,807, eliminating 83,198 of the possible 100,000 combos... but is it done safely?

It depends on the NNNNN period in the classification... if it is longer than 7 to 14 days, then not so much. I am not in full development of the pick 5 application of this script, only focusing on the pick 3 until something works.

Up next is finishing the modification of the follower function to give similar output as the raw function. Already have the standard deviation and quartile output.

Then the real fun begins, finding ANY correlations between the 2 data sets... with the full knowledge that there may not be any at all.

Entry #400

Coding update.

In the raw frequency function, quartiles were literally 2 lines of code! Thanks to the statistics library, you call it as a quantile on your list, and use N=4, this gives you 25%(Q1), 50%(Q2, Median) and 75%(Q3) all on the same line. The second line was added so I could round the output to 2 decimal places for readability.

That was the easy part... changing the follower function to allow classification of the output to match the functionality of the raw frequency data is proving much harder... because  counting followers is a 2 step process where you first get a list to show every number that followed the last draw, THEN you have to count that list. If I try to make it a one shot function, it fails when validating the data. So basically the current project is adding the classification part after the second part completes but before it fetches the next list.... getting there slowly, but surely.

Looking at the two sets of data together has no immediate correlation presenting itself, though adding quartiles is showing that the neutral numbers drawn seem to sit close to one of the 3 quartiles, more so than the standard deviation. SD is the classifier, but the quartiles are where I feel some consistency will be found.

One funny error was that I accidentally hooked the quartile calculation to the list of digits rather than the frequency... the output there is ALWAYS the same... 1.5, 4.5 and 7.5 when your list is 0 to 9. Fortunately I recognized exactly where I messed up after I saw the output

Entry #399

Simple version control

As I begin coding, I start with the V1 working script and make a copy of it (save as whatever_V2)

I also add to the top comments what I am implementing, sort of a checklist. Rather than erase the comment, I simply change "implementing" to "implemented".

This helps tell future forgetful me what present forgetful me was intending.

I have some scripts in double digit versions, but I always know where I was in the development process.

I do something similar with all of the spreadsheets that hold the game history as well. There is the current spreadsheet, and there are also older versions that were analyzing one thing or another. The master data is actually stored in CSV files so they can be easily transferred into any new sheets AND read by Python scripts.

Also, I back up everything. Once at the end of every coding session, including draw history updates, to a second internal hard drive, and separately on a monthly basis to a thumb drive.

Before I started up again, I did go through the painful process of losing everything when the old laptop died. Mostly power ball stuff and a ton of Vtrac sheets.

The version control is super important because you always have a working version of a script as you try to implement new features.

Another way to do this is with free version control software like git, but for the simplicity of what I do, that seems like overkill.

Entry #398

Moving forward with coding changes.

Regardless of if the mid day PA pick 3 catches an NNN on it's last draw tomorrow, moving forward with proposed changes. The only difference with regard to tomorrow's result is whether or not I ditch the mid day game and move forward only on evening draws. NNN in 7 draws was observed on the evening data set every time. The day game was spot checked and HAD shown similar result. Remember, the day game is RNG and the night ball draw... there may be a difference here worth considering. The code base is super flexible with input files, so processing 2 games or 1 is simply done in seconds. Cutting days cuts the play cost in half, so tomorrow's mid day result will tell. To be fair, MOST observed weeks in back testing mid day had the NNN fall later in the list than the evening game, with last week (before the test) arriving on day 7.

Still feel "onto" something here... blocking out 4 hours of coding and testing on my next day off (Tuesday).

Entry #397