Thursday, September 16, 2010

Binary addition of very long numbers


"To non-mathematicians we point out that it is a commonplace of modern algebra for symbols to represent concepts other than the numbers", from An Algebra for Theoretical Genetics, Claude Shannon's 1940 PhD thesis at MIT


Computers represent integers as a series of bits, 32 bits in a row on my machine. Some programming languages offer longer integers. For example in C++ the longer data types(1) use two of my 32 bit chunks for a total of 64 bits. This allows representation of 264 different integers -- perhaps the positive integers from 0 to 18,446,744,073,709,551,615. 

I recently found a good excuse (the Collatz conjecture) to add integers that are much longer than this, perhaps a thousand bits or digits long. No data types this big are provided, and built in math operations like addition and multiplication only work on standard data types.

I could have found an arbitrary precision library for the programming language of my choice, but I know it would take me a while to learn the ins-and-outs of how to use the library. So I wrote my own function that does addition of arbitrarily long strings of characters. This was easy, fun and satisfying -- the sky's the limit when you can add numbers greater than 101000. The entire universe contains less than 10100 fundamental particles, so I'll need something larger to count. Maybe I can count the number of ways that the particles of the universe can be arranged. Maybe not; do I have enough memory to store my long strings? After all, a bit of memory must require a few particles. Or does it?

It was easier than I thought. I used strings to represent the integers, one dimensional arrays of the character (char) data type. They are easy to read (the number 1 reads as '1'), and easy to manipulate with built in string functions. It is inefficient, because it requires 8-bits to represent each character and a function that can compare 8-bit characters, but that wasn't a concern. It only takes a few thousand operations to add a pair of 1000 bit numbers, and that is plenty fast for my purposes. I figure the total time it takes is about .00001 seconds, or 10,000 additions/second.

What's the algorithm? We learned this in third grade: line up the two numbers, fill in imaginary 0's in empty spots, starting from the right add single digits (we memorized these single digit sums in second grade), add any carry digit, and carry a 1 to the left if the sum is larger than 10:

   11   (carry digits)
  1359
+ 0248
  ----
  1607

It would be no big deal to implement this algorithm on a string of characters from the set ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], but I wanted to see the binary numbers and only needed two characters, '0' and '1'.

The algorithm for binary integers is slightly more simple because the addition table is shorter: we only need to know that: 

if there is no carry bit (0):
0 + 0 = 0 carry 0
0 + 1 = 1 carry 0
1 + 0 = 1 carry 0
1 + 1 = 0 carry the  1

and if there is a carry bit (1), 
0 + 0 = 1 carry 0
0 + 1 = 0 carry the 1
1 + 0 = 0 carry the 1
1 + 1 = 1 carry the 1

Given two strings that represent 2 binary integers, for example '11011' and '110110' which represent the decimal integers 27 and 54:

 (1111100) carry bits
  0011011
+ 0110110
  -------
  1010001

The pseudocode is something like this:

{ given string arguments str1 and str2
  that represent binary integers

  // get the right-hand character of each input string
     b1 and b2
  // store the results, starting with an empty string
     answer = ''
  // and keep track of the carry bit, starting with 0
     carry = '0'

  while not at left end of either string{
     
     If carry == '0' {
          if b1 == '0'{
               if b2 == '0' {
                    append '0' to the left of answer
                    carry = '0'
               }
               else b2 == '1' {
                    append '1' to the left of answer
                    carry = '0'
               }
          }
          else b1 == '1' {
               if b2 = '0' {
                    append '1' to the left of answer
                    carry = '0'
               }
               else b2 == '1' {
                    append '0' to the left of answer
                    carry = '1'
               }
          }
     }
     else carry == '1' {
          if b1 == '0' {
               if b2 == '0' {
                    append '1' to the left of answer
                    carry = '0
               }
               else b2 == '1' {
                    append '0' to the left of answer
                    carry = '1'
               }
          }
          else b1 == '1'{
               if b2 == '0' {
                    append '1' to the left of answer
                    carry = '0
               }
               else b2 == '1' {
                    append '0' to the left of answer
                    carry = '1'
               }
          }
     }
                    
         // get the next two bits from the two input strings
         next b1 and b2

   }

   return answer
}

I was introduced to programming in the 1970's, in an era dominated by electronic calculators. For a long while I thought of computation as fundamentally about numbers. How could computers do anything with text, when they understand nothing about the meaning of letters and words. But I'm slowly changing my point of view. Now I think about calculation as pure symbol manipulation, where computers understand nothing about the meaning of numbers or text. 

Any meaning, with respect to number or text, is supplied by control stratements (while..., if..else) and the conditional statements they check (0 or 1, true or false, same or different). This comparison of two bits,  with the result of same or different, is one of a few core thing a computer does know how to do without software, as it is hardcoded into the CPU's hardware and microcode

This program has made symbol manipulation more concrete for me: it takes letters (abstract symbols), adds them, and outputs letters (more abstract symbols). I could replace every '0' with 'a' and every '1' with 'b', and the program would still add two binary numbers:

    aabaa 
+  aabaab 
   ------
= ababbba

I could just as well replace the letters with two pictures and add those. The meaning of the algorithm would be the same. I think I will choose black and white square images. I'll call them pixels, and add strange arrays of pixels with ambiguous meaning.




(1) These types have many names: doubleword, longword, long long, quad, quadword, int64, or in specific languages, C/C++ long (on 64-bit Unix[3][4][5]), C/C++ long long, C/C++ uint64_t, int64_t, C# long, ulong, Delphi Int64, T-SQL bigint.

Sunday, September 12, 2010

Collatz conjecture and binary palindromes

The Collatz conjecture, or 3n + 1 problem, is about unexplained number patterns.
Take any natural number n. If n is even, divide it by 2 to get n / 2, if n is odd multiply it by 3 and add 1 to obtain 3n + 1. Repeat the process indefinitely. The conjecture is that no matter what number you start with, you will always eventually reach 1.
One wonders if the conjecture is true or false, and why. Might the existence of a path to 1 be undecideable, where we can't know if the process will ever end? [In 2006, researchers Kurtz and Simon, building on earlier work by J.H. Conway in the 1970s, proved that a natural generalization of the Collatz problem is undecidable.[2]]


But the appeal of the conjecture has more to due with the strange patterns that occur in the paths that numbers take on the way to reaching 1. The length of the path, the stopping time, is highly variable and anything but random. What's the pattern?

It is easy to find the stopping time for any small integer. A short Matlab program was able to find the first 10,000,000 stopping times within a few minutes, without any optimization. Jacob programmed a Collatz path calculator in Scratch, that shows the intermediate path values.

The first path that begs for an explanation is 27. Integers lower than 27 have stopping times less than 23 steps, but at 27 the number of steps jumps to 111!  What's so special about 27, or other jumps in stopping time? Is it because it is a power of 3 (1000 base 3)? No, other powers of 3 are easily checked with no obvious extremes or pattern.

Some patterns are easily explained. Any power of 2 (1000... in base 2) will have a small stopping time, because repeated division by 2 (removing 0's from the right repeatedly) brings it to 1 directly, without increases. Similarly, there are special binary palindromes (bit sequences that read the same backwards and forwards) that have short stopping times: any integer of the form 101010....1, with alternating 0's and 1's, immediately precedes a power of 2, so will also have a small stopping time. The reason for this is that 3n + 1, where n is of this form, is a power of two:


    101010...10      2*n
+    101010...1        n
       _______________________________
=   1111111...1      3*n
+                      1
      _______________________________
=  10000000...0      3*n + 1
Do all binary palindromes have short paths? No. In fact, 27 = 11011 base 2 is a binary palindrome and has a very long path relative to its length. Another binary palindrome, 22-bits long:
3732423 base 10 = 1110001111001111000111 base 2
also has an extreme stopping time, larger than any integer before it.

Is this a coincidence, a long stopping time for some palindromes? No, it is unlikely. There are 2^11 22-bit palindromes (one for every possible leading 11 bits), but 2^22 22-bit numbers. That means these palindromes only occur for 1 of 2^11 = 1 / 2048 integers. Accounting for the first bit always being 1, there is ~1/1000 chance that one of these numbers is a palindrome. 3732423 is only the 10th extreme stopping time integer. That is long odds -- out of 10 randomly picked 22 bit integers, there is less than a 1% (1/100) chance of any one of them being a palindrome.

Is it also a coincidence that this long stopping time palindrome (1110001111001111000111) also has long palindromes within its two halves (111000111)? Is there a way to predict which palindromes will have long stopping times?

I see the algorithm as a symbolic dynamics system on binary words, the base 2 representation of the integers. Each step transforms a string of 1's and 0's into a new string of 1's and 0's, and the process is repeated until there is a single 1 in the string. For example multiplication by 2 just shifts bits left, appending a 0, and division by 2 removes a right 0. The conjecture is that any string of 1's and 0's will end up as a single letter word, '1'.

Why are palindromes special with respect to these string operations?

To do: Write a program that shows every step of an integer's stopping sequence as a binary word.

[ From Collatz conjecture as a tag system
The conjecture can also be mapped to a different symbolic dynamics system, a very short 2-tag system with production rules abc, ba, caaa. In this system, the positive integer n is represented by a string of n a's, and iteration of the tag operation halts on any word of length less than 2. The Collatz conjecture equivalently states that this tag system, with an arbitrary finite string of a's as the initial word, eventually halts. See Example: Computation of Collatz sequences for a worked example.  
 Adapted from De Mol, "Tag systems and Collatz-like functions", Theoretical Computer Science, 390:1, 92-101, January 2008 (Preprint Nr. 314)   ]