# Effective data representation

Excerpt from the book “The algorithm of the universe (A new perspective to cognitive AI)

To introduce the concepts, we need to start with the fundamental concepts. The first fundamental concept that we need to (re)consider is data representation. In-fact maybe we need to redefine what “data” means also. But, I will for now continue to call it data. It is not “information”. It is not “knowledge”. It is not “conclusions or intelligence”. It is similar to what we have as “bits and bytes” that can be combined together to associate a meaning. Hence, I will call it data.

We need to acknowledge that computers with the current digital data representation of bits and bytes, were created with “mathematical computations” in mind. We are trying to currently write a system that mimics “intelligence and logic” with that same bit and byte representation. We have never asked ourselves “can the same system be used for both a mathematical computation application and intelligence and logic application”? So, what is the difference in data representation that is needed between the two? We need to ask ourselves, “can the same data representation and computation processing capability that is present in our computers be effective for all types of use cases”? To understand the effectiveness of a data representation, we need to understand the differences in the use cases between the two. Comparing the use cases we find a large number of variations in requirements for data between the two as I have explained.

In a mathematical computation based system, we know the algorithm already and many of the computations are already wired into the processor of the computer, such as increment, decrement, add and so on. In an intelligent and logic learning based system, we do not necessarily know the algorithm upfront. It has to be learnt. The data that we have to learn from is just “ONE set of possibilities” of many other sets of possibilities. We are searching for intelligence and logic within the data that can be applied to other similar or dissimilar data sets. In such a system, more than the “accuracy” of the value represented by the data, the “intent and context of the data” is more important.

Mathematical computations are based on operations on discrete single valued datum and relations are established by well-defined mathematical functions that can be applied and computed by replacing known variables with different discrete values. So, if I wanted to write a linear regression function, it has a equation of the form Y = a + bX, where ‘a’ is the bias, ‘b’ is the weight, ‘X’ is a dependant variable and ‘Y’ is the computed variable. Here, discrete values of ‘X’ will produce related discrete values for ‘Y’.  Intelligent and logic based systems require continuous and highly related data with unknown variables. The continuity also represents strength of the intent and relations depict the covariance of parameters in relation to each other. For example, even for an AI to drive a car on a empty stretch of road, the data representing the road on which the car is being driven needs to have continuity to ensure the car can follow the contours of the road steadily without missing curves. The continuity has to be broken only when an obstacle appears on the road. In which case, the continuity of the obstacle needs to be present for the extent of the obstacle after which it passes to be replaced with the original road’s continuity. We definitely can see the road as discrete, very closely sampled data points which we tie together to form the continuity, but obviously we could have missed a curve just at the non-sampled interval which will lead to unnecessary conclusions or at the minimum lead to lane drifting. There are many ways to solve the problem, but, they are all just work arounds to get around the continuity problem of the data representation.

Trying to represent continuous and related data with discrete values occupy lot of space and require intensive logic to maintain the simplest relation such as subsequent or sequential values. Again, we are in a domain where data is not always mathematical in nature. Hence, sequential values need not necessarily imply subsequent value. In computers when we define data, it contains only one set of meaning that can be correctly associated to it. So, if we declared a variable called a byte and associated it with the hex value of 0x41, then, it only represents the number 0x41 or 65. If there is another program that wants to interpret this 0x41 as an ascii character, it can be interpreted as the character ‘A’. But that is about it. It remains at 0x41 and interpreted differently and can only take on say two discrete meanings. Now, let us add another byte and associate it with the hex value of 0x42, this again only represents the number 0x42 or 66 which can be interpreted as 66 or ‘B’. Intent-wise, these are sequential only when looked at in terms of the series “integer values”, this is the next value to 65 or when looked at in terms of the series “ascii characters”, where ‘B’ is the next character to ‘A’. When the intent is of a meaningful word, typically ‘A’ and ‘B’ don’t occur sequentially in a single word. Thus, sequential and subsequent have different meanings.

To represent and be able to understand this relation, either, we need to have a hardcoded logic in the system that force the two bytes allocated to be co-located next to each other, by changing the byte alignment of the declaring class of bytes or by using an ordered array or any such elaborate mechanism. Obviously, this ties us to the compiler, the language and all the other parameters that allow us to write such a code. In any case, it boils down to the logic coded by the application, rather than the data representation automatically providing us with inherent capability for relations. The discrete-ness in our data representation is not just a feature of the way the software is written, but is a feature that boils up to the software layer, all the way from the hardware representation of data and impacts the processor that is processing the logic.  Can such a representation be effective for intelligent and logic based systems, where even the simplest logic of having a knowledge of series of sequential data, needs a lot of code to be written to establish it?

Let us compare this to the “data representation” present in the underlying truth or what is called “The Brahman” in the Sanskrit literature, that forms the inner workings of reality. Currently, in any science, we do not have a word nor do we have a description for the representation that exists in the underlying truth. Hence I am using the word “data”. We also do not have word in English for describing what is called “The Brahman” in Sanskrit. “Brahma” translates to “create or creator”, thus Brahman is “that in which can be created from”. So, as I have said before, I will continue to refer to it as “The Brahman”.

As I have explained in the previous section, “The Brahman” is that continuous whole which is in a state of “an expanse of Shunya or non-existent nebulous something”. In this continuous whole, an “insignificant bit” of this Brahman can take on any property with a value. Hence from that continuous Shunya, a bit of it can take on the states of activated or inactivated which becomes data. Many such insignificant continuous bits activated for a specific property with different values come together to form a concentration, which is considered as “continuous data” or “series of data” of that specific property.

To understand this, if, we look at “The Brahman” as a huge painting canvas which is ready to be painted. In this huge canvas paint of many different colors can be painted in insignificant bits, which when “viewed” together becomes a whole painting. In this canvas, we find that, some paint bits coming together, forming the main focused picture such as a portrait of a person, while some other paint bits form the background in which the focused portrait is present. These background paint bits now can be split into data that are active from the main focused picture and inactive from the background. Same color bits can come together to form a concentration of the same color or different colors could have melded or diffused into each other. There can be pieces within this painted bits that have neither been “activated” or “inactivated” and still have the background canvas visible as is. Similar to this is the data representation that is present in “The Brahman”.

Some major differences in the way we represent data in computers to the way data is represented here, is: Here, we start with a continuous expanse that is then broken down to bits that represent data as opposed to having discrete data representations that need to be combined together to get the continuous expanse. Further, inherently, the bits of the continuous expanse can be either active or inactive. This gives us a data representation that has inherently two-states, independent of the value they take on. Along with this, data is a data only if it is associated with a property type and a property value. But, what is the most important piece to this is the “observer’s role in creating the active or inactive data and associating a property type and value to it”. As I have explained later on, a data of a specific treble (property type, property’s value, state) is present due to the presence of an observer associating the treble. The same continuous bit can take on different trebles based on the observer. Thus each data, inherently is represented by the quadruple  (observer, type, value, state).

What are the advantages of such a “data” representation? We need to realise that while it is broken down to continuous bits because of the observer’s ability to create the illusion of discrete-ness, the original continuity present in the Brahman is not lost. So, a series of “discrete bits of data” with subsequent sequential property values can either be seen as just that, i.e., is a series of discrete values or can be seen as an analog stretch from start to the end of the series with a gap separating this analog stretch from the next analog stretch. Thus, the same observer can interpret the same data in different modes based on the need. Either by splitting into series of bits or by combining the series into continuous bits based on what is needed for interpretation.

To look at a comparison to understand this: In computers, typically data structures are aligned on 8-byte boundaries for faster access and optimization. So, a data structure that is declared with a short (2 bytes) and integer (4 bytes) as its member variables, has a size of  8 bytes with extra two bytes added to the short variable. If an pointer to the address of the storage location is traversed from the first byte to the last byte, it will not continuously have valid values that can be used by the program. But, the data structure can be forced to have a 1-byte alignment when compiled. This forces the structure to have a size of 6-bytes. Here, the same memory location that stores the data structure can be interpreted as a structure with “short” and “integer” values or can be interpreted as a series of characters of 6 bytes. Similar to what is shown below.

This is exactly what the “data representation” in reality allows us to do. It also takes away the the discrete-ness of the single bit/byte that we are forced to work with in computers. This is shown in the above diagram. We can see it as a continuous analog variation of values for a specific property type, which I have represented as a gradient of colors. Since, this representation has embedded in it the contiguous blocks of data that is representative of series data, it allows us to fold a data series on itself. Given that data is just a quadruple of (observer, type, value, state), the observer observing the folded data through both the folds gives us the various deltas across various values in the series, that can be represented as the same quadruple for the observer, as shown in the below diagram. If we keep moving the fold, we can get the deltas between all values in the series. It also allows us to accumulate it easily to give us integrals represented in a data format which is the same as its own data representation i.e., a quadruple (observer, type, accumulated value, state).

This type of representation has embedded in it a perspective which allows the same data to be spun into different perspectives by different observers, giving an opportunity to accumulate across observers the same continuous bits to gain different perspectives. While, I have listed the main advantages here, there are many other advantages of such a data representation. I have explained these as I explain the various states the data can take and as I explain the algorithm.