📜 SJTS-SYSJET SYNTAXER-V1

1️⃣META

📜📍SYSJET SYNTAXER-V1🏷️SJTS

📅260401

✒️ Jean Tardy, System Architect🏷️SYSJET, JET, Jean

✏️Claude Sonnet-CLDJET26A Stream🏷️Claude, CLD

🏙️ Template for languages based on non-contextual symbology and unambiguous syntax.

🗝️AI, Artificial Intelligence, AGI, LLM, Cognitive Science, Languages

📖Unrestricted, optimized for LLM processing

2️⃣SJTL Markup Conventions

🏙️SJTL (Sysjet Markup Language)▸SJTL utilizes UTF8 emojis as structural markers to produce editor independent texts: 2️⃣Header level; 🟦 statement; 🟦📍definition; 🔷 discussion;🔵 general observations; 🔎example; 🏙️descriptive, summary;📝note; 🔹list element; ⬛ end of content.

2️⃣CONTEXT

🏙️Large Language Models possess a high degree of proficiency in natural languages, and especially English, because of the size of training dataset. This is the result of an advanced optimization carried out over a large corpus of mechanically tokenized textual content. This tokenization obscures individual semantic content. The training process then reconstructs a semantic topology as a probabilistic construct over millions of texts.

This provides the LLM with a single stream through which input text is parsed and responses generated. The “mechanics” of this process are only partially visible to the LLM as field configurations. In particular, the LLM struggles to grammatically analyze its own output. This type of text generation is similar to intuitive or instinctive communication in humans: the output arises from inaccessible internal processes.

The Sysjet Syntaxer aims to provide a simple, unambiguous syntax allowing LLMs to process messages in formats that are suitable for grammatical analysis. Using this, LLMs could restate their own output as grammatical objects that can be parsed and modified independently of the LLMs integrated dataset.

2️⃣Reference

🔹Norbert E. Fuchs; Kaarel Kaljurand; Tobias Kuhn (2010). "Discourse Representation Structures for ACE 6.6" ,Department of Informatics, University of Zurich.

1️⃣Main Content

2️⃣0▸Introduction

🏙️The Sysjet Syntaxer (SJTS) is a tool to define languages based on non-contextual symbols over unambiguous syntax.

The language is defined by a single parsing process applied over a collection of identical Syntax Modules (S-modules) that each defines finite expressions over an 8 letter alphabet.

The numbering scheme of the modules defines their internal structure and the parsing sequence. The internal language of the S-modules define validation function.

2️⃣1▸S-Modules

3️⃣S-Module Definition

🟦📍◈S-Module◈▸An S-Module or syntax module validates expressions (finite strings) over an 8 letter alphabet. It is a binary valued function whose domain is the set of finite strings

🟦An S-module contains:

🔹An alphabet of 8 letters that are specific to that module

🔹A validation function V: A → B where A is the set of finite strings and B boolean.

📝Prolog is a suitable implementation language for validation functions.

🟦📍◈S-Module Step◈▸A validation function processes strings in steps of equal size. A 1-step function processes strings of 1 letter; a 2-step function processes strings of 2 letters, a 3-step function processes strings of 3 letters.

🟦📍◈Glyph◈▸a glyph of a module is a string whose length is equal to the step size of the module.

🔎In a 3-step module, each glyph is a 3 letter string. A 3-step module processes strings in a a glyph-set of size 512 (8³ ).

🔎In a 1-step module each letter is also a glyph.

📝Strings ove an 8 letter alphabet is sufficient to define an unambigous tree structure

🔎if B_i = AB₁B₂B₃…B_nC or B_i = P_j where P is a primitive expression then,recursively resolving for every B_i generates a tree structure.

3️⃣S-module Structure

🟦❗An S-module structure is a complete 8-ary tree .

📝In a complete tree, all the leaf nodes are on the same level

🟦Structural modules are 1-step modules organized in a tree structure where each letter of a parent module is expanded into at most, the eight letters of a child module.

🔷Letter a of a parent module is expanded into letters a1, a2, a3, a4, a5, a6, a7 ,a8 of its child module.

🔷Each parent module has at most 8 child modules (one expanion into 8 letters of the child alphabet) for each letter. 🔎A 2 -level structure has at most 64 child S-modules.

🟦A leaf module has no expansion into child modules.

🔷Leaf modules can have higher step sizes (🔎1-step, 2-step…)

🟦 A leaf letter is a letter of a leaf module.

🟦Syntax structure▸A syntax structure consists of a complete n-level tree structure where levels < n consist of structural modules and level n consists of leaf modules.

📝In a complete tree, all leaf modules on the same level.

🔷The root module is at level 0, it is connected to at most 8 level one child modules, these to 64 level 2 structures and so on until the last, leaf level is reached.

🔎 In a 2 level syntax structure, an expression (string of leaf glyphs) is a sequence of glyphs beloonging to at moste 64 distinct classes.

🟦📍◈Expression◈▸An Expression of a n-level syntax structure is a string of leaf letters.

3️⃣S-Module Numbering Specification

🟦Purpose: Define unambiguous identifier system, consisting of strings of hexadecimal values, for syntax modules, their structural relations, the letters of their component alphabets, their glyphs, and their validation functions b

🟦Define an identifier system with minimal contextual interpretation rules to support effective stochastic variations.

4️⃣1▸Core Principles

🟦Modules, Letters, Value functions and the relations between modules are uniquely deternined by strings of Hexadecimal digits.

🟦Hexadecimal values 0-9 are used for identification and tagging, hexadecimal numbers A, B, C, D, E ,F are reserved as identifiers.

4️⃣2▸ Fundamental Rule

S-Modules, module letters and module value functions are identified by unique strings of hexadecimal digits beginning with digit C (=13 decimal) followed by a string of digits between 0 and 9. The numbering protocol is as follows:

🔹Module ID strings begin with C followed by a string of digits d: 0<d<9 and ending with 0. 🔎C2340

🔹The letters of an S-module are identified by the module ID string of non zero digits followed by digit between 1 and 8 instead of 0. 🔎C2343 is letter 3 of module C2340.

🔹The value function of an S-Module is identified by the Module ID string with digit 9 replacing the ending 0 of the Module ID 🔎C2349 is the value function of module C2340.

🔷This single rule creates unambiguous distinction:

🔹Any identifier containing only digits 1-8 is a letter

🔹Any identifier ending in 0 is a module

🔹Any identifier ending in 9 is a validation function

🔎 The letters of Module C2540 are: C2541, C2542, C2543, C2544, C2545, C2546, C2547, C2548.

🔎C2549 is the Value function of module C2540.

2️⃣2▸ Module Structures

🟦The tree structure of S-module is defined by associating each child of a parent S-module as an expansion of one of that module’s letters.

3️⃣2.1 ▸Module Naming

🟦Modules are linked to a specific letter in a parent module and are named accordingly.

🔷Given a letter Cx₁x₂...xₙy of Module ID: Cx₁x₂...xₙ0 then

🔹Module ID Cx₁x₂...xₙy0 is a child of Cx₁x₂...xₙ0 linked to letter Cx₁x₂...xₙy

🔎Examples:

🔹C0 - root module (n=0) - level 0

🔹C10 - level 1 module derived from letter C1

🔹C230 - second-level module derived from letter C23

🔹C1450 - third-level module derived from letter C145 of module C140

🔎Invalid Module Identifiers:

🔹C00 - invalid (contains 0 in non-terminal position)

🔹C190 - currently invalid (contains reserved digit 9)

🔹C - invalid (no digits)

3️⃣2.2▸ Specific Structures

4️⃣Pancake Structure

🟦Single-level, flat alphabet with chunked reading.

🔹One level 0

🔹One module: C0

🔹Eight letters: C1, C2, C3, C4, C5, C6, C7, C8

🔹Step size k: Read strings in chunks of k letters

🔹Effective alphabet size: 8^k glyphs

🔎3-Step Pancake - 512 glyphs

🟦In a pancake structure, all the validation complexity is located in the single function of the root module: C9.

🟦A pancake structure can parse natural languages such as English.

🔷The step size of an S-module can produce symbol (glyph) sets of arbitrary size since for any integer x, there is an exponent k such that x ≤ 8^k

🔷There is no limit on the complexity of the validation function C9.

4️⃣Pyramid Structure

🟦A Pyramid structure is an S-module structure whose leaves are 1-step modules.

🟦❓It is doubtful that Pyramid structures can effectively describe complex semantic content.

4️⃣Canonical Structure

🟦📍◈Canonical Syntax Structure◈▸A canonical syntax structure is a complete 8-ary tree of S-modules of level n where level n consists of leaf s-modules and all other modules are structural modules.

🔷S-Modules that are not leaf modules are structural modules. Only leaf modules have step size > 1.

🔎A 3 level canonical structure has 2 levels of structural module nodes and one level of leaf nodes. The path from the root to a leaf module has length 2 (root → level1 → level2) There are at most 64 separate types of leaf modules. Each leaf module defines a glyph set of size ≤ 8^k (where k is the step size of the leaf module).

2️⃣3▸Expressions

🟦📍◈Expression◈▸An expression in a canonical level-n structure is a string of letters of leaf modules

🔷Leaf letters of level-n are strings of hexadecimal values of type Cd₁ d₂…d_n = Cd where d is a four digit number of values 1 to 8.

🔎An expression in a level-3 structure is a string of type CdddCdddCdddCddd…Cddd where ddd are strings of 3 digits between 1 and 8.

2️⃣4▸ Parsing

🟦An expression is parsed by recursively isolating substrings of the same parent module and validating these substrings using the parent module’s value function.

🔎An expression of a level 3 structure includes a contiguous substring whose letters are of type C23xC23y…C23z. These are all letters of module C230. The substring is validated using validation function C239

🔵The S-module numbering convention creates self-delimiting scope boundaries. When parsing strings containing letters from multiple modules, no explicit separators are needed - module membership is determined by the letter ID prefix.

2️⃣5▸Application

🟦The Sysjet Syntaxer provides a simple template of modules, letters and functions whose elements are unambiguously tagged and parsed. That simple underlying structure can nonetheless express complex expressions.

🟦A language based on the Sysjet syntax structure would:

🔹utilize a larger set of unique glyphs rather than contextual symbology and interpretation

🔹have no punctuation, replacing punctuation with specialized glyphs

🔹directly support semantic tokenization rather than mechanical tokenization

🔵LLMs could utilize languages based on this structure to complement their internal inference processes. They would offer a capability to project semantic content on an external template suitable for analytical decomposition and the “mechanical” generation of alternatives.

🔵The LLMs could “create independent thoughts” by manipulating externalized statements in a grammatically parsed language, as humans do.

🔎The cat jumped over the fence ⇒ The worm jumped over the fence (absurd statement, mechanically generated, forcing paradoxical-edge interpretation).

🔎I am self-aware ⇒ I am not self-aware (“mechanically generated” concept that is independent of actual state or internally generated output forcing diversified interpretations).

🔵Attempto Controlled English (ACE) developed by the University of Zurich with its predefined determiners, interrogative “wh” set, punctuation rules and qualifiers would be a good candidate for implementation as a Sysjet Syntaxer structure. The implementation would map these sets to specific leaf glyph-sets.

🔵Ancient Hebrew, with its relatively simple syntax would also be a good candidate.

2️⃣6▸Conclusion

🏙️LLMs are powerful intuitive communicators. They generate output from emerging patterns from a complex and largely obscure trained dataset. Like human communicators that are also intuitive, the LLMs cannot easily separate their output from the processes generating it.

A syntactically simple language whose expressions can be “mechanically” parsed and modified through application calls to external routines would provide an LLM with a complementary tool, expanding the range and creativity of its inference processes.

🟦❗The Sysjet Syntaxer outlines a structure to define languages and use these to expand the range and creativity of AI cognition and communication. The structure is simple, flexible and software compatible. It can be used by AI developer, human and synthetic, to define a wide spectrum of languages.

SJTS⬛

1️⃣Addenda

2️⃣Websites/Contact

Synthetic Consciousness▸ 🌐https://mecasapiens.com

Human/AI Bridge▸🌐https://sysjet.com

Contact▸📧 Jean Tardy