The Tokenizer

EasyCoder handles a script in 2 steps; the first to convert it into a stream of tokens and the second to compile these into the intermediate form ready to be run. Here's the source of the tokenizer:


const EasyCoder_Tokenise = {

  markComments: ({
    list,
    index = 0,
    inComment = false,
    newList = []
  }) => {
    if (index >= list.length) {
      return newList;
    }
    const item = list[index];
    const {lino, token} = item;

    const noCommentParams = {
      list,
      index: index + 1,
      inComment: false,
      newList: newList.concat({lino, index, token})
    };
    const inCommentParams = {
      list,
      index: index + 1,
      inComment: true,
      newList: newList.concat({lino, index, comment: true, token})
    };

    if (inComment && index > 0 && lino === list[index - 1].lino) {
      // in a comment
      return EasyCoder_Tokenise.markComments(inCommentParams);
    } else {
      if (token.charAt(0) === `!`) {
        // a new comment
        return EasyCoder_Tokenise.markComments(inCommentParams);
      } else {
        return EasyCoder_Tokenise.markComments(noCommentParams);
      }
    }
  },

  findStrings: ({
    original,
    line,
    inComment = false,
    inQuote = false
  }) => {
    const c = line.charAt(0);
    const ch = inQuote && [` `, `\\t`].includes(c) ? `\\s` : c;
    if (line.length === 1) {
      return ch;
    } else {
      const tail = line.substring(1);
      if (c === `!` && !inQuote) {
        return c + EasyCoder_Tokenise.findStrings({original, line: tail, inComment: true, inQuote: false});
      }
      if (c === `\`` && !inComment && !inQuote) {
        return c + EasyCoder_Tokenise.findStrings({original, line: tail, inComment, inQuote: true});
      }
      if (c === `\`` && inQuote) {
        return c + EasyCoder_Tokenise.findStrings({original, line: tail, inComment, inQuote: false});
      }
      if (!inComment && !inQuote && !c.match(/[A-z0-9_\-+*/\- \t]/)) {
        if ([`'`, `"`].includes(c)) {
          throw new Error(`Bad syntax in "${original}":\nStrings in EasyCoder must be enclosed in backticks.`);
        } else {
          throw new Error(`Unrecognised character "${c}" in "${original}".`);
        }
      }
      return ch + EasyCoder_Tokenise.findStrings({original, line: tail, inComment, inQuote});
    }
  },

  tokenise: (file) => {
    const replaceAll = (target, search, replacement) => {
    return target.split(search).join(replacement);
    };

    // Convert quoted spaces to \\s
    const markedSpaces = file.map((original) => {
      const line = original.trim();
      if (line.length) {
        return EasyCoder_Tokenise.findStrings({original, line});
      }
      return ``;
    });

    // Convert to an array of lines
    const scriptLines = markedSpaces.map((line, lino) => {
      return {lino: lino + 1, line};
    });

    // Convert to an array of tokens within lines
    const lines = scriptLines.map((line) => {
      const items = line.line.trim().split(/\s+/);
      const tokens = items.map((token, index) => {
        return {lino: line.lino, index, token: token};
      });
      return tokens;
    });

    // merge all the lines into an array of tokens
    const merged = [].concat.apply([], lines);

    // filter out empty tokens
    const filtered = merged.filter((item) => {
      return item.token;
    });

    // Convert \\s to space
    const quoted = filtered.map((line) => {
      return {lino: line.lino, index: line.index, token: replaceAll(line.token, `\\s`, ` `)};
    });

    // Mark comments for removal
    const marked = EasyCoder_Tokenise.markComments({list: quoted});

    // filter out comments
    const tokens = marked.filter((item) => {
      return !item.comment;
    });

    return {scriptLines, tokens};
  }
};

module.exports = EasyCoder_Tokenise;

All tokens are delimited by white space so I start by converting spaces in quoted strings (strings between backticks) into \\s to ensure they will be treated as single tokens later. This uses findStrings(), which works its way along each line, extracting the head character and calling itself recursively until it reaches the end of the line.

Next I create an array of lines, where each item is an object with the line number and the text of the line. This will be needed (among other things) when displaying errors.

Then I create an array of tokens, where each item carries information about its index and the line it's in. Looking at the code I can't figure how it works - I'd need to step through it.

Empty tokens are filtered out and \\s converted back to space, then comments are filtered out. The result is a list of lines and a list of tokens; the second of these is what goes forward to the compiler.

The functions markComments() and findStrings() are almost certainly responsible for the rather slow performance of the tokenizer. They are heavily recursive pure functions and should probably be replaced by standard mutable functions. I'll do that when I get some time, then the speed of handling incoming scripts will go up dramatically.