decoded threading in instruction set translator / emulator

I have written a full system simulator / emulator of a RISC style processor (and all the peripherals). Currently, it is using an indirect threaded emulation loop. Ie all the instruction footers are something in the style of:

pc += 4;
inst = loadWord(mem, pc);
instp = decodeTable[opcode(inst)];
goto *instp

This is performing quite well, I get around 70-80 MIPS on a modern machine when booting Linux, which is quite good.

However, I am looking at moving to a direct predecoded threaded interpreter model, that is something that looks as follows:

tPC += 1;
instp = predecodeMem[tPC].operation;
goto *instp;

The pre-decoding is not much of a problem in itself, it is just the replacement of the existing decoder and the addition of some shadow memory. My main problem with this is related to self modifying code (or semi-self modifying code).

In the simple case we can just allocate the predecode pages lazily when pages are visited which have not been executed before. The software TLB is then purged from all the entries to ensure we go through the memory simulation system on the next write to that page and thus, writes to executable pages will have to update the decode info as well which cost in performance, but as this is rare we should have no problem with it (also we can speed this up by adding sub page executable bits computed at runtime).

The problem here is about a long term code discovery when pages are reused by the operating system running inside the emulator. For example a memory page may be allocated by the Linux kernel, assigned as code for one process. Next time a process is created, the page may be allocated as data, but in the scheme just described this cause problems since the now pure data page will have to go through rather slow predecoding on every byte write.

Some current ideas, but non of these I find particularly nice, ie they all have significant drawbacks:

  • Use mprotect() when executing a page and intercept the protection faults with signal handlers. This slows down writes alot and makes multithreaded multicore emulation a pain.
  • When writing, instead of updating the predecode info, we flush the soft TLB translations related to code execution and flip a bit about the page being dirty. This should then slow down code execution on that page, but at least reads and writes are fast. Problem is of-course that the next time the page is reused as executable instead of data we have the same problem.
  • Move to a smaller predecode-cache where a limited amount of pages with predecode info are kept. The pages age and are evicted based on an LRU policy or something similar. This approach penalise applications which have a fixed memory layout however (ie many embedded applications).
  • I find that the literature is seriously lacking on discussions on this topic. What general methods exist that can age pages when they are no longer used so that we for example clear the execute bit and the predecode memory associated with the page?

    链接地址: http://www.djcxy.com/p/50564.html

    上一篇: 如何在隐藏控制台的情况下运行C#控制台应用程序

    下一篇: 在指令集翻译器/仿真器中解码线程