Nucleus Released

Our Sommarhack 2022 contribution

Posted by spkr/smfx on 8 july 2022

The Context

At July 9th 2022 we released `Nucleus' an Atari Portfolio demo at our favourite Atari Demoscene party: Sommarhack.

This writeup serves as some additional content that comes with the demo, which can be found at: pouet youtube

This document is 80% finished, and thus a work in progress. Feedback is welcome ;)

Introduction

The Atari Portfolio.... easy money! Thats what Atari thought when they licensed the product from DIP in 1989; which was the hightide of the pocket computers. The Atari Portfolio was considered the first pocket-sized/palmtop IBM/DOS compatible computer.

Little did they know that some 33 odd years later, the Atari portfolio would remain one of the less explored demoscene platforms that bore the tag of Atari.

Sure, the Atari VCS and the Atari Jaguar, and pretty much everything in between, even the Atari Jaguar, has seen demoscene love to an extend that there are multi-part demos that push the envelope of the platforms. But the little guy, the Atari Portfolio has been left out.

The Beginning

Sillyventure gave the machine some attention in 2019, by announcing an Atari portfolio-only competition. Back in 2019 we were occopied with Atari ST demos, and did not consider participating in the Atari Portfolio demo competition. However, with much anticipation did we await the submissions for the compo. However, that compo only saw only a few releases and in our opinion did not do the platform any right.

So SMFX set off in december 2020 to explore the machine. spkr, being a retrocomputer enthusiast made an effort to gather a bunch of portfolios from the Dutch market; where they ended up on 2nd hand sales sites quite often and got us the needed hardware.

The Hardware

The Portfolio was released in 1989 sports fairly impressive hardware for the time:

  • CPU: Intel 80C88 CPU @ 4.9152 MHZ (rather than 4.77 MHZ)
  • RAM: 128 KB memory, shared with C:\ drive storage (max 104 KB available to user)
  • Display: 240x64 non-backlit LCD, exact hardware unknown
  • Sound: Signetics PCD3311/12 DTFM Generator
  • Storage: Bee memory cards, 32/64/128kb size; batterypowered to persist data
  • OS: DIP-DOS, which is a partial implementation of DOS (disassembly)

Much kudos for known resources go to Klaus Peichl (and his website) and the pofowiki.

In order to keep battery usage low, the hardware features NO programmable interrupt. There is one observable system interrupt, which is the time interrupt, which occurs, depending the configuration, twice every second or once every minute. Other observable from outisde the systems are interrutps driven by the expansion interface. This poses some challegnes with regards to effect sequencing, timing and playback of sound, which will be elaborated on later.

The Bee memory card is a memory mapped expansion, that can either be accessed directly by using hte memory map, or more generally is interfaces through the DOS/BIOS calls though using fopen, fread etc calls.

Our demo targets the stock portfolio with a 64kb bee card. We found that these 64kb memory cards are the most common.

Regarding the CPU's performance, we consulted both cycle counting documents found on the internet ref, as well as consulting the nice work by Trixter ref. Since traditional IBM 8088 CPU frequencies are tailored to the display output frequency, and as such conveniently chosen to be at 4.77MHZ, and the Atari Portfolio has no such benefits of this conveneint timing, we were not sure if the cycle tables and findings of Trixter would apply to portfolio development. Unfortunately, Trixters 8088 benchmark tooling relies on a programmable interrupt/timer, we were not able to benchmark the Atari Portfolio with the same software. However, we did some of our own investigation, based on code timed execution of code.

The Limited Benchmark Results

Benchmark setup code:

doBench:                                        ; entry point of the benchmark code
    mov     [timeTicks],word 10                 ; we want to stop execution after 10 observable ticks
    mov     [benchmark],word -1                 ; reset the flag that indicates that time has passed
    call    startTicks                          ; set the timedticks
    call    waitTick                            ; wait-lock to observ the actual system tick, before starting the benchmark
.again:
        call    benchCode                       ; run the code under bench
        add     [effectTicks], word 1           ; increase benchcode execution count +1
        call    checkTick                       ; check if tick is observed, which triggers the end of benchmark execution
    cmp     [benchmark], word 0                 ; check to continue bench
    jne     .again                              ; go for it

Subsequently the `benchcode' would be a subroutine:

benchCode:
    %REP 2000
        code
    %ENDREP
    ret

Then by using different code/operations, one would be able to genarate results that would provide data perform relative comparative analysis. Herewith the results (for relative comparison we based/estimated the nop operation code as 4 cycles):

;       operation               count       byte        est. cycles
;------------------------------------------------------------------
;       mul     cl              ;176        2           67,4
;       mov     [.t],word 1000  ;360        6           33
;       movsw                   ;453
;       mov     [.t],byte 100   ;471        5           25,2
;       mov     al,[di+bx+1234h];555        4
;       add     al,[bx+1234h]   ;555        4
;       and     ax,[bp]         ;555        2       ; mask                      
;       mov     bx,[cs:bx]      ;610        3           19,4    
;       mov     bx,[di+bx]      ;610        2           19,4
;       mov     bx,[bx+si]      ;610        2           19,4
;       mov     bx,[di]         ;678        2           17.5
;       mov     bx,[bp+di]      ;678        2           17.5
;       add     bx,[di]         ;678        2           17.5
;       add     ax,[bx]         ;678        2           17.5
;       add     ax,[si]         ;678        2           17.5
;       and     al,[bp]         ;678        2           17,5
;       es  lodsw               ;678        2           17.5
;       lodsw                   ;717        1           16.5
;       mov     al,[si]         ;869        2           13.6
;       add     bl,[di]         ;869        2           13.6
;       lodsb                   ;936        1           12.6
;       mov     cx,1            ;1012       3           11.7
;       xlat                    ;1012       1           11.7
;       mov     cl,1            ;1509       2           7,8
;       add     al,8            ;1509       2           7,8
;       add     cx,dx           ;1509       2           7,8
;       add     ax,ax           ;1509       2           7,8
;       add     al,al           ;1509       2           7,8
;       add     bx,bx           ;1509       2           7,8
;       sal     bx,1            ;1509       2           7,8
;       xor     ax,ax           ;1509       2           7,8
;       xor     al,al           ;1509       2           7,8
;       xchg    al,ah           ;1509       2           7,8
;       nop                     ;2966       1           4
;       inc     di              ;2966       1           4
;       inc     si              ;2966       1           4
;       xchg    ax,dx           ;2966       1           4   

Some observations:

The Video Display Controller

The portfolio has a 240x64 single color LCD display. instead of having shared memory with the system, the lcd has its own controller with its own videoram.

The video display controller (VDC) is accessed through ports that are memory mapped. the VDC in turn maintains internal state. By issueing commands following the VDC protocol it can be set to send or receive data, to then set the cursor the video ram position to read from or write to. Issueing commands in too quick succession does not cause a problem, but writing bytes too faster after one anohte r will cause the VDC to be confused and display garbage on (parts or the complete) screen.

The minimally required grace period in between VDC commands differ over powercycles and portoflio models. From our testing on different Portfolio models (HPC-004 and HPC-009) we deduced that the HPC-009 has the worst tolerance, which is mateched by '27' by Klaus' test program. the HPC-004 model consistently matches '24'. We asked people that had other models than the HPC-004 models from the Atari Portfolio facebook group to run the LCD test program with different portfolio models to validate results. Results received validated the observation that the HPC-009 is consistently less tolerant. In the resulting code to drive the VDC, this means that 1 extra nop required between the HPC-004 and HPC-009 tolerances, to prevent visual distortion on the HPC-009.

In contrary to the Atari ST (and many other micromachines of the time) the Portfolio does not have a vblank. As such, it can not direclty be controlled when a pixel is updated. So instead, for all intends and purpposes, we consider that pixels are directly updated when the VDC command is processed (and probably displayed in a VDC controlled frequrency that is unknown to us). As a result this has implications as to how to drive the VDC, either by individual plots to the screen directly, or to use some kind of backbuffer in the main ram, that is then written to screen.

The VDC autoincrements the internal VDC memory cursor after a byte is written to the VDP/Display. This means that the VDC favours sequential writes, becuse this circumvents setting the cursor for each write, and allows the VDC to be in a state where can `just receive bytes'. This poses implementation decisions when the effect covers a larger visual area, but only affects a limited set of bytes (f.e. a line draw effect).

Some example code of writing to setting the LCD cursor to the value contained by si. In setting the high and low byte of si.

setVDCCursor:
    mov dx,08011h               ;   LCD controller Addressregister
    mov al,10                   ;   10  Set cursor position Low
    out dx,al                   ;   write to VCD port
    dec dx ; mov dx,08010h      ;   dx= $8010, data register
    mov ax,si                   ;   8010 - si iterates as curor over screen, doing per cube copy in doing x (30), and then adds offset 240 per y-block
    out dx,al                   ;   write to VCD port

    inc dx ; mov dx,08011h      ;   dx = $8011, address register
    mov al,11                   ;   11  Set cursor position high
    out dx,al                   ;   write to VCD port
    dec dx ; mov dx,08010h      ;   dx = $8010, data register
    mov al,ah                   ;   move al
    out dx,al                   ;   write to VCD port

    inc dx ; mov dx,08011h      ;   dx = $8011, address register
    mov al,12                   ;   12  Byte (s) transferred to the LCD
    out dx,al                   ;   write to VCD port
    ret

Another observation about the LCD; the quality of LCDs differ over the different portfolios that we have seen. Mainly with regards to the speed in which pixels are turned on and off (regardless of VDC controller speed). We can only guess the life expectance that the display had during its haydays, but we suspect that time treated the various portfolios differently. This implies however that if the code and pixel updates occur quickly the pixel becomes less clear. There is thus a tradeoff between clarity of the effect and its speed.

This however also leads to some interesting side effects. When moving pixels across the screen fast ernough, provides a `free' motion blur like experience. Similarly, certain levels of greyscalec can be achieved by alternating setting and unsetting pixels. Depending on the speed and interval between alternating on and off, various degrees of grey can be achieved. Again, depending on your portfolio instance, ymmv.

The Music

The portfolio has sound capabiities, really it does. It has a Dual Tone Frequency Modulation (DTFM) chip to generate sounds and a little speaker next to the lcd screen to output them. As with the VDC the soundchip is memory mapped and exposed through a port to which commands are issued. This can be done through int/bios commands, or just by writing to the hardware directly.

We took the latter approach, and took the bios implementation that we found from the disassembled bios.

The DTFM chip has 2 octaves as a range of frequencies, ranging from 622.3hz to 2489hz. When the music chip is sent a command to play a tone, it will play this tone until told otherwise. Additonally to single tone generation, the chip can generate dual and modem tones.

Now that we could play sounds on the portfolio we needed to tackle both the problem of replaying it during the demo, as well as the music cration.

For the replayer we mimicked a tracker apporoach organization, where a song would consist out a sequence of patterns, that consist of 64 rows that can have commands. A command could be either of 2 interpretations by our sound code;

  • either play the note directly, or
  • first turn the sound generation off, introduce a slight delay, before issuinf the sound command

We introduced the latter sound code, because we obseerved that alternating a tone with stopping the tone, the difference in amplitude generated gives click. this way we could not only have a series of tones stringed together, but also introduce some finegrained clicks, and so percussion is born :).

The portfolio has no tracker (yet?). so we implemented a conversion between the maxYMizer format to our sound replayer, basically allowing the mucisian to compose his track in maxYMizer, using a limited set of instruments (and notes!). The exported file by maxYMizer is converter to the portfolio replayer format. The converision would detect any unsupported commands and so on...

However, in maximizer, playback speed per row can be set; while the portfolio has no programmable timer. The observable timer is so slow that is unusable to drive effects or a music driver, so we decided to not support playback speed settings conversion from the maxYMizer generated file.

The Demo

Portfolio coding initially started for sommarhack 2021, but we did not get to the point it was ready for release, so we picked it back up in april 2022.

Because the portfolio has no programmable interrupt nor finegrained observable counter such as VBLANK, timing of the demo is solelly based on execution time of the code, e.g. the speed/time taken by the code executed in the demoloop.

We built the demosystem/script around the number of demoloop executions that occurs; like do 250 effect frames of this code. However, different pieces of code have different execution times. But considering we also need to call the musicplayer from the demoliop, this would also result in the music player playing at different speeds on different parts of the demo. In short, every piece of demoloop that would be needed to run multiple frames from the demosystem, would have the same execution speed. Additionally, different effects would have different execution times, and hte speed of the music replay should be made up for this. To solve this, we throttled the musicplayer based on the execution speed of the code. The execution speed of the code was in turn determined by running the code in a fixed amount of time and measure how many effect frames would be executed in a fixed time.

This way all parts of code could have their execution speed measured, and the relative throttling od the musicplayer coud be determined. the overal demosystem layout is made out of parts, where each part is required to have a constant execution speed;

  • number of effect frames
  • music replay throttle fraxtion (between 0 and 65535(~1))
  • musicplayer run by demosystem or by codepart itself
  • part routine 1
  • part routine 2
  • part routine 3

an example part of thebdemosystem of the smfx we're back:

    dw  1,0x00fc,dummy,dummy,twirl_unpack           ; unpack assets
    dw  128,0x6468,doMusic,dummy,tsine_main         ; sine wave transition for 128 effect frames
    dw  1,0x4000,dummy,dummy,twirl_fix_pointers     ; fix the pointers for memory
    dw  1,0x46c7,dummy,dummy,loadTwirlFile          ; load assets from disk 
    dw  1,0x009a,dummy,dummy,twirl_unpack_data      ; unpack assets
    dw  1,0x00e9,dummy,dummy,twirl_precalc_data     ; precalc tables
    dw  1,0x51d0,doMusic,dummy,twirl_color_init     ; initialize the pixel table
    dw  35,0xfd00,doMusic,dummy,twirl_main          ; run the effect for 35 frames
    %REP 3
        dw  10,0xfd00,doMusic,dummy,twirl_main      ; run the effect for 10 frames              
        dw  1,0xfd00,doMusic,twirl_overA,twirl_main ; change the overlay
        dw  19,0xfd00,doMusic,dummy,twirl_main      ; run the effect for 19 frames
        dw  1,0xfd00,doMusic,twirl_norm,twirl_main  ; remove the overlay
        dw  19,0xfd00,doMusic,dummy,twirl_main      ; run the effect for 19 frames
        dw  1,0xfd00,doMusic,twirl_overB,twirl_main ; draw the 2nd overlay
        dw  19,0xfd00,doMusic,dummy,twirl_main      ; run the effect for 19 frames
        dw  1,0xfd00,doMusic,twirl_norm,twirl_main  ; remove overlay
        dw  19,0xfd00,doMusic,dummy,twirl_main      ; run the effect for 19 frames
    %ENDREP
    dw  35,0xfd00,doMusic,contrast_dark,twirl_main  ; run the effect while we fade the screen to black

As you can see, some of the parts of the demosystem run the effect from the mainloop, as can be seen by doMusic, while others do nothing, and have the music replayer implemented in the loading and depacking rout. The varying word values f.e. 0x4000 and 0x009a indicate different additions to a value that would only run the musicplayer when the word value would overflow.

Of course, having a approach like this would also allow for depacking code to run along side with running the effect, assuming you make the depacker code reentrant and executing in a way that it depacks f.e. a set amount of literals each call. We found that diskloading and depacking was fast rnough to our standards, that we implenented loading and depacking seperated from the effect code, while keeping the music player running.

The Setup

Other than the memory-card interface the portfolio has the expansion bus for connectivity. So either one needs to be able to program bee-cards or should use the expansion bus. For the expansion bus two adapers are available that are most common. A parallel and serial expansion interface.

The Portfolio system comes with some build in software in rom, and by default file transfer over parallel is supported. So for us it was trivial to use parallel cable. Even tho we also acquired a serial interface adapter, we could not get it to work.

To transfer tiles via parallel cable, one requires another machine with actual parallel port. We found that USB to parallel solutions would not work. Somoene on the Atari Portfolio facebook group reported that he cobbled up a hardware solution himself that acted as a wifi-to-parallel solution, with which he could push files to the wifi-to-parallel host using curl.

During the developent of our demo, we used an old laptop running linux that we could use as a base station to remote copy files to and remotely issue transferring of files to the portfolio. For the actual transfer we used transfolio.

Additionally, to do quick cycletimes, Harekiet suggested to do a implementation of the Atari Portfolio VDP in assembly, so I could run the code in Dosbox. So we implemented the VDP statemachine for display (Im sure its not completely covered...). In turn by abstracting the out dx,al by a macro out_dx_al, which implementation depends on definition DOS being defined or not, I could assemble 2 versions of the same code, one that targetted the portfolio, and one that targetted DOSBOX. In this matter a quick code > assemble > test cycle could be done, which is especially useful when you are learning a new assembly language and system. There is also a Portfolio debugger available; but the above approach allows you to use the DOSBOX debugger.

We didnt implement sound for DOSBOX, simply because DOSBOX is by no means cycle accurate with regards to the portfolio, specially considering my very naive DOS emulation of the Atari portfolio VDP.

The sources to the Atari Portfolio display emulation are found here[REF TODO!]

Modding the hardware

The hardware has been around for a while, so various mods have been done for and to this machine.

One of the more common hardware mods for that Atari porfolio is the backlight mod, sold by backlight4you. Soldergirl/Stephanie Rausch made a wonderful youtube tutorial. Recording video from an Atari Portfolio without backlight turned out to be hard. In order to see the screen well, it ought to be properly lighted; which in turn cause a lot of reflection on the glossy Pofo screen. Therefore we recorded Nucleus from a portfolio with backlight mod, because it may not be the best experience in real life; we found it to provide the best quality when recorded.

Another mod we did to make the recording, was to add an audio jack to the portfolio speaker. That way we could capture the speaker output directly and merge this with the video, rather than having to record the sound indirectly from the weak speaker.

A few more mods that are available on the Portfolio are increased oscillator speed, effectively boosting the CPU clock; and upgrading the memory, I believe up to a size of 640kb or so.

The effects

A writeup would probably not be complete without some small mention about the effects, so here's a short recount:

Pin Identification

This effect is an ode to the cult status of the Atari Portfolio. John Conner uses the Atari Portfolio to hack an atm in the Terminator 2 movie. So we wanted to start off with this. The effect is done in textmode, and the last line we keep rewriting the string data, without doing a linefeed. This is the first demo since Motus where all SMFX members contributed to, plus getting together again phyisically at a party; so good times ensured, bitches! :).

Nucleus 3d

Why save the best for last :). This features 3d object rotation using exponential/logatirthmic tables using a DDA linedraw rout. The 3d code and linedrawing is done in an offscreen buffer of 9*64 bytes buffer (72x64 pixels), which is then copied to screen each frame. Becuase the VDC is slow, and cannot be written to directly, the merging of the background and the backdrop mask (used to make the object more visible) is done while writing to screen. General approach is to make sure the buffer contains the background and mask, such that the linedraw can be or'ed in. The following code is the innerloop to draw to screen:

; di contains the 3d object buffer
; si contains the image background
; bp contains the object mask
    mov al,[di]         ;12     ;get byte from buffer (bg-mask+oject)       
    out_dx_al           ;       ;write byte to screen
    lodsb               ;12     ;background byte
    and al,[bp]         ;12     ;background-mask byte 
    stosb               ;12     ;write to buffer
    inc bp              ;4      ;increase pointer

Twirl

Not too sure what to write about it. This is just some variation of an offset effect. Nothing much else really. Since Dekadence did this type of effect on the wonderswan, we wanted to do it for the Portfolio too.

Twister

The twister is the first piece of code that tries to exploit the slow decay of pixels being turned off. The twister effect is your usual prerendered twister segments which are then conviently picked per scanline to make the twister effect. To exploit the pixel decay, two different twister sources datas are used, alternating each frame. This results in some pixels remaining black, and some pixels alternating. The latter resulting in a sort of greyscale colour.

TO BE FINISHED