SWC Transcription

XTrans annotation

The annotation and transcription are performed on the 4 channel headset audio recordings with tool XTrans (figure above). The default output file is in ".tdf" format, which is then converted into ".mlf" and ".stm" format in release. The information about strips for dataset definition is added into the ".stm" format. Below are details for each format for those who are not familiar with them.

The .stm file starts with label information lines. These information will be read by NIST scoring tool SCTK so that WER will be analysed for each category and each label during scoring. Such label information lines start with ";;", while the main transcription does not have ";;".

For main transcription, i.e. the lines without ";;", each line displays several parts of information for each annotated speech utterance in the following order:

<session ID> <microphone channel ID> <global speaker ID> <starting time (s)> <end time (s)> <labels> <word transcription>

It is worth emphasizing that the second column (microphone channel) differs when evaluating the ASR output based on individual headset microphone (IHM) recordings and the ASR output based on single or multiple distant microphones (SDM/MDM). For SDM and MDM, the value for the second column should be the same for all utterances. For NIST scoring tool there are a lot of options for this column as long as the string is the same among all utterances. However in Kaldi default setup, there is one validation script that only accepts either "A" or "B" as the value for this column.

The label column quotes one or multiple labels with "< >". This column indicates recording ID (swc1/swc2/swc3) and strip ID (A/B/C), and both will be used to decide which dataset that utterance belongs to. Below is one piece of transcription for SWC1 for IHM in .stm format.

;;
;;
;; CATEGORY "0" "Test set" ""
;; LABEL "swc1" "swc1" "swc1"
;; LABEL "A" "A" "A"
;; LABEL "B" "B" "B"
;; LABEL "C" "C" "C"
SWC1-00001 01 mn0001 1.10 1.80 <swc1,A> YOU'RE NUMBER TWO
SWC1-00001 03 mn0003 1.85 2.74 <swc1,A> I'M NUMBER ONE
SWC1-00001 02 mn0002 2.79 4.24 <swc1,A> THAT'S OVER THERE
SWC1-00001 01 mn0001 4.29 5.36 <swc1,A> DAN IS NUMBER ONE
SWC1-00001 01 mn0001 8.28 9.50 <swc1,A> NOT PAYING ATTENTION
SWC1-00001 02 mn0002 9.96 12.64 <swc1,A> OH HE'LL BE ON THIS SIDE THEN THAT MAKES SENSE NOW
SWC1-00001 01 mn0001 12.88 15.41 <swc1,A> I HAD TO ASK IT'S
SWC1-00001 01 mn0001 15.56 20.81 <swc1,A> ALSO I HAD TO THEN GO AND STAND THAT FAR AWAY FROM THEM COS I DIDN'T HAVE MY GLASSES ON
SWC1-00001 04 mn0004 19.73 21.10 <swc1,A> THREE BARS
SWC1-00001 01 mn0001 25.41 26.54 <swc1,A> STICK IT IN YOUR POCKET
SWC1-00001 04 mn0004 26.49 28.98 <swc1,A> YEAH WELL THAT'S WAS BUT THEN I (D-) MANAGED TO DISCONNECT IT WHEN I MOVED
SWC1-00001 02 mn0002 28.61 32.88 <swc1,A> PRODUCTION WAR GAMES DIRECTOR FOX SAYS
SWC1-00001 03 mn0003 32.88 34.48 <swc1,A> I JUST NEED A CHAIR NOW IN THE CORNER

The .mlf file is provided for HTK based systems. The format of each utterance/segment name is

"*/<session ID>_<global speaker ID>_<starting time in 6 digits (10ms)>_<ending time in 6 digits (10ms)>.lab"

Each segment name is followed by word transcription with each word in one line. Below is a piece of transcription for SWC1 in .mlf format.

#!MLF!#
"*/SWC1-00001_mn0001_000110_000180.lab"
YOU\'RE
NUMBER
TWO
.
"*/SWC1-00001_mn0003_000185_000274.lab"
I\'M
NUMBER
ONE
.
"*/SWC1-00001_mn0002_000279_000424.lab"
THAT\'S
OVER
THERE
.
"*/SWC1-00001_mn0001_000429_000536.lab"
DAN
IS
NUMBER
ONE
.
"*/SWC1-00001_mn0001_000827_000950.lab"
NOT
PAYING
ATTENTION
.

The .tdf format is the default format as XTrans output. It looks similar to .stm formt, while the field separation is tab ("\t") rather than space (" "). The .tdf format is included in the release package for users' convenience to edit and verify the transcription or annotation along with audio (.wav file) in case of need.

Below is a piece of transcription for SWC1 in .tdf format.

file;unicode channel;int start;float end;float speaker;unicode speakerType;unicode speakerDialect;unicode transcript;unicode section;int turn;int segment;int sectionType;unicode suType;unicode ;;MM sectionTypes [None, None]
;;MM sectionBoundaries [0.0, 9999999.0]
SWC1-00001 01 1.1 1.8 mn0001 unknown native YOU'RE NUMBER TWO 0 0 8
SWC1-00001 03 1.85 2.74 mn0003 unknown native I'M NUMBER ONE 0 0 8
SWC1-00001 02 2.79 4.24 mn0002 unknown native THAT'S OVER THERE 0 0 8
SWC1-00001 01 4.29 5.36 mn0001 unknown native DAN IS NUMBER ONE 0 0 9
SWC1-00001 01 8.28 9.5 mn0001 unknown native NOT PAYING ATTENTION 0 0 8
SWC1-00001 02 9.96 12.64 mn0002 unknown native OH HE'LL BE ON THIS SIDE THEN THAT MAKES SENSE NOW 0 0 16