Seb's Blog
Hacking & Life

26th May 2010
Cowon iAudio 9

I need music. I listen to music all day long. I eat music for breakfast. It's a kind of supplement to air for me. That's why I like to have a good portable music player. These are the requirements I have for a player:

  • good sound quality
  • support for a lossless sound codec, preferably FLAC
  • support for MP3; all of it (don't laugh; there were and are some obscure players that didn't support it (fully))
  • use as USB mass storage (MTP just is not it)
  • browse by files / directories
  • optional Ogg Vorbis support
  • optional FM radio support
  • optional FM / mic recording support

As you can see, some of these requirements come from me being a free software / open source geek. I use Linux (specifically Debian) as my main operating system. UMS support is important because MTP support is flaky at best. I also like to encode music as FLAC or Vorbis.

With these requirements you pretty quickly have only a handful of options. If the playback quality is of importance you will have to consider the Cowon players. These are considered to provide the best sound quality in the consumer market (audiophiles really have to take their Marantz turntable with them ;-)).

So, I tested some players and was really blown away by the Cowon iAudio 6 (not available anymore). That was really, really good stuff. Powerful bass, and very dynamic sound throughout the whole frequency spectrum. It was hard drive based but was quite resistant to failure. But after three years the disk died.

Now I needed a new player. Research again brought up only a handful of players that supported my most important requirements. Only the Cowon players support all of these requirements. I chose the player from the middle of Cowon's product spectrum -- again an iAudio.

Enter: Cowon iAudio 9.

This one is even better than the iAudio 6. It has got a brilliant display, is flash storage based, and has playback quality better than any portable media player I ever tested. It doesn't need to shy away from comparison with good stationary equipment. It's just very, very good.

You should buy one ;-) (no, I'm not paid for this; I'm also not affiliated with Cowon).

Permalink | hardware, music.
26th May 2010
Deserializing large XML files

As mentioned in another blog post I needed to parse a huge (as in 1 TB) XML file. Okay, I wasn't really interested in a parse tree. I just wanted to deserialize entries in that file.

Serialization is the process of turning a program's data structure to a representation that may be taken out of the program (e.g. to a hard disk) to have it read back in later and turned back to a practically identical data structure (that's deserialization).

So, a MediaWiki database dump file is not really a database dump, but a serialization of a data structure that describes articles, revisions, and some meta data into XML. The format is basically this:


  <page>
    <title>Page 1</title>
    <id>1</id>
    <revision>
      <id>1</id>
      <timestamp>2009-07-23T15:31:45Z</timestamp>
      <contributor>
        <username>User 4</username>
        <id>4</id>
      </contributor>
      <text xml:space="preserve">some content

...
end of content</text>
    </revision>
    <revision>
      <id>15</id>
      <timestamp>2009-07-24T02:02:51Z</timestamp>
      <contributor>
        <username>User 8</username>
        <id>8</id>
      </contributor>
      <comment>made some changes</comment>
      <text xml:space="preserve">some changed content

...
end of content</text>
    </revision>
  </page>

This describes one page with two revisions, their respective IDs, some time + contributor meta data, revision comment and, most importantly, the content of the revision. For levitations specific application logic I needed to deserialize this XML one revision at a time but with the corresponding page meta data. I.e. I would like to have this data:


$VAR1 = {
    title       => 'Page 1',
    page_id     => 1,
    rev_id      => 1,
    timestamp   => '2009-07-23T15:31:45Z',
    contributor => {
        username    => 'User 4',
        id          => 4
    },
    text    => "some content\n\n...\nend of content"
};

and then:


$VAR1 = {
    title       => 'Page 1',
    page_id     => 1,
    rev_id      => 15,
    timestamp   => '2009-07-24T02:02:51Z',
    contributor => {
        username    => 'User 8',
        id          => 8
    },
    comment     => 'made some changes',
    text    => "some changed content\n\n...\nend of content"
};

etc... And I needed this as fast as possible.

There are two predominant XML access models: SAX and DOM. SAX reads the XML piece by piece and calls specific methods depending on the content it just read. DOM reads the whole document into memory and makes its parts available via an API.

As you may see, DOM is a no-go for this application. Holding a 1 TB document in memory isn't just it. So, SAX it is? Well, let's see: SAX calls a method whenever it enters and leaves an element node (a tag) and a text node. That's at least 20 events per revision (ignoring whitespace text nodes). Times 65 million (the number of revisions of de.wikipedia.org) that makes 1.3 billion method calls. If I program in a high level language like Python or Perl this will mean significant overhead. Method call overhead (just the cost for calling the method) can amount to 2-3 ms per call. Even with just 1 ms per call it would cost me 130,000 seconds (one and a half day) to use an event-driven API. Not to speak of actually doing something in these methods.

So I needed something simpler, something more stupid.

Welcome to Perl and the XML::Bare module. Perl allows to read a chunk of a file up to a specific string. So I read in chunks ending with </revision>. If I'm at the start of a page I get a dangling <page> tag at the beginning. If I'm in the middle of a page I get a whole unbroken "revision" subtree. Enter XML::Bare. This is a fast deserializer. It works on small amounts of data (which we now have), and works on partial XML that may be a little bit corrupt. Like having dangling start tags.

I had to code around some small problems, but then I arrived at this primitive MediaWiki XML deserializer that was really awesomely fast.

Permalink | levitation.
25th May 2010
levitation-perl 0.2 released & de.wikipedia.org converted

Levitation is a project to convert a MediaWiki (the wiki engine used by Wikipedia) database dump to a git repository. This project got started by Tim Weber and was first implemented in Python (this old version can be found at this GitHub page).

This blog entry is about my version, to be found at this GitHub levitation-perl repository.

Some time into the implementation process I got quite frustrated with some aspects of the Python version:

  • inefficient persistent storage
  • slow XML parsing
  • various other performance issues

Some of these performance problems were caused by Python's object system which seems to severely punish you with expensive method calls. Others could have been worked on, but as it wasn't really known if git could handle large wikis, I had the desire to just have a prototype kind of tool to answer that question. Unfortunately the project leader wanted to incorporate production qualtity code only.

So that was a kind of conflict there ;-) Determined to answer my question before Christmas (and no, I didn't say which Christmas), I started to hack on my own version of levitation. In Perl. Just because I'm way more fluent in that than in any other language. And it ought to be just a prototype to clarify if it is feasible to have a git repository of a large wiki.

The goal was always to have the German Wikipedia converted. This one has about three million articles (in all namespaces) and 65 million revisions. That's a lot.

The WikiMedia Foundation provides dumps of each Wikipedia in an XML file that includes all revisisions of all articles sorted by article ID and timestamp. Each revision entry holds the whole text of the article as it was at that moment. Not a diff, the whole text. So these dumps tend to be pretty large. For the de.wikipedia.org we're clocking at over 1 TB now.

So, that's part one of the conversion process: parsing XML, fast.

But we have another problem: the sort order. The original sort order would look pretty stupid in a git repository. So we have to sort the entries by timestamp and revision ID somehow. Because of this we persist some information for the commits to disk that can later be retrieved in a sorted order. Meanwhile we store the revision contents as git blobs (via git-fast-import).

In the second step we retrieve the persisted data in the correct order and write git trees and commits. That's a pretty complex process that's expected to take some time.

And it all took some time.

After 2 days and 6 hours I had a git repository of the German Wikipedia. The .git directory contained 92 GB of packed data. I could run git log. Yeah. I also tried to run git checkout. That took another 11 hours of my computers life. Yep, checking out nearly 3 million files is not fast on any file system.

To conclude: the created git repository is not (yet) usable. It would take days to clone. It takes half a day to just checkout.

But yet, to celebrate this milestone -- a working, valid repository of a really large wiki -- I declared this version of levitation-perl as 0.2.

Fetch it at this page.

Upcoming blog entries will detail the hacks used to make this conversion possible before (any) Christmas.

Permalink | levitation.

Created by Chronicle v4.4

Archives

Tags
Imprint / Impressum