Erlang Challenge: Level 3

By anders pearson 24 Aug 2006

(Read [[Erlang Challenge: Level 1]], [[Erlang Challenge: Level 2]], and [[Erlang Challenge: Level 2 (2)]] before reading this post or it won’t make sense)

Level 3 involves regular expressions. I was a serious Perl programmer for a few years, so this shouldn’t take too long.

The problem presented is similar to the last level. It’s looking for needles in a 100K haystack of text that looks like:

kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJIHEmbT
MUKLECKdCthezSYBpIElRnZugFAxDRtQPpyeCBgBfaRVvvguRXLvkAdLOeCKxsDUvBBCwdpMMWmuELeG
ENihrpCLhujoBqPRDPvfzcwadMMMbkmkzCCzoTPfbRlzBqMblmxTxNniNoCufprWXxgHZpldkoLCrHJq
vYuyJFCZtqXLhWiYzOXeglkzhVJIWmeUySGuFVmLTCyMshQtvZpPwuIbOHNoBauwvuJYCmqznOBgByPw
TDQheAbsaMLjTmAOKmNsLziVMenFxQdATQIjItwtyCHyeMwQTNxbbLXWZnGmDqHhXnLHfEyvzxMhSXzd

The needle we’re looking for is a single lowercase letter with exactly three uppercase letters on either side.

My approach is basically going to boil down to: 1) save the data in a text file again 2) read it into an erlang string 3) use erlang’s regexp module to look for the pattern

(?<![A-Z])[A-Z]{,3}[a-z][A-Z]{,3})(?![A-Z]) 

That regexp is pretty gnarly, so let me break it down for anyone who hasn’t been immersed in Perl or a similarly regexp obsessed language. In the middle is “[a-z]”, which matches the lowercase letter we’re looking for. On either side of it is “[A-Z]{,3}” which matches exactly three uppercase letters. At the outside are “(?<![A-Z])” and “(?![A-Z])” which are zero-width negative lookbehind and lookahead (respectively) assertions. They are there to make sure that it doesn’t match cases with 4 or more uppercase letters since the instructions said “exactly 3”. The even funkier than normal syntax on them is so they’ll match the characters but those characters won’t get included in the results (hence “zero-width”).

Checking out Erlang’s regexp module, a problem appears. Erlang’s regexp support simply isn’t up to the same level as Perl or Python’s. I was checking because zero-width assertions are fancy enough that I was worried it wouldn’t support them. Naturally it doesn’t. But it also appears that it doesn’t support curly braces for specifying exact numbers of matches. That’s unfortunate, but ultimately both of those are just icing on a regexp engine and we can easily rewrite the pattern without them:

[^A-Z][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][A-Z][^A-Z]

The catch is that without zero-width assertions, we’ll get back an extra character on either side of the interesting bits which we’ll have to strip out with Erlang code.

Otherwise, the regexp module is pretty straightforward. There’s a regexp:first_match/2 function which takes a string and a regexp and returns a tuple of ‘match’ or ‘nomatch’ and the start position and length of the first match in the string that it finds. Those we can pass to string:substr/3 to pull out the stuff we’re interested in:

:::erlang
findit(String) ->
    Pattern = "[^A-Z][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][A-Z][^A-Z]",
    {match,Start,Length} = regexp:first_match(String,Pattern),
    string:substr(String,Start + 1,Length - 2).

The ‘Start + 1’ and ‘Length - 2’ are to strip off the extra characters on either end of it that we don’t care about. And again, in the interest of expediency, we’re ignoring the possible ‘nomatch’ condition since we expect our data file to have a match.

We can reuse the read_data/1 function from last time to read in the data (stored in “l3.txt” this time), but instead of copying and pasting and changing the filename, we might as well refactor it so the filename can be passed in as a parameter:

:::erlang
slurp(Filename) ->
    {ok,IoDevice} = file:open(Filename,[read]),
    {ok,Data} = file:read(IoDevice,10000000),
    Data.

Then, we can do:

:::erlang
findit(slurp("l3.txt")).

And it finds what we’re looking for.

Kind of. When I try using it as the key to get to the next level, I notice that I didn’t read the problem closely enough and there should be multiple matches in the data and we need all of them.

So regexp:find_first/2 isn’t going to cut it. Instead, we can swap it out with regexp:matches/2 which returns all the matches as a list of {Start,Length} tuples. It’s a quick change to just put a list comprehension in to handle the list of results instead of a single result:

:::erlang
findit(String) ->
    Pattern = "[^A-Z][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][A-Z][^A-Z]",
    {match,Matches} = regexp:matches(String,Pattern),
    [string:substr(String,Start + 1,Length - 2) || {Start,Length} <- Matches].

And we’re on to Level 4.

I was tempted to see if I could use Erlang’s gen_fsm behavior to write a small finite state machine to find the matches instead of using regexps (since a regexp is essentially just a convenient syntax for defining a certain kind of FSM) but it’s getting late and I should really get back to my packing.

Tags: erlang challenge erlang python challenge programming regular expressions