Hash join: Difference between revisions

Content added Content deleted

Inline

Revision as of 22:42, 30 November 2013

The classic hash join algorithm for an inner join of two relations has the following steps:

Hash phase : Creating a hash table for one of the two relations by applying a hash function to the join attribute of each row. Ideally we should create a hash table for the smaller relation. Thus, optimizing for creation time and memory size of the hash table.
Join phase : Scanning the larger relation and finding the relevant rows by looking in the hash table created before.

The algorithm is as follows:

for each tuple s in S do
   let h = hash on join attributes s(b)
   place s in hash table S_h in bucket keyed by hash value h
for each tuple r in R do
   let h = hash on join attributes r(a)
   if h indicates a nonempty bucket (B) of hash table S_h
      if h matches any s in B
         concatenate r and s
      place relation in Q

Implement the Hash Join algorithm in your programming language (optionally providing a test case as well).

Haskell

The ST monad allows us to utilise mutable memory behind a referentially transparent interface, allowing us to use hashtables (efficiently).

Our hashJoin function takes two lists and two selector functions.

Placing all relations with the same selector value in a list in the hashtable allows us to join many to one/many relations. <lang Haskell>{-# LANGUAGE LambdaCase, TupleSections #-} import qualified Data.HashTable.ST.Basic as H import Data.Hashable import Control.Monad.ST import Control.Monad import Data.STRef

hashJoin :: (Eq k, Hashable k) =>

           [t] -> (t -> k) -> [a] -> (a -> k) -> [(t, a)]

hashJoin xs fx ys fy = runST $ do

 l <- newSTRef []
 ht <- H.new
 forM_ ys $ \y -> H.insert ht (fy y) =<< 
   (H.lookup ht (fy y) >>= \case
     Nothing -> return [y]
     Just v -> return (y:v))
 forM_ xs $ \x -> do
   H.lookup ht (fx x) >>= \case
     Nothing -> return ()
     Just v -> modifySTRef' l ((map (x,)  v) ++) 
 readSTRef l

test = mapM_ print $ hashJoin

   [(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")]
       snd
   [("Jonah", "Whales"), ("Jonah", "Spiders"), 
     ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")]
       fst

</lang>

λ> test
((3,"Glory"),("Glory","Buffy"))
((2,"Alan"),("Alan","Zombies"))
((2,"Alan"),("Alan","Ghosts"))
((1,"Jonah"),("Jonah","Spiders"))
((1,"Jonah"),("Jonah","Whales"))

The task require hashtables; however, a cleaner and more functional solution would be to use Data.Map (based on binary trees): <lang Haskell>{-# LANGUAGE TupleSections #-} import qualified Data.Map as M import Data.List import Data.Maybe import Control.Applicative

mapJoin xs fx ys fy = joined

 where yMap = foldl' f M.empty ys
       f m y = M.insertWith (++) (fy y) [y] m
       joined = concat . catMaybes . 
                map (\x -> map (x,) <$> M.lookup (fx x) yMap) $ xs

test = mapM_ print $ mapJoin

   [(1, "Jonah"), (2, "Alan"), (3, "Glory"), (4, "Popeye")]
       snd
   [("Jonah", "Whales"), ("Jonah", "Spiders"), 
    ("Alan", "Ghosts"), ("Alan", "Zombies"), ("Glory", "Buffy")]
       fst

</lang>

λ> test
((1,"Jonah"),("Jonah","Spiders"))
((1,"Jonah"),("Jonah","Whales"))
((2,"Alan"),("Alan","Zombies"))
((2,"Alan"),("Alan","Ghosts"))
((3,"Glory"),("Glory","Buffy"))

@@ Line 1: / Line 1: @@
+{{draft task}}
-{{task}}[[wp:Hash Join|Hash Join]]
-The classic hash join algorithm for an inner join of two relations has the following steps:
+The classic [[wp:Hash Join|hash join]] algorithm for an inner join of two relations has the following steps:
 <ul>
 <li>Hash phase : Creating a hash table for one of the two relations by applying a hash
@@ Line 10: / Line 10: @@
 The algorithm is as follows:
-<pre>
-for each tuple s in S do
+ '''for each''' tuple ''s'' '''in''' ''S'' '''do'''
-{ hash on join attributes s(b)
+    '''let''' ''h'' = hash on join attributes ''s''(b)
-  place tuples in hash table based on hash values};
+    '''place''' ''s'' '''in''' hash table ''S<sub>h</sub>'' '''in''' bucket '''keyed by''' hash value ''h''
-for each tuple r do
+ '''for each''' tuple ''r'' '''in''' ''R'' '''do'''
-{ hash on join attributes r(a)
+    '''let''' ''h'' = hash on join attributes ''r''(a)
-  if r hashes in a nonempty bucket of hash table for S
+    '''if''' ''h'' indicates a nonempty bucket (''B'') of hash table ''S<sub>h</sub>''
+       '''if''' ''h'' matches any ''s'' in ''B''
-  then
-    {if r hash key matches any s in bucket concatenate r and s
+          '''concatenate''' ''r'' and ''s''
-    place relation in Q}};</pre>
+       '''place''' relation in ''Q''
 Implement the Hash Join algorithm in your programming language (optionally providing a test case as well).