Monday, March 30, 2009
Math expression in english sentence
Having done the word to number conversion, the next target is to write a series of blogs on how to handle expressions expressed in English sentence -
Part I - lets make a simple expression parser, one which can understand expression with constant numbers. Like "hundred and fifty plus five hundred".
Part II - We'll extend the simple parser to handle symbols. Example - "John has hundred and fifty doller" or "John has twice of what Jimmy has.". We'll talk about generating expression forest & symbol tables.
Part III - Solving. Well, we'll put in a agent based design that would
a. traverse through the expression forest
b. recognize patterns & try to map them to the domain they work against (ex. linear equation could be domain of "Linear Solver" agent)
c. Agents solve part of the unknowns. We'll also define a controller that "talks" to different agents to solve all unknowns (unless its not possible to solve for all)
So watch out ...
Part I - lets make a simple expression parser, one which can understand expression with constant numbers. Like "hundred and fifty plus five hundred".
Part II - We'll extend the simple parser to handle symbols. Example - "John has hundred and fifty doller" or "John has twice of what Jimmy has.". We'll talk about generating expression forest & symbol tables.
Part III - Solving. Well, we'll put in a agent based design that would
a. traverse through the expression forest
b. recognize patterns & try to map them to the domain they work against (ex. linear equation could be domain of "Linear Solver" agent)
c. Agents solve part of the unknowns. We'll also define a controller that "talks" to different agents to solve all unknowns (unless its not possible to solve for all)
So watch out ...
Converting words to number
I returned to Lisp (aka CLisp) after a long gap. Infact, last lisp program i wrote was back in my first year at grad school. Lisp at that time was not interesting to me as i had already started with windows programming & was more into "UI" . Now when i look into it, i'm amazed at the simplicity and power Lisp... especially when you want to try out new algorithms. So, here my first non-trivial lisp program.
Problem statement: Given a number as expressed in words (ex. "two hundred thousand"), convert it to its numeric representation. The test data & expected result i set for myself was:
Solution: One of the algorithm which will work for most of the practical cases would be:
BASIC_UNIT -> one | two | three |...|twenty|thirty|forty|...|ninety
HIGHER_DEGREE -> hundred | thousand | million
N_ATOM -> BASIC_UNIT
N_ATOM -> BASIC_UNIT HIGHER_DEGREE
NUMBER -> N_ATOM+, constraint N_ATOM[i].HIGHER_DEGREE < N_ATOM[j].HIGHER_DEGREE for all j >i
Though the algo is tested in Lisp, converting it to any other language (java, c#) should be straightforward. The constraint basically says that a number is a sequence of N_ATOM(s), given an atom in the sequence, all subsequent atom should have its "higher degree" factor less than it. It basically means that once you have written "thousand" you cannot in subsequent word write "thousand" or higher degree like "million". So sentence like "one thousand thousand" or "one thousand million" is not valid under this grammar (I'll talk about more refined grammar in my subsequent posts).
If you are wondering how to translate this grammar in program, then here's the Lisp code :)
For simplicity sake, i've assigned token to each number word (ideally there should a pre-phase of converting string to token.... but that should be simple). Here's how you'll call for the test cases -
Problem statement: Given a number as expressed in words (ex. "two hundred thousand"), convert it to its numeric representation. The test data & expected result i set for myself was:
- "ninety one" - 91
- "one hundred ninety nine" - 199
- "twenty one hundred" - 2100
- "ninety nine hundred" - 9900 .. notice this should be same as "nine thousand nine hundred"
- "ninety nine thousand nine hundred ninety nine" - 99999
- "nine million ninety nine thousand nine hundred fifty" - 999950
Solution: One of the algorithm which will work for most of the practical cases would be:
BASIC_UNIT -> one | two | three |...|twenty|thirty|forty|...|ninety
HIGHER_DEGREE -> hundred | thousand | million
N_ATOM -> BASIC_UNIT
N_ATOM -> BASIC_UNIT HIGHER_DEGREE
NUMBER -> N_ATOM+, constraint N_ATOM[i].HIGHER_DEGREE < N_ATOM[j].HIGHER_DEGREE for all j >i
Though the algo is tested in Lisp, converting it to any other language (java, c#) should be straightforward. The constraint basically says that a number is a sequence of N_ATOM(s), given an atom in the sequence, all subsequent atom should have its "higher degree" factor less than it. It basically means that once you have written "thousand" you cannot in subsequent word write "thousand" or higher degree like "million". So sentence like "one thousand thousand" or "one thousand million" is not valid under this grammar (I'll talk about more refined grammar in my subsequent posts).
If you are wondering how to translate this grammar in program, then here's the Lisp code :)
For simplicity sake, i've assigned token to each number word (ideally there should a pre-phase of converting string to token.... but that should be simple). Here's how you'll call for the test cases -
- (w2n ninety one) - 91
- (w2n one hundred ninety nine) - 199
- (w2n twenty one hundred - 2100
- (w2n ninety nine hundred) - 9900
- (w2n ninety nine thousand nine hundred ninety nine) - 99999
- (w2n nine million ninety nine thousand nine hundred fifty) - 999950
(setf one 1)
(setf two 2)
(setf three 3)
(setf four 4)
(setf five 5)
(setf six 6)
(setf seven 7)
(setf eight 8)
(setf nine 9)
(setf ten 10)
(setf eleven 11)
(setf twelve 12)
(setf thirteen 13)
(setf fourteen 14)
(setf fifteen 15)
(setf sixteen 16)
(setf seventy 17)
(setf eighteen 18)
(setf nineteen 19)
(setf twenty 20)
(setf thirty 30)
(setf forty 40)
(setf fifty 50)
(setf sixty 60)
(setf seventy 70)
(setf eighty 80)
(setf ninety 90)
(setf hundred 100)
(setf thousand 1000)
(setf million 100000)
;198,149
; BASIC_UNIT := one | two | three |...|twenty|thirty|forty|...|ninety
; HIGHER_DEGREE := hundred | thousand | lakh | crore
; N_ATOM := BASIC_UNIT
; N_ATOM := BASIC_UNIT HIGHER_DEGREE
; N_ATOM := N_ATOM+, constraint N_ATOM[i].HIGHER_DEGREE < N_ATOM[j].HIGHER_DEGREE for all j = i+1 ... N_ATOMS.COUNT
(defun w2n (&rest word-list)
(w2n-aux word-list 0 0 0)
); end of function body
(defun w2n-is-higher-degree (x)
(or (or (equal x hundred) (equal x thousand)) (equal x million))
)
(defun w2n-aux (word-list w2n_result w2n_s0 w2n_h0)
(let ((x (first word-list))
(next_list (cdr word-list))
)
(if (endp word-list) (+ w2n_result w2n_s0)
(if (w2n-is-higher-degree x) (if (or (equal w2n_h0 0) (< x w2n_h0))
(w2n-aux next_list (if (> w2n_s0 0) (+ w2n_result (* w2n_s0 x)) (* w2n_result x)) 0 x)
; else we have issue ... not a well formed wording
(print "Error: words not well formed.")
)
; else
(w2n-aux next_list w2n_result (+ w2n_s0 x) w2n_h0)
)
)
) ; end of let
); end of defun
Subscribe to Posts [Atom]