Posts

Apache Spark Features

  In-memory computation Distributed processing using parallelize Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c) Fault-tolerant Immutable Lazy evaluation Cache & persistence Inbuild-optimization when using DataFrames Supports ANSI SQL

Language processing tasks and corresponding NLTK modules

Accessing corpora: nltk.corpus : this is a Standardized interfaces to corpora and lexicons String processing nltk.tokenize, nltk.stem : Tokenizers, sentence tokenizers, stemmers   Collocation discovery   nltk.collocations:      t-test, chi-squared, point-wise mutual information   Part-of-speech tagging nltk.tag : n-gram, backoff, Brill, HMM, TnT Classification nltk.classify, nltk.cluster: Decision tree, maximum entropy, naive Bayes, EM, k-means Chunking nltk.chunk : Regular expression, n-gram, named entity Parsing nltk.parse : Chart, feature-based, unification, probabilistic, dependency         . Semantic interpretation nltk.sem, nltk.inference : Lambda calculus, first-order logic, model checking     Evaluation metrics nltk.metrics : Precision, recall, agreement coefficients     Probability and estimation nltk.probability : Frequency distributions, smoothed probability distributions     Applications nltk.app, nltk.chat : Graphical concordancer, parsers, WordNet browser, chatbo