Introducing Gallia: a Scala+Spark library for data manipulation

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Introducing Gallia: a Scala+Spark library for data manipulation

galliaproject

Hi everyone,

This is an announcement for Gallia, a new library for data manipulation that maintains a schema throughout transformations and may process data at scale by wrapping Spark RDDs.

Here’s a very basic example of usage on an individual object:

  """{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
    .read() // will infer schema if none is provided
      .toUpperCase('foo)
      .increment  ('bar)
      .remove     ('qux)
      .nest       ('baz).under('parent)
      .flip       ('parent |> 'baz)
    .printJson()
    // prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}

Trying to manipulate 'parent |> 'baz as anything other than a boolean results in a type failure at runtime (but before the data is seen):


      .square ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier
      // ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz

SQL-like processing looks like the following:


  "/data/people.jsonl.gz2"

    // case class Person(name: String, ...)
    .stream[Person]

    // INPUT: [{"name": "John", "age": 20, "city": "Toronto"}, {...

      /* 1. WHERE            */ .filterBy('age).matches(_ < 25)
      /* 2. SELECT           */ .retain('name, 'age)
      /* 3. GROUP BY + COUNT */ .countBy('age)

    .printJsonl()
    // OUTPUT: {"age": 21, "_count": 10}\n{"age": 22, ...

More examples:

It’s also possible - but not required - to process data at scale by leveraging Spark RDDs.

A much more thorough tour can be found at https://github.com/galliaproject/gallia-core/blob/init/README.md

I would love to hear whether this is an effort worth pursuing!

Anthony (@anthony_cros)



Sent from the Apache Spark User List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Introducing Gallia: a Scala+Spark library for data manipulation

galliaproject
I posted a quick update on the  scala mailing list
<https://users.scala-lang.org/t/introducing-gallia-a-library-for-data-manipulation/7112/4>
, which mostly discusses Scala 2.13 support, additional examples and
licensing.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]